Tag Archives: Amazon Elastic Kubernetes Service

Introducing ACK controller for Amazon EMR on EKS

Post Syndicated from Peter Dalbhanjan original https://aws.amazon.com/blogs/big-data/introducing-ack-controller-for-amazon-emr-on-eks/

AWS Controllers for Kubernetes (ACK) was announced in August, 2020, and now supports 14 AWS service controllers as generally available with an additional 12 in preview. The vision behind this initiative was simple: allow Kubernetes users to use the Kubernetes API to manage the lifecycle of AWS resources such as Amazon Simple Storage Service (Amazon S3) buckets or Amazon Relational Database Service (Amazon RDS) DB instances. For example, you can define an S3 bucket as a custom resource, create this bucket as part of your application deployment, and delete it when your application is retired.

Amazon EMR on EKS is a deployment option for EMR that allows organizations to run Apache Spark on Amazon Elastic Kubernetes Service (Amazon EKS) clusters. With EMR on EKS, the Spark jobs run using the Amazon EMR runtime for Apache Spark. This increases the performance of your Spark jobs so that they run faster and cost less than open source Apache Spark. Also, you can run Amazon EMR-based Apache Spark applications with other types of applications on the same EKS cluster to improve resource utilization and simplify infrastructure management.

Today, we’re excited to announce the ACK controller for Amazon EMR on EKS is generally available. Customers have told us that they like the declarative way of managing Apache Spark applications on EKS clusters. With the ACK controller for EMR on EKS, you can now define and run Amazon EMR jobs directly using the Kubernetes API. This lets you manage EMR on EKS resources directly using Kubernetes-native tools such as kubectl.

The controller pattern has been widely adopted by the Kubernetes community to manage the lifecycle of resources. In fact, Kubernetes has built-in controllers for built-in resources like Jobs or Deployment. These controllers continuously ensure that the observed state of a resource matches the desired state of the resource stored in Kubernetes. For example, if you define a deployment that has NGINX using three replicas, the deployment controller continuously watches and tries to maintain three replicas of NGINX pods. Using the same pattern, the ACK controller for EMR on EKS installs two custom resource definitions (CRDs): VirtualCluster and JobRun. When you create EMR virtual clusters, the controller tracks these as Kubernetes custom resources and calls the EMR on EKS service API (also known as emr-containers) to create and manage these resources. If you want to get a deeper understanding of how ACK works with AWS service APIs, and learn how ACK generates Kubernetes resources like CRDs, see blog post.

If you need a simple getting started tutorial, refer to Run Spark jobs using the ACK EMR on EKS controller. Typically, customers who run Apache Spark jobs on EKS clusters use higher level abstraction such as Argo Workflows, Apache Airflow, or AWS Step Functions, and use workflow-based orchestration in order to run their extract, transform, and load (ETL) jobs. This gives you a consistent experience running jobs while defining job pipelines using Directed Acyclic Graphs (DAGs). DAGs allow you organize your job steps with dependencies and relationships to say how they should run. Argo Workflows is a container-native workflow engine for orchestrating parallel jobs on Kubernetes.

In this post, we show you how to use Argo Workflows with the ACK controller for EMR on EKS to run Apache Spark jobs on EKS clusters.

Solution overview

In the following diagram, we show Argo Workflows submitting a request to the Kubernetes API using its orchestration mechanism.

We’re using Argo to showcase the possibilities with workflow orchestration in this post, but you can also submit jobs directly using kubectl (the Kubernetes command line tool). When Argo Workflows submits these requests to the Kubernetes API, the ACK controller for EMR on EKS reconciles VirtualCluster custom resources by invoking the EMR on EKS APIs.

Let’s go through an exercise of creating custom resources using the ACK controller for EMR on EKS and Argo Workflows.

Prerequisites

Your environment needs the following tools installed:

Install the ACK controller for EMR on EKS

You can either create an EKS cluster or re-use an existing one. We refer to the instructions in Run Spark jobs using the ACK EMR on EKS controller to set up our environment. Complete the following steps:

  1. Install the EKS cluster.
  2. Create IAM Identity mapping.
  3. Install emrcontainers-controller.
  4. Configure IRSA for the EMR on EKS controller.
  5. Create an EMR job execution role and configure IRSA.

At this stage, you should have an EKS cluster with proper role-based access control (RBAC) permissions so that Amazon EMR can run its jobs. You should also have the ACK controller for EMR on EKS installed and the EMR job execution role with IAM Roles for Service Account (IRSA) configurations so that they have the correct permissions to call EMR APIs.

Please note, we’re skipping the step to create an EMR virtual cluster because we want to create a custom resource using Argo Workflows. If you created this resource using the getting started tutorial, you can either delete the virtual cluster or create new IAM identity mapping using a different namespace.

Let’s validate the annotation for the EMR on EKS controller service account before proceeding:

# validate annotation
kubectl get pods -n $ACK_SYSTEM_NAMESPACE
CONTROLLER_POD_NAME=$(kubectl get pods -n $ACK_SYSTEM_NAMESPACE --selector=app.kubernetes.io/name=emrcontainers-chart -o jsonpath='{.items..metadata.name}')
kubectl describe pod -n $ACK_SYSTEM_NAMESPACE $CONTROLLER_POD_NAME | grep "^\s*AWS_"

The following code shows the expected results:

AWS_REGION:                      us-west-2
AWS_ENDPOINT_URL:
AWS_ROLE_ARN:                    arn:aws:iam::012345678910:role/ack-emrcontainers-controller
AWS_WEB_IDENTITY_TOKEN_FILE:     /var/run/secrets/eks.amazonaws.com/serviceaccount/token (http://eks.amazonaws.com/serviceaccount/token)

Check the logs of the controller:

kubectl logs ${CONTROLLER_POD_NAME} -n ${ACK_SYSTEM_NAMESPACE}

The following code is the expected outcome:

2022-11-02T18:52:33.588Z    INFO    controller.virtualcluster    Starting Controller    {"reconciler group": "emrcontainers.services.k8s.aws", "reconciler kind": "VirtualCluster"}
2022-11-02T18:52:33.588Z    INFO    controller.virtualcluster    Starting EventSource    {"reconciler group": "emrcontainers.services.k8s.aws", "reconciler kind": "VirtualCluster", "source": "kind source: *v1alpha1.VirtualCluster"}
2022-11-02T18:52:33.589Z    INFO    controller.virtualcluster    Starting Controller    {"reconciler group": "emrcontainers.services.k8s.aws", "reconciler kind": "VirtualCluster"}
2022-11-02T18:52:33.589Z    INFO    controller.jobrun    Starting EventSource    {"reconciler group": "emrcontainers.services.k8s.aws", "reconciler kind": "JobRun", "source": "kind source: *v1alpha1.JobRun"}
2022-11-02T18:52:33.589Z    INFO    controller.jobrun    Starting Controller    {"reconciler group": "emrcontainers.services.k8s.aws", "reconciler kind": "JobRun"}
...
2022-11-02T18:52:33.689Z    INFO    controller.jobrun    Starting workers    {"reconciler group": "emrcontainers.services.k8s.aws", "reconciler kind": "JobRun", "worker count": 1}
2022-11-02T18:52:33.689Z    INFO    controller.virtualcluster    Starting workers    {"reconciler group": "emrcontainers.services.k8s.aws", "reconciler kind": "VirtualCluster", "worker count": 1}

Now we’re ready to install Argo Workflows and use workflow orchestration to create EMR on EKS virtual clusters and submit jobs.

Install Argo Workflows

The following steps are meant for quick installation with a proof of concept in mind. This is not meant for a production install. We recommend reviewing the Argo documentation, security guidelines, and other considerations for a production install.

We install the argo CLI first. We have provided instructions to install the argo CLI using brew, which is compatible with the Mac operating system. If you use Linux or another OS, refer to Quick Start for installation steps.

brew install argo

Let’s create a namespace and install Argo Workflows on your EMR on EKS cluster:

kubectl create namespace argo
kubectl apply -n argo -f https://github.com/argoproj/argo-workflows/releases/download/v3.4.3/install.yaml

You can access the Argo UI locally by port-forwarding the argo-server deployment:

kubectl -n argo port-forward deploy/argo-server 2746:2746

You can access the web UI at https://localhost:2746. You will get a notice that “Your connection is not private” because Argo is using a self-signed certificate. It’s okay to choose Advanced and then Proceed to localhost.

Please note, you get an Access Denied error because we haven’t configured permissions yet. Let’s set up RBAC so that Argo Workflows has permissions to communicate with the Kubernetes API. We give admin permissions to argo serviceaccount in the argo and emr-ns namespaces.

Open another terminal window and run these commands:

# setup rbac 
kubectl create rolebinding default-admin --clusterrole=admin --serviceaccount=argo:default --namespace=argo
kubectl create rolebinding default-admin --clusterrole=admin --serviceaccount=argo:default --namespace=emr-ns

# extract bearer token to login into UI
SECRET=$(kubectl get sa default -n argo -o=jsonpath='{.secrets[0].name}')
ARGO_TOKEN="Bearer $(kubectl get secret $SECRET -n argo -o=jsonpath='{.data.token}' | base64 --decode)"
echo $ARGO_TOKEN

You now have a bearer token that we need to enter for client authentication.

You can now navigate to the Workflows tab and change the namespace to emr-ns to see the workflows under this namespace.

Let’s set up RBAC permissions and create a workflow that creates an EMR on EKS virtual cluster:

cat << EOF > argo-emrcontainers-vc-role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: argo-emrcontainers-virtualcluster
rules:
  - apiGroups:
      - emrcontainers.services.k8s.aws
    resources:
      - virtualclusters
    verbs:
      - '*'
EOF

cat << EOF > argo-emrcontainers-jr-role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: argo-emrcontainers-jobrun
rules:
  - apiGroups:
      - emrcontainers.services.k8s.aws
    resources:
      - jobruns
    verbs:
      - '*'
EOF

Let’s create these roles and a role binding:

# create argo clusterrole with permissions to emrcontainers.services.k8s.aws
kubectl apply -f argo-emrcontainers-vc-role.yaml
kubectl apply -f argo-emrcontainers-jr-role.yaml

# Give permissions for argo to use emr-containers clusterrole
kubectl create rolebinding argo-emrcontainers-virtualcluster --clusterrole=argo-emrcontainers-virtualcluster --serviceaccount=emr-ns:default -n emr-ns
kubectl create rolebinding argo-emrcontainers-jobrun --clusterrole=argo-emrcontainers-jobrun --serviceaccount=emr-ns:default -n emr-ns

Let’s recap what we have done so far. We created an EMR on EKS cluster, installed the ACK controller for EMR on EKS using Helm, installed the Argo CLI, installed Argo Workflows, gained access to the Argo UI, and set up RBAC permissions for Argo. RBAC permissions are required so that the default service account in the Argo namespace can use VirtualCluster and JobRun custom resources via the emrcontainers.services.k8s.aws API.

It’s time to create the EMR virtual cluster. The environment variables used in the following code are from the getting started guide, but you can change these to meet your environment:

export EKS_CLUSTER_NAME=ack-emr-eks
export EMR_NAMESPACE=emr-ns

cat << EOF > argo-emr-virtualcluster.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: emr-virtualcluster
spec:
  arguments: {}
  entrypoint: emr-virtualcluster
  templates:
  - name: emr-virtualcluster
    resource:
      action: create
      manifest: |
        apiVersion: emrcontainers.services.k8s.aws/v1alpha1
        kind: VirtualCluster
        metadata:
          name: my-ack-vc
        spec:
          name: my-ack-vc
          containerProvider:
            id: ${EKS_CLUSTER_NAME}
            type_: EKS
            info:
              eksInfo:
                namespace: ${EMR_NAMESPACE}
EOF

Use the following command to create an Argo Workflow for virtual cluster creation:

kubectl apply -f argo-emr-virtualcluster.yaml -n emr-ns
argo list -n emr-ns

The following code is the expected result from the Argo CLI:

NAME                 STATUS      AGE   DURATION   PRIORITY   MESSAGE
emr-virtualcluster   Succeeded   12m   11s        0 

Check the status of virtualcluster:

kubectl describe virtualcluster/my-ack-vc -n emr-ns

The following code is the expected result from the preceding command:

Name:         my-ack-vc
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  emrcontainers.services.k8s.aws/v1alpha1
Kind:         VirtualCluster
...
Status:
  Ack Resource Metadata:
    Arn:               arn:aws:emr-containers:us-west-2:012345678910:/virtualclusters/dxnqujbxexzri28ph1wspbxo0
    Owner Account ID:  012345678910
    Region:            us-west-2
  Conditions:
    Last Transition Time:  2022-11-03T15:34:10Z
    Message:               Resource synced successfully
    Reason:                
    Status:                True
    Type:                  ACK.ResourceSynced
  Id:                      dxnqujbxexzri28ph1wspbxo0
Events:                    <none>

If you run into issues, you can check Argo logs using the following command or through the console:

argo logs emr-virtualcluster -n emr-ns

You can also check controller logs as mentioned in the troubleshooting guide.

Because we have an EMR virtual cluster ready to accept jobs, we can start working on the prerequisites for job submission.

Create an S3 bucket and Amazon CloudWatch Logs group that are needed for the job (see the following code). If you already created these resources from the getting started tutorial, you can skip this step.

export RANDOM_ID1=$(LC_ALL=C tr -dc a-z0-9 </dev/urandom | head -c 8)

aws logs create-log-group --log-group-name=/emr-on-eks-logs/$EKS_CLUSTER_NAME
aws s3 mb s3://$EKS_CLUSTER_NAME-$RANDOM_ID1

We use the New York Citi Bike dataset, which has rider demographics and trip data information. Run the following command to copy the dataset into your S3 bucket:

export S3BUCKET=$EKS_CLUSTER_NAME-$RANDOM_ID1
aws s3 sync s3://tripdata/ s3://${S3BUCKET}/citibike/csv/

Copy the sample Spark application code to your S3 bucket:

aws s3 cp s3://aws-blogs-artifacts-public/artifacts/BDB-2782/citibike-convert-csv-to-parquet.py s3://${S3BUCKET}/application/
aws s3 cp s3://aws-blogs-artifacts-public/artifacts/BDB-2782/citibike-ridership.py s3://${S3BUCKET}/application/
aws s3 cp s3://aws-blogs-artifacts-public/artifacts/BDB-2782/citibike-popular-stations.py s3://${S3BUCKET}/application/
aws s3 cp s3://aws-blogs-artifacts-public/artifacts/BDB-2782/citibike-trips-by-age.py s3://${S3BUCKET}/application/

Now, it’s time to run sample Spark job. Run the following to generate an Argo workflow submission template:

export RANDOM_ID2=$(LC_ALL=C tr -dc a-z0-9 </dev/urandom | head -c 8)

cat << EOF > argo-citibike-steps-jobrun.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: emr-citibike-${RANDOM_ID2}
spec:
  entrypoint: emr-citibike
  templates:
  - name: emr-citibike
    steps:
    - - name: emr-citibike-csv-parquet
        template: emr-citibike-csv-parquet
    - - name: emr-citibike-ridership
        template: emr-citibike-ridership
      - name: emr-citibike-popular-stations
        template: emr-citibike-popular-stations
      - name: emr-citibike-trips-by-age
        template: emr-citibike-trips-by-age

  # This is parent job that converts csv data to parquet
  - name: emr-citibike-csv-parquet
    resource:
      action: create
      successCondition: status.state == COMPLETED
      failureCondition: status.state == FAILED      
      manifest: |
        apiVersion: emrcontainers.services.k8s.aws/v1alpha1
        kind: JobRun
        metadata:
          name: my-ack-jobrun-csv-parquet-${RANDOM_ID2}
        spec:
          name: my-ack-jobrun-csv-parquet-${RANDOM_ID2}
          virtualClusterRef:
            from:
              name: my-ack-vc
          executionRoleARN: "${ACK_JOB_EXECUTION_ROLE_ARN}"
          releaseLabel: "emr-6.7.0-latest"
          jobDriver:
            sparkSubmitJobDriver:
              entryPoint: "s3://${S3BUCKET}/application/citibike-convert-csv-to-parquet.py"
              entryPointArguments: [${S3BUCKET}]
              sparkSubmitParameters: "--conf spark.executor.instances=2 --conf spark.executor.memory=1G --conf spark.executor.cores=1 --conf spark.driver.cores=1 --conf spark.sql.shuffle.partitions=60 --conf spark.dynamicAllocation.enabled=false"
          configurationOverrides: |
            ApplicationConfiguration: null
            MonitoringConfiguration:
              CloudWatchMonitoringConfiguration:
                LogGroupName: /emr-on-eks-logs/${EKS_CLUSTER_NAME}
                LogStreamNamePrefix: citibike
              S3MonitoringConfiguration:
                LogUri: s3://${S3BUCKET}/logs

  # This is a child job which runs after csv-parquet jobs is complete
  - name: emr-citibike-ridership
    resource:
      action: create
      manifest: |
        apiVersion: emrcontainers.services.k8s.aws/v1alpha1
        kind: JobRun
        metadata:
          name: my-ack-jobrun-ridership-${RANDOM_ID2}
        spec:
          name: my-ack-jobrun-ridership-${RANDOM_ID2}
          virtualClusterRef:
            from:
              name: my-ack-vc
          executionRoleARN: "${ACK_JOB_EXECUTION_ROLE_ARN}"
          releaseLabel: "emr-6.7.0-latest"
          jobDriver:
            sparkSubmitJobDriver:
              entryPoint: "s3://${S3BUCKET}/application/citibike-ridership.py"
              entryPointArguments: [${S3BUCKET}]
              sparkSubmitParameters: "--conf spark.executor.instances=2 --conf spark.executor.memory=1G --conf spark.executor.cores=1 --conf spark.driver.cores=1 --conf spark.sql.shuffle.partitions=60 --conf spark.dynamicAllocation.enabled=false"
          configurationOverrides: |
            ApplicationConfiguration: null
            MonitoringConfiguration:
              CloudWatchMonitoringConfiguration:
                LogGroupName: /emr-on-eks-logs/${EKS_CLUSTER_NAME}
                LogStreamNamePrefix: citibike
              S3MonitoringConfiguration:
                LogUri: s3://${S3BUCKET}/logs   

  # This is a child job which runs after csv-parquet jobs is complete
  - name: emr-citibike-popular-stations
    resource:
      action: create
      manifest: |
        apiVersion: emrcontainers.services.k8s.aws/v1alpha1
        kind: JobRun
        metadata:
          name: my-ack-jobrun-popular-stations-${RANDOM_ID2}
        spec:
          name: my-ack-jobrun-popular-stations-${RANDOM_ID2}
          virtualClusterRef:
            from:
              name: my-ack-vc
          executionRoleARN: "${ACK_JOB_EXECUTION_ROLE_ARN}"
          releaseLabel: "emr-6.7.0-latest"
          jobDriver:
            sparkSubmitJobDriver:
              entryPoint: "s3://${S3BUCKET}/application/citibike-popular-stations.py"
              entryPointArguments: [${S3BUCKET}]
              sparkSubmitParameters: "--conf spark.executor.instances=2 --conf spark.executor.memory=1G --conf spark.executor.cores=1 --conf spark.driver.cores=1 --conf spark.sql.shuffle.partitions=60 --conf spark.dynamicAllocation.enabled=false"
          configurationOverrides: |
            ApplicationConfiguration: null
            MonitoringConfiguration:
              CloudWatchMonitoringConfiguration:
                LogGroupName: /emr-on-eks-logs/${EKS_CLUSTER_NAME}
                LogStreamNamePrefix: citibike
              S3MonitoringConfiguration:
                LogUri: s3://${S3BUCKET}/logs             

  # This is a child job which runs after csv-parquet jobs is complete
  - name: emr-citibike-trips-by-age
    resource:
      action: create
      manifest: |
        apiVersion: emrcontainers.services.k8s.aws/v1alpha1
        kind: JobRun
        metadata:
          name: my-ack-jobrun-trips-by-age-${RANDOM_ID2}
        spec:
          name: my-ack-jobrun-trips-by-age-${RANDOM_ID2}
          virtualClusterRef:
            from:
              name: my-ack-vc
          executionRoleARN: "${ACK_JOB_EXECUTION_ROLE_ARN}"
          releaseLabel: "emr-6.7.0-latest"
          jobDriver:
            sparkSubmitJobDriver:
              entryPoint: "s3://${S3BUCKET}/application/citibike-trips-by-age.py"
              entryPointArguments: [${S3BUCKET}]
              sparkSubmitParameters: "--conf spark.executor.instances=2 --conf spark.executor.memory=1G --conf spark.executor.cores=1 --conf spark.driver.cores=1 --conf spark.sql.shuffle.partitions=60 --conf spark.dynamicAllocation.enabled=false"
          configurationOverrides: |
            ApplicationConfiguration: null
            MonitoringConfiguration:
              CloudWatchMonitoringConfiguration:
                LogGroupName: /emr-on-eks-logs/${EKS_CLUSTER_NAME}
                LogStreamNamePrefix: citibike
              S3MonitoringConfiguration:
                LogUri: s3://${S3BUCKET}/logs                        
EOF

Let’s run this job:

argo -n emr-ns submit --watch argo-citibike-steps-jobrun.yaml

The following code is the expected result:

Name:                emr-citibike-tp8dlo6c
Namespace:           emr-ns
ServiceAccount:      unset (will run with the default ServiceAccount)
Status:              Succeeded
Conditions:          
 PodRunning          False
 Completed           True
Created:             Mon Nov 07 15:29:34 -0500 (20 seconds ago)
Started:             Mon Nov 07 15:29:34 -0500 (20 seconds ago)
Finished:            Mon Nov 07 15:29:54 -0500 (now)
Duration:            20 seconds
Progress:            4/4
ResourcesDuration:   4s*(1 cpu),4s*(100Mi memory)
STEP                                  TEMPLATE                       PODNAME                                                         DURATION  MESSAGE
 ✔ emr-citibike-if32fvjd              emr-citibike                                                                                               
 ├───✔ emr-citibike-csv-parquet       emr-citibike-csv-parquet       emr-citibike-if32fvjd-emr-citibike-csv-parquet-140307921        2m          
 └─┬─✔ emr-citibike-popular-stations  emr-citibike-popular-stations  emr-citibike-if32fvjd-emr-citibike-popular-stations-1670101609  4s          
   ├─✔ emr-citibike-ridership         emr-citibike-ridership         emr-citibike-if32fvjd-emr-citibike-ridership-2463339702         4s          
   └─✔ emr-citibike-trips-by-age      emr-citibike-trips-by-age      emr-citibike-if32fvjd-emr-citibike-trips-by-age-3778285872      4s       

You can open another terminal and run the following command to check on the job status as well:

kubectl -n emr-ns get jobruns -w

You can also check the UI and look at the Argo logs, as shown in the following screenshot.

Clean up

Follow the instructions from the getting started tutorial to clean up the ACK controller for EMR on EKS and its resources. To delete Argo resources, use the following code:

kubectl delete -n argo -f https://github.com/argoproj/argo-workflows/releases/download/v3.4.3/install.yaml
kubectl delete -f argo-emrcontainers-vc-role.yaml
kubectl delete -f argo-emrcontainers-jr-role.yaml
kubectl delete rolebinding argo-emrcontainers-virtualcluster -n emr-ns
kubectl delete rolebinding argo-emrcontainers-jobrun -n emr-ns
kubectl delete ns argo

Conclusion

In this post, we went through how to manage your Spark jobs on EKS clusters using the ACK controller for EMR on EKS. You can define Spark jobs in a declarative fashion and manage these resources using Kubernetes custom resources. We also reviewed how to use Argo Workflows to orchestrate these jobs to get a consistent job submission experience. You can take advantage of the rich features from Argo Workflows such as using DAGs to define multi-step workflows and specify dependencies within job steps, using the UI to visualize and manage the jobs, and defining retries and timeouts at the workflow or task level.

You can get started today by installing the ACK controller for EMR on EKS and start managing your Amazon EMR resources using Kubernetes-native methods.


About the authors

Peter Dalbhanjan is a Solutions Architect for AWS based in Herndon, VA. Peter is passionate about evangelizing and solving complex business problems using combination of AWS services and open source solutions. At AWS, Peter helps with designing and architecting variety of customer workloads.

Amine Hilaly is a Software Development Engineer at Amazon Web Services working on the Kubernetes and Open source related projects for about two years. Amine is a Go, open-source, and Kubernetes fanatic.

Simplifying Amazon EC2 instance type flexibility with new attribute-based instance type selection features

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/simplifying-amazon-ec2-instance-type-flexibility-with-new-attribute-based-instance-type-selection-features/

This blog is written by Rajesh Kesaraju, Sr. Solution Architect, EC2-Flexible Compute and Peter Manastyrny, Sr. Product Manager, EC2.

Today AWS is adding two new attributes for the attribute-based instance type selection (ABS) feature to make it even easier to create and manage instance type flexible configurations on Amazon EC2. The new network bandwidth attribute allows customers to request instances based on the network requirements of their workload. The new allowed instance types attribute is useful for workloads that have some instance type flexibility but still need more granular control over which instance types to run on.

The two new attributes are supported in EC2 Auto Scaling Groups (ASG), EC2 Fleet, Spot Fleet, and Spot Placement Score.

Before exploring the new attributes in detail, let us review the core ABS capability.

ABS refresher

ABS lets you express your instance type requirements as a set of attributes, such as vCPU, memory, and storage when provisioning EC2 instances with ASG, EC2 Fleet, or Spot Fleet. Your requirements are translated by ABS to all matching EC2 instance types, simplifying the creation and maintenance of instance type flexible configurations. ABS identifies the instance types based on attributes that you set in ASG, EC2 Fleet, or Spot Fleet configurations. When Amazon EC2 releases new instance types, ABS will automatically consider them for provisioning if they match the selected attributes, removing the need to update configurations to include new instance types.

ABS helps you to shift from an infrastructure-first to an application-first paradigm. ABS is ideal for workloads that need generic compute resources and do not necessarily require the hardware differentiation that the Amazon EC2 instance type portfolio delivers. By defining a set of compute attributes instead of specific instance types, you allow ABS to always consider the broadest and newest set of instance types that qualify for your workload. When you use EC2 Spot Instances to optimize your costs and save up to 90% compared to On-Demand prices, instance type diversification is the key to access the highest amount of Spot capacity. ABS provides an easy way to configure and maintain instance type flexible configurations to run fault-tolerant workloads on Spot Instances.

We recommend ABS as the default compute provisioning method for instance type flexible workloads including containerized apps, microservices, web applications, big data, and CI/CD.

Now, let us dive deep on the two new attributes: network bandwidth and allowed instance types.

How network bandwidth attribute for ABS works

Network bandwidth attribute allows customers with network-sensitive workloads to specify their network bandwidth requirements for compute infrastructure. Some of the workloads that depend on network bandwidth include video streaming, networking appliances (e.g., firewalls), and data processing workloads that require faster inter-node communication and high-volume data handling.

The network bandwidth attribute uses the same min/max format as other ABS attributes (e.g., vCPU count or memory) that assume a numeric value or range (e.g., min: ‘10’ or min: ‘15’; max: ‘40’). Note that setting the minimum network bandwidth does not guarantee that your instance will achieve that network bandwidth. ABS will identify instance types that support the specified minimum bandwidth, but the actual bandwidth of your instance might go below the specified minimum at times.

Two important things to remember when using the network bandwidth attribute are:

  • ABS will only take burst bandwidth values into account when evaluating maximum values. When evaluating minimum values, only the baseline bandwidth will be considered.
    • For example, if you specify the minimum bandwidth as 10 Gbps, instances that have burst bandwidth of “up to 10 Gbps” will not be considered, as their baseline bandwidth is lower than the minimum requested value (e.g., m5.4xlarge is burstable up to 10 Gbps with a baseline bandwidth of 5 Gbps).
    • Alternatively, c5n.2xlarge, which is burstable up to 25 Gbps with a baseline bandwidth of 10 Gbps will be considered because its baseline bandwidth meets the minimum requested value.
  • Our recommendation is to only set a value for maximum network bandwidth if you have specific requirements to restrict instances with higher bandwidth. That would help to ensure that ABS considers the broadest possible set of instance types to choose from.

Using the network bandwidth attribute in ASG

In this example, let us look at a high-performance computing (HPC) workload or similar network bandwidth sensitive workload that requires a high volume of inter-node communications. We use ABS to select instances that have at minimum 10 Gpbs of network bandwidth and at least 32 vCPUs and 64 GiB of memory.

To get started, you can create or update an ASG or EC2 Fleet set up with ABS configuration and specify the network bandwidth attribute.

The following example shows an ABS configuration with network bandwidth attribute set to a minimum of 10 Gbps. In this example, we do not set a maximum limit for network bandwidth. This is done to remain flexible and avoid restricting available instance type choices that meet our minimum network bandwidth requirement.

Create the following configuration file and name it: my_asg_network_bandwidth_configuration.json

{
    "AutoScalingGroupName": "network-bandwidth-based-instances-asg",
    "DesiredCapacityType": "units",
    "MixedInstancesPolicy": {
        "LaunchTemplate": {
            "LaunchTemplateSpecification": {
                "LaunchTemplateName": "LaunchTemplate-x86",
                "Version": "$Latest"
            },
            "Overrides": [
                {
                "InstanceRequirements": {
                    "VCpuCount": {"Min": 32},
                    "MemoryMiB": {"Min": 65536},
                    "NetworkBandwidthGbps": {"Min": 10} }
                 }
            ]
        },
        "InstancesDistribution": {
            "OnDemandPercentageAboveBaseCapacity": 30,
            "SpotAllocationStrategy": "capacity-optimized"
        }
    },
    "MinSize": 1,
    "MaxSize": 10,
    "DesiredCapacity":10,
    "VPCZoneIdentifier": "subnet-f76e208a, subnet-f76e208b, subnet-f76e208c"
}

Next, let us create an ASG using the following command:

my_asg_network_bandwidth_configuration.json file

aws autoscaling create-auto-scaling-group --cli-input-json file://my_asg_network_bandwidth_configuration.json

As a result, you have created an ASG that may include instance types m5.8xlarge, m5.12xlarge, m5.16xlarge, m5n.8xlarge, and c5.9xlarge, among others. The actual selection at the time of the request is made by capacity optimized Spot allocation strategy. If EC2 releases an instance type in the future that would satisfy the attributes provided in the request, that instance will also be automatically considered for provisioning.

Considered Instances (not an exhaustive list)


Instance Type        Network Bandwidth
m5.8xlarge             “10 Gbps”

m5.12xlarge           “12 Gbps”

m5.16xlarge           “20 Gbps”

m5n.8xlarge          “25 Gbps”

c5.9xlarge               “10 Gbps”

c5.12xlarge             “12 Gbps”

c5.18xlarge             “25 Gbps”

c5n.9xlarge            “50 Gbps”

c5n.18xlarge          “100 Gbps”

Now let us focus our attention on another new attribute – allowed instance types.

How allowed instance types attribute works in ABS

As discussed earlier, ABS lets us provision compute infrastructure based on our application requirements instead of selecting specific EC2 instance types. Although this infrastructure agnostic approach is suitable for many workloads, some workloads, while having some instance type flexibility, still need to limit the selection to specific instance families, and/or generations due to reasons like licensing or compliance requirements, application performance benchmarking, and others. Furthermore, customers have asked us to provide the ability to restrict the auto-consideration of newly released instances types in their ABS configurations to meet their specific hardware qualification requirements before considering them for their workload. To provide this functionality, we added a new allowed instance types attribute to ABS.

The allowed instance types attribute allows ABS customers to narrow down the list of instance types that ABS considers for selection to a specific list of instances, families, or generations. It takes a comma separated list of specific instance types, instance families, and wildcard (*) patterns. Please note, that it does not use the full regular expression syntax.

For example, consider container-based web application that can only run on any 5th generation instances from compute optimized (c), general purpose (m), or memory optimized (r) families. It can be specified as “AllowedInstanceTypes”: [“c5*”, “m5*”,”r5*”].

Another example could be to limit the ABS selection to only memory-optimized instances for big data Spark workloads. It can be specified as “AllowedInstanceTypes”: [“r6*”, “r5*”, “r4*”].

Note that you cannot use both the existing exclude instance types and the new allowed instance types attributes together, because it would lead to a validation error.

Using allowed instance types attribute in ASG

Let us look at the InstanceRequirements section of an ASG configuration file for a sample web application. The AllowedInstanceTypes attribute is configured as [“c5.*”, “m5.*”,”c4.*”, “m4.*”] which means that ABS will limit the instance type consideration set to any instance from 4th and 5th generation of c or m families. Additional attributes are defined to a minimum of 4 vCPUs and 16 GiB RAM and allow both Intel and AMD processors.

Create the following configuration file and name it: my_asg_allow_instance_types_configuration.json

{
    "AutoScalingGroupName": "allow-instance-types-based-instances-asg",
    "DesiredCapacityType": "units",
    "MixedInstancesPolicy": {
        "LaunchTemplate": {
            "LaunchTemplateSpecification": {
                "LaunchTemplateName": "LaunchTemplate-x86",
                "Version": "$Latest"
            },
            "Overrides": [
                {
                "InstanceRequirements": {
                    "VCpuCount": {"Min": 4},
                    "MemoryMiB": {"Min": 16384},
                    "CpuManufacturers": ["intel","amd"],
                    "AllowedInstanceTypes": ["c5.*", "m5.*","c4.*", "m4.*"] }
            }
            ]
        },
        "InstancesDistribution": {
            "OnDemandPercentageAboveBaseCapacity": 30,
            "SpotAllocationStrategy": "capacity-optimized"
        }
    },
    "MinSize": 1,
    "MaxSize": 10,
    "DesiredCapacity":10,
    "VPCZoneIdentifier": "subnet-f76e208a, subnet-f76e208b, subnet-f76e208c"
}

As a result, you have created an ASG that may include instance types like m5.xlarge, m5.2xlarge, c5.xlarge, and c5.2xlarge, among others. The actual selection at the time of the request is made by capacity optimized Spot allocation strategy. Please note that if EC2 will in the future release a new instance type which will satisfy the other attributes provided in the request, but will not be a member of 4th or 5th generation of m or c families specified in the allowed instance types attribute, the instance type will not be considered for provisioning.

Selected Instances (not an exhaustive list)

m5.xlarge

m5.2xlarge

m5.4xlarge

c5.xlarge

c5.2xlarge

m4.xlarge

m4.2xlarge

m4.4xlarge

c4.xlarge

c4.2xlarge

As you can see, ABS considers a broad set of instance types for provisioning, however they all meet the compute attributes that are required for your workload.

Cleanup

To delete both ASGs and terminate all the instances, execute the following commands:

aws autoscaling delete-auto-scaling-group --auto-scaling-group-name network-bandwidth-based-instances-asg --force-delete

aws autoscaling delete-auto-scaling-group --auto-scaling-group-name allow-instance-types-based-instances-asg --force-delete

Conclusion

In this post, we explored the two new ABS attributes – network bandwidth and allowed instance types. Customers can use these attributes to select instances based on network bandwidth and to limit the set of instances that ABS selects from. The two new attributes, as well as the existing set of ABS attributes enable you to save time on creating and maintaining instance type flexible configurations and make it even easier to express the compute requirements of your workload.

ABS represents the paradigm shift in the way that our customers interact with compute, making it easier than ever to request diversified compute resources at scale. We recommend ABS as a tool to help you identify and access the largest amount of EC2 compute capacity for your instance type flexible workloads.

AWS Batch for Amazon Elastic Kubernetes Service

Post Syndicated from Steve Roberts original https://aws.amazon.com/blogs/aws/aws-batch-for-amazon-elastic-kubernetes-service/

Today I’m pleased to announce AWS Batch for Amazon Elastic Kubernetes Service (Amazon EKS). AWS Batch for Amazon EKS is ideal for customers who no longer want to shoulder the burden of configuring, fine-tuning, and managing Kubernetes clusters and pods to use with their batch processing workflows. Furthermore, there is no charge for this service. You only pay for the resources that your batch jobs launch.

When I’ve previously considered Kubernetes, it appeared to be focused on the management and hosting of microservice workloads. I was therefore surprised to discover that Kubernetes is also used by some customers to run large-scale, compute-intensive batch workloads. The differences between batch and microservice workloads mean that using Kubernetes for batch processing can be difficult and requires you to invest significant time in custom configuration and management to fine-tune a suitable solution.

Microservice and batch workloads on Kubernetes
Before we look further at AWS Batch for Amazon EKS, let’s consider some of the important differences between batch and microservice workloads to help set some context on why running batch workloads on Kubernetes can be difficult:

  • Microservice workloads are assumed to start and not stop—we expect them to be continuously available. In contrast, batch workloads run to completion and then exit—regardless of success or failure.
  • The results from a batch workload might not be available for several minutes—and sometimes hours or even days. Microservice workloads are expected to respond to requests within milliseconds.
  • We usually deploy microservice workloads across several Availability Zones to ensure high availability. This isn’t a requirement for batch workloads. Although we might distribute a batch job to allow it to process different input data in a distributed analysis, we more typically want to prioritize fast and optimal access to resources the job needs within the Availability Zone in which it is running.
  • Microservice and batch workloads scale differently. For microservices, scaling is generally predictable and usually linear as load increases (or decreases). With batch workloads, you might first perform an initial, or infrequently repeated, proof-of-concept run to analyze performance and discover the correct tuning needed for a full production run. The difference in size between the two can be exponential. Furthermore, with batch workloads, we might scale to an extreme level for a run, then scale back to zero instances for long periods of time, sometimes months.

Although third-party frameworks can help with running batch workloads on Kubernetes, you can also roll your own. Whichever approach you take, significant gaps and challenges can remain in handling the undifferentiated heavy lifting of building, configuring, and maintaining custom batch solutions. Then you also need to consider the scheduling, placing, and scaling of batch workloads on Kubernetes in a cost-effective manner. So how does AWS Batch on Amazon EKS help?

AWS Batch for Amazon EKS
AWS Batch for Amazon EKS offers a fully managed service to run batch workloads using clusters hosted on Amazon Elastic Compute Cloud (Amazon EC2) with no need to install and manage complex, custom batch solutions to address the differences highlighted earlier. AWS Batch provides a scheduler that controls and runs high-volume batch jobs, together with an orchestration component that evaluates when, where, and how to place jobs submitted to a queue. There’s no need for you, as the user, to coordinate any of this work—you just submit a job request into the queue.

Job queueing, dependency tracking, retries, prioritization, compute resource provisioning for Amazon Elastic Compute Cloud (EC2) and Amazon Elastic Compute Cloud (EC2) Spot, and pod submission are all handled using a serverless queue. As a managed service, AWS Batch for Amazon EKS enables you to reduce your operational and management overhead and focus instead on your business requirements. It provides integration with other services such as AWS Identity and Access Management (IAM), Amazon EventBridge, and AWS Step Functions and allows you to take advantage of other partners and tools in the Kubernetes ecosystem.

When running batch jobs on Amazon EKS clusters, AWS Batch is the main entry point to submit workload requests. Based on the queued jobs, AWS Batch then launches worker nodes in your cluster to process the jobs. These nodes are kept separate in a distinct namespace from your other node groups in Amazon EKS. Similarly, nodes in other pods are isolated from those used with AWS Batch.

How it works
AWS Batch uses managed Amazon EKS clusters, which need to be registered with AWS Batch, and permissions set so that AWS Batch can launch and manage compute environments in those clusters to process jobs submitted to the queue. You can find instructions on how to launch a managed cluster that AWS Batch can use in this topic in the Amazon EKS User Guide. Instructions for configuring permissions can be found in the AWS Batch User Guide.

Once one or more clusters have been registered, and permissions set, users can submit jobs to the queue. When a job is submitted, the following actions take place to process the request:

  • On receiving a job request, the queue dispatches a request to the configured compute environment for resources. If an AWS Batch managed scaling group does not yet exist, one is created, and AWS Batch then starts launching Amazon Elastic Compute Cloud (EC2) instances in the group. These new instances are added to the AWS Batch Kubernetes namespace of the cluster.
  • The Kubernetes scheduler places any configured DaemonSet on the node.
  • Once the node is ready, AWS Batch starts sending pod placement requests to your cluster, using labels and taints to make the placement choices for the pods, bypassing much of the logic of the k8s scheduler.
  • This process is repeated, scaling as needed across more EC2 instances in the scaling group until the maximum configured capacity is reached.
  • If the job queue has another compute environment defined, such as one configured to use Spot instances, it will launch additional nodes in that compute environment.
  • Once all work is complete, AWS Batch removes the nodes from the cluster, and terminates the instances.

These steps are illustrated in the animation below.

Animation showing the steps AWS Batch takes when processing a request using an Amazon EKS cluster

Start using your clusters with AWS Batch today
AWS Batch for Amazon Elastic Kubernetes Service (Amazon EKS) is available today. As I noted earlier, there is no charge for this service, and you pay only for the resources your jobs consume. To learn more, visit the Getting Started with Amazon EKS topic in the AWS Batch User Guide. There is also a self-guided workshop to help introduce you to AWS Batch on Amazon EKS.

— Steve

How Shiji Group created a global guest profile store on AWS

Post Syndicated from Maximilian Schellhorn original https://aws.amazon.com/blogs/architecture/how-shiji-group-created-a-global-guest-profile-store-on-aws/

Shiji Group provides global software solutions for the hospitality industry. The Shiji Enterprise Platform enables customers to manage large hotel property portfolios using software as a service (SaaS). Among functionalities such as reservations, housekeeping, finance, and integrations with external systems, the guest profile is a key aspect of the system. Besides personal information (such as name and address) and billing details, the guest profile can include room preferences and entertainment options.

A property portfolio can span multiple hotels across the globe, and each hotel location can offer better customer service by consolidating data. Once the guest gives their cross-border data processing consent (CBDPC), profile information can be shared between properties. This provides a centralized and seamless experience for the hotel guest no matter which hotel in the portfolio was chosen.

In the following blog post, you will explore the architecture of the guest profile store that replicates the profile across multiple geographic areas. We will review the single Region design first and its infrastructure components and architectural patterns. We will then show the evolution to a multi-Region architecture.

Single Region architecture with CQRS

The ability to find relevant guest profile data fast is essential in the day-to-day hospitality business. Therefore, the following architecture uses the command query responsibility segregation (CQRS) pattern to provide high scalability and rich full-text search capabilities without sacrificing performance. With CQRS, write requests (commands) are targeting a different service than read requests (queries). This allows systems to store an item (such as a profile) in a search-optimized format for serving reads, while providing a simple schema for writes.

The microservices for the guest profile architecture are operated as containers on Amazon Elastic Kubernetes Service (Amazon EKS). The write model of the guest profile is stored in an Amazon Relational Database Service (Amazon RDS) PostgreSQL database. A separate read model uses Amazon OpenSearch Service. For interservice communication, Shiji runs a self-managed Apache Kafka cluster on Amazon Elastic Compute Cloud (Amazon EC2).

The following diagram provides a walk through the single Region architecture:

Single Region architecture with CQRS

Figure 1. Single Region architecture with CQRS

  1. The front desk employee creates the Guest Profile upon first interaction with the hotel guest (name, address, billing, and room preferences).
  2. The request is routed to the Kong API Management Solution that is running in an Amazon EKS Kubernetes cluster. It acts as the single entry-point to the system. It identifies the type of request by parsing the URL and forwarding write requests to the profile-write-model-service.
  3. The service validates the request. It stores the data and ProfileCreated event in the PostgreSQL database, Amazon RDS.
  4. A change data capture (CDC) mechanism publishes the ProfileCreated event to an Apache Kafka Local Profiles topic.
  5. The profile-read-model-service subscribes to the Local Profiles topic and stores the profile in an optimized read format in Amazon OpenSearch. Whenever the hotel performs a guest profile search, results will now be provided via the profile-read-model-service.

Multi-Region networking setup

Shiji operates in multiple AWS Regions to provide low latency, regulatory requirements, and resilience across the globe. The previously presented single Region architecture can be replicated to multiple AWS Regions (eu-central-1 and ap-southeast-1, for example). Hotels with a given property portfolio that operate in the same Region can reuse the profile store of the Shiji Enterprise Platform. However, hotels that are being operated in a different AWS Region can be interconnected as well.

This is achieved by providing an AWS Transit Gateway in a separate networking account that connects the different Regions with a VPC attachment:

Multi-Region networking setup

Figure 2. Multi-Region networking setup

The account segregation provides an additional layer of flexibility to add further Regions in the future.

Multi-Region event replication

Upon first arrival, guests can choose to sign a cross-border-data processing consent (CBDPC). This permits the hotel to share the profile information globally. If accepted, the profile-write-model-service creates an additional ProfileCreated event that gets published to a GlobalProfilesEU Apache Kafka topic. This topic is accessible for subscribers in the target Region, which replicates relevant profiles into the local database as follows.

A replicator-service in the target Region (ap-souteast-1) is now able to subscribe to the GlobalProfileEU topic in (eu-central-1), via the established network connection from the previous section. It republishes the event to a local ReplicatedProfiles topic that the profile-write-model-service subscribes to and saves to the local database:

Event replication

Figure 3. Event replication

Putting it all together: The multi-Region guest profile store

The following diagram combines all the components from the previous sections. It provides an end-to-end look at the multi-Region guest profile architecture. Due to the event driven nature of the system, the architecture can be extended without changing the initial flow outlined in the single Region design.

Multi-Region guest profile architecture

Figure 4. Multi-Region guest profile architecture

  1. If the hotel guest signed a cross-border data processing consent (CBDPC), the ProfileCreated event will also be published to a Global Profiles topic.
  2. The replicator-service in the target Region (for example, ap-southeast-1) subscribes to the Global Profiles topic of the source Region (for example, eu-central-1). It then publishes the event to its local Replicated Profiles topic.
  3. The profile-write-model-service in the target Region subscribes to the Replicated Profiles topic and records the item in the Amazon RDS PostgreSQL database with information about the source Region. This will initiate the local replication similar to the single Region design, and therefore creates a consistent experience between both Regions.

Conclusion and outlook

In this blog post, we showed how Shiji built a modern multi-Region microservice architecture on AWS. You have learned about patterns such as CQRS, which provide a scalable solution for both read and write traffic. We’ve also shown what is needed to interconnect two physically separated Regions. With cross-border data processing consent (CBDPC), you have seen how the ownership of guest data can be secured and utilized. The single Region architecture already provided a solid baseline for this solution architecture. The event-driven nature of the system permitted us to add additional functionality for the final multi-Region architecture.

The ability to manage a global guest profile within the main system as well as at the property itself is a huge advantage for enterprise hotel companies. It permits hotels to deliver a unified experience to their guests no matter where the guest is within the hotel or on their journey. Food preferences, spa, room, and more, can all be managed from a single guest profile. This centralized information hasn’t been possible within the hotel’s property management system (PMS) until recently.

Visit Shiji Enterprise Platform for more information.

AWS Week In Review — September 26, 2022

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/aws-week-in-review-september-26-2022/

It looks like my travel schedule is coupled with this Week In Review series of blog posts. This week, I am traveling to Fort-de-France in the French Caribbean islands to meet our customers and partners. I enjoy the travel time when I am offline. It gives me the opportunity to reflect on the past or plan for the future.

Last Week’s Launches
Here are some of the launches that caught my eye last week:

Amazon SageMaker Autopilothas added a new Ensemble training mode powered by AutoGluon that is 8X faster than the current Hyper parameter Optimization Mode and supports a wide range of algorithms, including LightGBM, CatBoost, XGBoost, Random Forest, Extra Trees, linear models, and neural networks based on PyTorch and FastAI.

AWS Outposts and Amazon EKSYou can now deploy both the worker nodes and the Kubernetes control plane on an Outposts rack. This allows you to maximize your application availability in case of temporary network disconnection on premises. The Kubernetes control plane continues to manage the worker nodes, and no pod eviction happens when on-premises network connectivity is reestablished.

Amazon Corretto 19 – Corretto is a no-cost, multiplatform, production-ready distribution of OpenJDK. Corretto is distributed by Amazon under an open source license. This version supports the latest OpenJDK feature release and is available on Linux, Windows, and macOS. You can download Corretto 19 from our downloads page.

Amazon CloudWatch Evidently – Evidently is a fully-managed service that makes it easier to introduce experiments and launches in your application code. Evidently adds support for Client Side Evaluations (CSE) for AWS Lambda, powered by AWS AppConfig. Evidently CSE allows application developers to generate feature evaluations in single-digit milliseconds from within their own Lambda functions. Check the client-side evaluation documentation to learn more.

Amazon S3 on AWS OutpostsS3 on Outposts now supports object versioning. Versioning helps you to locally preserve, retrieve, and restore each version of every object stored in your buckets. Versioning objects makes it easier to recover from both unintended user actions and application failures.

Amazon PollyAmazon Polly is a service that turns text into lifelike speech. This week, we announced the general availability of Hiujin, Amazon Polly’s first Cantonese-speaking neural text-to-speech (NTTS) voice. With this launch, the Amazon Polly portfolio now includes 96 voices across 34 languages and language variants.

X in Y – We launched existing AWS services in additional Regions:

Other AWS News
Introducing the Smart City Competency program – The AWS Smart City Competency provides best-in-class partner recommendations to our customers and the broader market. With the AWS Smart City Competency, you can quickly and confidently identify AWS Partners to help you address Smart City focused challenges.

An update to IAM role trust policy behavior – This is potentially a breaking change. AWS Identity and Access Management (IAM) is changing an aspect of how role trust policy evaluation behaves when a role assumes itself. Previously, roles implicitly trusted themselves. AWS is changing role assumption behavior to always require self-referential role trust policy grants. This change improves consistency and visibility with regard to role behavior and privileges. This blog post shares the details and explains how to evaluate if your roles are impacted by this change and what to modify. According to our data, only 0.0001 percent of roles are impacted. We notified by email the account owners.

Amazon Music Unifies Music QueuingThe Amazon Music team published a blog post to explain how they created a unified music queue across devices. They used AWS AppSync and AWS Amplify to build a robust solution that scales to millions of music lovers.

Upcoming AWS Events
Check your calendar and sign up for an AWS event in your Region and language:

AWS re:Invent – Learn the latest from AWS and get energized by the community present in Las Vegas, Nevada. Registrations are open for re:Invent 2022 which will be held from Monday, November 28 to Friday, December 2.

AWS Summits – Come together to connect, collaborate, and learn about AWS. Registration is open for the following in-person AWS Summits: Bogotá (October 4), and Singapore (October 6).

Natural Language Processing (NLP) Summit – The AWS NLP Summit 2022 will host over 25 sessions focusing on the latest trends, hottest research, and innovative applications leveraging NLP capabilities on AWS. It is happening at our UK headquarters in London, October 5–6, and you can register now.

AWS Innovate for every app – This regional online conference is designed to inspire and educate executives and IT professionals about AWS. It offers dozens of technical sessions in eight languages (English, Spanish, French, German, Italian, Japanese, Korean, and Indonesian). Register today: Americas, September 28; Europe, Middle-East, and Africa, October 6; Asia Pacific & Japan, October 20.

AWS Innovate for every app

AWS Community DaysAWS Community Day events are community-led conferences to share and learn with one another. In September, the AWS community in the US will run events in Arlington, Virginia (September 30). In Europe, Community Day events will be held in October. Join us in Amersfoort, Netherlands (October 3), Warsaw, Poland (October 14), and Dresden, Germany (October 19).

AWS Tour du Cloud – The AWS Team in France has prepared a roadshow to meet customers and partners with a one-day free conference in seven cities across the country (Aix en Provence, Lille, Toulouse, Bordeaux, Strasbourg, Nantes, and Lyon), and in Fort-de-France, Martinique. Tour du Cloud France

AWS Fest – This third-party event will feature AWS influencers, community heroes, industry leaders, and AWS customers, all sharing AWS optimization secrets (this week on Wednesday, September). You can register for AWS Fest here.

Stay Informed
That is my selection for this week! To better keep up with all of this news, please check out the following resources:

— seb
This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

How Facteus improved Quantamatics performance by adopting Amazon Aurora Serverless and Amazon EKS

Post Syndicated from Aishwarya Subramaniam original https://aws.amazon.com/blogs/architecture/how-facteus-improved-quantamatics-performance-by-adopting-amazon-aurora-serverless-and-amazon-eks/

Facteus Inc. is a leading provider of actionable insights from sensitive transaction data. Facteus safely transforms raw financial transaction data from legacy technologies into actionable information, without compromising data privacy, through its innovative synthetic data process. Quantamatics is one of Facteus’ core product offering.

Quantamatics accelerates the time it takes a user to go from raw alternative data to insights, by providing a cloud-based, turnkey research platform that handles data from ingestion to analysis. This platform saves the analysts, data researchers, and data scientists time by doing all the preparation and normalization efforts prior to working with the data for insight discovery. The provided cloud environment also allows for easy and flexible analysis of both provided and external data sources. Quantamatics is a SaaS offering with a subscription model that provides access to both the research platform and the associated Facteus datasets.

In June 2021, Facteus re-architected their monolithic Quantamatics application to use microservices. This blog will contrast the before and after states from a performance and management perspective as they migrated from Snowflake to Amazon Aurora Serverless v2 (Postgres) and from Amazon Elastic Compute Cloud (Amazon EC2) to Amazon Elastic Kubernetes Service (Amazon EKS).

A great place to start when evaluating existing workloads for fault tolerance and reliability is the AWS Well-Architected Framework. The Well-Architected Framework is designed to help cloud architects build secure, high-performing, resilient, and efficient infrastructure for their applications. Based on six pillars—operational excellence, security, reliability, performance efficiency, cost optimization, and sustainability—the Framework provides a consistent approach for customers to evaluate architectures, and implement designs that will scale over time.

The AWS Well-Architected Tool,  available at no charge in the AWS Management Console, lets you create self-assessments to identify and correct gaps in your current architecture. Adhering to Well-Architected principles, Facteus adopted managed services, such as Amazon EKS and Amazon Aurora Serverless, as they reduce efforts on provisioning, configuring, scaling, backing up, and so on. Additionally, using managed services helps to save on the overall costs of maintaining the services.

Facteus’ architecture overview

Before

Users can access Quantamatics for their research either through a Jupyter notebook or a Microsoft Excel plugin. Facteus used EC2 instances to directly host the underlying JupyterHub deployments and AWS Elastic Beanstalk to deploy APIs.

The legacy architecture, while cloud-based, had multiple issues that made it ineffective from a maintenance, scalability, and cost perspective (as demonstrated in Figure 1):

  • JupyterHub does not currently support high availability (HA) natively. This meant an EC2 failover would require relatively long unavailability while a replacement EC2 node spun up or potentially double the cost to keep an idle node on standby.
    • Also, with the EC2 instances being specialized, portions of each EC2 instance will remain unused, resulting in unnecessary costs compared to more modern solutions such as Amazon EKS, which can pool and divide up instances in a more granular fashion.
    • Finally, as the EC2 instances were standalone, solutions would need to be set up to both monitor application health and perform the appropriate actions in case of an outage.
  • Although Elastic Beanstalk was a great way to deploy API instances in an HA and scalable way, to completely modernize and remain consistent throughout application to a microservice-based architecture, Facteus migrated their Elastic Beanstalk instances as well, to better utilize the pooled resources.
Cloud-based legacy architecture

Figure 1. Cloud-based legacy architecture

Quantamatics requires a Data Warehouse solution to constantly run behind an API to allow for acceptable request and response times. While Snowflake is a great data warehousing and big data querying solution, Facteus found it expensive for their deployment. The queries that the Quantamatics APIs run are typically not computationally expensive but do end up returning relatively large amounts of data. This makes transferring the results back to the API over the internet a potential bottleneck.

To address these bottlenecks, Facteus re-architected their application into an Amazon EKS based one, backed with Aurora Serverless v2 (Postgres).

The new architecture resolves the previous problems in two ways (Figure 2):

  • By using Aurora Serverless v2 (Postgres) to store and query the datasets used by the API within the same VPC instead of Snowflake, it kept the query run time relatively the same but drastically decreased both the transfer time and the associated costs due to the locality of the database as well as the cost and scalability of Aurora Serverless v2.
  • By switching to Amazon EKS, the underlying EC2 nodes could easily be pooled and more thoroughly utilized across the various deployments, thus reducing costs. Additionally, as the deployments were now containerized, an outage would result in the quick relocation of those containerized apps (pods) to nodes with capacity, thus reducing downtime and cost.
    • As a side benefit with the move to managed nodes on Amazon EKS, this completely removed the node patching overhead, as Amazon EKS safely handles the patching of the underlying nodes with a single command.
    • Amazon EKS monitors and restarts pods automatically, which eliminated the need to set up and manage a solution that monitors pod health and takes the appropriate actions upon failures.
Contemporary architecture with Amazon EKS and Aurora Serverless v2 (Postgres)

Figure 2. Contemporary architecture with Amazon EKS and Aurora Serverless v2 (Postgres)

Auto scaling with Amazon EKS and Aurora Serverless

  • Amazon EKS helped to greatly reduce the overhead of setting up and managing the auto scaling of Quantamatics in two ways:
    • User compute environments could be spun up as isolated pods, with Amazon EKS spinning nodes up and down automatically based on demand.
    • API instances could also be automatically spun up and down based on network throughput metrics queried by Amazon EKS to handle the requests made by users in a timely fashion.
  • Aurora Serverless v2
    • With Aurora Serverless v2, the needed compute capacity of the database automatically scales based on load generated by the corresponding API requests. This both reduced the cost as the load varies heavily throughout the day, reducing the management overhead of handling spinning up and down of read replicas if other solutions were used instead.

Snowflake vs. Aurora Serverless V2 (Postgres) – Quantamatics query performance and cost comparison

The following steps were performed to migrate data from Snowflake to Aurora Serverless v2:

  • Use the Snowflake COPY INTO <location> command to copy the data from the Snowflake database table into one or more files in an S3 bucket.
  • Create tables in Aurora Serverless. Use the create_s3_uri function to load variables.
  • Use the aws_s3.table_import_from_s3 function to import the data file from an Amazon S3 file name prefix.
  • Verify that the information was loaded.

This blog post describes importing data from Amazon S3 to Amazon Aurora PostgreSQL.

Testing strategy: Run the corresponding CLI database utility for each database (snowsql vs psql) from within the VPC. Run the same query on each dataset. Return and write the results as CSV to a local file.
Data set size: ~178,000,000 rows
Result set size: ~418,000 rows

Data source Configuration Results
Snowflake Snowflake: Medium Warehouse (running), AWS based in same Region as APIs

  • Cost: ~$0.01 per query based on credit usage
  • 21.99 seconds run time
  • 3.36 seconds run time, 18.63 seconds transfer time
Aurora Serverless V2(Postgres) Idling on four Aurora Compute Units (ACU)

  • Cost: ~$0.24 an hour
  • Tables and indexes tuned for Quantamatics use cases
  • 7.00 seconds run time
  • 3.58 seconds run time, 3.42 seconds transfer time

Conclusion

The customer was able to achieve similar run times for the given dataset and query, but faster transfer speeds from Aurora Serverless due to the locality of the database. They also realized up to ~40x runtime cost savings by using Aurora Serverless—1,000 queries in Aurora Serverless vs. ~24 queries in Snowflake for the same cost.

Note: These results are specific to Quantamatics use cases where queries are fixed and well-known, and relatively limited in terms of complexity. This allowed the tables and database in Aurora Serverless v2 to be tuned for those specific purposes.

AWS recommends customers review their workloads using the AWS Well-Architected Tool to help ensure that their workloads are performant, secure, and cost-optimized. Well-Architected Framework Reviews are excellent opportunities to work together with your AWS account team and key stakeholders to discuss how modern infrastructure can help you win in the market.

Deploy your Amazon EKS Clusters Locally on AWS Outposts

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/deploy-your-amazon-eks-clusters-locally-on-aws-outposts/

I am pleased to announce the availability of local clusters for Amazon Elastic Kubernetes Service (Amazon EKS) on AWS Outposts. It means that starting today, you can deploy your Amazon EKS cluster entirely on Outposts: both the Kubernetes control plane and the nodes.

Amazon EKS is a managed Kubernetes service that makes it easy for you to run Kubernetes on AWS and on premises. AWS Outposts is a family of fully managed solutions delivering AWS infrastructure and services to virtually any on-premises or edge location for a truly consistent hybrid experience.

To fully understand the benefits of local clusters for Amazon EKS on Outposts, I need to first share a bit of background.

Some customers use Outposts to deploy Kubernetes cluster nodes and pods close to the rest of their on-premises infrastructure. This allows their applications to benefit from low latency access to on-premises services and data while managing the cluster and the lifecycle of the nodes using the same AWS API, CLI, or AWS console as they do for their cloud-based clusters.

Until today, when you deployed Kubernetes applications on Outposts, you typically started by creating an Amazon EKS cluster in the AWS cloud. Then you deployed the cluster nodes on your Outposts machines. In this hybrid cluster scenario, the Kubernetes control plane runs in the parent Region of your Outposts, and the nodes are running on your on-premises Outposts. The Amazon EKS service communicates through the network with the nodes running on the Outposts machine.

But, remember: everything fails all the time. Customers told us the main challenge they have in this scenario is to manage site disconnections. This is something we cannot control, especially when you deploy Outposts on rough edges: areas with poor or intermittent network connections. When the on-premises facility is temporarily disconnected from the internet, the Amazon EKS control plane running in the cloud is unable to communicate with the nodes and the pods. Although the nodes and pods work perfectly and continue to serve the application on the on-premises local network, Kubernetes may consider them unhealthy and schedule them for replacement when the connection is reestablished (see pod eviction in Kubernetes documentation). This may lead to application downtimes when connectivity is restored.

I talked with Chris, our Kubernetes Product Manager and expert, while preparing this blog post. He told me there are at least seven distinct options to configure how a control plane reconnects to its nodes. Unless you master all these options, the system status at re-connection is unpredictable.

To simplify this, we are giving you the ability to host your entire Amazon EKS cluster on Outposts. In this configuration, both the Kubernetes control plane and your worker nodes run locally on premises on your Outposts machine. That way, your cluster continues to operate even in the event of a temporary drop in your service link connection. You can perform cluster operations such as creating, updating, and scaling applications during network disconnects to the cloud.

EKS Local Cluster DiagramLocal clusters are identical to Amazon EKS in the cloud and automatically deploy the latest security patches to make it easy for you to maintain an up-to-date, secure cluster. You can use the same tooling you use with Amazon EKS in the cloud and the AWS Management Console for a single interface for your clusters running on Outposts and in AWS Cloud.

Let’s See It In Action
Let’s see how we can use this new capability. For this demo, I will deploy the Kubernetes control plane on Amazon Elastic Compute Cloud (Amazon EC2) instances running on premises on an Outposts rack.

I use an Outposts rack already configured. If you want to learn how to get started with Outposts, you can read the steps on the Get Started with AWS Outposts page.

AWS Outposts Configuration

This demo has two parts. First, I create the cluster. Second, I connect to the cluster and create nodes.

Creating Cluster
Before deploying the Amazon EKS local cluster on Outposts, I make sure I created an IAM cluster role and attached the AmazonEKSLocalOutpostClusterPolicy managed policy. This IAM cluster role will be used in cluster creation.

Then, I switch to the Amazon EKS dashboard, and I select Add Cluster, then Create.

Creating Cluster

On the following page, I chose the location of the Kubernetes control plane: the AWS Cloud or AWS Outposts. I select AWS Outposts and specify the Outposts ID.

Configure EKS Cluster to Use AWS Outposts

The Kubernetes control plane on Outposts is deployed on three EC2 instances for high availability. That’s why I see three Replicas. Then, I choose the instance type according to the number of worker nodes needed for workloads. For example, to handle 0–20 worker nodes, it is recommended to use m5d.large EC2 instances.

Setting Instance Type

On the same page, I specify configuration values for the Kubernetes cluster, such as its Name, Kubernetes version, and the Cluster service role that I created earlier.

Cluster Configuration

On the next page, I configure the networking options. Since Outposts is an extension of an AWS Region, I need to use the VPC and Subnets used by Outposts to enable communication between Kubernetes control plane and worker nodes. For Security Groups, Amazon EKS creates a security group for local clusters that enables communication between my cluster and my VPC. I can also define additional security groups according to my application requirements.

Specify Networking

As we run the Kubernetes control plane inside Outposts, the Cluster endpoint access can only be accessed privately. This means I can only access the Kubernetes cluster through machines that are deployed in the same VPC or over the local network via the Outposts local gateway with Direct VPC Routing.

Private Cluster Endoint Access
On the next page, I define logging. Logging is disabled by default, and I may enable it as needed. For more details about logging, you can read the Amazon EKS control plane logging documentation.

Configure Logging

The last screen allows me to review all configuration options. When I’m satisfied with the configuration, I select Create to create the cluster.

Networking

The cluster creation takes a few minutes. To check the cluster creation status, I can use the console or the terminal with the following command:

$ aws eks describe-cluster \ 
--region <REGION_CODE> \ 
--name <CLUSTER_NAME> \ 
--query "cluster.status"

The Status section tells me when the cluster is created and active.

Active Cluster

In addition to using the AWS Management Console, I can also create a local cluster using the AWS CLI. Here is the command snippet to create a local cluster with the AWS CLI:

$ aws eks create-cluster \ 
--region <REGION_CODE> \ 
--name <CLUSTER_NAME> \ 
--resources-vpc-config subnetIds=<SUBNET_ID>\ 
--role-arn <ARN_CLUSTER_ROLE> \ 
--outpost-config controlPlaneInstanceType=<INSTANCE_TYPE> \ 
--outpostArns=<ARN_OUTPOST>

Connecting to the Cluster
The endpoint access for a local cluster is private; therefore, I can access it from a local gateway with Direct VPC Routing or from machines that are in the same VPC. To find out how to use local gateways with Outposts, you can follow the information on the Working with local gateways page. For this demo, I use an EC2 instance as a bastion host, and I manage the Kubernetes cluster using kubectl command.

The first thing I do is edit Security Groups to open traffic access from the bastion host. I go to the detail page of the Kubernetes cluster and select the Networking tab. Then I select the link in Cluster security group.

Networking & Security Group

Then, I add inbound rules, and I provide access for the bastion host by specifying its IP address.

Adding Inbound Rule in Security Group

Once I’ve allowed the access, I create kubeconfig in the bastion host by running the command:

$ aws eks update-kubeconfig --region <REGION_CODE> --name <CLUSTER_NAME>

Finally, I use kubectl to interact with the Kubernetes API server, just like usual.

$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-10-X-Y-Z.us-west-2.compute.internal NotReady control-plane,master 10h v1.21.13 10.X.Y.Z <none> Bottlerocket OS 1.8.0 (aws-k8s-1.21) 5.10.118 containerd://1.6.6+bottlerocket
ip-10-X-Y-Z.us-west-2.compute.internal NotReady control-plane,master 10h v1.21.13 10.X.Y.Z <none> Bottlerocket OS 1.8.0 (aws-k8s-1.21) 5.10.118 containerd://1.6.6+bottlerocket
ip-10-X-Y-Z.us-west-2.compute.internal NotReady control-plane,master 9h v1.21.13 10.X.Y.Z <none> Bottlerocket OS 1.8.0 (aws-k8s-1.21) 5.10.118 containerd://1.6.6+bottlerocket

Kubernetes local clusters running on AWS Outposts run on three EC2 instances. We see on the output above that the status of three worker nodes is NotReady. This is because they are used by the control plane exclusively, and we cannot use them to schedule pods.

From this stage, you can deploy self-managed node groups using the Amazon EKS local cluster.

Pricing and Availability
Amazon EKS local clusters are charged at the same price as traditional EKS clusters. It starts at $0.10/hour. The EC2 instances required to deploy the Kubernetes control plane and nodes on Outposts are included in the price of the Outposts. As usual, the pricing page has the details.

Amazon EKS local clusters are available in all AWS Regions where Outposts is available.

Go build and create your first EKS local cluster today!

— seb and Donnie.

Choosing an AWS container service to run your modern application

Post Syndicated from Lewis Tang original https://aws.amazon.com/blogs/architecture/choosing-an-aws-container-service-to-run-your-modern-application/

Businesses want to innovate quickly and deliver value even faster. To achieve these goals, the platform needs to enable teams to focus on delivering applications that are reliable, secure, highly available, cost-efficient, and scalable to required sizes.

Consider including containers on AWS in your platform, whether you are trying containers for the first time, spinning out parts of an on-premises solution to microservices in the cloud, or are new to the cloud. Containers can help you achieving a range of business benefits, including increased scalability, agility, flexibility, and cost efficiency.

In this post, we discuss three sets of builder expectations and how AWS container services can help to meet with your application delivery requirements and choose the appropriate container platform service on AWS.

Decrease container platform operations management overhead

If managing a platform is not your business’s strategic focus (for example, if most of your engineers are code developers), it can be preferable to only manage application development.

Amazon Lightsail containers offer a simple way for developers to deploy their containers to the cloud. With a Docker image you provide for your containers, AWS automatically deploys containerized workloads for you.

Lightsail assigns an HTTPS endpoint that is ready to serve your web application running in the cloud container. It automatically sets up a load-balanced Transport Layer Security (TLS) endpoint and takes care of the TLS certificate. This service replaces unresponsive containers for you automatically; by assigning a Domain Name System to your endpoint, Lightsail maintains the old version until the new version is healthy and ready to go live (Figure 1).

Amazon Lightsail containers

Figure 1. Amazon Lightsail containers

Another simple way to build and run your containerized web application in AWS is using AWS App Runner, which provides a fully managed container-native service.

Without orchestrators to configure, build pipelines to set up, or load balancers to optimize, you can bring existing containers or use the integrated container build service to go directly from the code repository to deployed application.

The build service can connect to a GitHub repository, providing a Git push workflow that deploys changes automatically. App Runner orchestration workflow take cares of the build, deployment, and configuration tasks, such as host, runtime patching, monitoring load balancing, and auto scaling (Figure 2). Explore AWS App Runner documentation and workshop for more details about the service.

AWS App Runner

Figure 2. AWS App Runner

When designing an application, you often start with a whiteboard or mental model that has representations of each service and lines for how they interact with each other. When considering an application’s platform architecture, the cloud components are not limited to virtual private cloud (VPC) subnets, load balancers, deployment pipelines, and durable storage for your application’s stateful data. Bringing all underlying cloud components together and making sure the design is well architected can be challenging.

AWS Copilot can provide guided best practices when deploying a microservice architecture that includes multiple services deployed as containers. You can use Copliot to handle cloud component details for you. By providing a container image, Copilot works with App Runner or Amazon Elastic Container Service (Amazon ECS) to provision cloud components, like VPC and having Copilot handle high-availability deployment, load balancer creation, and configuration.

To automate application deployment and new version release, Copilot can create a deployment pipeline so that the latest version of your application is automatically deployed every time you push a new commit to your code repository (as demonstrated in Figure 3).

AWS Copilot pipeline

Figure 3. AWS Copilot pipeline

Full-control application deployment with container orchestration

As business grows, your application portfolio grows. Some applications may require the selection of Microsoft Windows containers or deep customizations on container-resource scheduling, monitoring, and logging. To accommodate this, you need the flexibility of configuring the required underlying container services while still using the efficient container orchestrator to automate the common processes to achieve operation efficiency. This is where Amazon ECS and Amazon Elastic Kubernetes Service (Amazon EKS) can help.

Using Amazon ECS

As demonstrated in Figure 4, Amazon ECS is a highly scalable, high-performance container management service that supports containers and allows you to easily run applications on a managed cluster of Amazon Elastic Compute Cloud instances with Amazon Fargate (a serverless compute engine for containers). With this, you can launch and stop containerized applications and query the complete state of your cluster. You have the ability to access and configure many familiar features, like security groups and Elastic Load Balancing (ELB), by sending simple API calls.

Amazon ECS can be used to schedule container placement across your cluster based on resource needs and availability requirements. You can also integrate your own scheduler or third-party schedulers to meet business- or application-specific requirements.

Amazon ECS using AWS Fargate

Figure 4. Amazon ECS using AWS Fargate

Using Amazon EKS

Amazon EKS is a managed service that can be used to run Kubernetes on AWS, without installing, operating, and maintaining your own Kubernetes control plane or nodes. For many developers who have experience using Kubernetes, running Amazon EKS for application container workload is a preferred option because Amazon EKS provides the flexibility of Kubernetes with the scalability, security and resiliency of being an AWS managed service.

Amazon EKS runs and automatically scales the Kubernetes control plane across multiple AWS availability zones to ensure high availability, as in Figure 5. The control plane instances are automatically scaled based on load. Amazon EKS detects and replaces unhealthy control plane instances and provides automated version updates and patching. Amazon EKS enables developers to run up-to-date versions of the open-source Kubernetes software, the existing or new third-party plugins, and tooling. This means you can more easily migrate any standard Kubernetes application to Amazon EKS without code modification.

Scalability and security are essential to your business-critical workloads. Amazon EKS is integrated with many AWS services, including Amazon Elastic Container Registry for container images, ELB for load distribution, IAM for authentication, and Amazon Virtual Private Cloud for isolation.

Amazon EKS scales Kubernetes across multiple availability zones

Figure 5. Amazon EKS scales Kubernetes across multiple availability zones

Conclusion

To innovate and respond to changes faster, businesses need to build modern applications quickly and manage them efficiently. AWS provides container services to run your most sensitive, secure, and business-critical workloads reliably and to-scale.

With little-to-no prior container experience, developers can use Lightsail containers to run web application container workloads with easy-to-use interface. App Runner simplifies application deployment and management down into one particular service for running web applications. With Copilot, you can get step-by-step best practice guidance when you need to deploy microservice architecture with multiple services deployed as containers. Amazon ECS and Amazon EKS give the flexibility of configuring container workloads while maintaining the application deployment and operational efficiency.

Further reading

Deploying AWS Lambda functions using AWS Controllers for Kubernetes (ACK)

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/deploying-aws-lambda-functions-using-aws-controllers-for-kubernetes-ack/

This post is written by Rajdeep Saha, Sr. SSA, Containers/Serverless.

AWS Controllers for Kubernetes (ACK) allows you to manage AWS services directly from Kubernetes. With the ACK service controller for AWS Lambda, you can provision and manage Lambda functions with kubectl and custom resources. With ACK, you can have a single consolidated approach to managing container workloads and other AWS services, such as Lambda, directly from Kubernetes without needing additional infrastructure automation tools.

This post walks you through deploying a sample Lambda function from a Kubernetes cluster provided by Amazon EKS.

Use cases

Some of the use cases for provisioning Lambda functions from ACK include:

  • Your organization already has a DevOps process to deploy resources into the Amazon EKS cluster using Kubernetes declarative YAMLs (known as manifest files). With ACK for AWS Lambda, you can now use manifest files to provision Lambda functions without creating separate infrastructure as a code template.
  • Your project has implemented GitOps with Kubernetes. With GitOps, git becomes the single source of truth, and all the changes are done via git repo. In this model, Kubernetes continuously reconciles the git repo (desired state) with the resources running inside the cluster (current state). If any differences are found, the GitOps process automatically implements changes to the cluster from the git repo. Using ACK for AWS Lambda, since you are creating the Lambda function using Kubernetes custom resource, the GitOps model is applied for Lambda.
  • Your organization has established permissions boundaries for different users and groups using role-based access control (RBAC) and IAM roles for service accounts (IRSA). You can reuse this security model for Lambda without having to create new users and policies.

How ACK for AWS Lambda works

  1. The ‘Ops’ team deploys the ACK service controller for Lambda. This controller runs as a pod within the Amazon EKS cluster.
  2. The controller pod needs permission to read the Lambda function code and create the Lambda function. The Lambda function code is stored as a zip file in an S3 bucket for this example. The permissions are granted to the pod using IRSA.
  3. Each AWS service has separate ACK service controllers. This specific controller for AWS Lambda can act on the custom resource type ‘Function’.
  4. The ‘Dev’ team deploys Kubernetes manifest file with custom resource type ‘Function’. This manifest file defines the necessary fields required to create the function, such as S3 bucket name, zip file name, Lambda function IAM role, etc.
  5. The ACK service controller creates the Lambda function using the values from the manifest file.

Prerequisites

You need a few tools before deploying the sample application. Ensure that you have each of the following in your working environment:

This post uses shell variables to make it easier to substitute the actual names for your deployment. When you see placeholders like NAME=<your xyz name>, substitute in the name for your environment.

Setting up the Amazon EKS cluster

  1. Run the following to create an Amazon EKS cluster. The following single command creates a two-node Amazon EKS cluster with a unique name.
    eksctl create cluster
  2. It may take 15–30 minutes to provision the Amazon EKS cluster. When the cluster is ready, run:
    kubectl get nodes
  3. The output shows the following:
    Output
  4. To get the Amazon EKS cluster name to use throughout the walkthrough, run:
    eksctl get cluster
    
    export EKS_CLUSTER_NAME=<provide the name from the previous command>

Setting up the ACK Controller for Lambda

To set up the ACK Controller for Lambda:

  1. Install an ACK Controller with Helm by following these instructions:
    – Change ‘export SERVICE=s3’ to ‘export SERVICE=lambda’.
    – Change ‘export AWS_REGION=us-west-2’ to reflect your Region appropriately.
  2. To configure IAM permissions for the pod running the Lambda ACK Controller to permit it to create Lambda functions, follow these instructions.
    – Replace ‘SERVICE=”s3”’ with ‘SERVICE=”lambda”’.
  3. Validate that the ACK Lambda controller is running:
    kubectl get pods -n ack-system
  4. The output shows the running ACK Lambda controller pod:
    Output

Provisioning a Lambda function from the Kubernetes cluster

In this section, you write a sample “Hello world” Lambda function. You zip up the code and upload the zip file to an S3 bucket. Finally, you deploy that zip file to a Lambda function using the ACK Controller from the EKS cluster you created earlier. For this example, use Python3.9 as your language runtime.

To provision the Lambda function:

  1. Run the following to create the sample “Hello world” Lambda function code, and then zip it up:
    mkdir my-helloworld-function
    cd my-helloworld-function
    cat << EOF > lambda_function.py 
    import json
    
    def lambda_handler(event, context):
        # TODO implement
        return {
            'statusCode': 200,
            'body': json.dumps('Hello from Lambda!')
        }
    EOF
    zip my-deployment-package.zip lambda_function.py
    
  2. Create an S3 bucket following the instructions here. Alternatively, you can use an existing S3 bucket in the same Region of the Amazon EKS cluster.
  3. Run the following to upload the zip file into the S3 bucket from the previous step:
    export BUCKET_NAME=<provide the bucket name from step 2>
    aws s3 cp  my-deployment-package.zip s3://${BUCKET_NAME}
  4. The output shows:
    upload: ./my-deployment-package.zip to s3://<BUCKET_NAME>/my-deployment-package.zip
  5. Create your Lambda function using the ACK Controller. The full spec with all the available fields is listed here. First, provide a name for the function:
    export FUNCTION_NAME=hello-world-s3-ack
  6. Create and deploy the Kubernetes manifest file. The command at the end, kubectl create -f function.yaml submits the manifest file, with kind as ‘Function’. The ACK Controller for Lambda identifies this custom ‘Function’ object and deploys the Lambda function based on the manifest file.
    export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)
    export LAMBDA_ROLE="arn:aws:iam::${AWS_ACCOUNT_ID}:role/lambda_basic_execution"
    
    cat << EOF > lambdamanifest.yaml 
    apiVersion: lambda.services.k8s.aws/v1alpha1
    kind: Function
    metadata:
     name: $FUNCTION_NAME
     annotations:
       services.k8s.aws/region: $AWS_REGION
    spec:
     name: $FUNCTION_NAME
     code:
       s3Bucket: $BUCKET_NAME
       s3Key: my-deployment-package.zip
     role: $LAMBDA_ROLE
     runtime: python3.9
     handler: lambda_function.lambda_handler
     description: function created by ACK lambda-controller e2e tests
    EOF
    kubectl create -f lambdamanifest.yaml
    
  7. The output shows:
    function.lambda.services.k8s.aws/< FUNCTION_NAME> created
  8. To retrieve the details of the function using a Kubernetes command, run:
    kubectl describe function/$FUNCTION_NAME
  9. This Lambda function returns a “Hello world” message. To invoke the function, run:
    aws lambda invoke --function-name $FUNCTION_NAME  response.json
    cat response.json
    
  10. The Lambda function returns the following output:
    {"statusCode": 200, "body": "\"Hello from Lambda!\""}

Congratulations! You created a Lambda function from your Kubernetes cluster.

To learn how to provision the Lambda function using the ACK controller from an OCI container image instead of a zip file in an S3 bucket, follow these instructions.

Cleaning up

This section cleans up all the resources that you have created. To clean up:

  1. Delete the Lambda function:
    kubectl delete function $FUNCTION_NAME
  2. If you have created a new S3 bucket, delete it by running:
    aws s3 rm s3://${BUCKET_NAME} --recursive
    aws s3api delete-bucket --bucket ${BUCKET_NAME}
  3. Delete the EKS cluster:
    eksctl delete cluster --name $EKS_CLUSTER_NAME
  4. Delete the IAM role created for the ACK Controller. Get the IAM role name by running the following command, then delete the role from the IAM console:
    echo $ACK_CONTROLLER_IAM_ROLE

Conclusion

This blog post shows how AWS Controllers for Kubernetes enables you to deploy a Lambda function directly from your Amazon EKS environment. AWS Controllers for Kubernetes provides a convenient way to connect your Kubernetes applications to AWS services directly from Kubernetes.

ACK is open source: you can request new features and report issues on the ACK community GitHub repository.

For more serverless learning resources, visit Serverless Land.

Design patterns to manage Amazon EMR on EKS workloads for Apache Spark

Post Syndicated from Jamal Arif original https://aws.amazon.com/blogs/big-data/design-patterns-to-manage-amazon-emr-on-eks-workloads-for-apache-spark/

Amazon EMR on Amazon EKS enables you to submit Apache Spark jobs on demand on Amazon Elastic Kubernetes Service (Amazon EKS) without provisioning clusters. With EMR on EKS, you can consolidate analytical workloads with your other Kubernetes-based applications on the same Amazon EKS cluster to improve resource utilization and simplify infrastructure management. Kubernetes uses namespaces to provide isolation between groups of resources within a single Kubernetes cluster. Amazon EMR creates a virtual cluster by registering Amazon EMR with a namespace on an EKS cluster. Amazon EMR can then run analytics workloads on that namespace.

In EMR on EKS, you can submit your Spark jobs to Amazon EMR virtual clusters using the AWS Command Line Interface (AWS CLI), SDK, or Amazon EMR Studio. Amazon EMR requests the Kubernetes scheduler on Amazon EKS to schedule pods. For every job you run, EMR on EKS creates a container with an Amazon Linux 2 base image, Apache Spark, and associated dependencies. Each Spark job runs in a pod on Amazon EKS worker nodes. If your Amazon EKS cluster has worker nodes in different Availability Zones, the Spark application driver and executor pods can spread across multiple Availability Zones. In this case, data transfer charges apply for cross-AZ communication and increases data processing latency. If you want to reduce data processing latency and avoid cross-AZ data transfer costs, you should configure Spark applications to run only within a single Availability Zone.

In this post, we share four design patterns to manage EMR on EKS workloads for Apache Spark. We then show how to use a pod template to schedule a job with EMR on EKS, and use Karpenter as our autoscaling tool.

Pattern 1: Manage Spark jobs by pod template

Customers often consolidate multiple applications on a shared Amazon EKS cluster to improve utilization and save costs. However, each application may have different requirements. For example, you may want to run performance-intensive workloads such as machine learning model training jobs on SSD-backed instances for better performance, or fault-tolerant and flexible applications on Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances for lower cost. In EMR on EKS, there are a few ways to configure how your Spark job runs on Amazon EKS worker nodes. You can utilize the Spark configurations on Kubernetes with the EMR on EKS StartJobRun API, or you can use Spark’s pod template feature. Pod templates are specifications that determine how to run each pod on your EKS clusters. With pod templates, you have more flexibility and can use pod template files to define Kubernetes pod configurations that Spark doesn’t support.

You can use pod templates to achieve the following benefits:

  • Reduce costs – You can schedule Spark executor pods to run on EC2 Spot Instances while scheduling Spark driver pods to run on EC2 On-Demand Instances.
  • Improve monitoring – You can enhance your Spark workload’s observability. For example, you can deploy a sidecar container via a pod template to your Spark job that can forward logs to your centralized logging application
  • Improve resource utilization – You can support multiple teams running their Spark workloads on the same shared Amazon EKS cluster

You can implement these patterns using pod templates and Kubernetes labels and selectors. Kubernetes labels are key-value pairs that are attached to objects, such as Kubernetes worker nodes, to identify attributes that are meaningful and relevant to users. You can then choose where Kubernetes schedules pods using nodeSelector or Kubernetes affinity and anti-affinity so that it can only run on specific worker nodes. nodeSelector is the simplest way to constrain pods to nodes with specific labels. Affinity and anti-affinity expand the types of constraints you can define.

Autoscaling in Spark workload

Autoscaling is a function that automatically scales your compute resources up or down to changes in demand. For Kubernetes auto scaling, Amazon EKS supports two auto scaling products: the Kubernetes Cluster Autoscaler and the Karpenter open-source auto scaling project. Kubernetes autoscaling ensures your cluster has enough nodes to schedule your pods without wasting resources. If some pods fail to schedule on current worker nodes due to insufficient resources, it increases the size of the cluster and adds additional nodes. It also attempts to remove underutilized nodes when its pods can run elsewhere.

Pattern 2: Turn on Dynamic Resource Allocation (DRA) in Spark

Spark provides a mechanism called Dynamic Resource Allocation (DRA), which dynamically adjusts the resources your application occupies based on the workload. With DRA, the Spark driver spawns the initial number of executors and then scales up the number until the specified maximum number of executors is met to process the pending tasks. Idle executors are deleted when there are no pending tasks. It’s particularly useful if you’re not certain how many executors are needed for your job processing.

You can implement it in EMR on EKS by following the Dynamic Resource Allocation workshop.

Pattern 3: Fully control cluster autoscaling by Cluster Autoscaler

Cluster Autoscaler utilizes the concept of node groups as the element of capacity control and scale. In AWS, node groups are implemented by auto scaling groups. Cluster Autoscaler implements it by controlling the DesiredReplicas field of your auto scaling groups.

To save costs and improve resource utilization, you can use Cluster Autoscaler in your Amazon EKS cluster to automatically scale your Spark pods. The following are recommendations for autoscaling Spark jobs with Amazon EMR on EKS using Cluster Autoscaler:

  • Create Availability Zone bounded auto scaling groups to make sure Cluster Autoscaler only adds worker nodes in the same Availability Zone to avoid cross-AZ data transfer charges and data processing latency.
  • Create separate node groups for EC2 On-Demand and Spot Instances. By doing this, you can add or shrink driver pods and executor pods independently.
  • In Cluster Autoscaler, each node in a node group needs to have identical scheduling properties. That includes EC2 instance types, which should be of similar vCPU to memory ratio to avoid inconsistency and wastage of resources. To learn more about Cluster Autoscaler node groups best practices, refer to Configuring your Node Groups.
  • Adhere to Spot Instance best practices and maximize diversification to take advantages of multiple Spot pools. Create multiple node groups for Spark executor pods with different vCPU to memory ratios. This greatly increases the stability and resiliency of your application.
  • When you have multiple node groups, use pod templates and Kubernetes labels and selectors to manage Spark pod deployment to specific Availability Zones and EC2 instance types.

The following diagram illustrates Availability Zone bounded auto scaling groups.

AZ bounded CA ASG
As multiple node groups are created, Cluster Autoscaler has the concept of expanders, which provide different strategies for selecting which node group to scale. As of this writing, the following strategies are supported: random, most-pods, least-waste, and priority. With multiple node groups of EC2 On-Demand and Spot Instances, you can use the priority expander, which allows Cluster Autoscaler to select the node group that has the highest priority assigned by the user. For configuration details, refer to Priority based expander for Cluster Autoscaler.

Pattern 4: Group-less autoscaling with Karpenter

Karpenter is an open-source, flexible, high-performance Kubernetes cluster auto scaler built with AWS. The overall goal is the same of auto scaling Amazon EKS clusters to adjust un-schedulable pods; however, Karpenter takes a different approach than Cluster Autoscaler, known as group-less provisioning. It observes the aggregate resource requests of unscheduled pods and makes decisions to launch minimal compute resources to fit the un-schedulable pods for efficient binpacking and reducing scheduling latency. It can also delete nodes to reduce infrastructure costs. Karpenter works directly with the Amazon EC2 Fleet.

To configure Karpenter, you create provisioners that define how Karpenter manages un-schedulable pods and expired nodes. You should utilize the concept of layered constraints to manage scheduling constraints. To reduce EMR on EKS costs and improve Amazon EKS cluster utilization, you can use Karpenter with similar constraints of Single-AZ, On-Demand Instances for Spark driver pods, and Spot Instances for executor pods without creating multiple types of node groups. With its group-less approach, Karpenter allows you to be more flexible and diversify better.

The following are recommendations for auto scaling EMR on EKS with Karpenter:

  • Configure Karpenter provisioners to launch nodes in a single Availability Zone to avoid cross-AZ data transfer costs and reduce data processing latency.
  • Create a provisioner for EC2 Spot Instances and EC2 On-Demand Instances. You can reduce costs by scheduling Spark driver pods to run on EC2 On-Demand Instances and schedule Spark executor pods to run on EC2 Spot Instances.
  • Limit the instance types by providing a list of EC2 instances or let Karpenter choose from all the Spot pools available to it. This follows the Spot best practices of diversifying across multiple Spot pools.
  • Use pod templates and Kubernetes labels and selectors to allow Karpenter to spin up right-sized nodes required for un-schedulable pods.

The following diagram illustrates how Karpenter works.

Karpenter How it Works

To summarize the design patterns we discussed:

  1. Pod templates help tailor your Spark workloads. You can configure Spark pods in a single Availability Zone and utilize EC2 Spot Instances for Spark executor pods, resulting in better price-performance.
  2. EMR on EKS supports the DRA feature in Spark. It is useful if you’re not familiar how many Spark executors are needed for your job processing, and use DRA to dynamically adjust the resources your application needs.
  3. Utilizing Cluster Autoscaler enables you to fully control how to autoscale your Amazon EMR on EKS workloads. It improves your Spark application availability and cluster efficiency by rapidly launching right-sized compute resources.
  4. Karpenter simplifies autoscaling with its group-less provisioning of compute resources. The benefits include reduced scheduling latency, and efficient bin-packing to reduce infrastructure costs.

Walkthrough overview

In our example walkthrough, we will show how to use Pod template to schedule a job with EMR on EKS. We use Karpenter as our autoscaling tool.

We complete the following steps to implement the solution:

  1. Create an Amazon EKS cluster.
  2. Prepare the cluster for EMR on EKS.
  3. Register the cluster with Amazon EMR.
  4. For Amazon EKS auto scaling, set up Karpenter auto scaling in Amazon EKS.
  5. Submit a sample Spark job using pod templates to run in single Availability Zone and utilize Spot for Spark executor pods.

The following diagram illustrates this architecture.

Walkthrough Overview

Prerequisites

To follow along with the walkthrough, ensure that you have the following prerequisite resources:

Create an Amazon EKS cluster

There are two ways to create an EKS cluster: you can use AWS Management Console and AWS CLI, or you can install all the required resources for Amazon EKS using eksctl, a simple command line utility for creating and managing Kubernetes clusters on EKS. For this post, we use eksctl to create our cluster.

Let’s start with installing the tools to set up and manage your Kubernetes cluster.

  1. Install the AWS CLI with the following command (Linux OS) and confirm it works:
    curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
    unzip awscliv2.zip
    sudo ./aws/install
    aws --version

    For other operating systems, see Installing, updating, and uninstalling the AWS CLI version.

  2. Install eksctl, the command line utility for creating and managing Kubernetes clusters on Amazon EKS:
    curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
    sudo mv -v /tmp/eksctl /usr/local/bin
    eksctl version

    eksctl is a tool jointly developed by AWS and Weaveworks that automates much of the experience of creating EKS clusters.

  3. Install the Kubernetes command-line tool, kubectl, which allows you to run commands against Kubernetes clusters:
    curl -o kubectl https://amazon-eks.s3.us-west-2.amazonaws.com/1.18.8/2020-09-18/bin/linux/amd64/kubectl
    chmod +x ./kubectl
    sudo mv ./kubectl /usr/local/bin

  4. Create a new file called eks-create-cluster.yaml with the following:
    apiVersion: eksctl.io/v1alpha5
    kind: ClusterConfig
    
    metadata:
      name: emr-on-eks-blog-cluster
      region: us-west-2
    
    availabilityZones: ["us-west-2b", "us-west-2c", "us-west-2d"]
    
    managedNodeGroups:#On-demand nodegroups for spark job
    - name: singleaz-ng-ondemand
      instanceType: m5.xlarge
      desiredCapacity: 1
      availabilityZones: ["us-west-2b"]
    

  5. Create an Amazon EKS cluster using the eks-create-cluster.yaml file:
    eksctl create cluster -f eks-create-cluster.yaml

    In this Amazon EKS cluster, we create a single managed node group with a general purpose m5.xlarge EC2 Instance. Launching Amazon EKS cluster, its managed node groups, and all dependencies typically takes 10–15 minutes.

  6. After you create the cluster, you can run the following to confirm all node groups were created:
    eksctl get nodegroups --cluster emr-on-eks-blog-cluster

    You can now use kubectl to interact with the created Amazon EKS cluster.

  7. After you create your Amazon EKS cluster, you must configure your kubeconfig file for your cluster using the AWS CLI:
    aws eks --region us-west-2 update-kubeconfig --name emr-on-eks-blog-cluster
    kubectl cluster-info
    

You can now use kubectl to connect to your Kubernetes cluster.

Prepare your Amazon EKS cluster for EMR on EKS

Now we prepare our Amazon EKS cluster to integrate it with EMR on EKS.

  1. Let’s create the namespace emr-on-eks-blog in our Amazon EKS cluster:
    kubectl create namespace emr-on-eks-blog

  2. We use the automation powered by eksctl to create role-based access control permissions and to add the EMR on EKS service-linked role into the aws-auth configmap:
    eksctl create iamidentitymapping --cluster emr-on-eks-blog-cluster --namespace emr-on-eks-blog --service-name "emr-containers"

  3. The Amazon EKS cluster already has an OpenID Connect provider URL. You enable IAM roles for service accounts by associating IAM with the Amazon EKS cluster OIDC:
    eksctl utils associate-iam-oidc-provider --cluster emr-on-eks-blog-cluster —approve

    Now let’s create the IAM role that Amazon EMR uses to run Spark jobs.

  4. Create the file blog-emr-trust-policy.json:

    {
    "Version": "2012-10-17",
    "Statement": [
    {
    "Effect": "Allow",
    "Principal": {
    "Service": "elasticmapreduce.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
    }
    ]
    }

    Set up an IAM role:

    aws iam create-role --role-name blog-emrJobExecutionRole —assume-role-policy-document file://blog-emr-trust-policy.json

    This IAM role contains all permissions that the Spark job needs—for instance, we provide access to S3 buckets and Amazon CloudWatch to access necessary files (pod templates) and share logs.

    Next, we need to attach the required IAM policies to the role so it can write logs to Amazon S3 and CloudWatch.

  5. Create the file blog-emr-policy-document with the required IAM policies. Replace the bucket name with your S3 bucket ARN.

    {
    "Version": "2012-10-17",
    "Statement": [
    {
    "Effect": "Allow",
    "Action": [
    "s3:PutObject",
    "s3:GetObject",
    "s3:ListBucket"
    ],
    "Resource": ["arn:aws:s3:::<bucket-name>"]
    },
    {
    "Effect": "Allow",
    "Action": [
    "logs:PutLogEvents",
    "logs:CreateLogStream",
    "logs:DescribeLogGroups",
    "logs:DescribeLogStreams"
    ],
    "Resource": [
    "arn:aws:logs:::"
    ]
    }
    ]
    }

    Attach it to the IAM role created in the previous step:

    aws iam put-role-policy --role-name blog-emrJobExecutionRole --policy-name blog-EMR-JobExecution-policy —policy-document file://blog-emr-policy-document.json

  6. Now we update the trust relationship between the IAM role we just created with the Amazon EMR service identity. The namespace provided here in the trust policy needs to be same when registering the virtual cluster in next step:
    aws emr-containers update-role-trust-policy --cluster-name emr-on-eks-blog-cluster --namespace emr-on-eks-blog --role-name blog-emrJobExecutionRole --region us-west-2

Register the Amazon EKS cluster with Amazon EMR

Registering your Amazon EKS cluster is the final step to set up EMR on EKS to run workloads.

We create a virtual cluster and map it to the Kubernetes namespace created earlier:

aws emr-containers create-virtual-cluster \
    --region us-west-2 \
    --name emr-on-eks-blog-cluster \
    --container-provider '{
       "id": "emr-on-eks-blog-cluster",
       "type": "EKS",
       "info": {
          "eksInfo": {
              "namespace": "emr-on-eks-blog"
          }
       }
    }'

After you register, you should get confirmation that your EMR virtual cluster is created:

{
"arn": "arn:aws:emr-containers:us-west-2:142939128734:/virtualclusters/lwpylp3kqj061ud7fvh6sjuyk",
"id": "lwpylp3kqj061ud7fvh6sjuyk",
"name": "emr-on-eks-blog-cluster"
}

A virtual cluster is an Amazon EMR concept that means that Amazon EMR registered to a Kubernetes namespace and can run jobs in that namespace. If you navigate to your Amazon EMR console, you can see the virtual cluster listed.

Set up Karpenter in Amazon EKS

To get started with Karpenter, ensure there is some compute capacity available, and install it using the Helm charts provided in the public repository. Karpenter also requires permissions to provision compute resources. For more information, refer to Getting Started.

Karpenter’s single responsibility is to provision compute for your Kubernetes clusters, which is configured by a custom resource called a provisioner. Once installed in your cluster, the Karpenter provisioner observes incoming Kubernetes pods, which can’t be scheduled due to insufficient compute resources in the cluster, and automatically launches new resources to meet their scheduling and resource requirements.

For our use case, we provision two provisioners.

The first is a Karpenter provisioner for Spark driver pods to run on EC2 On-Demand Instances:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: ondemand
spec:
  ttlSecondsUntilExpired: 2592000 

  ttlSecondsAfterEmpty: 30

  labels:
    karpenter.sh/capacity-type: on-demand

  requirements:
    - key: "topology.kubernetes.io/zone"
      operator: In
      values: ["us-west-2b"]
    - key: "kubernetes.io/arch"
      operator: In
      values: ["arm64"]
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["on-demand"]

  limits:
    resources:
      cpu: "1000"
      memory: 1000Gi

  provider: 
    subnetSelector:
      alpha.eksctl.io/cluster-name: emr-on-eks-blog-cluster
    securityGroupSelector:
      alpha.eksctl.io/cluster-name: emr-on-eks-blog-cluster

The second is a Karpenter provisioner for Spark executor pods to run on EC2 Spot Instances:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  ttlSecondsUntilExpired: 2592000 

  ttlSecondsAfterEmpty: 30

  labels:
    karpenter.sh/capacity-type: spot

  requirements:
    - key: "topology.kubernetes.io/zone"
      operator: In
      values: ["us-west-2b"]
    - key: "kubernetes.io/arch"
      operator: In
      values: ["arm64"]
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["spot"]

  limits:
    resources:
      cpu: "1000"
      memory: 1000Gi

  provider: 
    subnetSelector:
      alpha.eksctl.io/cluster-name: emr-on-eks-blog-cluster
    securityGroupSelector:
      alpha.eksctl.io/cluster-name: emr-on-eks-blog-cluster

Note the highlighted portion of the provisioner config. In the requirements section, we use the well-known labels with Amazon EKS and Karpenter to add constraints for how Karpenter launches nodes. We add constraints that if the pod is looking for a label karpenter.sh/capacity-type: spot, it uses this provisioner to launch an EC2 Spot Instance only in Availability Zone us-west-2b. Similarly, we follow the same constraint for the karpenter.sh/capacity-type: on-demand label. We can also be more granular and provide EC2 instance types in our provisioner, and they can be of different vCPU and memory ratios, giving you more flexibility and adding resiliency to your application. Karpenter launches nodes only when both the provisioner’s and pod’s requirements are met. To learn more about the Karpenter provisioner API, refer to Provisioner API.

In the next step, we define pod requirements and align them with what we have defined in Karpenter’s provisioner.

Submit Spark job using Pod template

In Kubernetes, labels are key-value pairs that are attached to objects, such as pods. Labels are intended to be used to specify identifying attributes of objects that are meaningful and relevant to users. You can constrain a pod so that it can only run on particular set of nodes. There are several ways to do this, and the recommended approaches all use label selectors to facilitate the selection.

Beginning with Amazon EMR versions 5.33.0 or 6.3.0, EMR on EKS supports Spark’s pod template feature. We use pod templates to add specific labels where Spark driver and executor pods should be launched.

Create a pod template file for a Spark driver pod and save them in your S3 bucket:

apiVersion: v1
kind: Pod
spec:
  nodeSelector:
    karpenter.sh/capacity-type: on-demand
  containers:
  - name: spark-kubernetes-driver # This will be interpreted as Spark driver container

Create a pod template file for a Spark executor pod and save them in your S3 bucket:

apiVersion: v1
kind: Pod
spec:
  nodeSelector:
    karpenter.sh/capacity-type: spot
  containers:
  - name: spark-kubernetes-executor # This will be interpreted as Spark driver container

Pod templates provide different fields to manage job scheduling. For additional details, refer to Pod template fields. Note the nodeSelector for the Spark driver pods and Spark executor pods, which match the labels we defined with the Karpenter provisioner.

For a sample Spark job, we use the following code, which creates multiple parallel threads and waits for a few seconds:

cat << EOF > threadsleep.py
import sys
from time import sleep
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("threadsleep").getOrCreate()
def sleep_for_x_seconds(x):sleep(x*20)
sc=spark.sparkContext
sc.parallelize(range(1,6), 5).foreach(sleep_for_x_seconds)
spark.stop()
EOF

Copy the sample Spark job into your S3 bucket:

aws s3 mb s3://<YourS3Bucket>
aws s3 cp threadsleep.py s3://<YourS3Bucket>

Before we submit the Spark job, let’s get the required values of the EMR virtual cluster and Amazon EMR job execution role ARN:

export S3blogbucket= s3://<YourS3Bucket>
export VIRTUAL_CLUSTER_ID=$(aws emr-containers list-virtual-clusters --query "virtualClusters[?state=='RUNNING'].id" --region us-west-2 --output text)

export EMR_ROLE_ARN=$(aws iam get-role --role-name blog-emrJobExecutionRole --query Role.Arn --region us-west-2 --output text)

To enable the pod template feature with EMR on EKS, you can use configuration-overrides to specify the Amazon S3 path to the pod template:

aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name spark-threadsleep-single-az \
--execution-role-arn $EMR_ROLE_ARN \
--release-label emr-5.33.0-latest \
--region us-west-2 \
--job-driver '{
    "sparkSubmitJobDriver": {
        "entryPoint": "'${S3blogbucket}'/threadsleep.py",
        "sparkSubmitParameters": "--conf spark.executor.instances=6 --conf spark.executor.memory=1G --conf spark.executor.cores=1 --conf spark.driver.cores=2"
        }
    }' \
--configuration-overrides '{
    "applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.driver.memory":"1G",
"spark.kubernetes.driver.podTemplateFile":"'${S3blogbucket}'/spark_driver_podtemplate.yaml", "spark.kubernetes.executor.podTemplateFile":"'${S3blogbucket}'/spark_executor_podtemplate.yaml"
         }
      }
    ], 
    "monitoringConfiguration": {
      "cloudWatchMonitoringConfiguration": {
        "logGroupName": "/emr-on-eks/emreksblog", 
        "logStreamNamePrefix": "threadsleep"
      }, 
      "s3MonitoringConfiguration": {
        "logUri": "'"$S3blogbucket"'/logs/"
      }
    }
}'

In the Spark job, we’re requesting two cores for the Spark driver and one core each for Spark executor pod. Because we only had a single EC2 instance in our managed node group, Karpenter looks at the un-schedulable Spark driver pods and utilizes the on-demand provisioner to launch EC2 On-Demand Instances for Spark driver pods in us-west-2b. Similarly, when the Spark executor pods are in pending state, because there are no Spot Instances, Karpenter launches Spot Instances in us-west-2b.

This way, Karpenter optimizes your costs by starting from zero Spot and On-Demand Instances and only creates them dynamically when required. Additionally, Karpenter batches pending pods and then binpacks them based on CPU, memory, and GPUs required, taking into account node overhead, VPC CNI resources required, and daemon sets that will be packed when bringing up a new node. This makes sure you’re efficiently utilizing your resources with least wastage.

Clean up

Don’t forget to clean up the resources you created to avoid any unnecessary charges.

  1. Delete all the virtual clusters that you created:
    #List all the virtual cluster ids
    aws emr-containers list-virtual-clusters#Delete virtual cluster by passing virtual cluster id
    aws emr-containers delete-virtual-cluster —id <virtual-cluster-id>
    

  2. Delete the Amazon EKS cluster:
    eksctl delete cluster emr-on-eks-blog-cluster

  3. Delete the EMR_EKS_Job_Execution_Role role and policies.

Conclusion

In this post, we saw how to create an Amazon EKS cluster, configure Amazon EKS managed node groups, create an EMR virtual cluster on Amazon EKS, and submit Spark jobs. Using pod templates, we saw how to ensure Spark workloads are scheduled in the same Availability Zone and utilize Spot with Karpenter auto scaling to reduce costs and optimize your Spark workloads.

To get started, try out the EMR on EKS workshop. For more resources, refer to the following:


About the author

Jamal Arif is a Solutions Architect at AWS and a containers specialist. He helps AWS customers in their modernization journey to build innovative, resilient, and cost-effective solutions. In his spare time, Jamal enjoys spending time outdoors with his family hiking and mountain biking.

AWS Week in Review – August 1, 2022

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-week-in-review-august-1-2022/

AWS re:Inforce returned to Boston last week, kicking off with a keynote from Amazon Chief Security Officer Steve Schmidt and AWS Chief Information Security officer C.J. Moses:

Be sure to take some time to watch this video and the other leadership sessions, and to use what you learn to take some proactive steps to improve your security posture.

Last Week’s Launches
Here are some launches that caught my eye last week:

AWS Wickr uses 256-bit end-to-end encryption to deliver secure messaging, voice, and video calling, including file sharing and screen sharing, across desktop and mobile devices. Each call, message, and file is encrypted with a new random key and can be decrypted only by the intended recipient. AWS Wickr supports logging to a secure, customer-controlled data store for compliance and auditing, and offers full administrative control over data: permissions, ephemeral messaging options, and security groups. You can now sign up for the preview.

AWS Marketplace Vendor Insights helps AWS Marketplace sellers to make security and compliance data available through AWS Marketplace in the form of a unified, web-based dashboard. Designed to support governance, risk, and compliance teams, the dashboard also provides evidence that is backed by AWS Config and AWS Audit Manager assessments, external audit reports, and self-assessments from software vendors. To learn more, read the What’s New post.

GuardDuty Malware Protection protects Amazon Elastic Block Store (EBS) volumes from malware. As Danilo describes in his blog post, a malware scan is initiated when Amazon GuardDuty detects that a workload running on an EC2 instance or in a container appears to be doing something suspicious. The new malware protection feature creates snapshots of the attached EBS volumes, restores them within a service account, and performs an in-depth scan for malware. The scanner supports many types of file systems and file formats and generates actionable security findings when malware is detected.

Amazon Neptune Global Database lets you build graph applications that run across multiple AWS Regions using a single graph database. You can deploy a primary Neptune cluster in one region and replicate its data to up to five secondary read-only database clusters, with up to 16 read replicas each. Clusters can recover in minutes in the result of an (unlikely) regional outage, with a Recovery Point Objective (RPO) of 1 second and a Recovery Time Objective (RTO) of 1 minute. To learn a lot more and see this new feature in action, read Introducing Amazon Neptune Global Database.

Amazon Detective now Supports Kubernetes Workloads, with the ability to scale to thousands of container deployments and millions of configuration changes per second. It ingests EKS audit logs to capture API activity from users, applications, and the EKS control plane, and correlates user activity with information gleaned from Amazon VPC flow logs. As Channy notes in his blog post, you can enable Amazon Detective and take advantage of a free 30 day trial of the EKS capabilities.

AWS SSO is Now AWS IAM Identity Center in order to better represent the full set of workforce and account management capabilities that are part of IAM. You can create user identities directly in IAM Identity Center, or you can connect your existing Active Directory or standards-based identify provider. To learn more, read this post from the AWS Security Blog.

AWS Config Conformance Packs now provide you with percentage-based scores that will help you track resource compliance within the scope of the resources addressed by the pack. Scores are computed based on the product of the number of resources and the number of rules, and are reported to Amazon CloudWatch so that you can track compliance trends over time. To learn more about how scores are computed, read the What’s New post.

Amazon Macie now lets you perform one-click temporary retrieval of sensitive data that Macie has discovered in an S3 bucket. You can retrieve up to ten examples at a time, and use these findings to accelerate your security investigations. All of the data that is retrieved and displayed in the Macie console is encrypted using customer-managed AWS Key Management Service (AWS KMS) keys. To learn more, read the What’s New post.

AWS Control Tower was updated multiple times last week. CloudTrail Organization Logging creates an org-wide trail in your management account to automatically log the actions of all member accounts in your organization. Control Tower now reduces redundant AWS Config items by limiting recording of global resources to home regions. To take advantage of this change you need to update to the latest landing zone version and then re-register each Organizational Unit, as detailed in the What’s New post. Lastly, Control Tower’s region deny guardrail now includes AWS API endpoints for AWS Chatbot, Amazon S3 Storage Lens, and Amazon S3 Multi Region Access Points. This allows you to limit access to AWS services and operations for accounts enrolled in your AWS Control Tower environment.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Here are some other news items and customer stories that you may find interesting:

AWS Open Source News and Updates – My colleague Ricardo Sueiras writes a weekly open source newsletter and highlights new open source projects, tools, and demos from the AWS community. Read installment #122 here.

Growy Case Study – This Netherlands-based company is building fully-automated robot-based vertical farms that grow plants to order. Read the case study to learn how they use AWS IoT and other services to monitor and control light, temperature, CO2, and humidity to maximize yield and quality.

Journey of a Snap on Snapchat – This video shows you how a snapshot flows end-to-end from your camera to AWS, to your friends. With over 300 million daily active users, Snap takes advantage of Amazon Elastic Kubernetes Service (EKS), Amazon DynamoDB, Amazon Simple Storage Service (Amazon S3), Amazon CloudFront, and many other AWS services, storing over 400 terabytes of data in DynamoDB and managing over 900 EKS clusters.

Cutting Cardboard Waste – Bin packing is almost certainly a part of every computer science curriculum! In the linked article from the Amazon Science site, you can learn how an Amazon Principal Research Scientist developed PackOpt to figure out the optimal set of boxes to use for shipments from Amazon’s global network of fulfillment centers. This is an NP-hard problem and the article describes how they build a parallelized solution that explores a multitude of alternative solutions, all running on AWS.

Upcoming Events
Check your calendar and sign up for these online and in-person AWS events:

AWS SummitAWS Global Summits – AWS Global Summits are free events that bring the cloud computing community together to connect, collaborate, and learn about AWS. Registrations are open for the following AWS Summits in August:

Imagine Conference 2022IMAGINE 2022 – The IMAGINE 2022 conference will take place on August 3 at the Seattle Convention Center, Washington, USA. It’s a no-cost event that brings together education, state, and local leaders to learn about the latest innovations and best practices in the cloud. You can register here.

That’s all for this week. Check back next Monday for another Week in Review!

Jeff;

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

How Epos Now modernized their data platform by building an end-to-end data lake with the AWS Data Lab

Post Syndicated from Debadatta Mohapatra original https://aws.amazon.com/blogs/big-data/how-epos-now-modernized-their-data-platform-by-building-an-end-to-end-data-lake-with-the-aws-data-lab/

Epos Now provides point of sale and payment solutions to over 40,000 hospitality and retailers across 71 countries. Their mission is to help businesses of all sizes reach their full potential through the power of cloud technology, with solutions that are affordable, efficient, and accessible. Their solutions allow businesses to leverage actionable insights, manage their business from anywhere, and reach customers both in-store and online.

Epos Now currently provides real-time and near-real-time reports and dashboards to their merchants on top of their operational database (Microsoft SQL Server). With a growing customer base and new data needs, the team started to see some issues in the current platform.

First, they observed performance degradation for serving the reporting requirements from the same OLTP database with the current data model. A few metrics that needed to be delivered in real time (seconds after a transaction was complete) and a few metrics that needed to be reflected in the dashboard in near-real-time (minutes) took several attempts to load in the dashboard.

This started to cause operational issues for their merchants. The end consumers of reports couldn’t access the dashboard in a timely manner.

Cost and scalability also became a major problem because one single database instance was trying to serve many different use cases.

Epos Now needed a strategic solution to address these issues. Additionally, they didn’t have a dedicated data platform for doing machine learning and advanced analytics use cases, so they decided on two parallel strategies to resolve their data problems and better serve merchants:

  • The first was to rearchitect the near-real-time reporting feature by moving it to a dedicated Amazon Aurora PostgreSQL-Compatible Edition database, with a specific reporting data model to serve to end consumers. This will improve performance, uptime, and cost.
  • The second was to build out a new data platform for reporting, dashboards, and advanced analytics. This will enable use cases for internal data analysts and data scientists to experiment and create multiple data products, ultimately exposing these insights to end customers.

In this post, we discuss how Epos Now designed the overall solution with support from the AWS Data Lab. Having developed a strong strategic relationship with AWS over the last 3 years, Epos Now opted to take advantage of the AWS Data lab program to speed up the process of building a reliable, performant, and cost-effective data platform. The AWS Data Lab program offers accelerated, joint-engineering engagements between customers and AWS technical resources to create tangible deliverables that accelerate data and analytics modernization initiatives.

Working with an AWS Data Lab Architect, Epos Now commenced weekly cadence calls to come up with a high-level architecture. After the objective, success criteria, and stretch goals were clearly defined, the final step was to draft a detailed task list for the upcoming 3-day build phase.

Overview of solution

As part of the 3-day build exercise, Epos Now built the following solution with the ongoing support of their AWS Data Lab Architect.

Epos Now Arch Image

The platform consists of an end-to-end data pipeline with three main components:

  • Data lake – As a central source of truth
  • Data warehouse – For analytics and reporting needs
  • Fast access layer – To serve near-real-time reports to merchants

We chose three different storage solutions:

  • Amazon Simple Storage Service (Amazon S3) for raw data landing and a curated data layer to build the foundation of the data lake
  • Amazon Redshift to create a federated data warehouse with conformed dimensions and star schemas for consumption by Microsoft Power BI, running on AWS
  • Aurora PostgreSQL to store all the data for near-real-time reporting as a fast access layer

In the following sections, we go into each component and supporting services in more detail.

Data lake

The first component of the data pipeline involved ingesting the data from an Amazon Managed Streaming for Apache Kafka (Amazon MSK) topic using Amazon MSK Connect to land the data into an S3 bucket (landing zone). The Epos Now team used the Confluent Amazon S3 sink connector to sink the data to Amazon S3. To make the sink process more resilient, Epos Now added the required configuration for dead-letter queues to redirect the bad messages to another topic. The following code is a sample configuration for a dead-letter queue in Amazon MSK Connect:

Because Epos Now was ingesting from multiple data sources, they used Airbyte to transfer the data to a landing zone in batches. A subsequent AWS Glue job reads the data from the landing bucket , performs data transformation, and moves the data to a curated zone of Amazon S3 in optimal format and layout. This curated layer then became the source of truth for all other use cases. Then Epos Now used an AWS Glue crawler to update the AWS Glue Data Catalog. This was augmented by the use of Amazon Athena for doing data analysis. To optimize for cost, Epos Now defined an optimal data retention policy on different layers of the data lake to save money as well as keep the dataset relevant.

Data warehouse

After the data lake foundation was established, Epos Now used a subsequent AWS Glue job to load the data from the S3 curated layer to Amazon Redshift. We used Amazon Redshift to make the data queryable in both Amazon Redshift (internal tables) and Amazon Redshift Spectrum. The team then used dbt as an extract, load, and transform (ELT) engine to create the target data model and store it in target tables and views for internal business intelligence reporting. The Epos Now team wanted to use their SQL knowledge to do all ELT operations in Amazon Redshift, so they chose dbt to perform all the joins, aggregations, and other transformations after the data was loaded into the staging tables in Amazon Redshift. Epos Now is currently using Power BI for reporting, which was migrated to the AWS Cloud and connected to Amazon Redshift clusters running inside Epos Now’s VPC.

Fast access layer

To build the fast access layer to deliver the metrics to Epos Now’s retail and hospitality merchants in near-real time, we decided to create a separate pipeline. This required developing a microservice running a Kafka consumer job to subscribe to the same Kafka topic in an Amazon Elastic Kubernetes Service (Amazon EKS) cluster. The microservice received the messages, conducted the transformations, and wrote the data to a target data model hosted on Aurora PostgreSQL. This data was delivered to the UI layer through an API also hosted on Amazon EKS, exposed through Amazon API Gateway.

Outcome

The Epos Now team is currently building both the fast access layer and a centralized lakehouse architecture-based data platform on Amazon S3 and Amazon Redshift for advanced analytics use cases. The new data platform is best positioned to address scalability issues and support new use cases. The Epos Now team has also started offloading some of the real-time reporting requirements to the new target data model hosted in Aurora. The team has a clear strategy around the choice of different storage solutions for the right access patterns: Amazon S3 stores all the raw data, and Aurora hosts all the metrics to serve real-time and near-real-time reporting requirements. The Epos Now team will also enhance the overall solution by applying data retention policies in different layers of the data platform. This will address the platform cost without losing any historical datasets. The data model and structure (data partitioning, columnar file format) we designed greatly improved query performance and overall platform stability.

Conclusion

Epos Now revolutionized their data analytics capabilities, taking advantage of the breadth and depth of the AWS Cloud. They’re now able to serve insights to internal business users, and scale their data platform in a reliable, performant, and cost-effective manner.

The AWS Data Lab engagement enabled Epos Now to move from idea to proof of concept in 3 days using several previously unfamiliar AWS analytics services, including AWS Glue, Amazon MSK, Amazon Redshift, and Amazon API Gateway.

Epos Now is currently in the process of implementing the full data lake architecture, with a rollout to customers planned for late 2022. Once live, they will deliver on their strategic goal to provide real-time transactional data and put insights directly in the hands of their merchants.


About the Authors

Jason Downing is VP of Data and Insights at Epos Now. He is responsible for the Epos Now data platform and product direction. He specializes in product management across a range of industries, including POS systems, mobile money, payments, and eWallets.

Debadatta Mohapatra is an AWS Data Lab Architect. He has extensive experience across big data, data science, and IoT, across consulting and industrials. He is an advocate of cloud-native data platforms and the value they can drive for customers across industries.

Amazon Detective Supports Kubernetes Workloads on Amazon EKS for Security Investigations

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/amazon-detective-supports-kubernetes-workloads-on-amazon-eks-for-security-investigations/

In March 2020, we introduced Amazon Detective, a fully managed service that makes it easy to analyze, investigate, and quickly identify the root cause of potential security issues or suspicious activities.

Amazon Detective continuously extracts temporal events such as login attempts, API calls, and network traffic from Amazon GuardDutyAWS CloudTrail, and Amazon Virtual Private Cloud (Amazon VPC) Flow Logs into a graph model that summarizes the resource behaviors and interactions observed across your entire AWS environment. We have added new features such as AWS IAM Role session analysis, enhanced IP address analytics, Splunk integration, Amazon S3 and DNS finding types, and the support of AWS Organizations.

Customers are rapidly moving to containers to deploy Kubernetes workloads with Amazon Elastic Kubernetes Service (Amazon EKS). Its highly programmatic nature allows thousands of individual container deployments and millions of configuration changes to occur in seconds. To effectively secure EKS workloads, it is important to monitor container deployments and configurations that are captured in the form of EKS audit logs and to correlate activities to user activity and network traffic happening across AWS accounts.

Today we announce new capabilities in Amazon Detective to expand security investigation coverage for Kubernetes workloads running on Amazon EKS. When you enable this new feature, Amazon Detective automatically starts ingesting EKS audit logs to capture chronological API activity from users, applications, and the control plane in Amazon EKS for clusters, pods, container images, and Kubernetes subjects (Kubernetes users and service accounts).

Detective automatically correlates user activity using CloudTrail, and network activity using Amazon VPC Flow logs, without the need for you to enable, store, or retain logs manually. The service gleans key security information from these logs and retains them in a security behavioral graph database that enables fast cross-referenced access to twelve months of activity. Detective provides a data analysis and visualization layer purpose-built to answer common security questions backed by a behavioral graph database that allows you to quickly investigate potential malicious behavior associated with your EKS workloads.

You can rapidly respond to security issues rather than focusing on log management, operational systems, or ongoing security tooling maintenance. Detective’s EKS capabilities come with a free 30-day trial for all customers that allows you to ensure that the capabilities meet your needs and to fully understand the cost for the service on an ongoing basis.

Getting Started with Security Investigations for EKS Audit Logs
To get started, enable Amazon Detective with just a few clicks in the AWS Management Console. GuardDuty is a prerequisite of Amazon Detective. When you try to enable Detective, Detective checks whether GuardDuty has been enabled for your account. You must either enable GuardDuty or wait for 48 hours. This allows GuardDuty to assess the data volume that your account produces.

You can enable your account by attaching the AWS IAM policy or delegate it to an administrator of your organization. To learn more, refer to Setting up Detective in the AWS documentation.

To enable EKS support in Detective as an existing customer, navigate to the Settings menu in the left panel and select General. Under Optional source packages, enable EKS audit logs.

If you are a new customer of Detective, the EKS protection feature will be enabled by default. If you do not want to trial EKS audit logs right away, you can disable this feature within the first week of enabling Detective and preserve the full 30-day free trial period to use in the future.

Once enabled, Detective will begin monitoring the Kubernetes audit logs that are generated by Amazon EKS, extracting and correlating information for security usage. You do not need to enable any log sources or make any configuration changes to your existing EKS clusters or future deployments.

You can see recent monitoring results of your EKS clusters on the Summary page.

When you choose one of the EKS clusters, you will see the details of containers running in the cluster, Kubernetes API activities, and network activities that occurred on this resource around the scope time.

In the Overview tab, you also see details about all containers running in the cluster, including their pod, image and security context.

In the Kubernetes API activity tab, you can get an overview of the full API activities involving the EKS cluster. You can choose a time range to drill down based on specific API methods within the EKS cluster. When you select a specific time, you can see API subjects, IP addresses, and the number of API calls by the success, failure, unauthorized, or forbidden state.

You can also see details of newly observed Kubernetes API calls  inside this cluster for the first time and subjects with increased volume that happened inside the cluster.

Enabling GuardDuty EKS Protection
In January 2022, Amazon GuardDuty expanded coverage to EKS cluster activity to identify malicious or suspicious behavior that represents potential threats to container workloads.

When the optional GuardDuty EKS Protection is enabled, GuardDuty will continuously monitor your EKS deployments and alert you to threats detected in your workloads. You can view and investigate these security findings in Detective.

With Detective for EKS enabled, you can quickly access information about the resources involved in the finding, such as their CloudTrail and Kubernetes API activity, and netflow information. This can aid in investigation and help you determine root cause, impact, and other related resources that may also be compromised.

To learn more, see How to use new Amazon GuardDuty EKS Protection findings in the AWS Security Blog.

Now Available
You can now use Amazon Detective for EKS protection in all Regions where Amazon Detective is available. This feature is priced based on the volume of audit logs processed and analyzed by Detective.

Detective provides a free 30-day trial to all customers that enable EKS coverage, allowing customers to ensure that Detective’s capabilities meet security needs and to get an estimate of the service’s monthly cost before committing to paid usage. To learn more, see the Detective pricing page.

For technical documentation, visit the Amazon Detective User Guide. Please send feedback to AWS re:Post for Amazon Detective or through your usual AWS support contacts.

Learn all the details about Amazon Detective for EKS protection and get started today.

Channy

Stream Amazon EMR on EKS logs to third-party providers like Splunk, Amazon OpenSearch Service, or other log aggregators

Post Syndicated from Matthew Tan original https://aws.amazon.com/blogs/big-data/stream-amazon-emr-on-eks-logs-to-third-party-providers-like-splunk-amazon-opensearch-service-or-other-log-aggregators/

Spark jobs running on Amazon EMR on EKS generate logs that are very useful in identifying issues with Spark processes and also as a way to see Spark outputs. You can access these logs from a variety of sources. On the Amazon EMR virtual cluster console, you can access logs from the Spark History UI. You also have flexibility to push logs into an Amazon Simple Storage Service (Amazon S3) bucket or Amazon CloudWatch Logs. In each method, these logs are linked to the specific job in question. The common practice of log management in DevOps culture is to centralize logging through the forwarding of logs to an enterprise log aggregation system like Splunk or Amazon OpenSearch Service (successor to Amazon Elasticsearch Service). This enables you to see all the applicable log data in one place. You can identify key trends, anomalies, and correlated events, and troubleshoot problems faster and notify the appropriate people in a timely fashion.

EMR on EKS Spark logs are generated by Spark and can be accessed via the Kubernetes API and kubectl CLI. Therefore, although it’s possible to install log forwarding agents in the Amazon Elastic Kubernetes Service (Amazon EKS) cluster to forward all Kubernetes logs, which include Spark logs, this can become quite expensive at scale because you get information that may not be important for Spark users about Kubernetes. In addition, from a security point of view, the EKS cluster logs and access to kubectl may not be available to the Spark user.

To solve this problem, this post proposes using pod templates to create a sidecar container alongside the Spark job pods. The sidecar containers are able to access the logs contained in the Spark pods and forward these logs to the log aggregator. This approach allows the logs to be managed separately from the EKS cluster and uses a small amount of resources because the sidecar container is only launched during the lifetime of the Spark job.

Implementing Fluent Bit as a sidecar container

Fluent Bit is a lightweight, highly scalable, and high-speed logging and metrics processor and log forwarder. It collects event data from any source, enriches that data, and sends it to any destination. Its lightweight and efficient design coupled with its many features makes it very attractive to those working in the cloud and in containerized environments. It has been deployed extensively and trusted by many, even in large and complex environments. Fluent Bit has zero dependencies and requires only 650 KB in memory to operate, as compared to FluentD, which needs about 40 MB in memory. Therefore, it’s an ideal option as a log forwarder to forward logs generated from Spark jobs.

When you submit a job to EMR on EKS, there are at least two Spark containers: the Spark driver and the Spark executor. The number of Spark executor pods depends on your job submission configuration. If you indicate more than one spark.executor.instances, you get the corresponding number of Spark executor pods. What we want to do here is run Fluent Bit as sidecar containers with the Spark driver and executor pods. Diagrammatically, it looks like the following figure. The Fluent Bit sidecar container reads the indicated logs in the Spark driver and executor pods, and forwards these logs to the target log aggregator directly.

Architecture of Fluent Bit sidecar

Pod templates in EMR on EKS

A Kubernetes pod is a group of one or more containers with shared storage, network resources, and a specification for how to run the containers. Pod templates are specifications for creating pods. It’s part of the desired state of the workload resources used to run the application. Pod template files can define the driver or executor pod configurations that aren’t supported in standard Spark configuration. That being said, Spark is opinionated about certain pod configurations and some values in the pod template are always overwritten by Spark. Using a pod template only allows Spark to start with a template pod and not an empty pod during the pod building process. Pod templates are enabled in EMR on EKS when you configure the Spark properties spark.kubernetes.driver.podTemplateFile and spark.kubernetes.executor.podTemplateFile. Spark downloads these pod templates to construct the driver and executor pods.

Forward logs generated by Spark jobs in EMR on EKS

A log aggregating system like Amazon OpenSearch Service or Splunk should always be available that can accept the logs forwarded by the Fluent Bit sidecar containers. If not, we provide the following scripts in this post to help you launch a log aggregating system like Amazon OpenSearch Service or Splunk installed on an Amazon Elastic Compute Cloud (Amazon EC2) instance.

We use several services to create and configure EMR on EKS. We use an AWS Cloud9 workspace to run all the scripts and to configure the EKS cluster. To prepare to run a job script that requires certain Python libraries absent from the generic EMR images, we use Amazon Elastic Container Registry (Amazon ECR) to store the customized EMR container image.

Create an AWS Cloud9 workspace

The first step is to launch and configure the AWS Cloud9 workspace by following the instructions in Create a Workspace in the EKS Workshop. After you create the workspace, we create AWS Identity and Access Management (IAM) resources. Create an IAM role for the workspace, attach the role to the workspace, and update the workspace IAM settings.

Prepare the AWS Cloud9 workspace

Clone the following GitHub repository and run the following script to prepare the AWS Cloud9 workspace to be ready to install and configure Amazon EKS and EMR on EKS. The shell script prepare_cloud9.sh installs all the necessary components for the AWS Cloud9 workspace to build and manage the EKS cluster. These include the kubectl command line tool, eksctl CLI tool, jq, and to update the AWS Command Line Interface (AWS CLI).

$ sudo yum -y install git
$ cd ~ 
$ git clone https://github.com/aws-samples/aws-emr-eks-log-forwarding.git
$ cd aws-emr-eks-log-forwarding
$ cd emreks
$ bash prepare_cloud9.sh

All the necessary scripts and configuration to run this solution are found in the cloned GitHub repository.

Create a key pair

As part of this particular deployment, you need an EC2 key pair to create an EKS cluster. If you already have an existing EC2 key pair, you may use that key pair. Otherwise, you can create a key pair.

Install Amazon EKS and EMR on EKS

After you configure the AWS Cloud9 workspace, in the same folder (emreks), run the following deployment script:

$ bash deploy_eks_cluster_bash.sh 
Deployment Script -- EMR on EKS
-----------------------------------------------

Please provide the following information before deployment:
1. Region (If your Cloud9 desktop is in the same region as your deployment, you can leave this blank)
2. Account ID (If your Cloud9 desktop is running in the same Account ID as where your deployment will be, you can leave this blank)
3. Name of the S3 bucket to be created for the EMR S3 storage location
Region: [xx-xxxx-x]: < Press enter for default or enter region > 
Account ID [xxxxxxxxxxxx]: < Press enter for default or enter account # > 
EC2 Public Key name: < Provide your key pair name here >
Default S3 bucket name for EMR on EKS (do not add s3://): < bucket name >
Bucket created: XXXXXXXXXXX ...
Deploying CloudFormation stack with the following parameters...
Region: xx-xxxx-x | Account ID: xxxxxxxxxxxx | S3 Bucket: XXXXXXXXXXX

...

EKS Cluster and Virtual EMR Cluster have been installed.

The last line indicates that installation was successful.

Log aggregation options

There are several log aggregation and management tools on the market. This post suggests two of the more popular ones in the industry: Splunk and Amazon OpenSearch Service.

Option 1: Install Splunk Enterprise

To manually install Splunk on an EC2 instance, complete the following steps:

  1. Launch an EC2 instance.
  2. Install Splunk.
  3. Configure the EC2 instance security group to permit access to ports 22, 8000, and 8088.

This post, however, provides an automated way to install Spunk on an EC2 instance:

  1. Download the RPM install file and upload it to an accessible Amazon S3 location.
  2. Upload the following YAML script into AWS CloudFormation.
  3. Provide the necessary parameters, as shown in the screenshots below.
  4. Choose Next and complete the steps to create your stack.

Splunk CloudFormation screen - 1

Splunk CloudFormation screen - 2

Splunk CloudFormation screen - 3

Alternatively, run an AWS CLI script like the following:

aws cloudformation create-stack \
--stack-name "splunk" \
--template-body file://splunk_cf.yaml \
--parameters ParameterKey=KeyName,ParameterValue="< Name of EC2 Key Pair >" \
  ParameterKey=InstanceType,ParameterValue="t3.medium" \
  ParameterKey=LatestAmiId,ParameterValue="/aws/service/ami-amazon-linux-latest/amzn2-ami-hvm-x86_64-gp2" \
  ParameterKey=VPCID,ParameterValue="vpc-XXXXXXXXXXX" \
  ParameterKey=PublicSubnet0,ParameterValue="subnet-XXXXXXXXX" \
  ParameterKey=SSHLocation,ParameterValue="< CIDR Range for SSH access >" \
  ParameterKey=VpcCidrRange,ParameterValue="172.20.0.0/16" \
  ParameterKey=RootVolumeSize,ParameterValue="100" \
  ParameterKey=S3BucketName,ParameterValue="< S3 Bucket Name >" \
  ParameterKey=S3Prefix,ParameterValue="splunk/splunk-8.2.5-77015bc7a462-linux-2.6-x86_64.rpm" \
  ParameterKey=S3DownloadLocation,ParameterValue="/tmp" \
--region < region > \
--capabilities CAPABILITY_IAM
  1. After you build the stack, navigate to the stack’s Outputs tab on the AWS CloudFormation console and note the internal and external DNS for the Splunk instance.

You use these later to configure the Splunk instance and log forwarding.

Splunk CloudFormation output screen

  1. To configure Splunk, go to the Resources tab for the CloudFormation stack and locate the physical ID of EC2Instance.
  2. Choose that link to go to the specific EC2 instance.
  3. Select the instance and choose Connect.

Connect to Splunk Instance

  1. On the Session Manager tab, choose Connect.

Connect to Instance

You’re redirected to the instance’s shell.

  1. Install and configure Splunk as follows:
$ sudo /opt/splunk/bin/splunk start --accept-license
…
Please enter an administrator username: admin
Password must contain at least:
   * 8 total printable ASCII character(s).
Please enter a new password: 
Please confirm new password:
…
Done
                                                           [  OK  ]

Waiting for web server at http://127.0.0.1:8000 to be available......... Done
The Splunk web interface is at http://ip-xx-xxx-xxx-x.us-east-2.compute.internal:8000
  1. Enter the Splunk site using the SplunkPublicDns value from the stack outputs (for example, http://ec2-xx-xxx-xxx-x.us-east-2.compute.amazonaws.com:8000). Note the port number of 8000.
  2. Log in with the user name and password you provided.

Splunk Login

Configure HTTP Event Collector

To configure Splunk to be able to receive logs from Fluent Bit, configure the HTTP Event Collector data input:

  1. Go to Settings and choose Data input.
  2. Choose HTTP Event Collector.
  3. Choose Global Settings.
  4. Select Enabled, keep port number 8088, then choose Save.
  5. Choose New Token.
  6. For Name, enter a name (for example, emreksdemo).
  7. Choose Next.
  8. For Available item(s) for Indexes, add at least the main index.
  9. Choose Review and then Submit.
  10. In the list of HTTP Event Collect tokens, copy the token value for emreksdemo.

You use it when configuring the Fluent Bit output.

splunk-http-collector-list

Option 2: Set up Amazon OpenSearch Service

Your other log aggregation option is to use Amazon OpenSearch Service.

Provision an OpenSearch Service domain

Provisioning an OpenSearch Service domain is very straightforward. In this post, we provide a simple script and configuration to provision a basic domain. To do it yourself, refer to Creating and managing Amazon OpenSearch Service domains.

Before you start, get the ARN of the IAM role that you use to run the Spark jobs. If you created the EKS cluster with the provided script, go to the CloudFormation stack emr-eks-iam-stack. On the Outputs tab, locate the IAMRoleArn output and copy this ARN. We also modify the IAM role later on, after we create the OpenSearch Service domain.

iam_role_emr_eks_job

If you’re using the provided opensearch.sh installer, before you run it, modify the file.

From the root folder of the GitHub repository, cd to opensearch and modify opensearch.sh (you can also use your preferred editor):

[../aws-emr-eks-log-forwarding] $ cd opensearch
[../aws-emr-eks-log-forwarding/opensearch] $ vi opensearch.sh

Configure opensearch.sh to fit your environment, for example:

# name of our Amazon OpenSearch cluster
export ES_DOMAIN_NAME="emreksdemo"

# Elasticsearch version
export ES_VERSION="OpenSearch_1.0"

# Instance Type
export INSTANCE_TYPE="t3.small.search"

# OpenSearch Dashboards admin user
export ES_DOMAIN_USER="emreks"

# OpenSearch Dashboards admin password
export ES_DOMAIN_PASSWORD='< ADD YOUR PASSWORD >'

# Region
export REGION='us-east-1'

Run the script:

[../aws-emr-eks-log-forwarding/opensearch] $ bash opensearch.sh

Configure your OpenSearch Service domain

After you set up your OpenSearch service domain and it’s active, make the following configuration changes to allow logs to be ingested into Amazon OpenSearch Service:

  1. On the Amazon OpenSearch Service console, on the Domains page, choose your domain.

Opensearch Domain Console

  1. On the Security configuration tab, choose Edit.

Opensearch Security Configuration

  1. For Access Policy, select Only use fine-grained access control.
  2. Choose Save changes.

The access policy should look like the following code:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "*"
      },
      "Action": "es:*",
      "Resource": "arn:aws:es:xx-xxxx-x:xxxxxxxxxxxx:domain/emreksdemo/*"
    }
  ]
}
  1. When the domain is active again, copy the domain ARN.

We use it to configure the Amazon EMR job IAM role we mentioned earlier.

  1. Choose the link for OpenSearch Dashboards URL to enter Amazon OpenSearch Service Dashboards.

Opensearch Main Console

  1. In Amazon OpenSearch Service Dashboards, use the user name and password that you configured earlier in the opensearch.sh file.
  2. Choose the options icon and choose Security under OpenSearch Plugins.

opensearch menu

  1. Choose Roles.
  2. Choose Create role.

opensearch-create-role-button

  1. Enter the new role’s name, cluster permissions, and index permissions. For this post, name the role fluentbit_role and give cluster permissions to the following:
    1. indices:admin/create
    2. indices:admin/template/get
    3. indices:admin/template/put
    4. cluster:admin/ingest/pipeline/get
    5. cluster:admin/ingest/pipeline/put
    6. indices:data/write/bulk
    7. indices:data/write/bulk*
    8. create_index

opensearch-create-role-button

  1. In the Index permissions section, give write permission to the index fluent-*.
  2. On the Mapped users tab, choose Manage mapping.
  3. For Backend roles, enter the Amazon EMR job execution IAM role ARN to be mapped to the fluentbit_role role.
  4. Choose Map.

opensearch-map-backend

  1. To complete the security configuration, go to the IAM console and add the following inline policy to the EMR on EKS IAM role entered in the backend role. Replace the resource ARN with the ARN of your OpenSearch Service domain.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "es:ESHttp*"
            ],
            "Resource": "arn:aws:es:us-east-2:XXXXXXXXXXXX:domain/emreksdemo"
        }
    ]
}

The configuration of Amazon OpenSearch Service is complete and ready for ingestion of logs from the Fluent Bit sidecar container.

Configure the Fluent Bit sidecar container

We need to write two configuration files to configure a Fluent Bit sidecar container. The first is the Fluent Bit configuration itself, and the second is the Fluent Bit sidecar subprocess configuration that makes sure that the sidecar operation ends when the main Spark job ends. The suggested configuration provided in this post is for Splunk and Amazon OpenSearch Service. However, you can configure Fluent Bit with other third-party log aggregators. For more information about configuring outputs, refer to Outputs.

Fluent Bit ConfigMap

The following sample ConfigMap is from the GitHub repo:

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-sidecar-config
  namespace: sparkns
  labels:
    app.kubernetes.io/name: fluent-bit
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush         1
        Log_Level     info
        Daemon        off
        Parsers_File  parsers.conf
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_Port     2020

    @INCLUDE input-application.conf
    @INCLUDE input-event-logs.conf
    @INCLUDE output-splunk.conf
    @INCLUDE output-opensearch.conf

  input-application.conf: |
    [INPUT]
        Name              tail
        Path              /var/log/spark/user/*/*
        Path_Key          filename
        Buffer_Chunk_Size 1M
        Buffer_Max_Size   5M
        Skip_Long_Lines   On
        Skip_Empty_Lines  On

  input-event-logs.conf: |
    [INPUT]
        Name              tail
        Path              /var/log/spark/apps/*
        Path_Key          filename
        Buffer_Chunk_Size 1M
        Buffer_Max_Size   5M
        Skip_Long_Lines   On
        Skip_Empty_Lines  On

  output-splunk.conf: |
    [OUTPUT]
        Name            splunk
        Match           *
        Host            < INTERNAL DNS of Splunk EC2 Instance >
        Port            8088
        TLS             On
        TLS.Verify      Off
        Splunk_Token    < Token as provided by the HTTP Event Collector in Splunk >

  output-opensearch.conf: |
[OUTPUT]
        Name            es
        Match           *
        Host            < HOST NAME of the OpenSearch Domain | No HTTP protocol >
        Port            443
        TLS             On
        AWS_Auth        On
        AWS_Region      < Region >
        Retry_Limit     6

In your AWS Cloud9 workspace, modify the ConfigMap accordingly. Provide the values for the placeholder text by running the following commands to enter the VI editor mode. If preferred, you can use PICO or a different editor:

[../aws-emr-eks-log-forwarding] $  cd kube/configmaps
[../aws-emr-eks-log-forwarding/kube/configmaps] $ vi emr_configmap.yaml

# Modify the emr_configmap.yaml as above
# Save the file once it is completed

Complete either the Splunk output configuration or the Amazon OpenSearch Service output configuration.

Next, run the following commands to add the two Fluent Bit sidecar and subprocess ConfigMaps:

[../aws-emr-eks-log-forwarding/kube/configmaps] $ kubectl apply -f emr_configmap.yaml
[../aws-emr-eks-log-forwarding/kube/configmaps] $ kubectl apply -f emr_entrypoint_configmap.yaml

You don’t need to modify the second ConfigMap because it’s the subprocess script that runs inside the Fluent Bit sidecar container. To verify that the ConfigMaps have been installed, run the following command:

$ kubectl get cm -n sparkns
NAME                         DATA   AGE
fluent-bit-sidecar-config    6      15s
fluent-bit-sidecar-wrapper   2      15s

Set up a customized EMR container image

To run the sample PySpark script, the script requires the Boto3 package that’s not available in the standard EMR container images. If you want to run your own script and it doesn’t require a customized EMR container image, you may skip this step.

Run the following script:

[../aws-emr-eks-log-forwarding] $ cd ecr
[../aws-emr-eks-log-forwarding/ecr] $ bash create_custom_image.sh <region> <EMR container image account number>

The EMR container image account number can be obtained from How to select a base image URI. This documentation also provides the appropriate ECR registry account number. For example, the registry account number for us-east-1 is 755674844232.

To verify the repository and image, run the following commands:

$ aws ecr describe-repositories --region < region > | grep emr-6.5.0-custom
            "repositoryArn": "arn:aws:ecr:xx-xxxx-x:xxxxxxxxxxxx:repository/emr-6.5.0-custom",
            "repositoryName": "emr-6.5.0-custom",
            "repositoryUri": " xxxxxxxxxxxx.dkr.ecr.xx-xxxx-x.amazonaws.com/emr-6.5.0-custom",

$ aws ecr describe-images --region < region > --repository-name emr-6.5.0-custom | jq .imageDetails[0].imageTags
[
  "latest"
]

Prepare pod templates for Spark jobs

Upload the two Spark driver and Spark executor pod templates to an S3 bucket and prefix. The two pod templates can be found in the GitHub repository:

  • emr_driver_template.yaml – Spark driver pod template
  • emr_executor_template.yaml – Spark executor pod template

The pod templates provided here should not be modified.

Submitting a Spark job with a Fluent Bit sidecar container

This Spark job example uses the bostonproperty.py script. To use this script, upload it to an accessible S3 bucket and prefix and complete the preceding steps to use an EMR customized container image. You also need to upload the CSV file from the GitHub repo, which you need to download and unzip. Upload the unzipped file to the following location: s3://<your chosen bucket>/<first level folder>/data/boston-property-assessment-2021.csv.

The following commands assume that you launched your EKS cluster and virtual EMR cluster with the parameters indicated in the GitHub repo.

Variable Where to Find the Information or the Value Required
EMR_EKS_CLUSTER_ID Amazon EMR console virtual cluster page
EMR_EKS_EXECUTION_ARN IAM role ARN
EMR_RELEASE emr-6.5.0-latest
S3_BUCKET The bucket you create in Amazon S3
S3_FOLDER The preferred prefix you want to use in Amazon S3
CONTAINER_IMAGE The URI in Amazon ECR where your container image is
SCRIPT_NAME emreksdemo-script or a name you prefer

Alternatively, use the provided script to run the job. Change the directory to the scripts folder in emreks and run the script as follows:

[../aws-emr-eks-log-forwarding] cd emreks/scripts
[../aws-emr-eks-log-forwarding/emreks/scripts] bash run_emr_script.sh < S3 bucket name > < ECR container image > < script path>

Example: bash run_emr_script.sh emreksdemo-123456 12345678990.dkr.ecr.us-east-2.amazonaws.com/emr-6.5.0-custom s3://emreksdemo-123456/scripts/scriptname.py

After you submit the Spark job successfully, you get a return JSON response like the following:

{
    "id": "0000000305e814v0bpt",
    "name": "emreksdemo-job",
    "arn": "arn:aws:emr-containers:xx-xxxx-x:XXXXXXXXXXX:/virtualclusters/upobc00wgff5XXXXXXXXXXX/jobruns/0000000305e814v0bpt",
    "virtualClusterId": "upobc00wgff5XXXXXXXXXXX"
}

What happens when you submit a Spark job with a sidecar container

After you submit a Spark job, you can see what is happening by viewing the pods that are generated and the corresponding logs. First, using kubectl, get a list of the pods generated in the namespace where the EMR virtual cluster runs. In this case, it’s sparkns. The first pod in the following code is the job controller for this particular Spark job. The second pod is the Spark executor; there can be more than one pod depending on how many executor instances are asked for in the Spark job setting—we asked for one here. The third pod is the Spark driver pod.

$ kubectl get pods -n sparkns
NAME                                        READY   STATUS    RESTARTS   AGE
0000000305e814v0bpt-hvwjs                   3/3     Running   0          25s
emreksdemo-script-1247bf80ae40b089-exec-1   0/3     Pending   0          0s
spark-0000000305e814v0bpt-driver            3/3     Running   0          11s

To view what happens in the sidecar container, follow the logs in the Spark driver pod and refer to the sidecar. The sidecar container launches with the Spark pods and persists until the file /var/log/fluentd/main-container-terminated is no longer available. For more information about how Amazon EMR controls the pod lifecycle, refer to Using pod templates. The subprocess script ties the sidecar container to this same lifecycle and deletes itself upon the EMR controlled pod lifecycle process.

$ kubectl logs spark-0000000305e814v0bpt-driver -n sparkns  -c custom-side-car-container --follow=true

Waiting for file /var/log/fluentd/main-container-terminated to appear...
AWS for Fluent Bit Container Image Version 2.24.0Start wait: 1652190909
Elapsed Wait: 0
Not found count: 0
Waiting...
Fluent Bit v1.9.3
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2022/05/10 13:55:09] [ info] [fluent bit] version=1.9.3, commit=9eb4996b7d, pid=11
[2022/05/10 13:55:09] [ info] [storage] version=1.2.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
[2022/05/10 13:55:09] [ info] [cmetrics] version=0.3.1
[2022/05/10 13:55:09] [ info] [output:splunk:splunk.0] worker #0 started
[2022/05/10 13:55:09] [ info] [output:splunk:splunk.0] worker #1 started
[2022/05/10 13:55:09] [ info] [output:es:es.1] worker #0 started
[2022/05/10 13:55:09] [ info] [output:es:es.1] worker #1 started
[2022/05/10 13:55:09] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2022/05/10 13:55:09] [ info] [sp] stream processor started
Waiting for file /var/log/fluentd/main-container-terminated to appear...
Last heartbeat: 1652190914
Elapsed Time since after heartbeat: 0
Found count: 0
list files:
-rw-r--r-- 1 saslauth 65534 0 May 10 13:55 /var/log/fluentd/main-container-terminated
Last heartbeat: 1652190918

…

[2022/05/10 13:56:09] [ info] [input:tail:tail.0] inotify_fs_add(): inode=58834691 watch_fd=6 name=/var/log/spark/user/spark-0000000305e814v0bpt-driver/stdout-s3-container-log-in-tail.pos
[2022/05/10 13:56:09] [ info] [input:tail:tail.1] inotify_fs_add(): inode=54644346 watch_fd=1 name=/var/log/spark/apps/spark-0000000305e814v0bpt
Outside of loop, main-container-terminated file no longer exists
ls: cannot access /var/log/fluentd/main-container-terminated: No such file or directory
The file /var/log/fluentd/main-container-terminated doesn't exist anymore;
TERMINATED PROCESS
Fluent-Bit pid: 11
Killing process after sleeping for 15 seconds
root        11     8  0 13:55 ?        00:00:00 /fluent-bit/bin/fluent-bit -e /fluent-bit/firehose.so -e /fluent-bit/cloudwatch.so -e /fluent-bit/kinesis.so -c /fluent-bit/etc/fluent-bit.conf
root       114     7  0 13:56 ?        00:00:00 grep fluent
Killing process 11
[2022/05/10 13:56:24] [engine] caught signal (SIGTERM)
[2022/05/10 13:56:24] [ info] [input] pausing tail.0
[2022/05/10 13:56:24] [ info] [input] pausing tail.1
[2022/05/10 13:56:24] [ warn] [engine] service will shutdown in max 5 seconds
[2022/05/10 13:56:25] [ info] [engine] service has stopped (0 pending tasks)
[2022/05/10 13:56:25] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=54644346 watch_fd=1
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=60917120 watch_fd=1
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=60917121 watch_fd=2
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=58834690 watch_fd=3
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=58834692 watch_fd=4
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=58834689 watch_fd=5
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=58834691 watch_fd=6
[2022/05/10 13:56:25] [ info] [output:splunk:splunk.0] thread worker #0 stopping...
[2022/05/10 13:56:25] [ info] [output:splunk:splunk.0] thread worker #0 stopped
[2022/05/10 13:56:25] [ info] [output:splunk:splunk.0] thread worker #1 stopping...
[2022/05/10 13:56:25] [ info] [output:splunk:splunk.0] thread worker #1 stopped
[2022/05/10 13:56:25] [ info] [output:es:es.1] thread worker #0 stopping...
[2022/05/10 13:56:25] [ info] [output:es:es.1] thread worker #0 stopped
[2022/05/10 13:56:25] [ info] [output:es:es.1] thread worker #1 stopping...
[2022/05/10 13:56:25] [ info] [output:es:es.1] thread worker #1 stopped

View the forwarded logs in Splunk or Amazon OpenSearch Service

To view the forwarded logs, do a search in Splunk or on the Amazon OpenSearch Service console. If you’re using a shared log aggregator, you may have to filter the results. In this configuration, the logs tailed by Fluent Bit are in the /var/log/spark/*. The following screenshots show the logs generated specifically by the Kubernetes Spark driver stdout that were forwarded to the log aggregators. You can compare the results with the logs provided using kubectl:

kubectl logs < Spark Driver Pod > -n < namespace > -c spark-kubernetes-driver --follow=true

…
root
 |-- PID: string (nullable = true)
 |-- CM_ID: string (nullable = true)
 |-- GIS_ID: string (nullable = true)
 |-- ST_NUM: string (nullable = true)
 |-- ST_NAME: string (nullable = true)
 |-- UNIT_NUM: string (nullable = true)
 |-- CITY: string (nullable = true)
 |-- ZIPCODE: string (nullable = true)
 |-- BLDG_SEQ: string (nullable = true)
 |-- NUM_BLDGS: string (nullable = true)
 |-- LUC: string (nullable = true)
…

|02108|RETAIL CONDO           |361450.0            |63800.0        |5977500.0      |
|02108|RETAIL STORE DETACH    |2295050.0           |988200.0       |3601900.0      |
|02108|SCHOOL                 |1.20858E7           |1.20858E7      |1.20858E7      |
|02108|SINGLE FAM DWELLING    |5267156.561085973   |1153400.0      |1.57334E7      |
+-----+-----------------------+--------------------+---------------+---------------+
only showing top 50 rows

The following screenshot shows the Splunk logs.

splunk-result-driver-stdout

The following screenshots show the Amazon OpenSearch Service logs.

opensearch-result-driver-stdout

Optional: Include a buffer between Fluent Bit and the log aggregators

If you expect to generate a lot of logs because of high concurrent Spark jobs creating multiple individual connects that may overwhelm your Amazon OpenSearch Service or Splunk log aggregation clusters, consider employing a buffer between the Fluent Bit sidecars and your log aggregator. One option is to use Amazon Kinesis Data Firehose as the buffering service.

Kinesis Data Firehose has built-in delivery to both Amazon OpenSearch Service and Splunk. If using Amazon OpenSearch Service, refer to Loading streaming data from Amazon Kinesis Data Firehose. If using Splunk, refer to Configure Amazon Kinesis Firehose to send data to the Splunk platform and Choose Splunk for Your Destination.

To configure Fluent Bit to Kinesis Data Firehose, add the following to your ConfigMap output. Refer to the GitHub ConfigMap example and add the @INCLUDE under the [SERVICE] section:

     @INCLUDE output-kinesisfirehose.conf
…

  output-kinesisfirehose.conf: |
    [OUTPUT]
        Name            kinesis_firehose
        Match           *
        region          < region >
        delivery_stream < Kinesis Firehose Stream Name >

Optional: Use data streams for Amazon OpenSearch Service

If you’re in a scenario where the number of documents grows rapidly and you don’t need to update older documents, you need to manage the OpenSearch Service cluster. This involves steps like creating a rollover index alias, defining a write index, and defining common mappings and settings for the backing indexes. Consider using data streams to simplify this process and enforce a setup that best suits your time series data. For instructions on implementing data streams, refer to Data streams.

Clean up

To avoid incurring future charges, delete the resources by deleting the CloudFormation stacks that were created with this script. This removes the EKS cluster. However, before you do that, remove the EMR virtual cluster first by running the delete-virtual-cluster command. Then delete all the CloudFormation stacks generated by the deployment script.

If you launched an OpenSearch Service domain, you can delete the domain from the OpenSearch Service domain. If you used the script to launch a Splunk instance, you can go to the CloudFormation stack that launched the Splunk instance and delete the CloudFormation stack. This removes remove the Splunk instance and associated resources.

You can also use the following scripts to clean up resources:

Conclusion

EMR on EKS facilitates running Spark jobs on Kubernetes to achieve very fast and cost-efficient Spark operations. This is made possible through scheduling transient pods that are launched and then deleted the jobs are complete. To log all these operations in the same lifecycle of the Spark jobs, this post provides a solution using pod templates and Fluent Bit that is lightweight and powerful. This approach offers a decoupled way of log forwarding based at the Spark application level and not at the Kubernetes cluster level. It also avoids routing through intermediaries like CloudWatch, reducing cost and complexity. In this way, you can address security concerns and DevOps and system administration ease of management while providing Spark users with insights into their Spark jobs in a cost-efficient and functional way.

If you have questions or suggestions, please leave a comment.


About the Author

Matthew Tan is a Senior Analytics Solutions Architect at Amazon Web Services and provides guidance to customers developing solutions with AWS Analytics services on their analytics workloads.                       

AWS Week in Review – May 9, 2022

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-week-in-review-may-9-2022/

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Another week starts, and here’s a collection of the most significant AWS news from the previous seven days. This week is also the one-year anniversary of CloudFront Functions. It’s exciting to see what customers have built during this first year.

Last Week’s Launches
Here are some launches that caught my attention last week:

Amazon RDS supports PostgreSQL 14 with three levels of cascaded read replicas – That’s 5 replicas per instance, supporting a maximum of 155 read replicas per source instance with up to 30X more read capacity. You can now build a more robust disaster recovery architecture with the capability to create Single-AZ or Multi-AZ cascaded read replica DB instances in same or cross Region.

Amazon RDS on AWS Outposts storage auto scalingAWS Outposts extends AWS infrastructure, services, APIs, and tools to virtually any datacenter. With Amazon RDS on AWS Outposts, you can deploy managed DB instances in your on-premises environments. Now, you can turn on storage auto scaling when you create or modify DB instances by selecting a checkbox and specifying the maximum database storage size.

Amazon CodeGuru Reviewer suppression of files and folders in code reviews – With CodeGuru Reviewer, you can use automated reasoning and machine learning to detect potential code defects that are difficult to find and get suggestions for improvements. Now, you can prevent CodeGuru Reviewer from generating unwanted findings on certain files like test files, autogenerated files, or files that have not been recently updated.

Amazon EKS console now supports all standard Kubernetes resources to simplify cluster management – To make it easy to visualize and troubleshoot your applications, you can now use the console to see all standard Kubernetes API resource types (such as service resources, configuration and storage resources, authorization resources, policy resources, and more) running on your Amazon EKS cluster. More info in the blog post Introducing Kubernetes Resource View in Amazon EKS console.

AWS AppConfig feature flag Lambda Extension support for Arm/Graviton2 processors – Using AWS AppConfig, you can create feature flags or other dynamic configuration and safely deploy updates. The AWS AppConfig Lambda Extension allows you to access this feature flag and dynamic configuration data in your Lambda functions. You can now use the AWS AppConfig Lambda Extension from Lambda functions using the Arm/Graviton2 architecture.

AWS Serverless Application Model (SAM) CLI now supports enabling AWS X-Ray tracing – With the AWS SAM CLI you can initialize, build, package, test on local and cloud, and deploy serverless applications. With AWS X-Ray, you have an end-to-end view of requests as they travel through your application, making them easier to monitor and troubleshoot. Now, you can enable tracing by simply adding a flag to the sam init command.

Amazon Kinesis Video Streams image extraction – With Amazon Kinesis Video Streams you can capture, process, and store media streams. Now, you can also request images via API calls or configure automatic image generation based on metadata tags in ingested video. For example, you can use this to generate thumbnails for playback applications or to have more data for your machine learning pipelines.

AWS GameKit supports Android, iOS, and MacOS games developed with Unreal Engine – With AWS GameKit, you can build AWS-powered game features directly from the Unreal Editor with just a few clicks. Now, the AWS GameKit plugin for Unreal Engine supports building games for the Win64, MacOS, Android, and iOS platforms.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Some other updates you might have missed:

🎂 One-year anniversary of CloudFront Functions – I can’t believe it’s been one year since we launched CloudFront Functions. Now, we have tens of thousands of developers actively using CloudFront Functions, with trillions of invocations per month. You can use CloudFront Functions for HTTP header manipulation, URL rewrites and redirects, cache key manipulations/normalization, access authorization, and more. See some examples in this repo. Let’s see what customers built with CloudFront Functions:

  • CloudFront Functions enables Formula 1 to authenticate users with more than 500K requests per second. The solution is using CloudFront Functions to evaluate if users have access to view the race livestream by validating a token in the request.
  • Cloudinary is a media management company that helps its customers deliver content such as videos and images to users worldwide. For them, Lambda@Edge remains an excellent solution for applications that require heavy compute operations, but lightweight operations that require high scalability can now be run using CloudFront Functions. With CloudFront Functions, Cloudinary and its customers are seeing significantly increased performance. For example, one of Cloudinary’s customers began using CloudFront Functions, and in about two weeks it was seeing 20–30 percent better response times. The customer also estimates that they will see 75 percent cost savings.
  • Based in Japan, DigitalCube is a web hosting provider for WordPress websites. Previously, DigitalCube spent several hours completing each of its update deployments. Now, they can deploy updates across thousands of distributions quickly. Using CloudFront Functions, they’ve reduced update deployment times from 4 hours to 2 minutes. In addition, faster updates and less maintenance work result in better quality throughout DigitalCube’s offerings. It’s now easier for them to test on AWS because they can run tests that affect thousands of distributions without having to scale internally or introduce downtime.
  • Amazon.com is using CloudFront Functions to change the way it delivers static assets to customers globally. CloudFront Functions allows them to experiment with hyper-personalization at scale and optimal latency performance. They have been working closely with the CloudFront team during product development, and they like how it is easy to create, test, and deploy custom code and implement business logic at the edge.

AWS open-source news and updates – A newsletter curated by my colleague Ricardo to bring you the latest open-source projects, posts, events, and more. Read the latest edition here.

Reduce log-storage costs by automating retention settings in Amazon CloudWatch – By default, CloudWatch Logs stores your log data indefinitely. This blog post shows how you can reduce log-storage costs by establishing a log-retention policy and applying it across all of your log groups.

Observability for AWS App Runner VPC networking – With X-Ray support in App runner, you can quickly deploy web applications and APIs at any scale and take advantage of adding tracing without having to manage sidecars or agents. Here’s an example of how you can instrument your applications with the AWS Distro for OpenTelemetry (ADOT).

Upcoming AWS Events
It’s AWS Summits season and here are some virtual and in-person events that might be close to you:

You can now register for re:MARS to get fresh ideas on topics such as machine learning, automation, robotics, and space. The conference will be in person in Las Vegas, June 21–24.

That’s all from me for this week. Come back next Monday for another Week in Review!

Danilo

How to use new Amazon GuardDuty EKS Protection findings

Post Syndicated from Marshall Jones original https://aws.amazon.com/blogs/security/how-to-use-new-amazon-guardduty-eks-protection-findings/

If you run container workloads that use Amazon Elastic Kubernetes Service (Amazon EKS), Amazon GuardDuty now has added support that will help you better protect these workloads from potential threats. Amazon GuardDuty EKS Protection can help detect threats related to user and application activity that is captured in Kubernetes audit logs. Newly-added Kubernetes threat detections include Amazon EKS clusters that are accessed by known malicious actors or from Tor nodes, API operations performed by anonymous users that might indicate a misconfiguration, and misconfigurations that can result in unauthorized access to Amazon EKS clusters. By using machine learning (ML) models, GuardDuty can identify patterns consistent with privilege-escalation techniques, such as a suspicious launch of a container with root-level access to the underlying Amazon Elastic Compute Cloud (Amazon EC2) host. In this post, we give you an overview of the new GuardDuty EKS Protection feature; show you examples of new finding details; and help you understand, operationalize, and respond to these new findings.

Amazon GuardDuty is an automated threat detection service that continuously monitors for suspicious activity and potentially unauthorized behavior to help protect your AWS accounts, Amazon EC2 workloads, data stored in Amazon Simple Storage Service (S3), and now Amazon EKS workloads.

If you are already a GuardDuty customer, you can enable GuardDuty EKS Protection and efficiently navigate the console to begin to use this feature. Your delegated administrator accounts can enable this for existing member accounts and determine if new AWS accounts in an organization will be automatically enrolled. If you are new to GuardDuty, the EKS Protection feature is included as part of the service’s 30-day trial period. As part of the 30-day trial period, you can take full advantage of this new feature and gain insight into your Amazon EKS workloads.

Overview of GuardDuty EKS Protection

GuardDuty EKS Protection enables GuardDuty to detect suspicious activities and potential compromises of your EKS clusters by analyzing Kubernetes audit logs. Kubernetes audit logs provide a security relevant, chronological set of records documenting the sequence of events from individual users, administrators, or system components that have affected your cluster. Audit logs can help answer questions such as: What happened? When did it happen? Who initiated it? GuardDuty EKS Protection analyzes Kubernetes audit logs from your Amazon EKS clusters, both new and existing, without the need to configure EKS control plane logging in your environment. GuardDuty collects these Kubernetes audit logs in addition to AWS CloudTrail, Amazon Virtual Private Cloud (Amazon VPC) flow logs, DNS queries, and Amazon S3 data events. GuardDuty EKS Protection performs analysis and looks for suspicious activity without the need for agents or adding resource constraints to your environment.

To detect threats using Kubernetes audit logs, GuardDuty uses a combination of machine learning, anomaly detection, and integrated threat intelligence to identify and prioritize potential threats. These findings primarily align to five root causes including compromised container images, configuration issues, Kubernetes user compromise, pod compromise, and node compromise. An example of a configuration issue is granting unnecessary privileges to the anonymous user by misconfiguring role-based access control (RBAC), which may inadvertently allow anonymous and unauthenticated calls to the Kubernetes API. A Kubernetes user compromise example could be a bad actor using stolen credentials to deploy containers with insecure settings, to use for a variety of activities from command and control to crypto-mining.

After a threat is detected, GuardDuty generates a security finding that includes container details such as the pod ID, container image ID, and tags associated with the Amazon EKS cluster. These finding details assist you with understanding the root cause which you can use to identify basic steps to remediate findings specific to EKS clusters. For example, your response to a finding or group of findings associated with a compromised Kubernetes user might begin with revoking access. For more information, see Remediating Kubernetes security issues discovered by GuardDuty in the Amazon GuardDuty User Guide.

Understanding new GuardDuty EKS Protection findings

As adversaries continue to become more sophisticated, it becomes even more important for you to align to a common framework to understand the tactics, techniques, and procedures (TTPs) behind an individual event. GuardDuty aligns findings using the MITRE ATT&CK framework, which is a globally-accessible knowledge base of adversary tactics and techniques based on real-world observations. GuardDuty findings have a specific finding format that helps you understand details of each finding. If you examine the ThreatPurpose portion in the GuardDuty EKS Protection finding types, you see there are finding types associated with various MITRE ATT&CK tactics, including CredentialAccess, DefenseEvasion, Discovery, Impact, Persistence, and PrivilegeEscalation. This can help you identify and understand the type of activity associated with a finding.

For example, look at two different finding types that seem similar: Impact:Kubernetes/SuccessfulAnonymousAccess and Discovery:Kubernetes/SuccessfulAnonymousAccess. You can see the difference is the ThreatPurpose at the beginning. They are both involved with successful anonymous access, and the difference is the intent of the activity associated with each finding. GuardDuty has determined based on the API or request URI invoked, that in this example, the activity seen on one finding aligns with the Impact tactic whereas the other finding aligns with the Discovery tactic

With GuardDuty EKS Protection, you now have an additional mechanism to gain insight into your EKS clusters across your accounts to look for suspicious activity. You can be alerted to Kubernetes-specific suspicious activity including: allowing administrator access to the default service account, exposing a Kubernetes dashboard, and launching a container with sensitive host paths. With this new feature, GuardDuty is also able to extend support for finding types that you might already be familiar with that also apply to Amazon EKS workloads. These finding types include calls to a Kubernetes cluster API from a Tor node, or calls to a Kubernetes cluster from a known malicious IP address, which can indicate that there are interactions with your Kubernetes clusters from sources that are commonly associated with malicious actors.

Responding to GuardDuty EKS Protection findings

This section gives an overview of three new GuardDuty EKS Protection findings, how to prevent them, and how to investigate and respond if they happen in your environment. The patterns shown can also act as a guide for how to prevent, investigate, and respond to other GuardDuty EKS Protection findings.

Discovery:Kubernetes/SuccessfulAnonymousAccess

Finding documentation: Discovery:Kubernetes/SuccessfulAnonymousAccess

Severity: Medium

Overview: This finding (as shown in Figure 1) informs you that an API operation was successfully invoked by the system:anonymous user. API calls made by system:anonymous are unauthenticated. The observed API is commonly associated with the discovery stage of an attack when an adversary is gathering information on your Kubernetes cluster. This activity indicates that anonymous or unauthenticated access is permitted on the API action reported in the finding, and may be permitted on other actions. These API calls are possible because of a misconfiguration of the system:anonymous user or system:unauthenticated group.

Preventative measures: AWS recommends that you disable unnecessary anonymous authentication. For instructions, see Review and revoke unnecessary anonymous access in the Amazon EKS Best Practices Guides. It is important to note that Kubernetes versions older than 1.14 granted system:discovery and system:basic-user roles to system:anonymous user by default, and these permissions remain in place after updating unless you explicitly change them.

How to remediate: To respond to this finding, it is important to first identify the details of the activity, for example what cluster is involved? Who is the owner of this cluster? This information will assist you with the remediation steps that follow, to review and revoke unnecessary permissions, and also help you determine a root cause.
 

Figure 1: GuardDuty Console showing Discovery:Kubernetes/SuccessfulAnonymousAccess finding type

Figure 1: GuardDuty Console showing Discovery:Kubernetes/SuccessfulAnonymousAccess finding type

Remediation step 1: Examine permissions

The first step is to examine the permissions that have been granted to the system:anonymous user, and determine what permissions are needed. To accomplish this, you need to first understand what permissions the system:anonymous user has. You can use an rbac-lookup tool to list the Kubernetes roles and cluster roles bound to users, service accounts, and groups. An alternative method can be found at this GitHub page.

./rbac-lookup | grep -P 'system:(anonymous)|(unauthenticated)'
system:anonymous               cluster-wide        ClusterRole/system:discovery
system:unauthenticated         cluster-wide        ClusterRole/system:discovery
system:unauthenticated         cluster-wide        ClusterRole/system:public-info-viewer

Remediation step 2: Disassociate groups

Next, you disassociate the system:unauthenticated group from system:discovery and system:basic-user ClusterRoles, which you do by editing the ClusterRoleBinding. Make sure to not remove system:unauthenticated from the system:public-info-viewer cluster role binding, because that will prevent the Network Load Balancer from performing health checks against the API server. For more information, see Network Load Balancer in the AWS Load Balancer Controller Guide and Identity and Access Management in Amazon EKS Best Practices Guide.

To disassociate the appropriate groups

  1. Run the command kubectl edit clusterrolebindings system:discovery. This command will open the current definition of system:discovery ClusterRoleBinding in your editor as shown in the sample .yaml configuration file:
    # Please edit the object below. Lines beginning with a '#' will be ignored,
    # and an empty file will abort the edit. If an error occurs while saving this # file will be reopened with the relevant failures.
    #
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      annotations:
        rbac.authorization.kubernetes.io/autoupdate: "true"
      creationTimestamp: "2021-06-17T20:50:49Z"
      labels:
        kubernetes.io/bootstrapping: rbac-defaults
      name: system:discovery
      resourceVersion: "24502985"
      selfLink: /apis/rbac.authorization.k8s.io/v1/clusterrolebindings/system%3Adiscovery
      uid: b7936268-5043-431a-a0e1-171a423abeb6
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: system:discovery
    subjects:
    - apiGroup: rbac.authorization.k8s.io
      kind: Group
      name: system:authenticated
    - apiGroup: rbac.authorization.k8s.io
      kind: Group
      name: system:unauthenticated
    

  2. Delete the entry for system:unauthenticated group, which is highlighted in bold in the subjects section.
  3. Repeat the same steps for system:basic-user ClusterRoleBinding.

If there is no reason that the system:anonymous user should be used in your environment, AWS recommends that you set up automatic response and remediation steps 1-3. For more information about the system:anonymous user, see Identity and Access Management in Amazon EKS Best Practices Guide.

PrivilegeEscalation:Kubernetes/PrivilegedContainer

Finding documentation: PrivilegeEscalation:Kubernetes/PrivilegedContainer

Severity: Medium

Overview: This finding (as shown in Figure 2) informs you that a privileged container was launched on your Kubernetes cluster using an image that has never before been used to launch privileged containers in your cluster. A privileged container has root level access on the host. Adversaries commonly launch privileged containers to perform privilege escalation to gain access and compromise the underlying host.

Preventative measures: Create and enforce policy-as-code (PAC) or Pod Security Standards (PSS) that require that pods be created as non-privileged. For more information, see Pod Security in the in Amazon EKS Best Practices Guide.

How to remediate: To respond to this finding, it is important to first identify the details of the activity and begin to answer questions that will help determine what happened. For example, what pod or workload was launched? Who was the user that launched this pod or workload? What cluster is involved?
 

Figure 2: GuardDuty Console showing PrivilegeEscalation:Kubernetes/PrivilegedContainer finding type

Figure 2: GuardDuty Console showing PrivilegeEscalation:Kubernetes/PrivilegedContainer finding type

If this privileged container launch is unexpected, the credentials of the user identity used to launch the container may be compromised. You should then focus on remediating and reviewing access to your cluster, and remediating the user. To do this, follow the procedure in the Remediating a compromised Kubernetes user section of this post. Next, you should identify compromised pods using the procedure in the Identifying and remediating compromised pods section of this post.

If you know what specific circumstances a privileged container can be deployed in your environment, for example only in a specific namespace, it is likely you can automatically remediate any GuardDuty EKS Protection finding associated with a privileged container in any other namespace. For more information about automated response activities, see Incident response and forensics in the Amazon EKS Best Practices Guide.

Persistence:Kubernetes/ContainerWithSensitiveMount

Finding documentation: Persistence:Kubernetes/ContainerWithSensitiveMount

Severity: Medium

Overview: This finding (as shown in Figure 3) informs you that a container was launched with a configuration that included a sensitive host path with write access in the volumeMounts section. This makes the sensitive host path accessible and writable from inside the container. This technique is commonly used by adversaries to gain access to the host’s filesystem.

Preventative Measures: Create and enforce policy-as-code (PAC) or Pod Security Standards (PSS) that use the allowedHostPaths control to only allow required host paths for use in volumes and preferably with read-only access. For more information, see Pod Security in the Amazon EKS Best Practices Guide.

How to remediate: To respond to this finding, it is important to first identify the details of the activity and begin to answer questions that will help determine what happened. For example, what pod or workload was launched? Who was the user that launched this pod or workload? What cluster is involved?
 

Figure 3: GuardDuty Console showing Persistence:Kubernetes/ContainerWithSensitiveMount finding type

Figure 3: GuardDuty Console showing Persistence:Kubernetes/ContainerWithSensitiveMount finding type

If the container launched is unexpected, the credentials of the user identity used to launch the container may be compromised. You should then focus on remediating and reviewing access to your cluster and remediating the user. To do this, follow the procedure in the next section, Remediating a compromised Kubernetes user.

If you can determine what containers should and should not be launched with writable hostPath mounts, then you can create automatic response and remediation for this use case. For example, you might want to revoke temporary security credentials assigned to the pod or worker node. For more information about revoking temporary security credentials and other response and remediation actions, see Incident response and forensics in the Amazon EKS Best Practices Guide.

Remediating a compromised Kubernetes user

If the compromised user has privileges to read secrets of one or more namespaces, rotate all of the affected secrets. For more information about the different types of secrets, see Secrets in the Kubernetes documentation. If the user has write privileges, AWS recommends auditing all changes made by the user in question. You can accomplish this by querying audit logs, if you have enabled EKS control plane logging on your EKS cluster. If you do not currently have logging enabled, follow the instructions for Enabling and disabling control plane logs in the Amazon EKS User Guide. Amazon EKS stores these control plane logs in Amazon CloudWatch Logs in your account. You can use CloudWatch Logs Insights to list all the mutating changes that the compromised user has made.

Remediation step 1: Identify the user

All actions performed on a Kubernetes cluster has an associated identity. GuardDuty EKS Protection findings report details of the Kubernetes user identity that the malicious actor may have compromised. You can find details of the user identity in the GuardDuty console under the Kubernetes user details section in the finding details, or in the finding JSON under the resources.eksClusterDetails.kubernetesDetails.kubernetesUserDetails section. These user details include username, UID, and groups that the user belongs to.

Remediation step 2: Identify changes

  1. Identify the changes made by the attacker associated with the compromised user identity by using the code example below to query CloudWatch Logs Insights, replacing the placeholders with your values.
    fields @timestamp, @message
    | filter user.username == <username> 
    | filter verb == "create" or verb == "update" or verb == "patch"
    | filter responseStatus.code >= 200 and responseStatus.code <= 300
    | filter @timestamp >= <approximate start time of the attack in epoch milliseconds>

    For example:

    fields @timestamp, @message
    | filter user.username == "kubernetes-admin" 
    | filter verb == "create" or verb == "update" or verb == "patch"
    | filter responseStatus.code >= 200 and responseStatus.code <= 300
    | filter @timestamp >= 1628279482312
    

  2. An EKS cluster can have multiple types of user identities, for example the kubernetes-admin user, aws-auth ConfigMap defined user, and so on. You will need to take actions appropriate for the user type to properly revoke its access. For more information, see Remediating compromised Kubernetes users in the Amazon GuardDuty User Guide.
  3. (Optional) If the compromised user identity had extensive privileges and you determine that the attacker made extensive changes to the cluster, you should consider isolating the pod, followed by creating a new clean cluster and redeploying your applications to the new cluster. For instructions to isolate and redeploy EKS pods, see Isolate the Pod by creating a Network Policy that denies all ingress and egress traffic to the pod in the Amazon EKS Best Practices Guide.

Identifying and remediating compromised pods

If a GuardDuty EKS Protection finding is caused by activity related to a specific pod, the value of the finding JSON resource.kubernetesDetails.kubernetesWorkloadDetails.type field is pod. The finding includes the name of the pod and namespace in the resource.kubernetesDetails.kubernetesWorkloadDetails.name and resource.kubernetesDetails.kubernetesWorkloadDetails.namespace fields, which uniquely identify the pod.

In other cases, such as when a service account or a Kubernetes workload name is in the resource.kubernetesDetails.kubernetesUserDetails, you can follow the instructions in the Sample incident response plan to identify compromised pods using different pieces of information available in the GuardDuty EKS Protection findings.

After you have identified compromised pods, to remediate, use the instructions to isolate the pods, rotate the credentials, and gather data for forensic analysis in Isolate the Pod by creating a Network Policy that denies all ingress and egress traffic to the pod in the Amazon EKS Best Practices Guide.

Conclusion

In this post, you learned the details of the new Amazon GuardDuty EKS Protection feature, and Kubernetes audit logs, and you saw examples for how to understand, operationalize, and respond to these new findings. You can enable this feature through the GuardDuty Console or APIs to start monitoring your Amazon EKS clusters today. If you have created Amazon EventBridge Rules to send findings from GuardDuty to a target, then ensure that your rules are configured to deliver these newly added findings.

AWS is committed to continually improving GuardDuty, to make it more efficient for you to operate securely in AWS. At AWS, customer feedback drives change, so we encourage you to continue providing feedback. If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on AWS re:Post or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Marshall Jones

Marshall is a worldwide security specialist solutions architect at AWS. His background is in AWS consulting and security architecture, focused on a variety of security domains including edge, threat detection, and compliance. Today, he helps enterprise customers adopt and operationalize AWS security services to increase security effectiveness and reduce risk.

How SailPoint solved scaling issues by migrating legacy big data applications to Amazon EMR on Amazon EKS

Post Syndicated from Richard Li original https://aws.amazon.com/blogs/big-data/how-sailpoint-solved-scaling-issues-by-migrating-legacy-big-data-applications-to-amazon-emr-on-amazon-eks/

This post is co-written with Richard Li from SailPoint.

SailPoint Technologies is an identity security company based in Austin, TX. Its software as a service (SaaS) solutions support identity governance operations in regulated industries such as healthcare, government, and higher education. SailPoint distinguishes multiple aspects of identity as individual identity security services, including cloud governance, SaaS management, access risk governance, file access management, password management, provisioning, recommendations, and separation of duties, as well as access certification, access insights, access modeling, and access requests.

In this post, we share how SailPoint updated its platform for big data operations, and solved scaling issues by migrating legacy big data applications to Amazon EMR on Amazon EKS.

The challenge with the legacy data environment

SailPoint acquired a SaaS software platform that processes and analyzes identity, resource, and usage data from multiple cloud providers, and provides access insights, usage analysis, and access risk analysis. The original design criteria of the platform was focused on serving small to medium-sized companies. To quickly process these analytics insights, many of these processing workloads were done inside many microservices through streaming connections.

After acquisition, we set a goal to expand the platform’s capability to handle customers with large cloud footprints over multiple cloud providers, sometime over hundreds or even thousands of accounts producing large amount of cloud event data.

The legacy architecture has a simplistic approach for data processing, as shown in the following diagram. We were processing the vast majority of event data in-service and directly ingested into Amazon Relational Database Service (Amazon RDS), which we then merged with a graph database to form the final view..

We needed to convert this into a scalable process that could handle customers of any size. To address this challenge, we had to quickly introduce a big data processing engine in the platform.

How migrating to Amazon EMR on EKS helped solve this challenge

When evaluating the platform for our big data operations, several factors made Amazon EMR on EKS a top choice.

The amount of event data we receive at any given time is generally unpredictable. To stay cost-effective and efficient, we need a platform that is capable of scaling up automatically when the workload increases to reduce wait time, and can scale down when the capacity is no longer needed to save cost. Because our existing application workloads are already running on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster with the cluster autoscaler enabled, running Amazon EMR on EKS on top of our existing EKS cluster fits this need.

Amazon EMR on EKS can safely coexist on an EKS cluster that is already hosting other workloads, be contained within a specified namespace, and have controlled access through use of Kubernetes role-based access control and AWS Identity and Access Management (IAM) roles for service accounts. Therefore, we didn’t have to build new infrastructures just for Amazon EMR. We simply linked up Amazon EMR on EKS with our existing EKS cluster running our application workloads. This reduced the amount of DevOps support needed, and significantly sped up our implementation and deployment timeline.

Unlike Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2), because our EKS cluster spans over multiple Availability Zones, we can control Spark pods placements using Kubernetes’s pod scheduling and placement strategy to achieve higher fault tolerance.

With the ability to create and use custom images in Amazon EMR on EKS, we could also utilize our existing container-based application build and deployment pipeline for our Amazon EMR on EKS workload without any modifications. This also gave us additional benefit in reducing job startup time because we package all job scripts as well as all dependencies with the image, without having to fetch them at runtime.

We also utilize AWS Step Functions as our core workflow engine. The native integration of Amazon EMR on EKS with Step Functions is another bonus where we didn’t have to build custom code for job dispatch. Instead, we could utilize the Step Functions native integration to seamlessly integrate Amazon EMR jobs with our existing workflow, with very little effort.

In merely 5 months, we were able to go from design, to proof of concept, to rolling out phase 1 of the event analytics processing. This vastly improved our event analytics processing capability by extending horizontal scalability, which gave us the ability to take customers with significantly larger cloud footprints than the legacy platform was designed for.

During the development and rollout of the platform, we also found that the Spark History Server provided by Amazon EMR on EKS was very useful in terms of helping us identify performance issues and tune the performance of our jobs.

As of this writing, the phase 1 rollout, which includes the event processing component of the core analytics processing, is complete. We’re now expanding the platform to migrate additional components onto Amazon EMR on EKS. The following diagram depicts our future architecture with Amazon EMR on EKS when all phases are complete.

In addition, to improve performances and reduce costs, we’re currently testing the Spark dynamic resource allocation support of Amazon EMR on EKS. This would automatically scale up and down the job executors based on load, and therefore boost performance when needed and reduce cost when the workload is low. Furthermore, we’re investigating the possibility to reduce the overall cost and increase performance by utilizing the pod template feature that would allow us to seamlessly transition our Amazon EMR job workload to AWS Graviton based instances.

Conclusion

With Amazon EMR on EKS, we can now onboard new customers and process vast amounts of data in a cost-effective manner, which we couldn’t do with our legacy environment. We plan to expand our Amazon EMR on EKS footprint to handle all our transform and load data analytics processes.


About the Authors

Richard Li is a senior staff software engineer on the SailPoint Technologies Cloud Access Management team.

Janak Agarwal is a product manager for Amazon EMR on Amazon EKS at AWS.

Kiran Guduguntla is a WW Go-to-Market Specialist for Amazon EMR at AWS. He works with AWS customers across the globe to strategize, build, develop, and deploy modern data analytics solutions.

Financial Crime Discovery using Amazon EKS and Graph Databases

Post Syndicated from Severin Gassauer-Fleissner original https://aws.amazon.com/blogs/architecture/financial-crime-discovery-using-amazon-eks-and-graph-databases/

Discovering and solving financial crimes has become a challenge due to an increasing amount of financial data. While storing transactional payment data in a structured table format is useful for searching, filtering, and calculations, it is not always an ideal way to represent transactional data. For example, determining if there is a suspicious financial relationship between entity A and entity B is difficult to visualize in a table. Using tables, we would have to do SQL joins for every possible transaction from entity A to every possible subsequent transaction. We would then have to iterate this process until we found a relationship to entity B. Moreover, certain queries are challenging to run on a relational database management system (RDBMS). For example, it can be quite time consuming to discover which account received a minimum amount of $10,000,000 from other accounts.

Graph databases such as Amazon Neptune can be helpful with performing queries, because they can traverse the data and perform calculations simultaneously. Graphs enable us to represent transactions and parties over a multi-connected network, and discover patterns and chains of connections. It is common to use them in anti-money laundering (AML) applications, as they can help find patterns of suspicious transactions.

We needed a solution that could scale and process millions of transactions, by effectively using high memory and CPU configurations to perform complex queries quickly. As part of our customer demonstration to show how graph databases can help discover financial crimes, we also sourced a large dataset on which to test the solution. We used a graph database, Amazon Elastic Kubernetes Service (EKS), and Amazon Neptune, to search for suspicious financial chains across large amounts of transactional data in minutes.

Overview of our conceptual financial crime discovery solution

Figure 1. Workflow for financial crime discovery

Figure 1. Workflow for financial crime discovery

We first needed to find a rules engine that could perform transaction inferencing and reasoning. It had to be able to process various rules on our data, be efficient, and able to scale. Next, we needed a straightforward way to ingest data into the solution. Once we had the data available, we needed to initiate a task to begin our inference job. Finally, we needed a location to store the result for further analysis and persistence, see Figure 1.

Using the AWS Cloud to scale up a graph database

We used multiple AWS services to create a fully automated end-to-end batch-based transaction process, shown in Figure 2. We used RDFox, which is an AWS Marketplace product, created by Oxford Semantic Technologies. RDFox is a high-performance in-memory graph database and semantic reasoner. To orchestrate RDFox, we used Amazon EKS Autoscaler to spin up a cluster to instantiate the RDFox container. Amazon EKS can spin up multiple containers for difference inference jobs and recycle the resources when the job finishes. We also used Amazon Neptune, an Amazon managed distributed graph database that can store up to 64 TiB of the results for diagnosis and long-term retention.

The data is stored in Amazon S3 buckets, which provide a streamlined way to feed a large dataset for processing.

Figure 2. Architecture diagram for financial crime discovery

Figure 2. Architecture diagram for financial crime discovery

Financial crimes rules

The power of graphs can help discover financial crimes that are reflected in relationships and monetary transactions. To demonstrate this, we will write two rules to detect two scenarios:

  1. Given two suspicious parties X and Y, find out if there is a transactional relationship between them, and if so, provide the chain that connects them. This is a common scenario that financial institutions must detect.
  2. Given a chain identifying suspicious behavior, find out if the minimum transaction amount that reached the beneficiary exceeds a threshold of Z dollars.

Generating data

To generate data for testing, we used a synthetic data generator written in Python (see GitHub repo in References). The generator built two sets of graph artifacts – parties and transactions. Every transaction is being paired with two random parties, and this iterative process creates a network of connected transactions and parties.

We created a dataset with a small percentage (0.01%) of parties tagged as “Suspicious Party,” to simulate the preceding business scenario. Note that those parties will have transactions going both in and out. In some cases, this will collide with other suspicious parties and establish a chain. This method enables us to get simulated data without engineering the suspicious chains.
The test dataset used with this solution comprises 1M transactions and thousands of parties.
For more information on generating data, see GitHub: Transaction Chains Data Generator.

Walkthrough of the financial investigator workflow

Once deployed, this solution can assist investigators as follows:

  1. An investigator places the transactions and party data (nt triples) in a subdirectory within the input bucket. Typically, subdirectories can be named as a date or range of dates. In addition, the investigator uploads the particular rules (dlog files) and queries (rq files) that must be processed on the data.
  2. Once the data is ready, the investigator uploads a job spec file (simple JSON, see References section). This contains the description of what resources the job requires (CPUs and memory), along with other configurations for the job.
  3. Once the job spec has been uploaded to the bucket’s subdirectory, the job is automatically initiated. The Kubernetes scheduler will allocate enough resources to initiate the RDFox pod. The containers in the pod will then load, process, query results, and upload them to the output bucket.
  4. Once the data reaches the output bucket, an AWS Lambda function is initiated. This invokes the Amazon Neptune Bulk Loader, which asynchronously loads the results to the Neptune cluster in a parallelized manner.
  5. Once the load completes, the investigator gets an email notification that the job has been completed, and the results are ready for view.

Additionally:

  • The investigator can upload multiple rules and queries, they will all be processed automatically.
  • The investigator can launch multiple jobs with different/same data, and with different rules and queries at the same time.
  • All jobs outputs are saved in a unique job ID subdirectory in the output bucket.

To create the solution in your account, follow the instruction here: GitHub repo

Prerequisites

For this walkthrough, you should have the following prerequisites:

Rules

We create two materialization rule sets to fetch the two scenarios described.

1. detect-suspicious-parties-pair.dlog

The purpose of this first set of rules is to detect chains that might exist between two suspicious parties. The idea of these chains is to represent all the possible relationships that contain monetary transfers between a suspicious originator and the beneficiary. This will include non-suspicious parties in the chain. The rule tags these chains with the “SuspiciousChain” flag.

2. detect-chains-exceed-100-dollars.dlog

This set of rules is designed to tag the chains identified by the first set of rules. It also contains a minimum amount of $100 passed to the beneficiary. We can change the amount to check for different compliance requirements. We tag those chains as “HighValueChain.”

Run the job and check your results

Now we can run our job, with the given data, rules, and two additional queries (to extract “SuspiciousChain” and “HighValueChain” respectively). The result of the queries will be loaded to Amazon Neptune automatically for persistent storage, and is made durably available for further analysis.

Let’s look at the results. The following query can be initiated against RDFox console or Amazon Neptune.

Visualization

PREFIX : <http://oxfordsemantic.tech/transactions/entities#>

PREFIX prop: <http://oxfordsemantic.tech/transactions/properties#>

PREFIX type: <http://oxfordsemantic.tech/transactions/classes#>

PREFIX tt: <http://oxfordsemantic.tech/transactions/tupletables#>

SELECT ?S ?P ?O WHERE {

            ?S a type:SuspiciousChain .

            ?S ?P ?O .

}

Figure 3. Visualizing suspicious chains

Figure 3. Visualizing suspicious chains

Whoa! Figure 3 might look complicated at first, but this is because we are visualizing every pair of suspicious parties that have a relationship with another suspicious party. Let’s filter the query to look only at a single particular chain, which exceeds a minimum of $100 to the beneficiary. The following query can be executed against RDFox console or Amazon Neptune.

PREFIX : <http://oxfordsemantic.tech/transactions/entities#>
PREFIX prop: <http://oxfordsemantic.tech/transactions/properties#>
PREFIX type: <http://oxfordsemantic.tech/transactions/classes#>
PREFIX tt: <http://oxfordsemantic.tech/transactions/tupletables#>

SELECT * WHERE {
?S ?P ?O
{
SELECT ?S WHERE {
?S a type:HighValueChain .

} Limit 1

}

Figure 4. Visualizing a particular chain

Figure 4. Visualizing a particular chain

In Figure 4, we can see that Allison, the suspicious originator of the chain, has sent a transaction to Troy. Troy, who is not suspicious, sent the transaction to Karina. Karina is the suspicious beneficiary. We can also see additional information, such as the transaction amount that Karina received, and the chain length of 2 in this case.

In our testing, we were able to scale up to 500M transactions with 50M parties and process this in less than two hours! And we performed this at a significant lower cost when compared to running constant, fixed similar hardware.

Cleaning up

Follow Terraform cleanup instructions.

Conclusion

Graph databases are a powerful tool to apply reasoning on complex financial relationships. The combination of Amazon Web Services and the RDFox engine results in an automated, scalable, and cost-effective, thanks to the dynamic Kubernetes Cluster Autoscaler. Customers can use this solution and provide their investigators with a tool they can experiment and reason on financial transactions. This solution simplifies the process, and makes it easier to try different rules and queries on complex large data collections.

This blog post is written with Oxford Semantic.

Oxford Semantic Logo

References

Solution:

Further reading

RDFox:

Other:

Scaling DLT to 1M TPS on AWS: Optimizing a Regulated Liabilities Network

Post Syndicated from Erica Salinas original https://aws.amazon.com/blogs/architecture/scaling-dlt-to-1m-tps-on-aws-optimizing-a-regulated-liabilities-network/

SETL is an open source, distributed ledger technology (DLT) company that enables tokenisation, digital custody, and DLT for securities markets and payments. In mid-2021, they developed a blueprint for a Regulated Liabilities Network (RLN) that enables holding and managing a variety of tokenized value irrespective of its form. In a December 2021 collaboration with Amazon Web Services (AWS), SETL demonstrated that their RLN platform could support one million transactions per second. These findings were published in their whitepaper, The Regulated Liability Network: Whitepaper on scalability and performance.

This AWS-hosted architecture scaled each processing component and the complete transaction flow. It was built using Amazon Elastic Kubernetes Service (EKS), Amazon Elastic Cloud Compute (EC2), and Amazon Managed Streaming for Apache Kafka (MSK).

This post discusses the technical implementation of the simulated network. We show how scaling characteristics were achieved, while maintaining the business requirements of atomicity and finality. We also discuss how each RLN component was optimized for high performance.

Background: What is an RLN?

In The Regulated Internet of Value, Tony McLaughlin of Citi proposed a single network to record ownership of multiple token types, each which represents a liability of a regulated entity. The RLN would give commercial banks, e-money providers, and central banks, the ability to issue liabilities and immutably record all balances and owners. Because it is designed to maintain a globally sequenced set of state changes, an unambiguous ledger of token balances can be computed and updated. Transactions, which result in two or more ledger updates, are proposed to the network. They are authorized by all impacted network participants, ordered for distribution to participant ledgers, and then persisted by multiple participant ledgers. This process is completed with five processing components.

Figure 1. Core components and queues of RLN

Figure 1. Core components and queues of RLN

Given the existing performance limitations faced by many DLT platforms, a main concern with the network design was its ability to meet throughput requirements. SETL collaborated with AWS to define an architecture that integrated decoupled processing components, as shown in Figure 2. The target throughput was 1,000,000 TPS for each component through the simulated network.

Figure 2. Simulated RLN architecture in the AWS Cloud

Figure 2. Simulated RLN architecture in the AWS Cloud

A queue – process – queue architectural model

The basic architectural model applied was “queue – process – queue.” Each component consumed transactions from one or more Kafka topics, performed its requisite activities, and produced output to a second Kafka topic. Components of the message flow used Amazon EKS, Amazon EC2, and Amazon MSK.

Specific techniques were applied to achieve the scaling characteristics of components in the main message flow.

  • A distinct EKS-managed node group was used for each RLN component. Each node within the group had its own EC2 instance that housed at most two Kubernetes Pods. This technique decreased potential network throttling and enabled the monitoring of pod resource utilization using Amazon CloudWatch at the EC2 level. This aided in identifying bottlenecks, which can be challenging if large EC2 nodes house many Kubernetes Pods.
  • Each RLN queue component had its own dedicated Kafka cluster. By isolating the Kafka topics to their own cluster, we were able to scale and monitor the cluster individually. This decreased potential chokepoints.
  • A combination of eksctl, kubectl, and EC2 Auto Scaling groups was used to deploy and horizontally scale each RLN component. The tooling enabled rapid results by controlling pod configuration, automating deployments, and supporting multiple performance testing iterations.

Fine tuning RLN system components

The key metric tested was message throughput in transactions per second (TPS) consumed, processed, and produced for each component. An end-to-end test was performed, which measured throughput of the complete simulated network. Each RLN component was tuned to meet this metric as follows.

The Transaction Generator acted as the customer entering transactions into the system (see Figure 3.) The Kafka producer property buffer.memory was changed to 536870912 to simulate an increase in producer TPS for each component.

Figure 3. Simulated RLN Transaction Generator

Figure 3. Simulated RLN Transaction Generator

The Scheduler’s scaling technique was a dedicated Compute (one Pod/EC2 node) that listened to one Kafka partition in the Transaction Queue as seen in Figure 4. Additionally, turning on gzip compression (compression.type) on the producer resolved a network bottleneck for the downstream Kafka cluster.

Figure 4. Simulated RLN Scheduler

Figure 4. Simulated RLN Scheduler

The Approver’s scaling technique was like the Scheduler, however, the Proposal Queue had 10 topics as opposed to one. The approver is stateless, so it randomly selects a Kafka topic and partition to listen to. When scaling to 100 Kubernetes Pods, there would be roughly one Kafka partition per pod. As shown in Figure 5, each pod has its own EC2 instance. The Assembler’s scaling techniques were the same as for the Scheduler and Approver components.

Figure 5. Simulated RLN Approver

Figure 5. Simulated RLN Approver

The Sequencer, as designed, cannot scale horizontally due to the nature of the sequencing algorithm. However, without any special tuning of the Kafka consumer configuration or any significant increase in EC2 sizing, it was able to achieve one million TPS. Its architecture is shown in Figure 6.

Figure 6. Simulated RLN Sequencer

Figure 6. Simulated RLN Sequencer

The State Updater scaled with each Kubernetes Pod listening to an assigned Approved Proposal topic (one-to-one mapping) and receiving all sequenced hashes from the Hash Queue. Achieving 1 million TPS throughput required 10 nodes configured with c5.4xlarge, as seen in Figure 7.

Figure 7. Simulated RLN State Updater

Figure 7. Simulated RLN State Updater

Successful throughput testing

Test results demonstrate that each component can sustain over 1 million TPS. While horizontal scaling was applied, even a single instance achieved over 10,000 TPS. Individual component performance test results can be seen in Table 1.

Component Single Instance (TPS) # of Instances Multiple Instances (TPS)
Generator ~ 629,219 2 1,258,438
Scheduler ~11,019 100 1,101,928
Approver ~10,348 100 1,034,868
Assembler ~10,691 100 1,069,100
Sequencer ~1,052,845 1 N/A
State Updater ~100,256 10 1,002,569

Table 1. Test results per individual component

Network tests were performed with all components integrated to create a simulated RLN network. The tests were executed with transaction generation of ~1,000,000 TPS. Throughput was measured at the output of the final component’s (State Updater) instances (see Figure 8). It was also measured in aggregate at over 1.2 million TPS (see Figure 9).

Figure 8. Individual TPS for State Update component nodes

Figure 8. Individual TPS for State Update component nodes

Figure 9. Aggregate TPS for State Update component nodes

Figure 9. Aggregate TPS for State Update component nodes

While this study was designed to demonstrate capacity and throughput, the tests did replicate Kafka partitions across three Availability Zones in the selected AWS Region. Thus, testing demonstrated that the proposed architecture supports geographic resilience while meeting the throughput benchmark. Resiliency and security are areas of focus for testing in the next phases of architecting a production-ready version of RLN.

Looking forward

During the month of joint work between the SETL and AWS technical teams, we stood up and tuned basic RLN functionality. We successfully demonstrated the scalability of the environment to at least 1M TPS. By using Kafka queues and the horizontal and vertical scalability of AWS services, we addressed the primary concern of reaching production-level throughput.

The industry group collaborating on the RLN concept now has a solid technical foundation on which to build a production-ready RLN architecture. Future experimentation will include integration with financial institutions and technology partners that will provide the capabilities needed to build a successful RLN ecosystem.

Further explorations include enabling digital signing, verification at scale, and expanding the network to include financial institution environments integrated with their own RLN partitions.

Amazon Elastic Kubernetes Service Adds IPv6 Networking

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/amazon-elastic-kubernetes-service-adds-ipv6-networking/

Starting today, you can deploy applications that use IPv6 address space on Amazon Elastic Kubernetes Service (EKS).

Many of our customers are standardizing Kubernetes as their compute infrastructure platform for cloud and on-premises applications. Amazon EKS makes it easy to deploy containerized workloads. It provides highly available clusters and automates tasks such as patching, node provisioning, and updates.

Kubernetes uses a flat networking model that requires each pod to receive an IP address. This simplified approach enables low-friction porting of applications from virtual machines to containers but requires a significant number of IP addresses that many private VPC IPv4 networks are not equipped to handle. Some cluster administrators work around this IPv4 space limitation by installing container network plugins (CNI) that virtualize IP addresses a layer above the VPC, but this architecture limits an administrator’s ability to effectively observe and troubleshoot applications and has a negative impact on network performance at scale. Further, to communicate with internet services outside the VPC, traffic from IPv4 pods is routed through multiple network hops before reaching its destination, which adds latency and puts a strain on network engineering teams who need to maintain complex routing setups.

To avoid IP address exhaustion, minimize latency at scale, and simplify routing configuration, the solution is to use IPv6 address space.

IPv6 is not new. In 1996, I bought my first book on “IPng, Internet Protocol Next Generation”, as it was called 25 years ago. It provides a 64-bit address space, allowing 3.4 x 10^38 possible IP addresses for our devices, servers, or containers. We could assign an IPv6 address to every atom on the surface of the planet and still have enough addresses left to do another 100-plus Earths.

IPng Internet protocol Next Generation bookThere are a few advantages to using Amazon EKS clusters with an IPv6 network. First, you can run more pods on one single host or subnet without the risk of exhausting all available IPv4 addresses available in your VPC. Second, it allows for lower-latency communications with other IPv6 services, running on-premises, on AWS, or on the internet, by avoiding an extra NAT hop. Third, it relieves network engineers of the burden of maintaining complex routing configurations.

Kubernetes cluster administrators can focus on migrating and scaling applications without spending efforts working around IPv4 limits. Finally, pod networking is configured so that the pods can communicate with IPv4-based applications outside the cluster, allowing you to adopt the benefits of IPv6 on Amazon EKS without requiring that all dependent services deployed across your organization are first migrated to IPv6.

As usual, I built a short demo to show you how it works.

How It Works
Before I get started, I create an IPv6 VPC. I use this CDK script to create an IPv6-enabled VPC in a few minutes (thank you Angus Lees for the code). Just install CDK v2 (npm install -g aws-cdk@next) and deploy the stack (cdk bootstrap && cdk deploy).

When the VPC with IPv6 is created, I use the console to configure auto-assignment of IPv6 addresses to resources deployed in the public subnets (I do this for each public subnet).

auto assign IPv6 addresses in subnet

I take note of the subnet IDs created by the CDK script above (they are listed in the output of the script) and define a couple of variables I’ll use throughout the demo. I also create a cluster IAM role and a node IAM role, as described in the Amazon EKS documentation. When you already have clusters deployed, these two roles exist already.

I open a Terminal and type:


CLUSTER_ROLE_ARN="arn:aws:iam::0123456789:role/EKSClusterRole"
NODE_ROLE_ARN="arn:aws:iam::0123456789:role/EKSNodeRole"
SUBNET1="subnet-06000a8"
SUBNET2="subnet-03000cc"
CLUSTER_NAME="AWSNewsBlog"
KEYPAIR_NAME="my-key-pair-name"

Next, I create an Amazon EKS IPv6 cluster. In a terminal, I type:


aws eks create-cluster --cli-input-json "{
\"name\": \"${CLUSTER_NAME}\",
\"version\": \"1.21\",
\"roleArn\": \"${CLUSTER_ROLE_ARN}\",
\"resourcesVpcConfig\": {
\"subnetIds\": [
    \"${SUBNET1}\", \"${SUBNET2}\"
],
\"endpointPublicAccess\": true,
\"endpointPrivateAccess\": true
},
\"kubernetesNetworkConfig\": {
    \"ipFamily\": \"ipv6\"
}
}"

{
    "cluster": {
        "name": "AWSNewsBlog",
        "arn": "arn:aws:eks:us-west-2:486652066693:cluster/AWSNewsBlog",
        "createdAt": "2021-11-02T17:29:32.989000+01:00",
        "version": "1.21",

...redacted for brevity...

        "status": "CREATING",
        "certificateAuthority": {},
        "platformVersion": "eks.4",
        "tags": {}
    }
}

I use the describe-cluster while waiting for the cluster to be created. When the cluster is ready, it has "status" : "ACTIVE"

aws eks describe-cluster --name "${CLUSTER_NAME}"

Then I create a node group:

aws eks create-nodegroup                       \
        --cluster-name ${CLUSTER_NAME}         \
        --nodegroup-name AWSNewsBlog-nodegroup \
        --node-role ${NODE_ROLE_ARN}           \
        --subnets "${SUBNET1}" "${SUBNET2}"    \
        --remote-access ec2SshKey=${KEYPAIR_NAME}
		
{
    "nodegroup": {
        "nodegroupName": "AWSNewsBlog-nodegroup",
        "nodegroupArn": "arn:aws:eks:us-west-2:0123456789:nodegroup/AWSNewsBlog/AWSNewsBlog-nodegroup/3ebe70c7-6c45-d498-6d42-4001f70e7833",
        "clusterName": "AWSNewsBlog",
        "version": "1.21",
        "releaseVersion": "1.21.4-20211101",

        "status": "CREATING",
        "capacityType": "ON_DEMAND",

... redacted for brevity ...

}		

Once the node group is created, I see two EC2 instances in the console. I use the AWS Command Line Interface (CLI) to verify that the instances received an IPv6 address:

aws ec2 describe-instances --query "Reservations[].Instances[? State.Name == 'running' ][].NetworkInterfaces[].Ipv6Addresses" --output text 

2600:1f13:812:0000:0000:0000:0000:71eb
2600:1f13:812:0000:0000:0000:0000:3c07

I use the kubectl command to verify the cluster from a Kubernetes point of view.

kubectl get nodes -o wide

NAME                                       STATUS   ROLES    AGE     VERSION               INTERNAL-IP                              EXTERNAL-IP    OS-IMAGE         KERNEL-VERSION                CONTAINER-RUNTIME
ip-10-0-0-108.us-west-2.compute.internal   Ready    <none>   2d13h   v1.21.4-eks-033ce7e   2600:1f13:812:0000:0000:0000:0000:2263   18.0.0.205   Amazon Linux 2   5.4.149-73.259.amzn2.x86_64   docker://20.10.7
ip-10-0-1-217.us-west-2.compute.internal   Ready    <none>   2d13h   v1.21.4-eks-033ce7e   2600:1f13:812:0000:0000:0000:0000:7f3e   52.0.0.122   Amazon Linux 2   5.4.149-73.259.amzn2.x86_64   docker://20.10.7

Then I deploy a Pod. I follow these steps in the EKS documentation. It deploys a sample nginx web server.

kubectl create namespace aws-news-blog
namespace/aws-news-blog created

# sample-service.yml is available at https://docs.aws.amazon.com/eks/latest/userguide/sample-deployment.html
kubectl apply -f  sample-service.yml 
service/my-service created
deployment.apps/my-deployment created

kubectl get pods -n aws-news-blog -o wide
NAME                             READY   STATUS    RESTARTS   AGE   IP                           NODE                                       NOMINATED NODE   READINESS GATES
my-deployment-5dd5dfd6b9-7rllg   1/1     Running   0          17m   2600:0000:0000:0000:405b::2   ip-10-0-1-217.us-west-2.compute.internal   <none>           <none>
my-deployment-5dd5dfd6b9-h6mrt   1/1     Running   0          17m   2600:0000:0000:0000:46f9::    ip-10-0-0-108.us-west-2.compute.internal   <none>           <none>
my-deployment-5dd5dfd6b9-mrkfv   1/1     Running   0          17m   2600:0000:0000:0000:46f9::1   ip-10-0-0-108.us-west-2.compute.internal   <none>           <none>

I take note of the IPv6 address of my pods, and try to connect it from my laptop. As my awesome service provider doesn’t provide me with an IPv6 at home yet, the connection fails. This is expected as the pods do not have an IPv4 address at all. Notice the -g option telling curl to not consider : in the IP address as the separator for the port number and -6 to tell curl to connect through IPv6 only (required when you provide curl with a DNS hostname).

curl -g -6 http://\[2600:0000:0000:35000000:46f9::1\]
curl: (7) Couldn't connect to server

To test IPv6 connectivity, I start a dual stack (IPv4 and IPv6) EC2 instance in the same VPC as the cluster. I SSH connect to the instance and try the curl command again. I see I receive the default HTML page served by nginx. IPv6 connectivity to the pod works!

curl -g -6 http://\[2600:0000:0000:35000000:46f9::1\]
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>

... redacted for brevity ...

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

If it does not work for you, verify the security group for the cluster EC2 nodes and be sure it has a rule allowing incoming connections on port TCP 80 from ::/0.

A Few Things to Remember
Before I wrap up, I’d like to answer some frequent questions received from customers who have already experimented with this new capability:

Pricing and Availability
IPv6 support for your Amazon Elastic Kubernetes Service (EKS) cluster is available today in all AWS Regions where Amazon EKS is available, at no additional cost.

Go try it out and build your first IPv6 cluster today.

— seb