Tag Archives: EKS

Integrate Kubernetes policy-as-code solutions into Security Hub

Post Syndicated from Joaquin Manuel Rinaudo original https://aws.amazon.com/blogs/security/integrate-kubernetes-policy-as-code-solutions-into-security-hub/

Using Kubernetes policy-as-code (PaC) solutions, administrators and security professionals can enforce organization policies to Kubernetes resources. There are several publicly available PAC solutions that are available for Kubernetes, such as Gatekeeper, Polaris, and Kyverno.

PaC solutions usually implement two features:

  • Use Kubernetes admission controllers to validate or modify objects before they’re created to help enforce configuration best practices for your clusters.
  • Provide a way for you to scan your resources created before policies were deployed or against new policies being evaluated.

This post presents a solution to send policy violations from PaC solutions using Kubernetes policy report format (for example, using Kyverno) or from Gatekeeper’s constraints status directly to AWS Security Hub. With this solution, you can visualize Kubernetes security misconfigurations across your Amazon Elastic Kubernetes Service (Amazon EKS) clusters and your organizations in AWS Organizations. This can also help you implement standard security use cases—such as unified security reporting, escalation through a ticketing system, or automated remediation—on top of Security Hub to help improve your overall Kubernetes security posture and reduce manual efforts.

Solution overview

The solution uses the approach described in A Container-Free Way to Configure Kubernetes Using AWS Lambda to deploy an AWS Lambda function that periodically synchronizes the security status of a Kubernetes cluster from a Kubernetes or Gatekeeper policy report with Security Hub. Figure 1 shows the architecture diagram for the solution.

Figure 1: Diagram of solution

Figure 1: Diagram of solution

This solution works using the following resources and configurations:

  1. A scheduled event which invokes a Lambda function on a 10-minute interval.
  2. The Lambda function iterates through each running EKS cluster that you want to integrate and authenticate by using a Kubernetes Python client and an AWS Identity and Access Management (IAM) role of the Lambda function.
  3. For each running cluster, the Lambda function retrieves the selected Kubernetes policy reports (or the Gatekeeper constraint status, depending on the policy selected) and sends active violations, if present, to Security Hub. With Gatekeeper, if more violations exist than those reported in the constraint, an additional INFORMATIONAL finding is generated in Security Hub to let security teams know of the missing findings.

    Optional: EKS cluster administrators can raise the limit of reported policy violations by using the –constraint-violations-limit flag in their Gatekeeper audit operation.

  4. For each running cluster, the Lambda function archives archive previously raised and resolved findings in Security Hub.

You can download the solution from this GitHub repository.

Walkthrough

In the walkthrough, I show you how to deploy a Kubernetes policy-as-code solution and forward the findings to Security Hub. We’ll configure Kyverno and a Kubernetes demo environment with findings in an existing EKS cluster to Security Hub.

The code provided includes an example constraint and noncompliant resource to test against.

Prerequisites

An EKS cluster is required to set up this solution within your AWS environments. The cluster should be configured with either aws-auth ConfigMap or access entries. Optional: You can use eksctl to create a cluster.

The following resources need to be installed on your computer:

Step 1: Set up the environment

The first step is to install Kyverno on an existing Kubernetes cluster. Then deploy examples of a Kyverno policy and noncompliant resources.

Deploy Kyverno example and policy

  1. Deploy Kyverno in your Kubernetes cluster according to its installation manual using the Kubernetes CLI.
    kubectl create -f https://github.com/kyverno/kyverno/releases/download/v1.10.0/install.yaml

  2. Set up a policy that requires namespaces to use the label thisshouldntexist.
    kubectl create -f - << EOF
    apiVersion: kyverno.io/v1
    kind: ClusterPolicy
    metadata:
      name: require-ns-labels
    spec:
      validationFailureAction: Audit
      background: true
      rules:
      - name: check-for-labels-on-namespace
        match:
          any:
          - resources:
              kinds:
              - Namespace
        validate:
          message: "The label thisshouldntexist is required."
          pattern:
            metadata:
              labels:
                thisshouldntexist: "?*"
    EOF

Deploy a noncompliant resource to test this solution

  1. Create a noncompliant namespace.
    kubectl create namespace non-compliant

  2. Check the Kubernetes policy report status using the following command:
    kubectl get clusterpolicyreport -o yaml

You should see output similar to the following:

apiVersion: v1
items:
- apiVersion: wgpolicyk8s.io/v1alpha2
  kind: ClusterPolicyReport
  metadata:
    creationTimestamp: "2024-02-20T14:00:37Z"
    generation: 1
    labels:
      app.kubernetes.io/managed-by: kyverno
      cpol.kyverno.io/require-ns-labels: "3734083"
    name: cpol-require-ns-labels
    resourceVersion: "3734261"
    uid: 3cfcf1da-bd28-453f-b2f5-512c26065986
  results:
   ...
  - message: 'validation error: The label thisshouldntexist is required. rule check-for-labels-on-namespace
      failed at path /metadata/labels/thisshouldntexist/'
    policy: require-ns-labels
    resources:
    - apiVersion: v1
      kind: Namespace
      name: non-compliant
      uid: d62eb1ad-8a0b-476b-848d-ff6542c57840
    result: fail
    rule: check-for-labels-on-namespace
    scored: true
    source: kyverno
    timestamp:
      nanos: 0
      seconds: 1708437615

Step 2: Solution code deployment and configuration

The next step is to clone and deploy the solution that integrates with Security Hub.

To deploy the solution

  1. Clone the GitHub repository by using your preferred command line terminal:
    git clone https://github.com/aws-samples/securityhub-k8s-policy-integration.git

  2. Open the parameters.json file and configure the following values:
    1. Policy – Name of the product that you want to enable, in this case policyreport, which is supported by tools such as Kyverno.
    2. ClusterNames – List of EKS clusters. When AccessEntryEnabled is enabled, this solution deploys an access entry for the integration to access your EKS clusters.
    3. SubnetIds – (Optional) A comma-separated list of your subnets. If you’ve configured the API endpoints of your EKS clusters as private only, then you need to configure this parameter. If your EKS clusters have public endpoints enabled, you can remove this parameter.
    4. SecurityGroupId – (Optional) A security group ID that allows connectivity to the EKS clusters. This parameter is only required if you’re running private API endpoints; otherwise, you can remove it. This security group should be allowed ingress from the security group of the EKS control plane.
    5. AccessEntryEnabled – (Optional) If you’re using EKS access entries, the solution automatically deploys the access entries with read-only-group permissions deployed in the next step. This parameter is True by default.
  3. Save the changes and close the parameters file.
  4. Set up your AWS_REGION (for example, export AWS_REGION=eu-west-1) and make sure that your credentials are configured for the delegated administrator account.
  5. Enter the following command to deploy:
    ./deploy.sh

You should see the following output:

Waiting for changeset to be created..
Waiting for stack create/update to complete
Successfully created/updated stack - aws-securityhub-k8s-policy-integration

Step 3: Set up EKS cluster access

You need to create the Kubernetes Group read-only-group to allow read-only permissions to the IAM role of the Lambda function. If you aren’t using access entries, you will also need to modify the aws-auth ConfigMap of the Kubernetes clusters.

To configure access to EKS clusters

  1. For each cluster that’s running in your account, run the kube-setup.sh script to create the Kubernetes read-only cluster role and cluster role binding.
  2. (Optional) Configure aws-auth ConfigMap using eksctl if you aren’t using access entries.

Step 4: Verify AWS service integration

The next step is to verify that the Lambda integration to Security Hub is running.

To verify the integration is running

  1. Open the Lambda console, and navigate to the aws-securityhub-k8s-policy-integration-<region> function.
  2. Start a test to import your cluster’s noncompliant findings to Security Hub.
  3. In the Security Hub console, review the recently created findings from Kyverno.
     
    Figure 2: Sample Kyverno findings in Security Hub

    Figure 2: Sample Kyverno findings in Security Hub

Step 5: Clean up

The final step is to clean up the resources that you created for this walkthrough.

To destroy the stack

  • Use the command line terminal in your laptop to run the following command:
    ./cleanup.sh

Conclusion

In this post, you learned how to integrate Kubernetes policy report findings with Security Hub and tested this setup by using the Kyverno policy engine. If you want to test the integration of this solution with Gatekeeper, you can find alternative commands for step 1 of this post in the GitHub repository’s README file.

Using this integration, you can gain visibility into your Kubernetes security posture across EKS clusters and join it with a centralized view, together with other security findings such as those from AWS Config, Amazon Inspector, and more across your organization. You can also try this solution with other tools, such as kube-bench or Gatekeeper. You can extend this setup to notify security teams of critical misconfigurations or implement automated remediation actions by using AWS Security Hub.

For more information on how to use PaC solutions to secure Kubernetes workloads in the AWS cloud, see Amazon Elastic Kubernetes Service (Amazon EKS) workshop, Amazon EKS best practices, Using Gatekeeper as a drop-in Pod Security Policy replacement in Amazon EKS and Policy-based countermeasures for Kubernetes.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Author

Joaquin Manuel Rinaudo

Joaquin is a Principal Security Architect with AWS Professional Services. He is passionate about building solutions that help developers improve their software quality. Prior to AWS, he worked across multiple domains in the security industry, from mobile security to cloud and compliance related topics. In his free time, Joaquin enjoys spending time with family and reading science fiction novels.

Accelerate your data exploration and experimentation with the AWS Analytics Reference Architecture library

Post Syndicated from Lotfi Mouhib original https://aws.amazon.com/blogs/big-data/accelerate-your-data-exploration-and-experimentation-with-the-aws-analytics-reference-architecture-library/

Organizations use their data to solve complex problems by starting small, running iterative experiments, and refining the solution. Although the power of experiments can’t be ignored, organizations have to be cautious about the cost-effectiveness of such experiments. If time is spent creating the underlying infrastructure for enabling experiments, it further adds to the cost.

Developers need an integrated development environment (IDE) for data exploration and debugging of workflows, and different compute profiles for running these workflows. If you choose Amazon EMR for such use cases, you can use an IDE called Amazon EMR Studio for data exploration, transformation, version control, and debugging, and run Spark jobs to process large volume of data. Deploying Amazon EMR on Amazon EKS simplifies management, reduces costs, and improves performance. However, a data engineer or IT administrator needs to spend time creating the underlying infrastructure, configuring security, and creating a managed endpoint for users to connect to. This means such projects have to wait until these experts create the infrastructure.

In this post, we show how a data engineer or IT administrator can use the AWS Analytics Reference Architecture (ARA) to accelerate infrastructure deployment, saving your organization both time and money spent on these data analytics experiments. We use the library to deploy an Amazon Elastic Kubernetes (Amazon EKS) cluster, configure it to use Amazon EMR on EKS, and deploy a virtual cluster and managed endpoints and EMR Studio. You can then either run jobs on the virtual cluster or run exploratory data analysis with Jupyter notebooks on Amazon EMR Studio and Amazon EMR on EKS. The architecture below represent the infrastructure you will deploy with the AWS Analytics Reference Architecture.

cdk-emr-eks-studio-architecture

Prerequisites

To follow along, you need to have an AWS account that is bootstrapped with the AWS Cloud Development Kit (AWS CDK). For instructions, refer to Bootstrapping. The following tutorial uses TypeScript, and requires version 2 or later of the AWS CDK. If you don’t have the AWS CDK installed, refer to Install the AWS CDK.

Set up an AWS CDK project

To deploy resources using the ARA, you first need to set up an AWS CDK project and install the ARA library. Complete the following steps:

  1. Create a folder named emr-eks-app:
    mkdir emr-eks-app && cd emr-eks-app

  2. Initialize an AWS CDK project in an empty directory and run the following command:
    cdk init app --language typescript

  3. Install the ARA library:
    npm install aws-analytics-reference-architecture --save

  4. In lib/emr-eks-app.ts, import the ARA library as follows. The first line calls the ARA library, the second one defines AWS Identity and Access Management (IAM) policies:
    import * as ara from 'aws-analytics-reference-architecture'; 
    import * as iam from 'aws-cdk-lib/aws-iam';

Create and define an EKS cluster and compute capacity

To create an EMR on EKS virtual cluster, you first need to deploy an EKS cluster. The ARA library defines a construct called EmrEksCluster. The construct provisions an EKS cluster, enables IAM roles for service accounts, and deploys a set of supporting controllers like certificate manager controller (needed by the managed endpoint that is used by Amazon EMR Studio) as well as a cluster auto scaler to have an elastic cluster and save on cost when no job is submitted to the cluster.

In lib/emr-eks-app.ts, add the following line:

const emrEks = ara.EmrEksCluster.getOrCreate(this,{ 
   eksAdminRoleArn:ROLE_ARN;, 
   eksClusterName:CLUSTER_NAME;
   autoscaling: Autoscaler.KARPENTER, 
});

To learn more about the properties you can customize, refer to EmrEksClusterProps. There are two mandatory parameters in EmrEksCluster construct: The first is eksAdminRoleArn role is mandatory and is the role you use to interact with the Kubernetes control plane. This role must have administrative permissions to create or update the cluster. The second parameter is autoscaling, this parameter allows you to select the autoscaling mechanism, either Karpenter or native Kubernetes Cluster Autoscaler. In this blog we will use Karpenter and we recommend its use due to faster autoscaling, simplified node management and provisioning. Now you’re ready to define the compute capacity.

One way to define worker nodes in Amazon EKS is to use managed node groups. We use one node group called tooling, which hosts the coredns, ingress controller, certificate manager, Karpenter and any other pod that is necessary for the running EMR on EKS jobs or ManagedEndpoint. We also define default Karpenter Provisioners that define capacity to be used for jobs submitted by EMR on EKS. These Provisioners are optimized for different Spark use cases (critical jobs, non-critical job, experimentation and interactive sessions). The construct also allows you to submit your own provisioner defined by a Kubernetes manifest through a method called addKarpenterProvisioner. Let’s discuss the predefined Provisioners.

Default Provisioners configurations

The default provisioners are set for rapid experimentation and are always created by default. However, if you don’t want to use them, you can set the defaultNodeGroups parameter to false in the EmrEksCluster properties at creation time. The Provisioners are defined as follows and are created in each of the subnets that are used by Amazon EKS:

  • Critical provisioner – It is dedicated to supporting jobs with aggressive SLAs and are time sensitive. The provisioner uses On-Demand Instances, which aren’t stopped, unlike Spot Instances, and their lifecycle follows through one of the jobs. The nodes use instance stores, which are NVMe disks physically attached to the host, which offer a high I/O throughput that allow better Spark performance, because it’s used as temporary storage for disk spill and shuffle. The instance types used in the node are of the m6gd family. The instances use the AWS Graviton processor, which offers better price/performance than x86 processors. To use this provisioner in your jobs, you can use the following sample configuration, which is referenced in the configuration override of the EMR on EKS job submission.
  • Non-critical provisioner – This Provisioner leverage Spot Instances to save costs for jobs that aren’t time sensitive or jobs that are used for experiments. This node use Spot Instances because the jobs aren’t critical and can be interrupted. These instances can be stopped if the instance is reclaimed. The instance types used in the node are of the m6gd family, the driver is On-Demand and executors are on spot instances.
  • Notebook provisioner – The Provisioner is for running managed endpoints that are used by Amazon EMR Studio for data exploration using Amazon EMR on EKS. The instances are of t3 family and are On-Demand for driver and Spot Instances for executors to keep the cost low. If the executor instances are stopped, new ones are started by Karpenter. If the executor instances are stopped too often, you can define your own that use On-Demand instances.

The following link provides more details about how each of the provisioner are defined. One import property that is defined in the default Provisioners is there is one for each AZ. This is important because it allows you to reduce inter-AZ network transfer cost when Spark runs a shuffle.

For this post, we use the default Provisioners, so you don’t need to add any lines of code for this section. If you want yo add your own Provisioners you can leverage the method addKarpenterProvisioner to apply your own manifests. You can use helper methods in Utils class like readYamlDocument to read YAML document and loadYaml load YAML files and pass them as arguments to addKarpenterProvisioner method.

Deploy the virtual cluster and an execution role

A virtual cluster is a Kubernetes namespace that Amazon EMR is registered with; when you submit a job, the driver and executor pods are running in the associated namespace. The EmrEksCluster construct offers a method called addEmrVirtualCluster, which creates the virtual cluster for you. The method takes EmrVirtualClusterOptions as a parameter, which has the following attributes:

  • name – The name of your virtual cluster.
  • createNamespace – An optional field that creates the EKS namespace. This is of type Boolean and by default it doesn’t create a separate EKS namespace, so your virtual cluster is created in the default namespace.
  • eksNamespace – The name of the EKS namespace to be linked with the virtual EMR cluster. If no namespace is supplied, the construct uses the default namespace.
  1. In lib/emr-eks-app.ts, add the following line to create your virtual cluster:
    const virtualCluster = emrEks.addEmrVirtualCluster(this,{ 
       name:'my-emr-eks-cluster', 
       eksNamespace: ‘batchjob’, 
       createNamespace: true 
    });

    Now we create the execution role, which is an IAM role that is used by the driver and executor to interact with AWS services. Before we can create the execution role for Amazon EMR, we need to first create the ManagedPolicy. Note that in the following code, we create a policy to allow access to the Amazon Simple Storage Service (Amazon S3) bucket and Amazon CloudWatch logs.

  2. In lib/emr-eks-app.ts, add the following line to create the policy:
    const emrEksPolicy = new iam.ManagedPolicy(this,'managed-policy',
    { statements: [ 
       new iam.PolicyStatement({ 
           effect: iam.Effect.ALLOW, 
           actions:['s3:PutObject','s3:GetObject','s3:ListBucket'], 
           resources:['YOUR-DATA-S3-BUCKET']
        }), 
       new iam.PolicyStatement({ 
           effect: iam.Effect.ALLOW, 
           actions:['logs:PutLogEvents','logs:CreateLogStream','logs:DescribeLogGroups','logs:DescribeLogStreams'], 
           resources:['arn:aws:logs:*:*:*'] 
        })
       ] 
    });

    If you want to use the AWS Glue Data Catalog, add its permission in the preceding policy.

    Now we create the execution role for Amazon EMR on EKS using the policy defined in the previous step using the createExecutionRole instance method. The driver and executor pods can then assume this role to access and process data. The role is scoped in such a way that only pods in the virtual cluster namespace can assume it. To learn more about the condition implemented by this method to restrict access to the role to only pods that are created by Amazon EMR on EKS in the namespace of the virtual cluster, refer to Using job execution roles with Amazon EMR on EKS.

  3. In lib/emr-eks-app.ts, add the following line to create the execution role:
    const role = emrEks.createExecutionRole(this,'emr-eks-execution-role', emrEksPolicy, ‘batchjob’,’ execRoleJob’);

    The preceding code produces an IAM role called execRoleJob with the IAM policy defined in emrekspolicy and scoped to the namespace dataanalysis.

  4. Lastly, we output parameters that are important for the job run:
// Virtual cluster Id to reference in jobs
new cdk.CfnOutput(this, 'VirtualClusterId', { value: virtualCluster.attrId });

// Job config for each nodegroup
new cdk.CfnOutput(this, 'CriticalConfig', { value: emrEks.criticalDefaultConfig });

// Execution role arn
new cdk.CfnOutput(this, 'ExecRoleArn', { value: role.roleArn });

Deploy Amazon EMR Studio and provision users

To deploy an EMR Studio for data exploration and job authoring, the ARA library has a construct called NotebookPlatform. This construct allows you to deploy as many EMR Studios as you need (within the account limit) and set them up with the authentication mode that is suitable for you and assign users to them. To learn more about the authentication modes available in Amazon EMR Studio, refer to Choose an authentication mode for Amazon EMR Studio.

The construct creates all the necessary IAM roles and policies needed by Amazon EMR Studio. It also creates an S3 bucket where all the notebooks are stored by Amazon EMR Studio. The bucket is encrypted with a customer managed key (CMK) generated by the AWS CDK stack. The following steps show you how to create your own EMR Studio with the construct.

The notebook platform construct takes NotebookPlatformProps as a property, which allows you to define your EMR Studio, a namespace, the name of the EMR Studio, and its authentication mode.

  1. In lib/emr-eks-app.ts, add the following line:
    const notebookPlatform = new ara.NotebookPlatform(this, 'platform-notebook', {
    emrEks: emrEks,
    eksNamespace: 'dataanalysis',
    studioName: 'platform',
    studioAuthMode: ara.StudioAuthMode.IAM,
    });

    For this post, we use IAM users so that you can easily reproduce it in your own account. However, if you have IAM federation or single sign-on (SSO) already in place, you can use them instead of IAM users.To learn more about the parameters of NotebookPlatformProps, refer to NotebookPlatformProps.

    Next, we need to create and assign users to the Amazon EMR Studio. For this, the construct has a method called addUser that takes a list of users and either assigns them to Amazon EMR Studio in case of SSO or updates the IAM policy to allows access to Amazon EMR Studio for the provided IAM users. The user can also have multiple managed endpoints, and each user can have their Amazon EMR version defined. They can use a different set of Amazon Elastic Compute Cloud (Amazon EC2) instances and different permissions using job execution roles.

  2. In lib/emr-eks-app.ts, add the following line:
    notebookPlatform.addUser([{
    identityName:<NAME-OF-EXISTING-IAM-USER>,
    notebookManagedEndpoints: [{
    emrOnEksVersion: 'emr-6.8.0-latest',
    executionPolicy: emrEksPolicy,
    managedEndpointName: ‘myendpoint’
    }],
    }]);

    In the preceding code, for the sake of brevity, we reuse the same IAM policy that we created in the execution role.

    Note that the construct optimizes the number of managed endpoints that are created. If two endpoints have the same name, then only one is created.

  3. Now that we have defined our deployment, we can deploy it:
   npm run build && cdk deploy

You can find a sample project that contains all the steps of the walk through in the following GitHub repository.

When the deployment is complete, the output contains the S3 bucket containing the assets for podTemplate, the link for the EMR Studio, and the EMR Studio virtual cluster ID. The following screenshot shows the output of the AWS CDK after the deployment is complete.

CDK output
Submit jobs

Because we’re using the default Provisioners, we will use the podTemplate that is defined by the construct available on the ARA GitHub repository. These are uploaded for you by the construct to an S3 bucket called <clustername>-emr-eks-assets; you only need to refer to them in your Spark job. In this job, you also use the job parameters in the output at the end of the AWS CDK deployment. These parameters allow you to use the AWS Glue Data Catalog and implement Spark on Kubernetes best practices like dynamicAllocation and pod collocation. At the end of cdk deploy ARA will output job sample configurations with the best practices listed before that you can use to submit a job. You can submit a job as follows.

A job run is a unit of work such as a Spark JAR file that is submitted to the EMR on EKS cluster. We start a job using the start-job-run command. Note you can use SparkSubmitParameters to specify the Amazon S3 path to the pod template, as shown in the following command:

aws emr-containers start-job-run \

--virtual-cluster-id <CLUSTER-ID>\

--name <SPARK-JOB-NAME>\

--execution-role-arn <ROLE-ARN> \

--release-label emr-6.8.0-latest \

--job-driver '{
"sparkSubmitJobDriver": {
"entryPoint": ""<S3URI-SPARK-JOB>"
}
}' --configuration-overrides '{
"applicationConfiguration": [
{
"classification": "spark-defaults",
"properties": {
"spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",

"spark.sql.catalogImplementation": "hive",

"spark.dynamicAllocation.enabled":"true",

"spark.dynamicAllocation.minExecutors": "8",

"spark.dynamicAllocation.maxExecutors": "40",

"spark.kubernetes.allocation.batch.size": "8",

"spark.executor.cores": "8",

"spark.kubernetes.executor.request.cores": "7",

"spark.executor.memory": "28G",

"spark.driver.cores": "2",

"spark.kubernetes.driver.request.cores": "2",

"spark.driver.memory": "6G",

"spark.dynamicAllocation.executorAllocationRatio": "1",

"spark.dynamicAllocation.shuffleTracking.enabled": "true",

"spark.dynamicAllocation.shuffleTracking.timeout": "300s",

"spark.kubernetes.driver.podTemplateFile": s3://<EKS-CLUSTER-NAME>-emr-eks-assets-<ACCOUNT-ID>-<REGION> /<EKS-CLUSTER-NAME>/pod-template/critical-driver.yaml ",

"spark.kubernetes.executor.podTemplateFile": s3://<EKS-CLUSTER-NAME>-emr-eks-assets-<ACCOUNT-ID>-<REGION> /<EKS-CLUSTER-NAME>/pod-template/critical-executor.yaml "
}
}
],
"monitoringConfiguration": {
"cloudWatchMonitoringConfiguration": {
"logGroupName": ""<Log_Group_Name>",
"logStreamNamePrefix": "<Log_Stream_Prefix>"
}
}'

The code takes the following values:

  • <CLUSTER-ID> – The EMR virtual cluster ID
  • <SPARK-JOB-NAME> – The name of your Spark job
  • <ROLE-ARN> – The execution role you created
  • <S3URI-SPARK-JOB> – The Amazon S3 URI of your Spark job
  • <S3URI-CRITICAL-DRIVER> – The Amazon S3 URI of the driver pod template, which you get from the AWS CDK output
  • <S3URI-CRITICAL-EXECUTOR> – The Amazon S3 URI of the executor pod template
  • <Log_Group_Name> – Your CloudWatch log group name
  • <Log_Stream_Prefix> – Your CloudWatch log stream prefix

You can go to the Amazon EMR console to check the status of your job and to view logs. You can also check the status by running the describe-job-run command:

aws emr-containers describe-job-run --<CLUSTER-ID> cluster-id --id <JOB-RUN-ID>

Explore data using Amazon EMR Studio

In this section, we show how you can create a workspace in Amazon EMR Studio and connect to the Amazon EKS managed endpoint from the workspace. From the output, use the link to Amazon EMR Studio to navigate to the EMR Studio deployment. You must sign in with the IAM username you provided in the addUser method.

Create a Workspace

To create a Workspace, complete the following steps:

  1. Log in to the EMR Studio created by the AWS CDK.
  2. Choose Create Workspace.
  3. Enter a workspace name and an optional description.
  4. Select Allow Workspace Collaboration if you want to work with other Studio users in this Workspace in real time.
  5. Choose Create Workspace.

create-emr-studio-workspace

After you create the Workspace, choose it from the list of Workspaces to open the JupyterLab environment.
emr studio workspace running

The following screenshot shows what the terminal looks like. For more information about the user interface, refer to Understand the Workspace user interface.

EMR Studio workspace view

Connect to an EMR on EKS managed endpoint

You can easily connect to the EMR on EKS managed endpoint from the Workspace.

  1. In the navigation pane, on the Clusters menu, select EMR Cluster on EKS for Cluster type.
    The virtual clusters appear on the EMR Cluster on EKS drop-down menu, and the endpoint appears on the Endpoint drop-down menu. If there are multiple endpoints, they appear here, and you can easily switch between endpoints from the Workspace.
  2. Select the appropriate endpoint and choose Attach.
    attach to managedendpoint

Work with a notebook

You can now open a notebook and connect to a preferred kernel to do your tasks. For instance, you can select a PySpark kernel, as shown in the following screenshot.
select-kernel

Explore your data

The first step of our data exploration exercise is to create a Spark session and then load the New York taxi dataset from the S3 bucket into a data frame. Use the following code block to load the data into a data frame. Copy the Amazon S3 URI for the location where the dataset resides in Amazon S3.

	from pyspark.sql import SparkSession
	from pyspark.sql.functions import *
	from datetime import datetime
	spark = SparkSession.builder.appName("SparkEDAA").getOrCreate()

After we load the data into a data frame, we replace the data of the current_date column with the actual current date, count the number of rows, and save the data into a Parquet file:

print("Total number of records: " + str(updatedNYTaxi.count()))
updatedNYTaxi.write.parquet("<YOUR-S3-PATH>")

The following screenshot shows the result of our notebook running on Amazon EMR Studio and with PySpark running on Amazon EMR on EKS.
notebook execution

Clean up

To clean up after this post, run cdk destroy.

Conclusion

In this post, we showed how you can use the ARA to quickly deploy a data analytics infrastructure and start experimenting with your data. You can find the full example referenced in this post in the GitHub repository. The AWS Analytics Reference Architecture implements common Analytics pattern and AWS best practices to offer you ready to use constructs to for your experiments. One of the patterns is the data mesh, which you can consult how to use in this blog post.

You can also explore other constructs offered in this library to experiment with AWS Analytics services before transitioning your workload for production.


About the Authors

co-author-1Lotfi Mouhib is a Senior Solutions Architect working for the Public Sector team with Amazon Web Services. He helps public sector customers across EMEA realize their ideas, build new services, and innovate for citizens. In his spare time, Lotfi enjoys cycling and running.

Sandipan Bhaumik is a Senior Analytics Specialist Solutions Architect based in London. He has worked with customers in different industries like Banking & Financial Services, Healthcare, Power & Utilities, Manufacturing and Retail helping them solve complex challenges with large-scale data platforms. At AWS he focuses on strategic accounts in the UK and Ireland and helps customers to accelerate their journey to the cloud and innovate using AWS analytics and machine learning services. He loves playing badminton, and reading books.

How to investigate and take action on security issues in Amazon EKS clusters with Amazon Detective – Part 2

Post Syndicated from Marshall Jones original https://aws.amazon.com/blogs/security/how-to-investigate-and-take-action-on-security-issues-in-amazon-eks-clusters-with-amazon-detective-part-2/

In part 1 of this of this two-part series, How to detect security issues in Amazon EKS cluster using Amazon GuardDuty, we walked through a real-world observed security issue in an Amazon Elastic Kubernetes Service (Amazon EKS) cluster and saw how Amazon GuardDuty detected each phase by following MITRE ATT&CK tactics.

In this blog post, we’ll walk you through investigative techniques to use with Amazon Detective, paired with the GuardDuty EKS and malware findings from the security issue. After we have identified impacted resources through our investigation, we’ll provide example remediation tactics and preventative controls to address and help prevent security issues in EKS clusters.

Amazon Detective can help you investigate security issues and related resources in your account. Detective provides EKS coverage that you can enable within your accounts. When this coverage is enabled, Detective can help investigate and remediate potentially unauthorized EKS activity that results from misconfiguration of the control plane nodes or application. Although GuardDuty is not a prerequisite to enable Detective, it is recommended that you enable GuardDuty to enhance the visualization capabilities in Detective with GuardDuty findings.

Prerequisites

You must have the following services enabled in your AWS account to generate and investigate findings associated with EKS security events in a similar manner as outlined in this blog. If you do not have GuardDuty enabled, you can still investigate with Detective, but in a limited capacity.

Investigate with Amazon Detective

In the five phases we walked through in part 1, we discussed GuardDuty findings and MITRE ATT&CK tactics that can help you detect and understand each phase of the unauthorized activity, from the initial misconfiguration to the impact on our application when the EKS cluster is used for crypto mining.

The next recommended step is to investigate the EKS cluster and any associated resources. Amazon Detective can help you to investigate whether there was any other related unauthorized activity in the environment. We will walk through Detective capabilities for visualizing and gathering important information to effectively respond to the security issue. If you’re interested in creating detailed incident response playbooks for your security team to follow in your own environment, refer to these sample AWS incident response playbooks.

Depending on your scenario, there are various resources you can use to start your investigation, such as Security Hub findings, GuardDuty findings, related Kubernetes subjects, or an AWS account’s AWS CloudTrail activity. For our walkthrough, we’ll start our investigation from the GuardDuty finding and use the EKS cluster resource to pivot to the Detective console, as shown in Figure 7. Although we initially focus on the EKS cluster, you could start from any entities that are supported in the Detective behavior graph structure in the Amazon Detective User Guide. For example, we could start directly with the Kubernetes subject system:anonymous and find activity associated with the anonymous user.

Figure 7: Example Detective popup from GuardDuty finding for EKS cluster

Figure 7: Example Detective popup from GuardDuty finding for EKS cluster

We’ll now go over the information that you would need to gather from Detective in order to investigate the example security issue.

To investigate EKS cluster findings with Detective

  1. In the GuardDuty console, navigate to an individual finding and hover over Investigate with Detective. Choose one of the specific resources to start. In the image below, we selected the EKS cluster resource to investigate with Detective. You will need to gather some preliminary information about the IAM roles associated with the EKS cluster.
    • Questions: When was the cluster created? What IAM role created the cluster? What IAM role is assigned to the cluster?
    • Why it matters: If you are an incident responder, these details can potentially help you identify the owner of the cluster and help you determine what IAM principals are involved.
    • What next: Start looking into each IAM principal’s activity, as seen in CloudTrail, to investigate whether the IAM entity itself is potentially compromised or what other resources may have been impacted.
    Figure 8: Detective summary page for EKS cluster metadata details

    Figure 8: Detective summary page for EKS cluster metadata details

  2. Next, on the EKS cluster overview page, you can see the container details associated with the cluster.
    • Question: What are some of the other container details for the cluster? Does anything look out of the ordinary? Is it using a public image? Is it missing a network policy?
    • Why it matters: Based on the architecture related to this cluster, you might be able to use this information to determine whether there are unauthorized containers. The contents of unauthorized containers will depend on your organization but typically consist of public images or unauthorized RBAC, pod security policies, or network policy configurations. It’s important to keep in mind that when you look at data in Detective, the scope time is very important. When you pivot from a GuardDuty finding, the scope time will be set to the first time the GuardDuty finding was seen to the last time the finding was seen. The container details reflect the containers that were running during the selected scope time. Changing the scope time might change the containers that are listed in the table shown in Figure 9.
    • What next: Information found on this page can help to highlight unauthorized resources or configurations that will need to be remediated. You will also need to look at how these resources were initially created and if there are missing guardrails that should have been created during the provisioning of the cluster.
    Figure 9: Detective summary page for EKS container metadata details

    Figure 9: Detective summary page for EKS container metadata details

  3. Finally, you will see associated security findings with this specific EKS cluster, similar to Figure 10, at the bottom of the EKS cluster overview page in Detective.
    • Question: Are there any other security findings associated with this cluster that I previously was not aware of?
    • Why it matters: In our example scenario, we walked through the findings that were initially detected and the events that unfolded from those findings. After further investigation, you might see other findings that were not part of the original investigation. This can occur if your security team is only investigating specific findings or severity values. The finding for PrivilegeEscalation:Kubernetes/PrivilegedContainer informs you that a privileged container was launched on your Kubernetes cluster by using an image that has never before been used to launch privileged containers in your cluster. A privileged container has root level access to the host. The other finding, Persistence:Kubernetes/ContainerWithSensitiveMount, informs you that a container was launched with a configuration that included a sensitive host path with write access in the volumeMounts section. This makes the sensitive host path accessible and writable from inside the container. Any finding associated to the suspicious or compromised cluster is valuable because it provides additional insight into what the unauthorized entity was trying to accomplish after the initial detection.
    • What next: With Detective, you might want to continue your investigation by selecting each of these findings and reviewing all details related to the finding. Depending on the findings, you could bring in additional team members to help investigate further. For this example, we will move on to the next step.
    Figure 10: Example Detective summary of security findings associated with the EKS cluster

    Figure 10: Example Detective summary of security findings associated with the EKS cluster

  4. Shift from the EKS cluster overview section to the Kubernetes API activity section, similar to Figure 11 below. This will give you the opportunity to dig into the API activity associated with this cluster.
    1. Question: What other Kubernetes API activity was attempted from the cluster? Which API calls were successful? Which API calls failed? What was the unauthorized user trying to do?
    2. Why it matters: It’s important to determine which actions were successfully invoked by the unauthorized user so that appropriate remediation actions can be taken. You can look at trends of successful and failed API calls, and can even search by Subject, IP address, or Kubernetes API call.
    3. What next: You might want to look at all cluster role binding from days before the first GuardDuty finding was seen to determine if there was any other suspicious activity you should be investigating regarding the cluster.
    Figure 11: Example Detective summary page for Kubernetes API activity on the EKS cluster

    Figure 11: Example Detective summary page for Kubernetes API activity on the EKS cluster

  5. Next, you will want to look at the Newly observed Kubernetes API calls section, similar to Figure 12 below.
    • Question: What are some of the more recent Kubernetes API calls? What are they trying to access right now and are they successful? Do I need to start taking action for other resources outside of EKS?
    • Why it matters: This data shows Kubernetes subjects who were observed issuing API calls to this cluster for the first time during our scope time. Detective provides you this information by keeping a baseline of the activity associated with supported AWS resources. This can help you more quickly determine whether activity might be suspicious and worth looking into. In our example, we used the search functionality to look at API calls associated with the built-in Kubernetes secrets management. A common way to start your search is to see if an unauthorized user has successfully accessed any secrets, which can help you determine what information you might want to search in the overall API call volume section discussed in step 4.
    • What next: If the unauthorized user has successfully accessed any secret, those secrets should be marked as compromised, and they should be rotated immediately.
    Figure 12: Example Detective summary for newly observed Kubernetes API calls from the EKS cluster

    Figure 12: Example Detective summary for newly observed Kubernetes API calls from the EKS cluster

  6. You can also consider the following question when you look at the Newly observed Kubernetes API calls section.
    • Question: Has the IP address associated with the finding been communicating with any other resources in our environment, and if so, what are the details of that communication?
    • Why it matters: To answer this question, you can use Detective’s search functionality and the ability to use wild cards to search for IP addresses with the same first three octets. Also note that you can use CIDR notation to search, as well. Based on the results in the example in Figure 13, you can see that there are a number of related IP addresses associated with the environment. With this information, you now can look at the traffic associated with these different IPs and what resources they were communicating with.
    Figure 13: Example Detective results page from a query against IP addresses associated with the EKS cluster

    Figure 13: Example Detective results page from a query against IP addresses associated with the EKS cluster

  7. You can select one of the IP addresses in the search results to get more information related to it, similar to Figure 14 below.
    1. Question: What was the first time an IP address was observed in the environment? When was the last time it was observed?
    2. Why it matters: You can use this information to start isolating where unauthorized activity is coming from and what actions are being taken. You can also start creating a time series of unauthorized activity and scope.
    3. What next: You can repeat some of the previous investigation steps for each IP address, like looking at the different tabs to review New behavior, Resource interaction, and Kubernetes activity.
    Figure 14: Example Detective results page for specific IP address and associated metadata details

    Figure 14: Example Detective results page for specific IP address and associated metadata details

In summary, we began our investigation with a GuardDuty finding about an anonymous API request that was successful in using system:anonymous on one of our EKS clusters. We then used Detective to investigate and visualize activity associated with that EKS cluster, such as volume of successful or unsuccessful API requests, where and when those actions were attempted and other security findings associated with the resource. Once we have completed the investigation, we can confirm scope and impact of the security event and start moving towards taking action.

Remediation techniques for Amazon EKS

In this section, we will focus on how to remediate the security issue in our example. Your actions will vary based on your organization and the resources affected. It’s important to note that these actions will impact the EKS cluster and associated workloads, and should accordingly be performed by or coordinated with the cluster operator.

Before you take action on the EKS cluster, you will need to preserve forensic artifacts and evidence for the impacted EKS resources. The order of operations for these actions matters, because you want to get all the data from forensic artifacts in order to determine the overall impact to the resources affected. If you quarantine resources before you capture forensic artifacts, there is a risk that running processes will be interrupted or that the malware attempts to destroy resources that are valuable to a forensics investigation, to cover its tracks.

To preserve forensic evidence

  1. Enable termination protection on the impacted worker node and change the shutdown behavior to Stop.
  2. Label the offending pod or node with a label indicating that it is part of an active investigation.
  3. Cordon the worker node.
  4. Capture both volatile (temporary memory) and non-volatile (Amazon EBS snapshots) artifacts on the worker node.

Now that you have the forensic evidence, you can start to quarantine your EKS resources to restrict unauthorized network communication. The main objective is to prevent the affected EKS pods from communicating with internal resources or exfiltrating data externally.

To quarantine EKS resources

  1. Isolate the pod by creating a network policy that denies ingress and egress traffic to the pod.
  2. Attach a security group to the host and remove inbound and outbound rules. Take this action if you believe the underlying host has been compromised.

    Depending on existing inbound and outbound rules on the security group, the connections will either be tracked or untracked. Applying an isolation security group will drop untracked connections. For tracked connections, new connections with the host will not be allowed from the isolation security group, but existing tracked connections will not be interrupted.

    Important: This action will affect all containers running on the host.

  3. Attach a deny rule for the EKS resources in a network access control list (network ACL). Because network ACLs are stateless firewalls, all connections will be interrupted, whether they are tracked or untracked connections.

    Important: This action will affect all subnets using the network ACL and all resources within those subnets.

At this point, the affected EKS resources are quarantined, but the cluster is still configured to allow anonymous, unauthenticated access. You will need to remove all unauthorized permissions that were created or added.

To remove unauthorized permissions

  1. Update the RBAC configuration to remove system:anonymous access.
  2. Revoke temporary security credentials that are assigned to the pod or worker node, if necessary. You can also remove the IAM role associated with the EKS resources.

    Note: Removing IAM policies or attaching IAM policies to restrict permissions will affect the resources that are using the IAM role.

  3. Remove any unauthorized ClusterRoleBinding created by the system:anonymous user.
  4. Redeploy the compromised pod or workload resource.

The actions taken so far primarily target the EKS resource, but based on our Detective investigation, there are other actions you might need to take. Because secrets were involved that could be used outside of the EKS cluster, those secrets will need to be rotated wherever they are referenced. Detective will also suggest additional areas where you can investigate and remediate additional unauthorized activity in your AWS account.

It is important that your team go through game days or run-throughs for investigating and responding to different scenarios in order to make sure the team is prepared. You can run through the EKS security workshop to get your security team more familiar with remediation for EKS.

For more information about responding to EKS cluster related security issues, refer to GuardDuty EKS remediation in the GuardDuty User Guide and the EKS Best Practices Guide.

Preventative controls for EKS

This section covers several preventative controls that you can use to protect EKS clusters.

How can I prevent external access to the EKS cluster?

To help prevent external access to your EKS clusters, limit the exposure of your API server. You can achieve that in two ways:

  1. Set the API server endpoint access to Private. This will effectively forbid anyone outside of the VPC to send Kubernetes API requests to your EKS cluster.
  2. Set an IP address allow list for the EKS cluster public access endpoint.

How can I prevent giving admin access to the EKS cluster?

To help prevent an EKS cluster user from granting any type of access to anonymous or unauthenticated users, you can set up a ValidatingAdmissionWebhook. This is a special type of Kubernetes admission controller that can be configured in the Kubernetes API. (To learn how to build serverless admission webhooks, see the blog post Building serverless admission webhooks for Kubernetes with AWS SAM.)

The ValidatingAdmissionWebhook will deny a Kubernetes API request that matches all of the following checks:

  1. The request is creating or modifying a ClusterRoleBinding or RoleBinding.
  2. The subjects section contains either of the following:
    • The user system:anonymous
    • The group system:unauthenticated

How can I prevent malicious images from being deployed?

Now that you have set controls to prevent external access to the EKS cluster and prevent granting access to anonymous users, you can focus on preventing the deployment of potentially malicious images.

Malicious container images can have different origins, including:

  1. Images stored in public or unauthorized registries
  2. Images replacing the ones that are stored in authorized registries
  3. Authorized images that contain software with existing or newly discovered vulnerabilities

You can address these sources of malicious images by doing the following:

  1. Use admission controllers to verify that images meet your organization’s requirements, including for the image origin. You can also refer to this this blog post to implement a solution with a webhook and admission controllers.
  2. Enable tag immutability in your registry, a control that prevents an actor from maliciously replacing container images without changing the image’s tags. Additionally, you can enable an AWS Config rule to check tag immutability
  3. Configure another ValidatingAdmissionWebhook that will only accept images if they meet all of the following criteria.
    1. Images that come from approved registries.
    2. Images that pass the vulnerability scan during deployment time.
    3. Images that are signed by a trusted party. Amazon Elastic Container Registry (Amazon ECR) is working on a product enhancement to store image signatures. Currently, you can use an open-source cosign tool to verify and store image signatures.

      Note: These criteria can vary based on your use case and internal security and compliance standards.

The above controls will help prevent the deployment of a vulnerable, unauthorized, or potentially malicious container image.

How can I prevent lateral movement inside the cluster?

To prevent lateral movement inside the cluster, it is recommended to use network policies, as follows:

  • Enforce Kubernetes network policies to enforce ingress and egress controls within the cluster. You can implement these policies by following the steps in the Securing your cluster with network policies EKS workshop.

It’s important to note that you could use security groups for the same purpose, but pod security groups should only be used if the cluster is compromised and when you want to control the traffic between a pod and a resource that resides in the VPC, not inter-pod traffic.

In this section, we’ve reviewed different preventative controls that could have helped mitigate our example security incident. With the first preventative control, we could have prevented external actors from connecting to the API server. The second control could have prevented granting access to anonymous users. The third control could have prevented the deployment of an unauthorized or vulnerable container image. Finally, the fourth control could have helped limit the impact of the deployed vulnerable images to only the pods where the images were deployed, making it harder to laterally move to other pods in the cluster.

Conclusion

In this post, we walked you through how to investigate an EKS cluster related security issue with Amazon Detective. We also provided some recommended remediation and preventative controls to put in place for the EKS cluster specific security issues. When pairing GuardDuty’s ability for continuous threat detection and monitoring with Detective’s organization and visualization capabilities, you enable your security team to conduct faster and more effective investigation. By providing the security team the ability quickly view an organized set of data associated with security events within your AWS account, you reduce the overall Mean Time to Respond (MTTR).

Now that you understand the investigative capabilities with Detective, it’s time to try things out! It is important that you provide a mechanism for your security team to practice detection, investigation, and remediation techniques using security incident response simulations. By periodically running simulations, your security team will be prepared to quickly respond to possible security events. You can find more detailed incident response playbooks that can assist you in preparing for events in your environment, see these sample AWS incident response playbooks.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a thread on Amazon GuardDuty re:Post.

Want more AWS Security news? Follow us on Twitter.

Author

Marshall Jones

Marshall is a worldwide senior security specialist solutions architect at AWS. His background is in AWS consulting and security architecture, focused on a variety of security domains including edge, threat detection, and compliance. Today, he helps enterprise customers adopt and operationalize AWS security services to increase security effectiveness and reduce risk.

Jonathan Nguyen

Jonathan Nguyen

Jonathan is a shared delivery team senior security consultant at AWS. His background is in AWS security, with a focus on threat detection and incident response. He helps enterprise customers develop a comprehensive AWS security strategy, deploy security solutions at scale, and train customers on AWS security best practices.

Manuel Martinez Arizmendi

Manuel Martinez Arizmendi

Manuel works a Security Engineer at Amazon Detective providing new security investigation capabilities to AWS customers. Based on Boston,MA and originally from Madrid, Spain, when he’s not at work, he enjoys playing and watching soccer, playing videogames, and hanging out with his friends.

How to detect security issues in Amazon EKS clusters using Amazon GuardDuty – Part 1

Post Syndicated from Marshall Jones original https://aws.amazon.com/blogs/security/how-to-detect-security-issues-in-amazon-eks-clusters-using-amazon-guardduty-part-1/

In this two-part blog post, we’ll discuss how to detect and investigate security issues in an Amazon Elastic Kubernetes Service (Amazon EKS) cluster with Amazon GuardDuty and Amazon Detective.

Amazon Elastic Kubernetes Service (Amazon EKS) is a managed service that you can use to run and scale container workloads by using Kubernetes in the AWS Cloud, which can help increase the speed of deployment and portability of modern applications. Amazon EKS provides secure, managed Kubernetes clusters on the AWS control plane by default. Kubernetes configurations such as pod security policies, runtime security, and network policies and configurations are specific for your organization’s use-case and securing them adequately would be a customer’s responsibility within AWS’ shared responsibility model.

Amazon GuardDuty can help you continuously monitor and detect suspicious activity related to AWS resources in your account. GuardDuty for EKS protection is a feature that you can enable within your accounts. When this feature is enabled, GuardDuty can help detect potentially unauthorized EKS activity resulting from misconfiguration of the control plane nodes or application.

In this post, we’ll walk through the events leading up to a real-world security issue that occurred due to EKS cluster misconfiguration, discuss how those misconfigurations could be used by a malicious actor, and how Amazon GuardDuty monitors and identifies suspicious activity throughout the EKS security event. In part 2 of the post, we’ll cover Amazon Detective investigation capabilities, possible remediation techniques, and preventative controls for EKS cluster related security issues.

Prerequisites

You must have AWS GuardDuty enabled in your AWS account in order to monitor and generate findings associated with an EKS cluster related security issue in your environment.

EKS security issue walkthrough

Before jumping into the security issue, it is important to understand how the AWS shared responsibility model applies to the Amazon EKS managed service. AWS is responsible for the EKS managed Kubernetes control plane and the infrastructure to deliver EKS in a secure and reliable manner. You have the ability to configure EKS and how it interacts with other applications and services, where you are responsible for making sure that secure configurations are being used.

The following scenario is based on a real-world observed event, where a malicious actor used Kubernetes compromise tactics and techniques to expose and access an EKS cluster. We use this example to show how you can use AWS security services to identify and investigate each step of this security event. For a security event in your own environment, the order of operations and the investigative and remediation techniques used might be different. The scenario is broken down into the following phases and associated MITRE ATT&CK tactics:

  • Phase 1 – EKS cluster misconfiguration
  • Phase 2 (Discovery) – Discovery of vulnerable EKS clusters
  • Phase 3 (Initial Access) – Credential access to obtain Kubernetes secrets
  • Phase 4 (Persistence) – Impact to persist unauthorized access to the cluster
  • Phase 5 (Impact) – Impact to manipulate resources for unauthorized activity

Phase 1 – EKS cluster misconfiguration

By default, when you provision an EKS cluster, the API cluster endpoint is set to public, meaning that it can be accessed from the internet. Despite being accessible from the internet, the endpoint is still considered secure because it requires all API requests to be authenticated by AWS Identity and Access Management (IAM) and then authorized by Kubernetes role-based access control (RBAC). Also, the entity (user or role) that creates the EKS cluster is automatically granted system:masters permissions, which allows the entity to modify the EKS cluster’s RBAC configuration.

This example scenario starts with a developer who has access to administer EKS clusters in an AWS account. The developer wants to work from their home network and doesn’t want to connect to their enterprise VPN for IAM role federation. They configure an EKS cluster API without setting up the proper authentication and authorization components. Instead, the developer grants explicit access to the system:anonymous user in the cluster’s RBAC configuration. (Alternatively, an unauthorized RBAC configuration could be introduced into your environment after a developer unknowingly installs a malicious helm chart from the internet without reviewing or inspecting it first.)

In Kubernetes anonymous requests, unauthenticated and unrejected HTTP requests are treated as anonymous access and are identified as a system:anonymous user belonging to a system:unauthenticated group. This means that any entity on the internet can access the cluster and make API requests that are permitted by the role. There aren’t many legitimate use cases for this type of activity, because it’s considered a best practice to use RBAC instead. Anonymous requests are primarily used for setting up health endpoints and custom authentication.

By monitoring EKS audit logs, GuardDuty identifies this activity and generates the finding Policy:Kubernetes/AnonymousAccessGranted, as shown in Figure 1. This finding informs you that a user on your Kubernetes cluster successfully created a ClusterRoleBinding or RoleBinding to bind the user system:anonymous to a role. This action enables unauthenticated access to the API operations permitted by the role.

Figure 1: Example GuardDuty finding for Kubernetes anonymous access granted

Figure 1: Example GuardDuty finding for Kubernetes anonymous access granted

Phase 2 (Discovery) – Discovery of vulnerable EKS clusters

Port scanning is a method that malicious actors use to determine if resources are publicly exposed, with open ports and known vulnerabilities. As an increasing number of open-source tools allows users to search for endpoints connected to the internet, finding these endpoints has become even easier. Security teams can use these open-source tools to their advantage by proactively scanning for and identifying externally exposed resources in their organization.

This brings us to the discovery phase of our misconfigured EKS cluster. The discovery phase is defined by MITRE as follows: “Discovery consists of techniques an adversary may use to gain knowledge about the system and internal network. These techniques help adversaries observe the environment and orient themselves before deciding how to act.”

By granting system:anonymous access to the EKS cluster in our example, the developer allowed requests from any public unauthenticated source. This can result in external web crawlers probing the cluster API, which can often happen within seconds of the system:anonymous access being granted. GuardDuty identifies this activity and generates the finding Discovery:Kubernetes/SuccessfulAnonymousAccess, as shown in Figure 2. This finding informs you that an API operation to discover resources in a cluster was successfully invoked by the system:anonymous user. Remember, all API calls made by system:anonymous are unauthenticated, in addition to /healthz and /version calls that are always unauthenticated regardless of the user identity, and any entity can make use of this user within the EKS cluster.

In the screenshot, under the Action section in the finding details, you can see that the anonymous user made a get request to “/”. This is a generic request that is not specific to a Kubernetes cluster, which may indicate that the crawler is not specifically targeting Kubernetes clusters. You can further see that the Status code is 200, indicating that the request was successful. If this activity is malicious, then the actor is now aware that there is an exposed resource.

Figure 2: Example GuardDuty finding for Kubernetes successful anonymous access

Figure 2: Example GuardDuty finding for Kubernetes successful anonymous access

Phase 3 (Initial Access) – Credential access to obtain Kubernetes secrets

Next, in this phase, you might start observing more targeted API calls for establishing initial access from unauthorized users. MITRE defines initial access as “techniques that use various entry vectors to gain their initial foothold within a network. Techniques used to gain a foothold include targeted spearphishing and exploiting weaknesses on public-facing web servers. Footholds gained through initial access may allow for continued access, like valid accounts and use of external remote services, or may be limited-use due to changing passwords.”

In our example, the malicious actor has established initial access for the EKS cluster which is evident in the next GuardDuty finding, CredentialAccess:Kubernetes/SuccessfulAnonymousAccess, as shown in Figure 3. This finding informs you that an API call to access credentials or secrets was successfully invoked by the system:anonymous user. The observed API call is commonly associated with the credential access tactic where an adversary is attempting to collect passwords, usernames, and access keys for a Kubernetes cluster.

You can see that in this GuardDuty finding, in the Action section, the Request uri is targeted at a Kubernetes cluster, specifically /api/v1/namespaces/kube-system/secrets. This request seems to be targeting the secrets management capabilities that are built into Kubernetes. You can find more information about this secrets management capability in the Kubernetes documentation.

Figure 3: Example GuardDuty finding for Kubernetes successful credential access from anonymous user

Figure 3: Example GuardDuty finding for Kubernetes successful credential access from anonymous user

Phase 4 (Persistence) – Impact to persist unauthorized access to the cluster

The next phase of this scenario is likely to be an impact in the EKS cluster to enable persistence by the malicious actor. MITRE defines impact as “techniques that adversaries use to disrupt availability or compromise integrity by manipulating business and operational processes.” Following the MITRE definitions, “Persistence consists of techniques that adversaries use to keep access to systems across restarts, changed credentials, and other interruptions that could cut off their access. Techniques used for persistence include any access, action, or configuration changes that let them maintain their foothold on systems, such as replacing or hijacking legitimate code or adding startup code.”

In the GuardDuty finding Impact:Kubernetes/SuccessfulAnonymousAccess, shown in Figure 4, you can see the Kubernetes user details and Action sections that indicate that a successful Kubernetes API call was made to create a ClusterRoleBinding by the system:anonymous username. This finding informs you that a write API operation to tamper with resources was successfully invoked by the system:anonymous user. The observed API call is commonly associated with the impact stage of an attack, when an adversary is tampering with resources in your cluster. This activity shows that the system:anonymous user has now created their own role to enable persistent access the EKS cluster. If the user is malicious, they can now access the cluster even if access is removed in the RBAC configuration for the system:anonymous user.

Figure 4 Example GuardDuty finding for Kubernetes successful credential change by anonymous user

Figure 4 Example GuardDuty finding for Kubernetes successful credential change by anonymous user

Phase 5 (Impact) – Impact to manipulate resources for unauthorized activity

The fifth phase of this scenario is where the unauthorized user is likely to focus on impact techniques in order to use the access for malicious purpose. MITRE says of the impact phase: “Techniques used for impact can include destroying or tampering with data. In some cases, business processes can look fine, but may have been altered to benefit the adversaries’ goals. These techniques might be used by adversaries to follow through on their end goal or to provide cover for a confidentiality breach.” Typically, once a malicious actor has access into a system, they will introduce malware to the system to manipulate the compromised resource and possibly also other resources.

With the introduction of GuardDuty Malware Protection, when an Amazon Elastic Compute Cloud (Amazon EC2) or container-related GuardDuty finding that indicates potentially suspicious activity is generated, an agentless scan on the volumes will initiate and detect the presence of malware. Existing GuardDuty customers need to enable Malware Protection, and for new customers this feature is on by default when they enable GuardDuty for the first time. Malware Protection comes with a 30-day free trial for both existing and new GuardDuty customers. You can see a list of findings that initiates a malware scan in the GuardDuty User Guide.

In this example, the malicious actor now uses access to the cluster to perform unauthorized cryptocurrency mining. GuardDuty monitors the DNS requests from the EC2 instances used to host the EKS cluster. This allows GuardDuty to identify a DNS request made to a domain name associated with a cryptocurrency mining pool, and generate the finding CryptoCurrency:EC2/BitcoinTool.B!DNS, as shown in Figure 5.

Figure 5: Example GuardDuty finding for EC2 instance querying bitcoin domain name

Figure 5: Example GuardDuty finding for EC2 instance querying bitcoin domain name

Because this is an EC2 related GuardDuty finding and GuardDuty Malware Protection is enabled in the account, GuardDuty then conducts an agentless scan on the volumes of the EC2 instance to detect malware. If the scan results in a successful detection of one or more malicious files, another GuardDuty finding for Execution:EC2/MaliciousFile is generated, as shown in Figure 6.

Figure 6: Example GuardDuty finding for detection of a malicious file on EC2

Figure 6: Example GuardDuty finding for detection of a malicious file on EC2

The first GuardDuty finding detects crypto mining activity, while the proceeding malware protection finding provides context on the malware associated with this activity. This context is very valuable for the incident response process.

Conclusion

In this post, we walked you through each of the five phases where we outlined how an initial misconfiguration could result in a malicious actor gaining control of EKS resources within an AWS account and how GuardDuty is able to continually monitor and detect the progression of the security event. As previously stated, this is just one example where a misconfiguration in an EKS cluster could result in a security event.

Now that you have a good understanding of GuardDuty capabilities to continuously monitor and detect EKS security events, you will need to establish processes and procedures to enable your security team to investigate these events. You can enable Amazon Detective to help accelerate your security team’s mean time to respond (MTTR) by providing an efficient mechanism to analyze, investigate, and identify the root cause of security events. Follow along in part 2 of this series, How to investigate and take action on an Amazon EKS cluster related security issue with Amazon Detective, where we’ll cover techniques you can use with Amazon Detective to identify impacted EKS resources in your AWS account, possible remediation actions to take on the cluster, and preventative controls you can implement.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a thread on Amazon GuardDuty re:Post.

Want more AWS Security news? Follow us on Twitter.

Author

Marshall Jones

Marshall is a worldwide senior security specialist solutions architect at AWS. His background is in AWS consulting and security architecture, focused on a variety of security domains including edge, threat detection, and compliance. Today, he helps enterprise customers adopt and operationalize AWS security services to increase security effectiveness and reduce risk.

Jonathan Nguyen

Jonathan Nguyen

Jonathan is a shared delivery team senior security consultant at AWS. His background is in AWS security, with a focus on threat detection and incident response. He helps enterprise customers develop a comprehensive AWS security strategy, deploy security solutions at scale, and train customers on AWS security best practices.

Manuel Martinez Arizmendi

Manuel Martinez Arizmendi

Manuel works a Security Engineer at Amazon Detective providing new security investigation capabilities to AWS customers. Based on Boston,MA and originally from Madrid, Spain, when he’s not at work, he enjoys playing and watching soccer, playing videogames, and hanging out with his friends.

Design considerations for Amazon EMR on EKS in a multi-tenant Amazon EKS environment

Post Syndicated from Lotfi Mouhib original https://aws.amazon.com/blogs/big-data/design-considerations-for-amazon-emr-on-eks-in-a-multi-tenant-amazon-eks-environment/

Many AWS customers use Amazon Elastic Kubernetes Service (Amazon EKS) in order to take advantage of Kubernetes without the burden of managing the Kubernetes control plane. With Kubernetes, you can centrally manage your workloads and offer administrators a multi-tenant environment where they can create, update, scale, and secure workloads using a single API. Kubernetes also allows you to improve resource utilization, reduce cost, and simplify infrastructure management to support different application deployments. This model is beneficial for those running Apache Spark workloads, for several reasons. For example, it allows you to have multiple Spark environments running concurrently with different configurations and dependencies that are segregated from each other through Kubernetes multi-tenancy features. In addition, the same cluster can be used for various workloads like machine learning (ML), host applications, data streaming and thereby reducing operational overhead of managing multiple clusters.

AWS offers Amazon EMR on EKS, a managed service that enables you to run your Apache Spark workloads on Amazon EKS. This service uses the Amazon EMR runtime for Apache Spark, which increases the performance of your Spark jobs so that they run faster and cost less. When you run Spark jobs on EMR on EKS and not on self-managed Apache Spark on Kubernetes, you can take advantage of automated provisioning, scaling, faster runtimes, and the development and debugging tools that Amazon EMR provides

In this post, we show how to configure and run EMR on EKS in a multi-tenant EKS cluster that can used by your various teams. We tackle multi-tenancy through four topics: network, resource management, cost management, and security.

Concepts

Throughout this post, we use terminology that is either specific to EMR on EKS, Spark, or Kubernetes:

  • Multi-tenancy – Multi-tenancy in Kubernetes can come in three forms: hard multi-tenancy, soft multi-tenancy and sole multi-tenancy. Hard multi-tenancy means each business unit or group of applications gets a dedicated Kubernetes; there is no sharing of the control plane. This model is out of scope for this post. Soft multi-tenancy is where pods might share the same underlying compute resource (node) and are logically separated using Kubernetes constructs through namespaces, resource quotas, or network policies. A second way to achieve multi-tenancy in Kubernetes is to assign pods to specific nodes that are pre-provisioned and allocated to a specific team. In this case, we talk about sole multi-tenancy. Unless your security posture requires you to use hard or sole multi-tenancy, you would want to consider using soft multi-tenancy for the following reasons:
    • Soft multi-tenancy avoids underutilization of resources and waste of compute resources.
    • There is a limited number of managed node groups that can be used by Amazon EKS, so for large deployments, this limit can quickly become a limiting factor.
    • In sole multi-tenancy there is high chance of ghost nodes with no pods scheduled on them due to misconfiguration as we force pods into dedicated nodes with label, taints and tolerance and anti-affinity rules.
  • Namespace – Namespaces are core in Kubernetes and a pillar to implement soft multi-tenancy. With namespaces, you can divide the cluster into logical partitions. These partitions are then referenced in quotas, network policies, service accounts, and other constructs that help isolate environments in Kubernetes.
  • Virtual cluster – An EMR virtual cluster is mapped to a Kubernetes namespace that Amazon EMR is registered with. Amazon EMR uses virtual clusters to run jobs and host endpoints. Multiple virtual clusters can be backed by the same physical cluster. However, each virtual cluster maps to one namespace on an EKS cluster. Virtual clusters don’t create any active resources that contribute to your bill or require lifecycle management outside the service.
  • Pod template – In EMR on EKS, you can provide a pod template to control pod placement, or define a sidecar container. This pod template can be defined for executor pods and driver pods, and stored in an Amazon Simple Storage Service (Amazon S3) bucket. The S3 locations are then submitted as part of the applicationConfiguration object that is part of configurationOverrides, as defined in the EMR on EKS job submission API.

Security considerations

In this section, we address security from different angles. We first discuss how to protect IAM role that is used for running the job. Then address how to protect secrets use in jobs and finally we discuss how you can protect data while it is processed by Spark.

IAM role protection

A job submitted to EMR on EKS needs an AWS Identity and Access Management (IAM) execution role to interact with AWS resources, for example with Amazon S3 to get data, with Amazon CloudWatch Logs to publish logs, or use an encryption key in AWS Key Management Service (AWS KMS). It’s a best practice in AWS to apply least privilege for IAM roles. In Amazon EKS, this is achieved through IRSA (IAM Role for Service Accounts). This mechanism allows a pod to assume an IAM role at the pod level and not at the node level, while using short-term credentials that are provided through the EKS OIDC.

IRSA creates a trust relationship between the EKS OIDC provider and the IAM role. This method allows only pods with a service account (annotated with an IAM role ARN) to assume a role that has a trust policy with the EKS OIDC provider. However, this isn’t enough, because it would allow any pod with a service account within the EKS cluster that is annotated with a role ARN to assume the execution role. This must be further scoped down using conditions on the role trust policy. This condition allows the assume role to happen only if the calling service account is the one used for running a job associated with the virtual cluster. The following code shows the structure of the condition to add to the trust policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": <OIDC provider ARN >
            },
            "Action": "sts:AssumeRoleWithWebIdentity"
            "Condition": { "StringLike": { “<OIDC_PROVIDER>:sub": "system:serviceaccount:<NAMESPACE>:emr-containers-sa-*-*-<AWS_ACCOUNT_ID>-<BASE36_ENCODED_ROLE_NAME>”} }
        }
    ]
}

To scope down the trust policy using the service account condition, you need to run the following the command with AWS CLI:

aws emr-containers update-role-trust-policy \
–cluster-name cluster \
–namespace namespace \
–role-name iam_role_name_for_job_execution

The command will the add the service account that will be used by the spark client, Jupyter Enterprise Gateway, Spark kernel, driver or executor. The service accounts name have the following structure emr-containers-sa-*-*-<AWS_ACCOUNT_ID>-<BASE36_ENCODED_ROLE_NAME>.

In addition to the role segregation offered by IRSA, we recommend blocking access to instance metadata because a pod can still inherit the rights of the instance profile assigned to the worker node. For more information about how you can block access to metadata, refer to Restrict access to the instance profile assigned to the worker node.

Secret protection

Sometime a Spark job needs to consume data stored in a database or from APIs. Most of the time, these are protected with a password or access key. The most common way to pass these secrets is through environment variables. However, in a multi-tenant environment, this means any user with access to the Kubernetes API can potentially access the secrets in the environment variables if this access isn’t scoped well to the namespaces the user has access to.

To overcome this challenge, we recommend using a Secrets store like AWS Secrets Manager that can be mounted through the Secret Store CSI Driver. The benefit of using Secrets Manager is the ability to use IRSA and allow only the role assumed by the pod access to the given secret, thereby improving your security posture. You can refer to the best practices guide for sample code showing the use of Secrets Manager with EMR on EKS.

Spark data encryption

When a Spark application is running, the driver and executors produce intermediate data. This data is written to the node local storage. Anyone who is able to exec into the pods would be able to read this data. Spark supports encryption of this data, and it can be enabled by passing --conf spark.io.encryption.enabled=true. Because this configuration adds performance penalty, we recommend enabling data encryption only for workloads that store and access highly sensitive data and in untrusted environments.

Network considerations

In this section we discuss how to manage networking within the cluster as well as outside the cluster. We first address how Spark handle cross executors and driver communication and how to secure it. Then we discuss how to restrict network traffic between pods in the EKS cluster and allow only traffic destined to EMR on EKS. Last, we discuss how to restrict traffic of executors and driver pods to external AWS service traffic using security groups.

Network encryption

The communication between the driver and executor uses RPC protocol and is not encrypted. Starting with Spark 3 in the Kubernetes backed cluster, Spark offers a mechanism to encrypt communication using AES encryption.

The driver generates a key and shares it with executors through the environment variable. Because the key is shared through the environment variable, potentially any user with access to the Kubernetes API (kubectl) can read the key. We recommend securing access so that only authorized users can have access to the EMR virtual cluster. In addition, you should set up Kubernetes role-based access control in such a way that the pod spec in the namespace where the EMR virtual cluster runs is granted to only a few selected service accounts. This method of passing secrets through the environment variable would change in the future with a proposal to use Kubernetes secrets.

To enable encryption, RPC authentication must also be enabled in your Spark configuration. To enable encryption in-transit in Spark, you should use the following parameters in your Spark config:

--conf spark.authenticate=true

--conf spark.network.crypto.enabled=true

Note that these are the minimal parameters to set; refer to Encryption from the complete list of parameters.

Additionally, applying encryption in Spark has a negative impact on processing speed. You should only apply it when there is a compliance or regulation need.

Securing Network traffic within the cluster

In Kubernetes, by default pods can communicate over the network across different namespaces in the same cluster. This behavior is not always desirable in a multi-tenant environment. In some instances, for example in regulated industries, to be compliant you want to enforce strict control over the network and send and receive traffic only from the namespace that you’re interacting with. For EMR on EKS, it would be the namespace associated to the EMR virtual cluster. Kubernetes offers constructs that allow you to implement network policies and define fine-grained control over the pod-to-pod communication. These policies are implemented by the CNI plugin; in Amazon EKS, the default plugin would be the VPC CNI. A policy is defined as follows and is applied with kubectl:

Kind: NetworkPolicy
metadata:
  name: default-np-ns1
  namespace: <EMR-VC-NAMESPACE>
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          nsname: <EMR-VC-NAMESPACE>

Network traffic outside the cluster

In Amazon EKS, when you deploy pods on Amazon Elastic Compute Cloud (Amazon EC2) instances, all the pods use the security group associated with the node. This can be an issue if your pods (executor pods) are accessing a data source (namely a database) that allows traffic based on the source security group. Database servers often restrict network access only from where they are expecting it. In the case of a multi-tenant EKS cluster, this means pods from other teams that shouldn’t have access to the database servers, would be able to send traffic to it.

To overcome this challenge, you can use security groups for pods. This feature allows you to assign a specific security group to your pods, thereby controlling the network traffic to your database server or data source. You can also refer to the best practices guide for a reference implementation.

Cost management and chargeback

In a multi-tenant environment, cost management is a critical subject. You have multiple users from various business units, and you need to be able to precisely chargeback the cost of the compute resource they have used. At the beginning of the post, we introduced three models of multi-tenancy in Amazon EKS: hard multi-tenancy, soft multi-tenancy, and sole multi-tenancy. Hard multi-tenancy is out of scope because the cost tracking is trivial; all the resources are dedicated to the team using the cluster, which is not the case for sole multi-tenancy and soft multi-tenancy. In the next sections, we discuss these two methods to track the cost for each of model.

Soft multi-tenancy

In a soft multi-tenant environment, you can perform chargeback to your data engineering teams based on the resources they consumed and not the nodes allocated. In this method, you use the namespaces associated with the EMR virtual cluster to track how much resources were used for processing jobs. The following diagram illustrates an example.

Diagram to illustrate soft multi-tenancy

Diagram -1 Soft multi-tenancy

Tracking resources based on the namespace isn’t an easy task because jobs are transient in nature and fluctuate in their duration. However, there are partner tools available that allow you to keep track of the resources used, such as Kubecost, CloudZero, Vantage, and many others. For instructions on using Kubecost on Amazon EKS, refer to this blog post on cost monitoring for EKS customers.

Sole multi-tenancy

For sole multi-tenancy, the chargeback is done at the instance (node) level. Each member on your team uses a specific set of nodes that are dedicated to it. These nodes aren’t always running, and are spun up using the Kubernetes auto scaling mechanism. The following diagram illustrates an example.

Diagram to illustrate Sole tenancy

Diagram -2 Sole tenancy

With sole multi-tenancy, you use a cost allocation tag, which is an AWS mechanism that allows you to track how much each resource has consumed. Although the method of sole multi-tenancy isn’t efficient in terms of resource utilization, it provides a simplified strategy for chargebacks. With the cost allocation tag, you can chargeback a team based on all the resources they used, like Amazon S3, Amazon DynamoDB, and other AWS resources. The chargeback mechanism based on the cost allocation tag can be augmented using the recently launched AWS Billing Conductor, which allows you to issue bills internally for your team.

Resource management

In this section, we discuss considerations regarding resource management in multi-tenant clusters. We briefly discuss topics like sharing resources graciously, setting guard rails on resource consumption, techniques for ensuring resources for time sensitive and/or critical jobs, meeting quick resource scaling requirements and finally cost optimization practices with node selectors.

Sharing resources

In a multi-tenant environment, the goal is to share resources like compute and memory for better resource utilization. However, this requires careful capacity management and resource allocation to make sure each tenant gets their fair share. In Kubernetes, resource allocation is controlled and enforced by using ResourceQuota and LimitRange. ResourceQuota limits resources on the namespace level, and LimitRange allows you to make sure that all the containers are submitted with a resource requirement and a limit. In this section, we demonstrate how a data engineer or Kubernetes administrator can set up ResourceQuota as a LimitRange configuration.

The administrator creates one ResourceQuota per namespace that provides constraints for aggregate resource consumption:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-resources
  namespace: teamA
spec:
  hard:
    requests.cpu: "1000"
    requests.memory: 4000Gi
    limits.cpu: "2000"
    limits.memory: 6000Gi

For LimitRange, the administrator can review the following sample configuration. We recommend using default and defaultRequest to enforce the limit and request field on containers. Lastly, from a data engineer perspective while submitting the EMR on EKS jobs, you need to make sure the Spark parameters of resource requirements are within the range of the defined LimitRange. For example, in the following configuration, the request for spark.executor.cores=7 will fail because the max limit for CPU is 6 per container:

apiVersion: v1
kind: LimitRange
metadata:
  name: cpu-min-max
  namespace: teamA
spec:
  limits:
  - max:
      cpu: "6"
    min:
      cpu: "100m"
    default:
      cpu: "500m"
    defaultRequest:
      cpu: "100m"
    type: Container

Priority-based resource allocation

Diagram Illustrates an example of resource allocation with priority

Diagram – 3 Illustrates an example of resource allocation with priority.

As all the EMR virtual clusters share the same EKS computing platform with limited resources, there will be scenarios in which you need to prioritize jobs in a sensitive timeline. In this case, high-priority jobs can utilize the resources and finish the job, whereas low-priority jobs that are running gets stopped and any new pods must wait in the queue. EMR on EKS can achieve this with the help of pod templates, where you specify a priority class for the given job.

When a pod priority is enabled, the Kubernetes scheduler orders pending pods by their priority and places them in the scheduling queue. As a result, the higher-priority pod may be scheduled sooner than pods with lower priority if its scheduling requirements are met. If this pod can’t be scheduled, the scheduler continues and tries to schedule other lower-priority pods.

The preemptionPolicy field on the PriorityClass defaults to PreemptLowerPriority, and the pods of that PriorityClass can preempt lower-priority pods. If preemptionPolicy is set to Never, pods of that PriorityClass are non-preempting. In other words, they can’t preempt any other pods. When lower-priority pods are preempted, the victim pods get a grace period to finish their work and exit. If the pod doesn’t exit within that grace period, that pod is stopped by the Kubernetes scheduler. Therefore, there is usually a time gap between the point when the scheduler preempts victim pods and the time that a higher-priority pod is scheduled. If you want to minimize this gap, you can set a deletion grace period of lower-priority pods to zero or a small number. You can do this by setting the terminationGracePeriodSeconds option in the victim Pod YAML.

See the following code samples for priority class:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 100
globalDefault: false
description: " High-priority Pods and for Driver Pods."

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 50
globalDefault: false
description: " Low-priority Pods."

One of the key considerations while templatizing the driver pods, especially for low-priority jobs, is to avoid the same low-priority class for both driver and executor. This will save the driver pods from getting evicted and lose the progress of all its executors in a resource congestion scenario. In this low-priority job example, we have used a high-priority class for driver pod templates and low-priority classes only for executor templates. This way, we can ensure the driver pods are safe during the eviction process of low-priority jobs. In this case, only executors will be evicted, and the driver can bring back the evicted executor pods as the resource becomes freed. See the following code:

apiVersion: v1
kind: Pod
spec:
  priorityClassName: "high-priority"
  nodeSelector:
    eks.amazonaws.com/capacityType: ON_DEMAND
  containers:
  - name: spark-kubernetes-driver # This will be interpreted as Spark driver container

apiVersion: v1
kind: Pod
spec:
  priorityClassName: "low-priority"
  nodeSelector:
    eks.amazonaws.com/capacityType: SPOT
  containers:
  - name: spark-kubernetes-executors # This will be interpreted as Spark executor container

Overprovisioning with priority

Diagram Illustrates an example of overprovisioning with priority

Diagram – 4 Illustrates an example of overprovisioning with priority.

As pods wait in a pending state due to resource availability, additional capacity can be added to the cluster with Amazon EKS auto scaling. The time it takes to scale the cluster by adding new nodes for deployment has to be considered for time-sensitive jobs. Overprovisioning is an option to mitigate the auto scaling delay using temporary pods with negative priority. These pods occupy space in the cluster. When pods with high priority are unschedulable, the temporary pods are preempted to make the room. This causes the auto scaler to scale out new nodes due to overprovisioning. Be aware that this is a trade-off because it adds higher cost while minimizing scheduling latency. For more information about overprovisioning best practices, refer to Overprovisioning.

Node selectors

EKS clusters can span multiple Availability Zones in a VPC. A Spark application whose driver and executor pods are distributed across multiple Availability Zones can incur inter- Availability Zone data transfer costs. To minimize or eliminate the data transfer cost, you should configure the job to run on a specific Availability Zone or even specific node type with the help of node labels. Amazon EKS places a set of default labels to identify capacity type (On-Demand or Spot Instance), Availability Zone, instance type, and more. In addition, we can use custom labels to meet workload-specific node affinity.

EMR on EKS allows you to choose specific nodes in two ways:

  • At the job level. Refer to EKS Node Placement for more details.
  • In the driver and executor level using pod templates.

When using pod templates, we recommend using on demand instances for driver pods. You can also consider including spot instances for executor pods for workloads that are tolerant of occasional periods when the target capacity is not completely available. Leveraging spot instances allow you to save cost for jobs that are not critical and can be terminated. Please refer Define a NodeSelector in PodTemplates.

Conclusion

In this post, we provided guidance on how to design and deploy EMR on EKS in a multi-tenant EKS environment through different lenses: network, security, cost management, and resource management. For any deployment, we recommend the following:

  • Use IRSA with a condition scoped on the EMR on EKS service account
  • Use a secret manager to store credentials and the Secret Store CSI Driver to access them in your Spark application
  • Use ResourceQuota and LimitRange to specify the resources that each of your data engineering teams can use and avoid compute resource abuse and starvation
  • Implement a network policy to segregate network traffic between pods

Lastly, if you are considering migrating your spark workload to EMR on EKS you can further learn about design patterns to manage Apache Spark workload in EMR on EKS in this blog and about migrating your EMR transient cluster to EMR on EKS in this blog.


About the Authors

author - lotfiLotfi Mouhib is a Senior Solutions Architect working for the Public Sector team with Amazon Web Services. He helps public sector customers across EMEA realize their ideas, build new services, and innovate for citizens. In his spare time, Lotfi enjoys cycling and running.

author - peter ajeebAjeeb Peter is a Senior Solutions Architect with Amazon Web Services based in Charlotte, North Carolina, where he guides global financial services customers to build highly secure, scalable, reliable, and cost-efficient applications on the cloud. He brings over 20 years of technology experience on Software Development, Architecture and Analytics from industries like finance and telecom.

Use AWS Network Firewall to filter outbound HTTPS traffic from applications hosted on Amazon EKS and collect hostnames provided by SNI

Post Syndicated from Kirankumar Chandrashekar original https://aws.amazon.com/blogs/security/use-aws-network-firewall-to-filter-outbound-https-traffic-from-applications-hosted-on-amazon-eks/

This blog post shows how to set up an Amazon Elastic Kubernetes Service (Amazon EKS) cluster such that the applications hosted on the cluster can have their outbound internet access restricted to a set of hostnames provided by the Server Name Indication (SNI) in the allow list in the AWS Network Firewall rules. For encrypted web traffic, SNI can be used for blocking access to specific sites in the network firewall. SNI is an extension to TLS that remains unencrypted in the traffic flow and indicates the destination hostname a client is attempting to access over HTTPS.

This post also shows you how to use Network Firewall to collect hostnames of the specific sites that are being accessed by your application. Securing outbound traffic to specific hostnames is called egress filtering. In computer networking, egress filtering is the practice of monitoring and potentially restricting the flow of information outbound from one network to another. Securing outbound traffic is usually done by means of a firewall that blocks packets that fail to meet certain security requirements. One such firewall is AWS Network Firewall, a managed service that you can use to deploy essential network protections for all of your VPCs that you create with Amazon Virtual Private Cloud (Amazon VPC).

Example scenario

You have the option to scan your application traffic by the identifier of the requested SSL certificate, which makes you independent from the relationship of the IP address to the certificate. The certificate could be served from any IP address. Traditional stateful packet filters are not able to follow the changing IP address of the endpoints. Therefore, the host name information that you get from the SNI becomes important in making security decisions. Amazon EKS has gained popularity for running containerized workloads in the AWS Cloud, and you can restrict outbound traffic to only the known hostnames provided by SNI. This post will walk you through the process of setting up the EKS cluster in two different subnets so that your software can use the additional traffic routing in the VPC and traffic filtering through Network Firewall.

Solution architecture

The architecture illustrated in Figure 1 shows a VPC with three subnets in Availability Zone A, and three subnets in Availability Zone B. There are two public subnets where Network Firewall endpoints are deployed, two private subnets where the worker nodes for the EKS cluster are deployed, and two protected subnets where NAT gateways are deployed.

Figure 1: Outbound internet access through Network Firewall from Amazon EKS worker nodes

Figure 1: Outbound internet access through Network Firewall from Amazon EKS worker nodes

The workflow in the architecture for outbound access to a third-party service is as follows:

  1. The outbound request originates from the application running in the private subnet (for example, to https://aws.amazon.com) and is passed to the NAT gateway in the protected subnet.
  2. The HTTPS traffic received in the protected subnet is routed to the AWS Network Firewall endpoint in the public subnet.
  3. The network firewall computes the rules, and either accepts or declines the request to pass to the internet gateway.
  4. If the request is passed, the application-requested URL (provided by SNI in the non-encrypted HTTPS header) is allowed in the network firewall, and successfully reaches the third-party server for access.

The VPC settings for this blog post follow the recommendation for using public and private subnets described in Creating a VPC for your Amazon EKS cluster in the Amazon EKS User Guide, but with additional subnets called protected subnets. Instead of placing the NAT gateway in a public subnet, it will be placed in the protected subnet, and the Network Firewall endpoints in the public subnet will filter the egress traffic that flows through the NAT gateway. This design pattern adds further checks and could be a recommendation for your VPC setup.

As suggested in Creating a VPC for your Amazon EKS cluster, using the Public and private subnets option allows you to deploy your worker nodes to private subnets, and allows Kubernetes to deploy load balancers to the public subnets. This arrangement can load-balance traffic to pods that are running on nodes in the private subnets. As shown in Figure 1, the solution uses an additional subnet named the protected subnet, apart from the public and private subnets. The protected subnet is a VPC subnet deployed between the public subnet and private subnet. The outbound internet traffic that is routed through the protected subnet is rerouted to the Network Firewall endpoint hosted within the public subnet. You can use the same strategy mentioned in Creating a VPC for your Amazon EKS cluster to place different AWS resources within private subnets and public subnets. The main difference in this solution is that you place the NAT gateway in a separate protected subnet, between private subnets, and place Network Firewall endpoints in the public subnets to filter traffic in the network firewall. The NAT gateway’s IP address is still preserved, and could still be used for adding to the allow list of third-party entities that need connectivity for the applications running on the EKS worker nodes.

To see a practical example of how the outbound traffic is filtered based on the hosted names provided by SNI, follow the steps in the following Deploy a sample section. You will deploy an AWS CloudFormation template that deploys the solution architecture, consisting of the VPC components, EKS cluster components, and the Network Firewall components. When that’s complete, you can deploy a sample app running on Amazon EKS to test egress traffic filtering through AWS Network Firewall.

Deploy a sample to test the network firewall

Follow the steps in this section to perform a sample app deployment to test the use case of securing outbound traffic through AWS Network Firewall.

Prerequisites

The prerequisite actions required for the sample deployment are as follows:

  1. Make sure you have the AWS CLI installed, and configure access to your AWS account.
  2. Install and set up the eksctl tool to create an Amazon EKS cluster.
  3. Copy the necessary CloudFormation templates and the sample eksctl config files from the blog’s Amazon S3 bucket to your local file system. You can do this by using the following AWS CLI S3 cp command.
    aws s3 cp s3://awsiammedia/public/sample/803-network-firewall-to-filter-outbound-traffic/config.yaml .
    aws s3 cp s3://awsiammedia/public/sample/803-network-firewall-to-filter-outbound-traffic/lambda_function.py .
    aws s3 cp s3://awsiammedia/public/sample/803-network-firewall-to-filter-outbound-traffic/network-firewall-eks-collect-all.yaml .
    aws s3 cp s3://awsiammedia/public/sample/803-network-firewall-to-filter-outbound-traffic/network-firewall-eks.yaml .

    Important: This command will download the S3 bucket contents to the current directory on your terminal, so the “.” (dot) in the command is very important.

  4. Once this is complete, you should be able to see the list of files shown in Figure 2. (The list includes config.yaml, lambda_function.py, network-firewall-eks-collect-all.yaml, and network-firewall-eks.yaml.)
    Figure 2: Files downloaded from the S3 bucket

    Figure 2: Files downloaded from the S3 bucket

Deploy the VPC architecture with AWS Network Firewall

In this procedure, you’ll deploy the VPC architecture by using a CloudFormation template.

To deploy the VPC architecture (AWS CLI)

  1. Deploy the CloudFormation template network-firewall-eks.yaml, which you previously downloaded to your local file system from the Amazon S3 bucket.

    You can do this through the AWS CLI by using the create-stack command, as follows.

    aws cloudformation create-stack --stack-name AWS-Network-Firewall-Multi-AZ \
    --template-body file://network-firewall-eks.yaml \
    --parameters ParameterKey=NetworkFirewallAllowedWebsites,ParameterValue=".amazonaws.com\,.docker.io\,.docker.com" \
    --capabilities CAPABILITY_NAMED_IAM

    Note: The initially allowed hostnames for egress filtering are passed to the network firewall by using the parameter key NetworkFirewallAllowedWebsites in the CloudFormation stack. In this example, the allowed hostnames are .amazonaws.com, .docker.io, and docker.com.

  2. Make a note of the subnet IDs from the stack outputs of the CloudFormation stack after the status goes to Create_Complete.
     
    aws cloudformation describe-stacks \
    --stack-name AWS-Network-Firewall-Multi-AZ

    Note: For simplicity, the CloudFormation stack name is AWS-Network-Firewall-Multi-AZ, but you can change this name to according to your needs and follow the same naming throughout this post.

To deploy the VPC architecture (console)

In your account, launch the AWS CloudFormation template by choosing the following Launch Stack button. It will take approximately 10 minutes for the CloudFormation stack to complete.

Select this image to open a link that starts building the CloudFormation stack

Note: The stack will launch in the N. Virginia (us-east-1) Region. To deploy this solution into other AWS Regions, download the solution’s CloudFormation template, modify it, and deploy it to the selected Region.

Deploy and set up access to the EKS cluster

In this step, you’ll use the eksctl CLI tool to create an EKS cluster.

To deploy an EKS cluster by using the eksctl tool

There are two methods for creating an EKS cluster. Method A uses the eksctl create cluster command without a configuration (config) file. Method B uses a config file.

Note: Before you start, make sure you have the VPC subnet details available from the previous procedure.

Method A: No config file

You can create an EKS cluster without a config file by using the eksctl create cluster command.

  1. From the CLI, enter the following commands.
    eksctl create cluster \
    --vpc-private-subnets=<private-subnet-A>,<private-subnet-B> \
    --vpc-public-subnets=<public-subnet-A>,<public-subnet-B>
  2. Make sure that the subnets passed to the --vpc-public-subnets parameter are protected subnets taken from the VPC architecture CloudFormation stack output. You can verify the subnet IDs by looking at step 2 in the To deploy the VPC architecture section.

Method B: With config file

Another way to create an EKS cluster is by using the following config file, with more options with the name (cluster.yaml in this example).

  1. Create a file named cluster.yaml by adding the following contents to it.
    apiVersion: eksctl.io/v1alpha5
    kind: ClusterConfig
    metadata:
      name: filter-egress-traffic-test
      region: us-east-1
      version: "1.19"
    availabilityZones: ["us-east-1a", "us-east-1b"]
    vpc:
      id: 
      subnets:
        public:
          us-east-1a: { id: <public-subnet-A> }
          us-east-1b: { id: <public-subnet-B> }
        private:
          us-east-1a: { id: <private-subnet-A> }
          us-east-1b: { id: <private-subnet-B> }
    
    managedNodeGroups:
    - name: nodegroup
      desiredCapacity: 3
      ssh:
        allow: true
        publicKeyName: main
      iam:
        attachPolicyARNs:
        - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
        - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
        - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
        - arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
        - arn:aws:iam::aws:policy/AmazonEKSServicePolicy
        - arn:aws:iam::aws:policy/AmazonEKSClusterPolicy
        - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
      preBootstrapCommands:
        - yum install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_amd64/amazon-ssm-agent.rpm
        - sudo systemctl enable amazon-ssm-agent
        - sudo systemctl start amazon-ssm-agent

  2. Run the following command to create an EKS cluster using the eksctl tool and the cluster.yaml config file.
    eksctl create cluster -f cluster.yaml

To set up access to the EKS cluster

  1. Before you deploy a sample Kubernetes Pod, make sure you have the kubeconfig file set up for the EKS cluster that you created in step 2 of To deploy an EKS cluster by using the eksctl tool. For more information, see Create a kubeconfig for Amazon EKS. You can use eksctl to do this, as follows.

    eksctl utils write-kubeconfig —cluster filter-egress-traffic-test

  2. Set the kubectl context to the EKS cluster you just created, by using the following command.

    kubectl config get-contexts

    Figure 3 shows an example of the output from this command.

    Figure 3: kubectl config get-contexts command output

    Figure 3: kubectl config get-contexts command output

  3. Copy the context name from the command output and set the context by using the following command.

    kubectl config use-context <NAME-OF-CONTEXT>

To deploy a sample Pod on the EKS cluster

  1. Next, deploy a sample Kubernetes Pod in the  EKS cluster.

    kubectl run -i --tty amazon-linux —image=public.ecr.aws/amazonlinux/amazonlinux:latest sh

    If you already have a Pod, you can use the following command to get a shell to a running container.

    kubectl attach amazon-linux -c alpine -i -t

  2. Now you can test access to a non-allowed website in the AWS Network Firewall stateful rules, using these steps.
    1. First, install the cURL tool on the sample Pod you created previously. cURL is a command-line tool for getting or sending data, including files, using URL syntax. Because cURL uses the libcurl library, it supports every protocol libcurl supports. On the Pod where you have obtained a shell to a running container, run the following command to install cURL.
      apk install curl
    2. Access a website using cURL.
      curl -I https://aws.amazon.com

      This gives a timeout error similar to the following.

      curl -I https://aws.amazon.com
      curl: (28) Operation timed out after 300476 milliseconds with 0 out of 0 bytes received

    3. Navigate to the AWS CloudWatch console and check the alert logs for Network Firewall. You will see a log entry like the following sample, indicating that the access to https://aws.amazon.com was blocked.
      {
          "firewall_name": "AWS-Network-Firewall-Multi-AZ-firewall",
          "availability_zone": "us-east-1a",
          "event_timestamp": "1623651293",
          "event": {
              "timestamp": "2021-06-14T06:14:53.483069+0000",
              "flow_id": 649458981081302,
              "event_type": "alert",
              "src_ip": "xxx.xxx.xxx.xxx",
              "src_port": xxxxx,
              "dest_ip": "xxx.xxx.xxx.xxx",
              "dest_port": 443,
              "proto": "TCP",
              "alert": {
                  "action": "blocked",
                  "signature_id": 4,
                  "rev": 1,
                  "signature": "not matching any TLS allowlisted FQDNs",
                  "category": "",
                  "severity": 1
              },
              "tls": {
                  "sni": "aws.amazon.com",
                  "version": "UNDETERMINED",
                  "ja3": {},
                  "ja3s": {}
              },
              "app_proto": "tls"
          }
      }

      The error shown here occurred because the hostname www.amazon.com was not added to the Network Firewall stateful rules allow list.

      When you deployed the network firewall in step 1 of the To deploy the VPC architecture procedure, the values provided for the CloudFormation parameter NetworkFirewallAllowedWebsites were just .amazonaws.com, .docker.io, .docker.com and not aws.amazon.com.

Update the Network Firewall stateful rules

In this procedure, you’ll update the Network Firewall stateful rules to allow the aws.amazon.com domain name.

To update the Network Firewall stateful rules (console)

  1. In the AWS CloudFormation console, locate the stack you used to create the network firewall earlier in the To deploy the VPC architecture procedure.
  2. Select the stack you want to update, and choose Update. In the Parameters section, update the stack by adding the hostname aws.amazon.com to the parameter NetworkFirewallAllowedWebsites as a comma-separated value. See Updating stacks directly in the AWS CloudFormation User Guide for more information on stack updates.

Re-test from the sample pod

In this step, you’ll test the outbound access once again from the sample Pod you created earlier in the To deploy a sample Pod on the EKS cluster procedure.

To test the outbound access to the aws.amazon.com hostname

  1. Get a shell to a running container in the sample Pod that you deployed earlier, by using the following command.
    kubectl attach amazon-linux -c alpine -i -t
  2. On the terminal where you got a shell to a running container in the sample Pod, run the following cURL command.
    curl -I https://aws.amazon.com
  3. The response should be a success HTTP 200 OK message similar to this one.
    curl -Ik https://aws.amazon.com
    HTTP/2 200
    content-type: text/html;charset=UTF-8
    server: Server

If the VPC subnets are organized according to the architecture suggested in this solution, outbound traffic from the EKS cluster can be sent to the network firewall and then filtered based on hostnames provided by SNI.

Collecting hostnames provided by the SNI

In this step, you’ll see how to configure the network firewall to collect all the hostnames provided by SNI that are accessed by an already running application—without blocking any access—by making use of CloudWatch and alert logs.

To configure the network firewall (console)

  1. In the AWS CloudFormation console, locate the stack that created the network firewall earlier in the To deploy the VPC architecture procedure.
  2. Select the stack to update, and then choose Update.
  3. Choose Replace current template and upload the template network-firewall-eks-collect-all.yaml. (This template should be available from the files that you downloaded earlier from the S3 bucket in the Prerequisites section.) Choose Next. See Updating stacks directly for more information.

To configure the network firewall (AWS CLI)

  1. Update the CloudFormation stack by using the network-firewall-eks-collect-all.yaml template file that you previously downloaded from the S3 bucket in the Prerequisites section, using the update-stack command as follows.
    aws cloudformation update-stack --stack-name AWS-Network-Firewall-Multi-AZ \
    --template-body file://network-firewall-eks-collect-all.yaml \
    --capabilities CAPABILITY_NAMED_IAM

To check the rules in the AWS Management Console

  1. In the AWS Management Console, navigate to the Amazon VPC console and locate the AWS Network Firewall tab.
  2. Select the network firewall that you created earlier, and then select the stateful rule with the name log-all-tls.
  3. The rule group should appear as shown in Figure 4, indicating that the logs are captured and sent to the Alert logs.
    Figure 4: Network Firewall rule groups

    Figure 4: Network Firewall rule groups

To test based on stateful rule

  1. On the terminal, get the shell for the running container in the Pod you created earlier. If this Pod is not available, follow the instructions in the To deploy a sample Pod on the EKS cluster procedure to create a new sample Pod.
  2. Run the cURL command to aws.amazon.com. It should return HTTP 200 OK, as follows.
    curl -Ik https://aws.amazon.com/
    HTTP/2 200
    content-type: text/html;charset=UTF-8
    server: Server
    date:
    ------
    ----------
    --------------
  3. Navigate to the AWS CloudWatch Logs console and look up the Alert logs log group with the name /AWS-Network-Firewall-Multi-AZ/anfw/alert.

    You can see the hostnames provided by SNI within the TLS protocol passing through the network firewall. The CloudWatch Alert logs for allowed hostnames in the SNI looks like the following example.

    {
        "firewall_name": "AWS-Network-Firewall-Multi-AZ-firewall",
        "availability_zone": "us-east-1b",
        "event_timestamp": "1627283521",
        "event": {
            "timestamp": "2021-07-26T07:12:01.304222+0000",
            "flow_id": 1977082435410607,
            "event_type": "alert",
            "src_ip": "xxx.xxx.xxx.xxx",
            "src_port": xxxxx,
            "dest_ip": "xxx.xxx.xxx.xxx",
            "dest_port": 443,
            "proto": "TCP",
            "alert": {
                "action": "allowed",
                "signature_id": 2,
                "rev": 0,
                "signature": "",
                "category": "",
                "severity": 3
            },
            "tls": {
                "subject": "CN=aws.amazon.com",
                "issuerdn": "C=US, O=Amazon, OU=Server CA 1B, CN=Amazon",
                "serial": "08:13:34:34:48:07:64:27:4D:BC:CB:14:4D:AF:F2:11",
                "fingerprint": "f7:53:97:5e:76:1e:fb:f6:70:72:02:95:d5:9f:2f:05:52:79:5d:ae",
                "sni": "aws.amazon.com",
                "version": "TLS 1.2",
                "notbefore": "2020-09-30T00:00:00",
                "notafter": "2021-09-23T12:00:00",
                "ja3": {},
                "ja3s": {}
            },
            "app_proto": "tls"
        }
    }

Optionally, you can also create an AWS Lambda function to collect the hostnames that are passed through the network firewall.

To create a Lambda function to collect hostnames provided by SNI (optional)

Sample Lambda code

The sample Lambda code from Figure 5 is shown following, and is written in Python 3. The sample collects the hostnames that are provided by SNI and captured in Network Firewall. Network Firewall logs the hostnames provided by SNI in the CloudWatch Alert logs. Then, by creating a CloudWatch logs subscription filter, you can send logs to the Lambda function for further processing, for example to invoke SNS notifications.

import json
import gzip
import base64
import boto3
import sys
import traceback
sns_client = boto3.client('sns')
def lambda_handler(event, context):
    try:
        decoded_event = json.loads(gzip.decompress(base64.b64decode(event['awslogs']['data'])))
        body = '''
        {filtermatch}
        '''.format(
            loggroup=decoded_event['logGroup'],
            logstream=decoded_event['logStream'],
            filtermatch=decoded_event['logEvents'][0]['message'],
        )
        # print(body)# uncomment this for debugging
        filterMatch = json.loads(body)
        data = []
        if 'http' in filterMatch['event']:
            data.append(filterMatch['event']['http']['hostname'])
        elif 'tls' in filterMatch['event']:
            data.append(filterMatch['event']['tls']['sni'])
        result = 'Trying to reach ' + 1*' ' + (data[0]) + 1*' ' 'via Network Firewall' + 1*' '  + (filterMatch['firewall_name'])
        # print(result)# uncomment this for debugging
        message = {'HostName': result}
        send_to_sns = sns_client.publish(
            TargetArn='<SNS-topic-ARN>', #Replace with the SNS topic ARN
            Message=json.dumps({'default': json.dumps(message),
                            'sms': json.dumps(message),
                            'email': json.dumps(message)}),
            Subject='Trying to reach the hostname through the Network Firewall',
            MessageStructure='json')
    except Exception as e:
        print('Function failed due to exception.')
        e = sys.exc_info()[0]
        print(e)
        traceback.print_exc()
        Status="Failure"
        Message=("Error occured while executing this. The error is %s" %e)

Clean up

In this step, you’ll clean up the infrastructure that was created as part of this solution.

To delete the Kubernetes workloads

  1. On the terminal, using the kubectl CLI tool, run the following command to delete the sample Pod that you created earlier.
    kubectl delete pods amazon-linux

    Note: Clean up all the Kubernetes workloads running on the EKS cluster. For example, if the Kubernetes service of type LoadBalancer is deployed, and if the EKS cluster where it exists is deleted, the LoadBalancer will not be deleted. The best practice is to clean up all the deployed workloads.

  2. On the terminal, using the eksctl CLI tool, delete the created EKS cluster by using the following command.
    eksctl delete cluster --name filter-egress-traffic-test

To delete the CloudFormation stack and AWS Network Firewall

  1. Navigate to the AWS CloudFormation console and choose the stack with the name AWS-Network-Firewall-Multi-AZ.
  2. Choose Delete, and then at the prompt choose Delete Stack. For more information, see Deleting a stack on the AWS CloudFormation console.

Conclusion

By following the VPC architecture explained in this blog post, you can protect the applications running on an Amazon EKS cluster by filtering the outbound traffic based on the approved hostnames that are provided by SNI in the Network Firewall Allow list.

Additionally, with a simple Lambda function, CloudWatch Logs, and an SNS topic, you can get readable hostnames provided by the SNI. Using these hostnames, you can learn about the traffic pattern for the applications that are running within the EKS cluster, and later create a strict list to allow only the required outbound traffic. To learn more about Network Firewall stateful rules, see Working with stateful rule groups in AWS Network Firewall in the AWS Network Firewall Developer Guide.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Kirankumar Chandrashekar

Kirankumar is a Sr. Solutions Architect for Strategic Accounts at AWS. He focuses on leading customers in architecting DevOps, containers and container technologies to name a few. Kirankumar is passionate about DevOps, Infrastructure as Code, and solving complex customer issues. He enjoys music, as well as cooking and traveling.