Tag Archives: Amazon Elastic Kubernetes Service

Unify log aggregation and analytics across compute platforms

2021-12-15 Hari Ohm Prasath

Post Syndicated from Hari Ohm Prasath original https://aws.amazon.com/blogs/big-data/unify-log-aggregation-and-analytics-across-compute-platforms/

Our customers want to make sure their users have the best experience running their application on AWS. To make this happen, you need to monitor and fix software problems as quickly as possible. Doing this gets challenging with the growing volume of data needing to be quickly detected, analyzed, and stored. In this post, we walk you through an automated process to aggregate and monitor logging-application data in near-real time, so you can remediate application issues faster.

This post shows how to unify and centralize logs across different computing platforms. With this solution, you can unify logs from Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Kinesis Data Firehose, and AWS Lambda using agents, log routers, and extensions. We use Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) with OpenSearch Dashboards to visualize and analyze the logs, collected across different computing platforms to get application insights. You can deploy the solution using the AWS Cloud Development Kit (AWS CDK) scripts provided as part of the solution.

Customer benefits

A unified aggregated log system provides the following benefits:

A single point of access to all the logs across different computing platforms
Help defining and standardizing the transformations of logs before they get delivered to downstream systems like Amazon Simple Storage Service (Amazon S3), Amazon OpenSearch Service, Amazon Redshift, and other services
The ability to use Amazon OpenSearch Service to quickly index, and OpenSearch Dashboards to search and visualize logs from its routers, applications, and other devices

Solution overview

In this post, we use the following services to demonstrate log aggregation across different compute platforms:

Amazon EC2 – A web service that provides secure, resizable compute capacity in the cloud. It’s designed to make web-scale cloud computing easier for developers.
Amazon ECS – A web service that makes it easy to run, scale, and manage Docker containers on AWS, designed to make the Docker experience easier for developers.
Amazon EKS – A web service that makes it easy to run, scale, and manage Docker containers on AWS.
Kinesis Data Firehose – A fully managed service that makes it easy to stream data to Amazon S3, Amazon Redshift, or Amazon OpenSearch Service.
Lambda – A compute service that lets you run code without provisioning or managing servers. It’s designed to make web-scale cloud computing easier for developers.
Amazon OpenSearch Service – A fully managed service that makes it easy for you to perform interactive log analytics, real-time application monitoring, website search, and more.

The following diagram shows the architecture of our solution.

The architecture uses various log aggregation tools such as log agents, log routers, and Lambda extensions to collect logs from multiple compute platforms and deliver them to Kinesis Data Firehose. Kinesis Data Firehose streams the logs to Amazon OpenSearch Service. Log records that fail to get persisted in Amazon OpenSearch service will get written to AWS S3. To scale this architecture, each of these compute platforms streams the logs to a different Firehose delivery stream, added as a separate index, and rotated every 24 hours.

The following sections demonstrate how the solution is implemented on each of these computing platforms.

Amazon EC2

The Kinesis agent collects and streams logs from the applications running on EC2 instances to Kinesis Data Firehose. The agent is a standalone Java software application that offers an easy way to collect and send data to Kinesis Data Firehose. The agent continuously monitors files and sends logs to the Firehose delivery stream.

The AWS CDK script provided as part of this solution deploys a simple PHP application that generates logs under the /etc/httpd/logs directory on the EC2 instance. The Kinesis agent is configured via /etc/aws-kinesis/agent.json to collect data from access_logs and error_logs, and stream them periodically to Kinesis Data Firehose (ec2-logs-delivery-stream).

Because Amazon OpenSearch Service expects data in JSON format, you can add a call to a Lambda function to transform the log data to JSON format within Kinesis Data Firehose before streaming to Amazon OpenSearch Service. The following is a sample input for the data transformer:

46.99.153.40 - - [29/Jul/2021:15:32:33 +0000] "GET / HTTP/1.1" 200 173 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"

The following is our output:

{
    "logs" : "46.99.153.40 - - [29/Jul/2021:15:32:33 +0000] \"GET / HTTP/1.1\" 200 173 \"-\" \"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36\"",
}

We can enhance the Lambda function to extract the timestamp, HTTP, and browser information from the log data, and store them as separate attributes in the JSON document.

Amazon ECS

In the case of Amazon ECS, we use FireLens to send logs directly to Kinesis Data Firehose. FireLens is a container log router for Amazon ECS and AWS Fargate that gives you the extensibility to use the breadth of services at AWS or partner solutions for log analytics and storage.

The architecture hosts FireLens as a sidecar, which collects logs from the main container running an httpd application and sends them to Kinesis Data Firehose and streams to Amazon OpenSearch Service. The AWS CDK script provided as part of this solution deploys a httpd container hosted behind an Application Load Balancer. The httpd logs are pushed to Kinesis Data Firehose (ecs-logs-delivery-stream) through the FireLens log router.

Amazon EKS

With the recent announcement of Fluent Bit support for Amazon EKS, you no longer need to run a sidecar to route container logs from Amazon EKS pods running on Fargate. With the new built-in logging support, you can select a destination of your choice to send the records to. Amazon EKS on Fargate uses a version of Fluent Bit for AWS, an upstream conformant distribution of Fluent Bit managed by AWS.

The AWS CDK script provided as part of this solution deploys an NGINX container hosted behind an internal Application Load Balancer. The NGINX container logs are pushed to Kinesis Data Firehose (eks-logs-delivery-stream) through the Fluent Bit plugin.

Lambda

For Lambda functions, you can send logs directly to Kinesis Data Firehose using the Lambda extension. You can deny the records being written to Amazon CloudWatch.

After deployment, the workflow is as follows:

On startup, the extension subscribes to receive logs for the platform and function events. A local HTTP server is started inside the external extension, which receives the logs.
The extension buffers the log events in a synchronized queue and writes them to Kinesis Data Firehose via PUT records.
The logs are sent to downstream systems.
The logs are sent to Amazon OpenSearch Service.

The Firehose delivery stream name gets specified as an environment variable (AWS_KINESIS_STREAM_NAME).

For this solution, because we’re only focusing on collecting the run logs of the Lambda function, the data transformer of the Kinesis Data Firehose delivery stream filters out the records of type function ("type":"function") before sending it to Amazon OpenSearch Service.

The following is a sample input for the data transformer:

[
   {
      "time":"2021-07-29T19:54:08.949Z",
      "type":"platform.start",
      "record":{
         "requestId":"024ae572-72c7-44e0-90f5-3f002a1df3f2",
         "version":"$LATEST"
      }
   },
   {
      "time":"2021-07-29T19:54:09.094Z",
      "type":"platform.logsSubscription",
      "record":{
         "name":"kinesisfirehose-logs-extension-demo",
         "state":"Subscribed",
         "types":[
            "platform",
            "function"
         ]
      }
   },
   {
      "time":"2021-07-29T19:54:09.096Z",
      "type":"function",
      "record":"2021-07-29T19:54:09.094Z\tundefined\tINFO\tLoading function\n"
   },
   {
      "time":"2021-07-29T19:54:09.096Z",
      "type":"platform.extension",
      "record":{
         "name":"kinesisfirehose-logs-extension-demo",
         "state":"Ready",
         "events":[
            "INVOKE",
            "SHUTDOWN"
         ]
      }
   },
   {
      "time":"2021-07-29T19:54:09.097Z",
      "type":"function",
      "record":"2021-07-29T19:54:09.097Z\t024ae572-72c7-44e0-90f5-3f002a1df3f2\tINFO\tvalue1 = value1\n"
   },   
   {
      "time":"2021-07-29T19:54:09.098Z",
      "type":"platform.runtimeDone",
      "record":{
         "requestId":"024ae572-72c7-44e0-90f5-3f002a1df3f2",
         "status":"success"
      }
   }
]

Prerequisites

To implement this solution, you need the following prerequisites:

The AWS Command Line Interface (AWS CLI) installed. The AWS CLI is a unified tool to manage your AWS services.
The AWS CDK installed on your local laptop.
Git installed and configured on your machine.
The Lambda extension for Kinesis Data Firehose, which is packaged as part of this solution.

Build the code

Check out the AWS CDK code by running the following command:

mkdir unified-logs && cd unified-logs
git clone https://github.com/aws-samples/unified-log-aggregation-and-analytics .

Build the lambda extension by running the following command:

cd lib/computes/lambda/extensions
chmod +x extension.sh
./extension.sh
cd ../../../../

Make sure to replace default AWS region specified under the value of firehose.endpoint attribute inside lib/computes/ec2/ec2-startup.sh.

Build the code by running the following command:

yarn install && npm run build

Deploy the code

If you’re running AWS CDK for the first time, run the following command to bootstrap the AWS CDK environment (provide your AWS account ID and AWS Region):

cdk bootstrap \
    --cloudformation-execution-policies arn:aws:iam::aws:policy/AdministratorAccess \
    aws://<AWS Account Id>/<AWS_REGION>

You only need to bootstrap the AWS CDK one time (skip this step if you have already done this).

Run the following command to deploy the code:

cdk deploy --requires-approval

You get the following output:

 ✅  CdkUnifiedLogStack

Outputs:
CdkUnifiedLogStack.ec2ipaddress = xx.xx.xx.xx
CdkUnifiedLogStack.ecsloadbalancerurl = CdkUn-ecsse-PY4D8DVQLK5H-xxxxx.us-east-1.elb.amazonaws.com
CdkUnifiedLogStack.ecsserviceLoadBalancerDNS570CB744 = CdkUn-ecsse-PY4D8DVQLK5H-xxxx.us-east-1.elb.amazonaws.com
CdkUnifiedLogStack.ecsserviceServiceURL88A7B1EE = http://CdkUn-ecsse-PY4D8DVQLK5H-xxxx.us-east-1.elb.amazonaws.com
CdkUnifiedLogStack.eksclusterClusterNameCE21A0DB = ekscluster92983EFB-d29892f99efc4419bc08534a3d253160
CdkUnifiedLogStack.eksclusterConfigCommand515C0544 = aws eks update-kubeconfig --name ekscluster92983EFB-d29892f99efc4419bc08534a3d253160 --region us-east-1 --role-arn arn:aws:iam::xxx:role/CdkUnifiedLogStack-clustermasterroleCD184EDB-12U2TZHS28DW4
CdkUnifiedLogStack.eksclusterGetTokenCommand3C33A2A5 = aws eks get-token --cluster-name ekscluster92983EFB-d29892f99efc4419bc08534a3d253160 --region us-east-1 --role-arn arn:aws:iam::xxx:role/CdkUnifiedLogStack-clustermasterroleCD184EDB-12U2TZHS28DW4
CdkUnifiedLogStack.elasticdomainarn = arn:aws:es:us-east-1:xxx:domain/cdkunif-elasti-rkiuv6bc52rp
CdkUnifiedLogStack.s3bucketname = cdkunifiedlogstack-logsfailederrcapturebucket0bcc-xxxxx
CdkUnifiedLogStack.samplelambdafunction = CdkUnifiedLogStack-LambdatransformerfunctionFA3659-c8u392491FrW

Stack ARN:
arn:aws:cloudformation:us-east-1:xxxx:stack/CdkUnifiedLogStack/6d53ef40-efd2-11eb-9a9d-1230a5204572

AWS CDK takes care of building the required infrastructure, deploying the sample application, and collecting logs from different sources to Amazon OpenSearch Service.

The following is some of the key information about the stack:

ec2ipaddress – The public IP address of the EC2 instance, deployed with the sample PHP application
ecsloadbalancerurl – The URL of the Amazon ECS Load Balancer, deployed with the httpd application
eksclusterClusterNameCE21A0DB – The Amazon EKS cluster name, deployed with the NGINX application
samplelambdafunction – The sample Lambda function using the Lambda extension to send logs to Kinesis Data Firehose
opensearch-domain-arn – The ARN of the Amazon OpenSearch Service domain

Generate logs

To visualize the logs, you first need to generate some sample logs.

To generate Lambda logs, invoke the function using the following AWS CLI command (run it a few times):

aws lambda invoke \
--function-name "<<samplelambdafunction>>" \
--payload '{"payload": "hello"}' /tmp/invoke-result \
--cli-binary-format raw-in-base64-out \
--log-type Tail

Make sure to replace samplelambdafunction with the actual Lambda function name. The file path needs to be updated based on the underlying operating system.

The function should return "StatusCode": 200, with the following output:

{
    "StatusCode": 200,
    "LogResult": "<<Encoded>>",
    "ExecutedVersion": "$LATEST"
}

Run the following command a couple of times to generate Amazon EC2 logs:

curl http://ec2ipaddress:80

Make sure to replace ec2ipaddress with the public IP address of the EC2 instance.

Run the following command a couple of times to generate Amazon ECS logs:

curl http://ecsloadbalancerurl:80

Make sure to replace ecsloadbalancerurl with the public ARN of the AWS Application Load Balancer.

We deployed the NGINX application with an internal load balancer, so the load balancer hits the health checkpoint of the application, which is sufficient to generate the Amazon EKS access logs.

Visualize the logs

To visualize the logs, complete the following steps:

On the Amazon OpenSearch Service console, choose the hyperlink provided for the OpenSearch Dashboard 7URL.
Configure access to the OpenSearch Dashboard.
Under OpenSearch Dashboard, on the Discover menu, start creating a new index pattern for each compute log.

We can see separate indexes for each compute log partitioned by date, as in the following screenshot.

The following screenshot shows the process to create index patterns for Amazon EC2 logs.

After you create the index pattern, we can start analyzing the logs using the Discover menu under OpenSearch Dashboard in the navigation pane. This tool provides a single searchable and unified interface for all the records with various compute platforms. We can switch between different logs using the Change index pattern submenu.

Clean up

Run the following command from the root directory to delete the stack:

cdk destroy

Conclusion

In this post, we showed how to unify and centralize logs across different compute platforms using Kinesis Data Firehose and Amazon OpenSearch Service. This approach allows you to analyze logs quickly and the root cause of failures, using a single platform rather than different platforms for different services.

If you have feedback about this post, submit your comments in the comments section.

Resources

For more information, see the following resources:

About the author

Hari Hari Ohm Prasath is a Senior Modernization Architect at AWS, helping customers with their modernization journey to become cloud native. Hari loves to code and actively contributes to the open source initiatives. You can find him in Medium, Github & Twitter @hariohmprasath.

ballu Ballu Singh is a Principal Solutions Architect at AWS. He lives in the San Francisco Bay area and helps customers architect and optimize applications on AWS. In his spare time, he enjoys reading and spending time with his family.

Use Amazon EKS and Argo Rollouts for Progressive Delivery

2021-12-13 Srinivasa Shaik

Post Syndicated from Srinivasa Shaik original https://aws.amazon.com/blogs/architecture/use-amazon-eks-and-argo-rollouts-for-progressive-delivery/

A common hurdle to DevOps strategies is the manual testing, sign-off, and deployment steps required to deliver new or enhanced feature sets. If an application is updated frequently, these actions can be time-consuming and error prone. You can address these challenges by incorporating progressive delivery concepts along with the Amazon Elastic Kubernetes Service (Amazon EKS) container platform and Argo Rollouts.

Progressive delivery deployments

Progressive delivery is a deployment paradigm in which new features are gradually rolled out to an expanding set of users. Real-time measurements of key performance indicators (KPIs) enable deployment teams to measure customer experience. These measurements can detect any negative impact during deployment and perform an automated rollback before it impacts a larger group of users. Since predefined KPIs are being measured, the rollout can continue autonomously and alleviate the bottleneck of manual approval or rollbacks.

These progressive delivery concepts can be applied to common deployment strategies such as blue/green and canary deployments. A blue/green deployment is a strategy where separate environments are created that are identical to one another. One environment (blue) runs the current application version, while the other environment (green) runs the new version. This enables teams to test on the new environment and move application traffic to the green environment when validated. Canary deployments slowly release your new application to the production environment so that you can build confidence while it is being deployed. Gradually, the new version will replace the current version in its entirety.

Using Kubernetes, you already can perform a rolling update, which can incrementally replace your resource’s Pods with new ones. However, you have limited control of the speed of the rollout, and can’t automatically revert a deployment. KPIs are also difficult to measure in this scenario, resulting in more manual work validating the integrity of the deployment.

To exert more granular control over the deployments, a progressive delivery controller such as Argo Rollouts can be implemented. By using a progressive delivery controller in conjunction with AWS services, you can tune the speed of your deployments and measure your success with KPIs. During the deployment, Argo Rollouts will query metric providers such as Prometheus to perform analysis. (You can find the complete list of the supported metric providers at Argo Rollouts.) If there is an issue with the deployment, automatic rollback actions can be taken to minimize any type of disruption.

Using blue/green deployments for progressive delivery

Blue/green deployments provide zero downtime during deployments and an ability to test your application in production without impacting the stable version. In a typical blue/green deployment on EKS using Kubernetes native resources, a new deployment will be spun up. This includes the new feature version in parallel with the stable deployment (see Figure 1). The new deployment will be tested by a QA team.

Figure 1. Blue/green deployment in progress

Once all the tests have been successfully conducted, the traffic must be directed from the live version to the new version (Figure 2). At this point, all live traffic is funneled to the new version. If there are any issues, a rollback can be conducted by swapping the pointer back to the previous stable version.

Figure 2. Blue/green deployment post-promotion

Keeping this process in mind, there are several manual interactions and decisions involved during a blue/green deployment. Using Argo Rollouts you can replace these manual steps with automation. It automatically creates a preview service for testing out a green service. With Argo Rollouts, test metrics can be captured by using a monitoring service, such as Amazon Managed Service for Prometheus (Figure 3).

Prometheus is a monitoring software that can be used to collect metrics from your application. With PromQL (Prometheus Query Language), you can write queries to obtain KPIs. These KPIs can then be used to define the success or failure of the deployment. Argo Rollout deployment includes stages to analyze the KPIs before and after promoting the new version. During the prePromotionAnalysis stage, you can validate the new version using preview endpoint. This stage can verify smoke tests or integration tests. Upon meeting the desired success (KPIs), the live traffic will be routed to the new version. In the postPromotionAnalysis stage, you can verify KPIs from the production environment. After promoting the new version, failure of the KPIs during any analysis stage will automatically shut down the deployment and revert to the previous stable version.

Figure 3. Blue/green deployment using KPIs

Using canary deployment for progressive delivery

Unlike in a blue/green deployment strategy, in a canary deployment a subset of the traffic is gradually shifted to the new version in your production environment. Since the new version is being deployed in a live environment, feedback can be obtained in real-time, and adjustments can be made accordingly (Figure 4).

Figure 4. An example of a canary deployment

Argo Rollouts supports integration with an Application Load Balancer to manipulate the flow of traffic to different versions of an application. Argo Rollouts can gradually and automatically increase the amount of traffic to the canary service at specific intervals. You can also fully automate the promotion process by using KPIs and metrics from Prometheus as discussed in the blue/green strategy. The analysis will run while the canary deployment is progressing. The success of the KPIs will gradually increase the traffic on the canary service. Any failure will stop the deployment and stop routing live traffic.

Conclusion

Implementing progressive delivery in your application can help you deploy new versions of applications with confidence. This approach mitigates the risk of rolling out new application versions by providing visibility into live error rate and performance. You can measure KPIs and automate rollout and rollback procedures. By leveraging Argo Rollouts, you can have more granular control over how your application is deployed in an EKS cluster.

For additional information on progressive delivery or Argo Rollouts:

Use Amazon Managed Service for Prometheus (AMP) or install Prometheus on your EKS cluster to incorporate Prometheus
Setting up Prometheus

Field Notes: Building a Data Service for Autonomous Driving Systems Development using Amazon EKS

2021-11-30 Ajay Vohra

Post Syndicated from Ajay Vohra original https://aws.amazon.com/blogs/architecture/field-notes-building-a-data-service-for-autonomous-driving-systems-development-using-amazon-eks/

Many aspects of autonomous driving (AD) system development are based on data that capture real-life driving scenarios. Therefore, research and development professionals working on AD systems need to handle an ever-changing array of interesting datasets composed from the real-life driving data. In this blog post, we address a key problem in AD system development, which is how to dynamically compose interesting datasets from real-life driving data and serve them at scale in near real-time.

The first challenge in composing large interesting datasets is high latency. If you have to wait for the entire dataset to be composed before you can start consuming the dataset, you may have to wait for several minutes, or even hours. This latency slows down AD system research and development. The second challenge is creating a data service that can cost-efficiently serve the dynamically composed datasets at scale. In this blog post, we propose solutions to both these challenges.

For the challenge of high latency, we propose dynamically composing the data sets as chunked data streams, and serving them using a Amazon FSx for Lustre high-performance file-system. Chunked data streams immediately solve the latency issue, because you do not need to compose the entire stream before it can be consumed. For the challenge of cost-efficiently serving the datasets at scale, we propose using Amazon EKS with auto-scaling features.

Overview of the Data Service Architecture

The data service described in this post dynamically composes and serves data streams of selected sensor modalities for a specified drive scene selected from the A2D2 driving dataset. The data stream is dynamically composed from the extracted A2D2 drive scene data stored in Amazon S3 object data store, and the accompanying meta-data stored in an Amazon Redshift data warehouse. While the data service described in this post uses the Robot Operating System (ROS), the data service can be easily adapted for use with other robotic systems.

The data service runs in Kubernetes Pods in an Amazon EKS cluster configured to use a Horizontal Pod Autoscaler and EKS Cluster Autoscaler. An Amazon Managed Service For Apache Kafka (MSK) cluster provides the communication channel between the data service, and the data clients. The data service implements a request-response paradigm over Apache Kafka topics. However, the response data is not sent back over the Kafka topics. Instead, the data service stages the response data in Amazon S3, Amazon FSx for Lustre, or Amazon EFS, as specified in the data client request, and only the location of the staged response data is sent back to the data client over the Kafka topics. The data client directly reads the response data from its staged location.

The data client running in a ROS enabled Amazon EC2 instance plays back the received data stream into ROS topics, whereby it can be nominally consumed by any ROS node subscribing to the ROS topics. The solution architecture diagram for the data service is shown in Figure 1.

Figure 1 – Data service solution architecture with default configuration

Data Client Request

Imagine the data client wants to request drive scene data in ROS bag file format from A2D2 autonomous driving dataset for vehicle id a2d2, drive scene id 20190401145936, starting at timestamp 1554121593909500 (microseconds) , and stopping at timestamp 1554122334971448 (microseconds). The data client wants the response to include data only from the camera/front_left sensor encoded in sensor_msgs/Image ROS data type, and the lidar/front_left sensor encoded in sensor_msgs/PointCloud2 ROS data type. The data client wants the response data to be streamed back chunked in series of rosbag files, each file spanning 1000000 microseconds of the drive scene. The data client wants the chunked response rosbag files to be staged on a shared Amazon FSx for Lustre file system.

Finally, the data client wants the camera/front_left sensor data to be played back on /a2d2/camera/front_left ROS topic, and the lidar/front_left sensor data to be played back on /a2d2/lidar/front_left ROS topic.

The data client can encode such a data request using the following JSON object, and send it to the Kafka bootstrap servers b-1.msk-cluster-1:9092,b-2.msk-cluster-1:9092 on the Apache Kafka topic named a2d2.

{
 "servers": "b-1.msk-cluster-1:9092,b-2.msk-cluster-1:9092",
 "requests": [{
    "kafka_topic": "a2d2", 
    "vehicle_id": "a2d2",
    "scene_id": "20190401145936",
    "sensor_id": ["lidar/front_left", "camera/front_left"],
    "start_ts": 1554121593909500, 
    "stop_ts": 1554122334971448,
    "ros_topic": {"lidar/front_left": "/a2d2/lidar/front_left", 
    "camera/front_left": "/a2d2/camera/front_left"},
    "data_type": {"lidar/front_left": "sensor_msgs/PointCloud2",
    "camera/front_left": "sensor_msgs/Image"},
    "step": 1000000,
    "accept": "fsx/multipart/rosbag",
    "preview": false
 }]
}

At any given time, one or more EKS pods in the data service are listening for messages on the Kafka topic a2d2. The EKS pod that picks the request message responds to the request by composing the requested data as a series of rosbag files, and staging them on FSx for Lustre, as requested in the "accept": "fsx/multipart/rosbag" field.

Each rosbag in the response is dynamically composed from the drive scene data stored in Amazon S3, using the meta-data stored in Amazon Redshift. Each rosbag contains drive scene data for a single time step. In the preceding example, the time step is specified as "step": 1000000 (microseconds).

Visualizing the Data Service Response

If a human is interested in visualizing the data response, one can use any ROS visualization tool. One such tool is rviz. This tool can be run on the ROS desktop. In the following screenshot, we show the visualization of the response using rviz tool for the example data request shown previously.

Figure 2 – Visualization of response using rviz tool

Dynamically Transforming the Coordinate Frames

The data service supports dynamically transforming the composed data from one coordinate frame to another frame. A typical use case is to transform the data from a sensor specific coordinate frame to AV (ego) coordinate frame. Such transformation request can be included in the data client request.

For example, imagine the data client wants to compose a data stream from all the LiDAR sensors, and transform the point cloud data into the vehicle’s coordinate frame. The example configuration c-config-lidar.json allows you to do that. Following is a visualization of the LiDAR point cloud data transformed to the vehicle coordinate frame and visualized in the rviz tool from a top-down perspective.

Figure 3 – Top-down rviz visualization of point-cloud data transformed to ego vehicle view

Walkthrough

In this walkthrough, we use the A2D2 autonomous driving dataset. The complete code for this walk-through and reference documentation is available in the associated Github repository. So before we get into the walk-through, clone the Github repository on your laptop using the Git clone command. Next, ensure these prerequisites are satisfied.

The approximate cost of the walk-through of this tutorial with default configuration is US $2,000. The actual cost may vary considerably based on actual configuration, and the duration used for the walk-through.

Configure the data service

To configure the data service, we need to create a new AWS CloudFormation stack in the AWS console using the cfn/mozart.yml template from the cloned repository on your laptop.

This template creates AWS Identity and Access Management (IAM) resources, so when you create the CloudFormation Stack using the console, in the review step, you must check I acknowledge that AWS CloudFormation might create IAM resources. The stack input parameters you must specify are the following:

Parameter Name table

For all other stack input parameters, default values are recommended during the first walkthrough. Review the complete list of all the template input parameters in the Github repository reference.

Once the stack status in CloudFormation console is CREATE_COMPLETE, find the ROS desktop instance launched in your stack in the Amazon EC2 console, and connect to the instance using SSH as user ubuntu, using your SSH key pair. The ROS desktop instance is named as <name-of-stack>-desktop.
When you connect to the ROS desktop using SSH, and you see the message "Cloud init in progress. Machine will REBOOT after cloud init is complete!!", disconnect and try later after about 20 minutes. The desktop installs the NICE DCV server on first-time startup, and reboots after the install is complete.
If the message NICE DCV server is enabled!appears, run the command sudo passwd ubuntu to set a new strong password for user ubuntu. Now you are ready to connect to the desktop using the NICE DCV client.
Download and install the NICE DCV client on your laptop.
Use the NICE DCV Client to login to the desktop as user ubuntu
When you first login to the desktop using the NICE DCV client, you may be asked if you would like to upgrade the OS version. Do not upgrade the OS version.

Now you are ready to proceed with the following steps. For all the commands in this blog, we assume the working directory to be ~/amazon-eks-autonomous-driving-data-service on the ROS desktop.

If you used an IAM role to create the stack above, you must manually configure the credentials associated with the IAM role in the ~/.aws/credentials file with the following fields:

aws_access_key_id=

aws_secret_access_key=

aws_session_token=

If you used an IAM user to create the stack, you do not have to manually configure the credentials. In the working directory, run the command:

./scripts/configure-eks-auth.sh

When successfully running this command, the following confirmation appears AWS Credentials Removed.

Configure the EKS cluster environment

In this step, we configure the EKS cluster environment by running the command:

./scripts/setup-dev.sh

This step also builds and pushes the data service container image into Amazon ECR.

Prepare the A2D2 data

Before we can run the A2D2 data service, we need to extract the raw A2D2 data into your S3 bucket, extract the metadata from the raw data, and upload the metadata into the Redshift cluster. We execute these three steps using an AWS Step Functions state machine. To create and run the AWS Step Functions state machine, run the following command in the working directory:

./scripts/a2d2-etl-steps.sh

Note the executionArn of the state machine execution in the output of the previous command. To check the status of the execution, use following command, replacing executionArn below with your value:

The state machine execution time depends on many variable factors, and may take anywhere from 4 – 24 hours, or possibly longer. All the AWS Batch jobs started as part of the state machine automatically reattempt in case of failure.

Run the data service

The data service is deployed using a Helm Chart, and runs as a kubernetes deployment in EKS. To start the data service, execute the following command in the working directory:

kubectl get pods -n a2d2

Run the data service client

To visualize the response data requested by the A2D2 data client, we will use the rviz tool on the ROS desktop. Open a terminal on the desktop, and run rviz.

In the rviz tool, use File>Open Config to select /home/ubuntu/amazon-eks-autonomous-driving-data-service/a2d2/config/a2d2.rviz as the rviz config. You should notice that the rviz tool is now configured with two areas, one for visualizing image data, and the other for visualizing point cloud data.

To run the data client, open a new terminal on the desktop, and execute the following command in the root directory of the cloned Github repository on the ROS desktop:

python ./a2d2/src/data_client.py --config ./a2d2/config/c-config-ex1.json 1>/tmp/a.out 2>&1 &

After a brief delay, you should be able to preview the response data in the rviz tool. You can set "preview": false in the data client config file, ./a2d2/config/c-config-ex1.json, and rerun the preceding command to view the complete response. For maximum performance, pre-load S3 data to FSx for Lustre.

Hard reset of the data service

This step is for reference purposes. If at any time you need to do a hard reset of the data service, you can do so by executing:

helm delete a2d2-data-service

This will delete all data service EKS pods immediately. All in-flight service responses will be aborted. Because the connection between the data client and data service is asynchronous, the data clients may wait indefinitely, and you may need to cleanup the data client processes manually on the ROS desktop using operating system tools. Note, each data client instance spawns multiple Python processes. You may also want to cleanup /fsx/rosbag directory.

Clean Up

When you no longer need the data service, delete the AWS CloudFormation stack from the AWS CloudFormation console. Deleting the stack will shut down the desktop instance, and delete the EFS and FSx for Lustre file-systems created in the stack. The Amazon S3 bucket is not deleted.

Conclusion

In this post, we demonstrated how to build a data service that can dynamically compose near real-time chunked data streams at scale using EKS, Redshift, MSK, and FSx for Lustre. By using a data service, you increase agility, flexibility and cost-efficiency in AD system research and development.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Introducing Karpenter – An Open-Source High-Performance Kubernetes Cluster Autoscaler

2021-11-30 Channy Yun

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/introducing-karpenter-an-open-source-high-performance-kubernetes-cluster-autoscaler/

Today we are announcing that Karpenter is ready for production. Karpenter is an open-source, flexible, high-performance Kubernetes cluster autoscaler built with AWS. It helps improve your application availability and cluster efficiency by rapidly launching right-sized compute resources in response to changing application load. Karpenter also provides just-in-time compute resources to meet your application’s needs and will soon automatically optimize a cluster’s compute resource footprint to reduce costs and improve performance.

Before Karpenter, Kubernetes users needed to dynamically adjust the compute capacity of their clusters to support applications using Amazon EC2 Auto Scaling groups and the Kubernetes Cluster Autoscaler. Nearly half of Kubernetes customers on AWS report that configuring cluster auto scaling using the Kubernetes Cluster Autoscaler is challenging and restrictive.

When Karpenter is installed in your cluster, Karpenter observes the aggregate resource requests of unscheduled pods and makes decisions to launch new nodes and terminate them to reduce scheduling latencies and infrastructure costs. Karpenter does this by observing events within the Kubernetes cluster and then sending commands to the underlying cloud provider’s compute service, such as Amazon EC2.

Karpenter is an open-source project licensed under the Apache License 2.0. It is designed to work with any Kubernetes cluster running in any environment, including all major cloud providers and on-premises environments. We welcome contributions to build additional cloud providers or to improve core project functionality. If you find a bug, have a suggestion, or have something to contribute, please engage with us on GitHub.

Getting Started with Karpenter on AWS
To get started with Karpenter in any Kubernetes cluster, ensure there is some compute capacity available, and install it using the Helm charts provided in the public repository. Karpenter also requires permissions to provision compute resources on the provider of your choice.

Once installed in your cluster, the default Karpenter provisioner will observe incoming Kubernetes pods, which cannot be scheduled due to insufficient compute resources in the cluster and automatically launch new resources to meet their scheduling and resource requirements.

I want to show a quick start using Karpenter in an Amazon EKS cluster based on Getting Started with Karpenter on AWS. It requires the installation of AWS Command Line Interface (AWS CLI), kubectl, eksctl, and Helm (the package manager for Kubernetes). After setting up these tools, create a cluster with eksctl. This example configuration file specifies a basic cluster with one initial node.

cat <<EOF > cluster.yaml
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: eks-karpenter-demo
  region: us-east-1
  version: "1.20"
managedNodeGroups:
  - instanceType: m5.large
    amiFamily: AmazonLinux2
    name: eks-kapenter-demo-ng
    desiredCapacity: 1
    minSize: 1
    maxSize: 5
EOF
$ eksctl create cluster -f cluster.yaml

Karpenter itself can run anywhere, including on self-managed node groups, managed node groups, or AWS Fargate. Karpenter will provision EC2 instances in your account.

Next, you need to create necessary AWS Identity and Access Management (IAM) resources using the AWS CloudFormation template and IAM Roles for Service Accounts (IRSA) for the Karpenter controller to get permissions like launching instances following the documentation. You also need to install the Helm chart to deploy Karpenter to your cluster.

$ helm repo add karpenter https://charts.karpenter.sh
$ helm repo update
$ helm upgrade --install --skip-crds karpenter karpenter/karpenter --namespace karpenter \
  --create-namespace --set serviceAccount.create=false --version 0.5.0 \
  --set controller.clusterName=eks-karpenter-demo
  --set controller.clusterEndpoint=$(aws eks describe-cluster --name eks-karpenter-demo --query "cluster.endpoint" --output json) \
  --wait # for the defaulting webhook to install before creating a Provisioner

Karpenter provisioners are a Kubernetes resource that enables you to configure the behavior of Karpenter in your cluster. When you create a default provisioner, without further customization besides what is needed for Karpenter to provision compute resources in your cluster, Karpenter automatically discovers node properties such as instance types, zones, architectures, operating systems, and purchase types of instances. You don’t need to define these spec:requirements if there is no explicit business requirement.

cat <<EOF | kubectl apply -f -
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: default
spec:
#Requirements that constrain the parameters of provisioned nodes. 
#Operators { In, NotIn } are supported to enable including or excluding values
  requirements:
    - key: node.k8s.aws/instance-type #If not included, all instance types are considered
      operator: In
      values: ["m5.large", "m5.2xlarge"]
    - key: "topology.kubernetes.io/zone" #If not included, all zones are considered
      operator: In
      values: ["us-east-1a", "us-east-1b"]
    - key: "kubernetes.io/arch" #If not included, all architectures are considered
      values: ["arm64", "amd64"]
    - key: " karpenter.sh/capacity-type" #If not included, the webhook for the AWS cloud provider will default to on-demand
      operator: In
      values: ["spot", "on-demand"]
  provider:
    instanceProfile: KarpenterNodeInstanceProfile-eks-karpenter-demo
  ttlSecondsAfterEmpty: 30  
EOF

The ttlSecondsAfterEmpty value configures Karpenter to terminate empty nodes. If this value is disabled, nodes will never scale down due to low utilization. To learn more, see Provisioner custom resource definitions (CRDs) on the Karpenter site.

Karpenter is now active and ready to begin provisioning nodes in your cluster. Create some pods using a deployment, and watch Karpenter provision nodes in response.

$ kubectl create deployment --name inflate \
          --image=public.ecr.aws/eks-distro/kubernetes/pause:3.2

Let’s scale the deployment and check out the logs of the Karpenter controller.

$ kubectl scale deployment inflate --replicas 10
$ kubectl logs -f -n karpenter $(kubectl get pods -n karpenter -l karpenter=controller -o name)
2021-11-23T04:46:11.280Z        INFO    controller.allocation.provisioner/default       Starting provisioning loop      {"commit": "abc12345"}
2021-11-23T04:46:11.280Z        INFO    controller.allocation.provisioner/default       Waiting to batch additional pods        {"commit": "abc123456"}
2021-11-23T04:46:12.452Z        INFO    controller.allocation.provisioner/default       Found 9 provisionable pods      {"commit": "abc12345"}
2021-11-23T04:46:13.689Z        INFO    controller.allocation.provisioner/default       Computed packing for 10 pod(s) with instance type option(s) [m5.large]  {"commit": " abc123456"}
2021-11-23T04:46:16.228Z        INFO    controller.allocation.provisioner/default       Launched instance: i-01234abcdef, type: m5.large, zone: us-east-1a, hostname: ip-192-168-0-0.ec2.internal    {"commit": "abc12345"}
2021-11-23T04:46:16.265Z        INFO    controller.allocation.provisioner/default       Bound 9 pod(s) to node ip-192-168-0-0.ec2.internal  {"commit": "abc12345"}
2021-11-23T04:46:16.265Z        INFO    controller.allocation.provisioner/default       Watching for pod events {"commit": "abc12345"}

The provisioner’s controller listens for Pods changes, which launched a new instance and bound the provisionable Pods into the new nodes.

Now, delete the deployment. After 30 seconds (ttlSecondsAfterEmpty = 30), Karpenter should terminate the empty nodes.

$ kubectl delete deployment inflate
$ kubectl logs -f -n karpenter $(kubectl get pods -n karpenter -l karpenter=controller -o name)
2021-11-23T04:46:18.953Z        INFO    controller.allocation.provisioner/default       Watching for pod events {"commit": "abc12345"}
2021-11-23T04:49:05.805Z        INFO    controller.Node Added TTL to empty node ip-192-168-0-0.ec2.internal {"commit": "abc12345"}
2021-11-23T04:49:35.823Z        INFO    controller.Node Triggering termination after 30s for empty node ip-192-168-0-0.ec2.internal {"commit": "abc12345"}
2021-11-23T04:49:35.849Z        INFO    controller.Termination  Cordoned node ip-192-168-116-109.ec2.internal   {"commit": "abc12345"}
2021-11-23T04:49:36.521Z        INFO    controller.Termination  Deleted node ip-192-168-0-0.ec2.internal    {"commit": "abc12345"}

If you delete a node with kubectl, Karpenter will gracefully cordon, drain, and shut down the corresponding instance. Under the hood, Karpenter adds a finalizer to the node object, which blocks deletion until all pods are drained, and the instance is terminated.

Things to Know
Here are a couple of things to keep in mind about Kapenter features:

Accelerated Computing: Karpenter works with all kinds of Kubernetes applications, but it performs particularly well for use cases that require rapid provisioning and deprovisioning large numbers of diverse compute resources quickly. For example, this includes batch jobs to train machine learning models, run simulations, or perform complex financial calculations. You can leverage custom resources of nvidia.com/gpu, amd.com/gpu, and aws.amazon.com/neuron for use cases that require accelerated EC2 instances.

Provisioners Compatibility: Kapenter provisioners are designed to work alongside static capacity management solutions like Amazon EKS managed node groups and EC2 Auto Scaling groups. You may choose to manage the entirety of your capacity using provisioners, a mixed model with both dynamic and statically managed capacity, or a fully static approach. We recommend not using Kubernetes Cluster Autoscaler at the same time as Karpenter because both systems scale up nodes in response to unschedulable pods. If configured together, both systems will race to launch or terminate instances for these pods.

Join our Karpenter Community
Karpenter’s community is open to everyone. Give it a try, and join our working group meeting, or follow our roadmap for future releases that interest you. As I said, we welcome any contributions such as bug reports, new features, corrections, or additional documentation.

To learn more about Karpenter, see the documentation and demo video from AWS Container Day.

– Channy

Disaster Recovery with AWS Managed Services, Part I: Single Region

2021-11-12 Dhruv Bakshi

Post Syndicated from Dhruv Bakshi original https://aws.amazon.com/blogs/architecture/disaster-recovery-with-aws-managed-services-part-i-single-region/

This 3-part blog series discusses disaster recovery (DR) strategies that you can implement to ensure your data is safe and that your workload stays available during a disaster. In Part I, we’ll discuss the single AWS Region/multi-Availability Zone (AZ) DR strategy.

The strategy outlined in this blog post addresses how to integrate AWS managed services into a single-Region DR strategy. This will minimize maintenance and operational overhead, create fault-tolerant systems, ensure high availability, and protect your data with robust backup/recovery processes. This strategy replicates workloads across multiple AZs and continuously backs up your data to another Region with point-in-time recovery, so your application is safe even if all AZs within your source Region fail.

Implementing the single Region/multi-AZ strategy

The following sections list the components of the example application presented in Figure 1, which illustrates a multi-AZ environment with a secondary Region that is strictly utilized for backups. This example architecture refers to an application that processes payment transactions that has been modernized with AMS. We’ll show you which AWS services it uses and how they work to maintain the single Region/multi-AZ strategy.

Figure 1. Single Region/multi-AZ with secondary Region for backups

Amazon EKS control plane

Amazon Elastic Kubernetes Service (Amazon EKS) runs the Kubernetes management infrastructure across multiple AZs to eliminate a single point of failure.

This means that if your infrastructure or AZ fails, it will automatically scale control plane nodes based on load, automatically detect and replace unhealthy control plane instances, and restart them across the AZs within the Region as needed.

Amazon EKS data plane

Instead of creating individual Amazon Elastic Compute Cloud (Amazon EC2) instances, create worker nodes using an Amazon EC2 Auto Scaling group. Join the group to a cluster, and the group will automatically replace any terminated or failed nodes if an AZ fails. This ensures that the cluster can always run your workload.

Amazon ElastiCache

Amazon ElastiCache continually monitors the state of the primary node. If the primary node fails, it will promote the read replica with the least replication lag to primary. A replacement read replica is then created and provisioned in the same AZ as the failed primary. This is to ensure high availability of the service and application.

An ElastiCache for Redis (cluster mode disabled) cluster with multiple nodes has three types of endpoints: the primary endpoint, the reader endpoint and the node endpoints. The primary endpoint is a DNS name that always resolves to the primary node in the cluster.

Amazon Redshift

Currently, Amazon Redshift only supports single-AZ deployments. Although there are ways to work around this, we are focusing on cluster relocation. Parts II and III of this series will show you how to implement this service in a multi-Region DR deployment.

Cluster relocation enables Amazon Redshift to move a cluster to another AZ with no loss of data or changes to your applications. When Amazon Redshift relocates a cluster to a new AZ, the new cluster has the same endpoint as the original cluster. Your applications can reconnect to the endpoint and continue operations without modifications or loss of data.

Note: Amazon Redshift may also relocate clusters in non-AZ failure situations, such as when issues in the current AZ prevent optimal cluster operation or to improve service availability.

Amazon OpenSearch Service

Deploying your data nodes into three AZs with Amazon OpenSearch Service (formerly Amazon Elasticsearch Service) can improve the availability of your domain and increase your workload’s tolerance for AZ failures.

Amazon OpenSearch Service automatically deploys into three AZs when you select a multi-AZ deployment. This distribution helps prevent cluster downtime if an AZ experiences a service disruption. When you deploy across three AZs, Amazon OpenSearch Service distributes master nodes equally across all three AZs. That way, in the rare event of an AZ disruption, two master nodes will still be available.

Amazon OpenSearch Service also distributes primary shards and their corresponding replica shards to different zones. In addition to distributing shards by AZ, Amazon OpenSearch Service distributes them by node. When you deploy the data nodes across three AZs with one replica enabled, shards are distributed across the three AZs.

Note: For more information on multi-AZ configurations, please refer to the AZ disruptions table.

Amazon RDS PostgreSQL

Amazon Relational Database Service (Amazon RDS) handles failovers automatically so you can resume database operations as quickly as possible.

In a Multi-AZ deployment, Amazon RDS automatically provisions and maintains a synchronous standby replica in a different AZ. The primary DB instance is synchronously replicated across AZs to a standby replica. If an AZ or infrastructure fails, Amazon RDS performs an automatic failover to the standby. This minimizes the disruption to your applications without administrative intervention.

Backing up data across Regions

Here is how the managed services back up data to a secondary Region:

Manage snapshots of persistent volumes for Amazon EKS with Velero. Amazon Simple Storage Service (Amazon S3) stores these snapshots in an S3 bucket in the primary Region. Amazon S3 replicates these snapshots to an S3 bucket in another Region via S3 cross-Region replication.
Create a manual snapshot of Amazon OpenSearch Service clusters, which are stored in a registered repository like Amazon S3. You can do this manually or automate it via an AWS Lambda function, which automatically and asynchronously copy objects across Regions.
Use manual backups and copy API calls for Amazon ElastiCache to establish a snapshot and restore strategy in a secondary Region. You can manually back your data up to an S3 bucket or automate the backup via Lambda. Once your data is backed up, a snapshot of the ElastiCache cluster will be stored in an S3 bucket. Then S3 cross-Region replication will asynchronously copy the backup to an S3 bucket in a secondary Region.
Take automatic, incremental snapshots of your data periodically with Amazon Redshift and save them to Amazon S3. You can precisely control when snapshots are taken and can create a snapshot schedule and attach it to one or more clusters. You can also configure a cross-Region snapshot copy, which automatically copies your automated and manual snapshots to another Region.
Use AWS Backup to support AWS resources and third-party applications. AWS Backup copies RDS backups to multiple Regions on demand or automatically as part of a scheduled backup plan.

Note: You can add a layer of protection to your backups through AWS Backup Vault Lock and S3 Object Lock.

Conclusion

The single Region/multi-AZ strategy safeguards your workloads against a disaster that disrupts an Amazon data center by replicating workloads across multiple AZs in the same Region. This blog shows you how AWS managed services automatically fails over between AZs without interruption when experiencing a localized disaster, and how backups to a separate Region ensure data protection.

In the next post, we will discuss a multi-Region warm standby strategy for the same application stack illustrated in this post.

Related information

Migrate your Applications to Containers at Scale

2021-11-05 John O'Donnell

Post Syndicated from John O'Donnell original https://aws.amazon.com/blogs/architecture/migrate-your-applications-to-containers-at-scale/

AWS App2Container is a command line tool that you can install on a server to automate the containerization of applications. This simplifies the process of migrating a single server to containers. But if you have a fleet of servers, the process of migrating all of them could be quite time-consuming. In this situation, you can automate the process using App2Container. You’ll then be able to leverage configuration management tools such as Chef, Ansible, or AWS Systems Manager. In this blog, we will illustrate an architecture to scale out App2Container, using AWS Systems Manager.

Why migrate to containers?

Organizations can move to secure, low-touch services with Containers on AWS. A container is a lightweight, standalone collection of software that includes everything needed to run an application. This can include code, runtime, system tools, system libraries, and settings. Containers provide logical isolation and will always run the same, regardless of the host environment.

If you are running a .NET application hosted on Windows Internet Information Server (IIS), when it reaches end of life (EOL) you have two options. Either migrate entire server platforms, or re-host websites on other hosting platforms. Both options require manual effort and are often too complex to implement for legacy workloads. Once workloads have been migrated, you must still perform costly ongoing patching and maintenance.

Modernize with AWS App2Container

Containers can be used for these legacy workloads via AWS App2Container. AWS App2Container is a command line interface (CLI) tool for modernizing .NET and Java applications into containerized applications. App2Container analyzes and builds an inventory of all applications running in virtual machines, on-premises, or in the cloud. App2Container reduces the need to migrate the entire server OS, and moves only the specific workloads needed.

After you select the application you want to containerize, App2Container does the following:

Packages the application artifact and identified dependencies into container images
Configures the network ports
Generates the infrastructure, Amazon Elastic Container Service (ECS) tasks, and Kubernetes pod definitions

App2Container has a specific set of steps and requirements you must follow to create container images:

Create an Amazon Simple Storage Service (S3) bucket to store your artifacts generated from each server.
Create an AWS Identity and Access Management (IAM) user that has access to the Amazon S3 buckets and a designated Amazon Elastic Container Registry (ECR).
Deploy a worker node as an Amazon Elastic Compute Cloud (Amazon EC2) instance. This will include a compatible operating system, which will take the artifacts and convert them into containers.
Install the App2Container agent on each server that you want to migrate.
Run a set of commands on each server for each application that you want to convert into a container.
Run the commands on your worker node to perform the containerization and deployment.

Following, we will introduce a way to automate App2Container to reduce the time needed to deploy and scale this functionality throughout your environment.

Scaling App2Container

AWS App2Container streamlines the process of containerizing applications on a single server. For each server you must install the App2Container agent, initialize it, run an inventory, and run an analysis. But you can save time when containerizing a fleet of machines by automation, using AWS Systems Manager. AWS Systems Manager enables you to create documents with a set of command line steps that can be applied to one or more servers.

App2Container also supports setting up a worker node that can consume the output of the App2Container analysis step. This can be deployed to the new containerized version of the applications. This allows you to follow the security best practice of least privilege. Only the worker node will have permissions to deploy containerized applications. The migrating servers will need permissions to write the analysis output into an S3 bucket.

Separate the App2Container process into two parts to use the worker node.

Analysis. This runs on the target server we are migrating. The results are output into S3.
Deployment. This runs on the worker node. It pushes the container image to Amazon ECR. It can deploy a running container to either Amazon ECS or Amazon Elastic Kubernetes Service (EKS).

Figure 1. App2Container scaling architecture overview

Architectural walkthrough

As you can see in Figure 1, we need to set up an Amazon EC2 instance as the worker node, an S3 bucket for the analysis output, and two AWS Systems Manager documents. The first document is run on the target server. It will install App2Container and run the analysis steps. The second document is run on the worker node and handles the deployment of the container image.
The AWS Systems Manager targets one or many hosts, enabling you to run the analysis step in parallel for multiple servers. Results and artifacts such as files or .Net assembly code, are sent to the preconfigured Amazon S3 bucket for processing as shown in Figure 2.

Figure 2. Container migration target servers

After the artifacts have been generated, a second document can be run against the worker node. This scans all files in the Amazon S3 bucket, and workloads are automatically containerized. The resulting images are pushed to Amazon ECR, as shown in Figure 3.

Figure 3. Container migration conversion

When this process is completed, you can then choose how to deploy these images, using Amazon ECS and/or Amazon EKS. Once the images and deployments are tested and the migration is completed, target servers and migration factory resources can be safely decommissioned.

This architecture demonstrates an automated approach to containerizing .NET web applications. AWS Systems Manager is used for discovery, package creation, and posting to an Amazon S3 bucket. An EC2 instance converts the package into a container so it is ready to use. The final step is to push the converted container to a scalable container repository (Amazon ECR). This way it can easily be integrated into our container platforms (ECS and EKS).

Summary

This solution offers many benefits to migrating legacy .Net based websites directly to containers. This proposed architecture is powered by AWS App2Container and automates the tooling on many targets in a secure manner. It is important to keep in mind that every customer portfolio and application requirements are unique. Therefore, it’s essential to validate and review any migration plans with business and application owners. With the right planning, engagement, and implementation, you should have a smooth and rapid journey to AWS Containers.

If you have any questions, post your thoughts in the comments section.

For further reading:

Configure Amazon EMR Studio and Amazon EKS to run notebooks with Amazon EMR on EKS

2021-09-25 Randy DeFauw

Post Syndicated from Randy DeFauw original https://aws.amazon.com/blogs/big-data/configure-amazon-emr-studio-and-amazon-eks-to-run-notebooks-with-amazon-emr-on-eks/

Amazon EMR on Amazon EKS provides a deployment option for Amazon EMR that allows you to run analytics workloads on Amazon Elastic Kubernetes Service (Amazon EKS). This is an attractive option because it allows you to run applications on a common pool of resources without having to provision infrastructure. In addition, you can use Amazon EMR Studio to build analytics code running on Amazon EKS clusters. EMR Studio is a web-based, integrated development environment (IDE) using fully managed Jupyter notebooks that can be attached to any EMR cluster, including EMR on EKS. It uses AWS Single Sign-On (SSO) or a compatible identity provider (IdP) to log directly in to EMR Studio through a secure URL using corporate credentials.

Deploying EMR Studio to attach to EMR on EKS requires integrating several AWS services:

Amazon EKS
Amazon EMR
AWS SSO
Amazon Virtual Private Cloud (Amazon VPC)
Amazon Elastic Compute Cloud (Amazon EC2) Application Load Balancer
AWS Identity and Access Management (IAM) roles and policies

In addition, you need to install the following EMR on EKS components:

AWS Load Balancer Controller
Jupyter Enterprise Gateway

This post helps you build all the necessary components and stitch them together by running a single script. We also describe the architecture of this setup and how the components work together.

Architecture overview

With EMR on EKS, you can run Spark applications alongside other types of applications on the same Amazon EKS cluster, which improves resource allocation and simplifies infrastructure management. For more information about how Amazon EMR operates inside an Amazon EKS cluster, see New – Amazon EMR on Amazon Elastic Kubernetes Service (EKS). EMR Studio provides a web-based IDE that makes it easy to develop, visualize, and debug applications that run in EMR. For more information, see Amazon EMR Studio (Preview): A new notebook-first IDE experience with Amazon EMR.

Spark kernels are scheduled pods in a namespace in an Amazon EKS cluster. EMR Studio uses Jupyter Enterprise Gateway (JEG) to launch Spark kernels on Amazon EKS. A managed endpoint of type JEG is provisioned as a Kubernetes deployment in the EMR virtual cluster’s associated namespace and exposed as a Kubernetes service. Each EMR virtual cluster maps to a Kubernetes namespace registered with the Amazon EKS cluster; virtual clusters don’t manage physical compute or storage, but point to the Kubernetes namespace where the workload is scheduled. Each virtual cluster can have several managed endpoints, each with their own configured kernels for different use cases and needs. JEG managed endpoints provide HTTPS endpoints, serviced by an Application Load Balancer (ALB), that are reachable only from EMR Studio and self-hosted notebooks that are created within a private subnet of the Amazon EKS VPC.

The following diagram illustrates the solution architecture.

The managed endpoint is created in the virtual cluster’s Amazon EKS namespace (in this case, sparkns) and the HTTPS endpoints are serviced from private subnets. The kernel pods run with the job-execution IAM role defined in the managed endpoint. During managed endpoint creation, EMR on EKS uses the AWS Load Balancer Controller in the kube-system namespace to create an ALB with a target group that connects with the JEG managed endpoint in the virtual cluster’s Kubernetes namespace.

You can configure each managed endpoint’s kernel differently. For example, to permit a Spark kernel to use AWS Glue as their catalog, you can apply the following configuration JSON file in the —configuration-overrides flag when creating a managed endpoint:

aws emr-containers create-managed-endpoint \
--type JUPYTER_ENTERPRISE_GATEWAY \
--virtual-cluster-id ${virtclusterid} \
--name ${virtendpointname} \
--execution-role-arn ${role_arn} \
--release-label ${emr_release_label} \
--certificate-arn ${certarn} \
--region ${region} \
--configuration-overrides '{
    "applicationConfiguration": [
      {
        "classification": "spark-defaults",
        "properties": {
          "spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
          "spark.sql.catalogImplementation": "hive"
        }
      }
    ]
  }'

The managed endpoint is a Kubernetes deployment fronted by a service inside the configured namespace (in this case, sparkns). When we trace the endpoint information, we can see how the Jupyter Enterprise Gateway deployment connects with the ALB and the target group:

# Get the endpoint ID
aws emr-containers list-managed-endpoints --region us-east-1 --virtual-cluster-id idzdhw2qltdr0dxkgx2oh4bp1
{
    "endpoints": [
        {
            "id": "5vbuwntrbzil1",
            "name": "virtual-emr-endpoint-demo",
            ...
            "serverUrl": "https://internal-k8s-default-ingress5-4f482e2d41-2097665209.us-east-1.elb.amazonaws.com:18888",

# List the deployment
kubectl get deployments -n sparkns -l "emr-containers.amazonaws.com/managed-endpoint-id=5vbuwntrbzil1"

NAME                READY   UP-TO-DATE   AVAILABLE   AGE
jeg-5vbuwntrbzil1   1/1     1            1           4h54m


# List the service
kubectl get svc -n sparkns -l "emr-containers.amazonaws.com/managed-endpoint-id=5vbuwntrbzil1"

NAME                    TYPE       CLUSTER-IP       EXTERNAL-IP   PORT(S)           AGE
service-5vbuwntrbzil1   NodePort   10.100.172.157   <none>        18888:30091/TCP   4h58m

# List the TargetGroups to get the TargetGroup ARN

kubectl get targetgroupbinding -n sparkns -o json | jq .items | jq .[].spec.targetGroupARN

"arn:aws:elasticloadbalancing:us-east-1:< account id >:targetgroup/k8s-sparkns-servicey-a37caa5e1e/02d10652a64cebd8"

# Get the TargetGroup Port number

aws elbv2 describe-target-groups --target-group-arns arn:aws:elasticloadbalancing:us-east-1:< account id >:targetgroup/k8s-sparkns-servicey-a37caa5e1e/02d10652a64cebd8 | jq .TargetGroups | jq .[].Port

30091


# Get Load Balancer ARN

aws elbv2 describe-target-groups --target-group-arns arn:aws:elasticloadbalancing:us-east-1:< account id >:targetgroup/k8s-sparkns-servicey-a37caa5e1e/02d10652a64cebd8 | jq .TargetGroups | jq .[].LoadBalancerArns | jq .[]

"arn:aws:elasticloadbalancing:us-east-1:< account id >:loadbalancer/app/k8s-sparkns-ingressy-830efa48aa/12199b1a7baee273"

# Get Listener Port number

aws elbv2 describe-listeners --load-balancer-arn arn:aws:elasticloadbalancing:us-east-1:< account id >:loadbalancer/app/k8s-sparkns-ingressy-830efa48aa/12199b1a7baee273 | jq .Listeners | jq .[].Port

18888

To look at how this connects, consider two EMR Studio sessions. The ALB exposes port 18888 to the EMR Studio sessions. The JEG service maps the external port 18888 on the ALB to the dynamic NodePort on the JEG service (in this case, 30091). The JEG service forwards the traffic to the TargetPort 9547, which routes the traffic to the appropriate Spark driver pod. Each notebook session has its own kernel, which has its own respective Spark driver and executor pods, as the following diagram illustrates.

Attach EMR Studio to a virtual cluster and managed endpoint

Each time a user attaches a virtual cluster and a managed endpoint to their Studio Workspace and launches a Spark session, Spark drivers and Spark executors are scheduled. You can see that when you run kubectl to check what pods were launched:

$ kubectl get all -l app=enterprise-gateway
NAME                                  READY   STATUS      RESTARTS   AGE
pod/kb1a317e8-b77b-448c-9b7d-exec-1   1/1     Running     0          2m30s
pod/kb1a317e8-b77b-448c-9b7d-exec-2   1/1     Running     0          2m30s
pod/kb1a317e8-b77b-448c-9b7d-driver   2/2     Running     0          2m38s

$ kubectl get pods -n sparkns
NAME                                  READY   STATUS      RESTARTS   AGE
jeg-5vbuwntrbzil1-5fc8469d5f-pfdv9    1/1     Running     0          3d7h
kb1a317e8-b77b-448c-9b7d-exec-1       1/1     Running     0          2m38s
kb1a317e8-b77b-448c-9b7d-exec-2       1/1     Running     0          2m38s
kb1a317e8-b77b-448c-9b7d-driver       2/2     Running     0          2m46s

Each notebook Spark kernel session deploys a driver pod and executor pods that continue running until the kernel session is shut down.

The code in the notebook cells runs in the executor pods that were deployed in the Amazon EKS cluster.

Set up EMR on EKS and EMR Studio

Several steps and pieces are required to set up both EMR on EKS and EMR Studio. Enabling AWS SSO is a prerequisite. You can use the two provided launch scripts in this section or manually deploy it using the steps provided later in this post.

We provide two launch scripts in this post. One is a bash script that uses AWS CloudFormation, eksctl, and AWS Command Line Interface (AWS CLI) commands to provide an end-to-end deployment of a complete solution. The other uses the AWS Cloud Development Kit (AWS CDK) to do so.

The following diagram shows the architecture and components that we deploy.

Prerequisites

Make sure to complete the following prerequisites:

Enable AWS SSO in the same Region where the EMR Studio resides. If the account is part of an organizational account, AWS SSO needs to be enabled in the primary account.
Enable AWS SSO and set up users in AWS SSO. For instructions, see Getting Started and How to create and manage users within AWS Single Sign-On.

For information about the supported IdPs, see Enable AWS Single Sign-On for Amazon EMR Studio.

Bash script

The script is available on GitHub.

Prerequisites

The script requires you to use AWS Cloud9. Follow the instructions in the Amazon EKS Workshop. Make sure to follow these instructions carefully:

After you deploy the AWS Cloud9 desktop, proceed to the next steps.

Preparation

Use the following code to clone the GitHub repo and prepare the AWS Cloud9 prerequisites:

# Download script from the repository
$ git clone https://github.com/aws-samples/amazon-emr-on-eks-emr-studio.git

# Prepare the Cloud9 Desktop pre-requisites
$ cd amazon-emr-on-eks-emr-studio
$ bash ./prepare_cloud9.sh

Deploy the stack

Before running the script, provide the following information:

The AWS account ID and Region, if your AWS Cloud9 desktop isn’t in the same account ID or Region where you want to deploy EMR on EKS
The name of the Amazon Simple Storage Service (Amazon S3) bucket to create
The AWS SSO user to be associated with the EMR Studio session

After the script deploys the stack, the URL to the deployed EMR Studio is displayed:

# Launch the script and follow the instructions to provide user parameters
$ bash ./deploy_eks_cluster_bash.sh

...
Go to https://***. emrstudio-prod.us-east-1.amazonaws.com and login using < SSO user > ...

AWS CDK script

The AWS CDK scripts are available on GitHub. You need to checkout the main branch. The stacks deploy an Amazon EKS cluster and EMR on EKS virtual cluster in a new VPC with private subnets, and optionally an Amazon Managed Apache Airflow (Amazon MWAA) environment and EMR Studio.

Prerequisites

You need the AWS CDK version 1.90.1 or higher. For more information, see Getting started with the AWS CDK.

We use a prefix list to restrict access to some resources to network IP ranges that you approve. Create a prefix list if you don’t already have one.

If you plan to use EMR Studio, you need AWS SSO configured in your account.

Preparation

After you clone the repository and checkout the main branch, create and activate a new Python virtual environment:

# Clone the repository
$ git clone https://github.com/aws-samples/aws-cdk-for-emr-on-eks.git
$ cd aws-cdk-for-emr-on-eks/
$ git checkout main

# 
$ python3 -m venv .venv
$ source .venv/bin/activate

Now install the Python dependencies:

$ pip install -r requirements.txt

Lastly, bootstrap the AWS CDK:

$ cdk bootstrap aws://<account>/<region> \
  --context prefix=<prefix list> \
  --context instance=m5.xlarge \
  --context username=<SSO user name>

Deploy the stacks

Synthesize the AWS CDK stacks with the following code:

$ cdk synth \
  --context prefix=<prefix list> \
  --context instance=m5.xlarge \
  --context username=<SSO user name>

This command generates four stacks:

emr-eks-cdk – The main stack
mwaa-cdk – Adds Amazon MWAA
studio-cdk – Adds EMR Studio prerequisites
studio-cdk-live – Adds EMR Studio

The following diagram illustrates the resources deployed by the AWS CDK stacks.

Start by deploying the first stack:

$ cdk deploy <stack name> \
  --context prefix=<prefix list> \
  --context instance=m5.xlarge  \
  --context username=<SSO user name> \
  emr-eks-cdk

If you want to use Apache Airflow as your orchestrator, deploy that stack:

$ cdk deploy <stack name> \
  --context prefix=<prefix list> \
  --context instance=m5.xlarge \
  --context username=<SSO user name> \
  mwaa-cdk

Deploy the first EMR Studio stack:

$ cdk deploy <stack name> \
  --context prefix=<prefix list> \
  --context instance=m5.xlarge \
  --context username=<SSO user name> \
  studio-cdk

Wait for the managed endpoint to become active. You can check the status by running the following code:

$ aws emr-containers list-managed-endpoints --virtual-cluster-id <cluster ID> | jq '.endpoints[].state'

The virtual cluster ID is available in the AWS CDK output from the emr-eks-cdk stack.

When the endpoint is active, deploy the second EMR Studio stack:

$ cdk deploy <stack name> \
  --context prefix=<prefix list> \
  --context instance=m5.xlarge \
  --context username=<SSO user name> \
  studio-live-cdk

Manual deployment

If you prefer to manually deploy EMR on EKS and EMR Studio, use the steps in this section.

Set up a VPC

If you’re using Amazon EKS v. 1.18, set up a VPC that also has private subnets and appropriately tagged for external load balancers. For tagging, see: Application load balancing on Amazon EKS and Create an EMR Studio service role.

Create an Amazon EKS cluster

Launch an Amazon EKS cluster with at least one managed node group. For instructions, see Setting up and Getting Started with Amazon EKS.

Create relevant IAM policies, roles, IdP, and SSL/TLS certificate

To create your IAM policies, roles, IdP, and SSL/TLS certificate, complete the following steps:

Enable cluster access for EMR on EKS.
Create an IdP in IAM based on the EKS OIDC provider URL.
Create an SSL/TLS certificate and place it in AWS Certificate Manager.
Create the relevant IAM policies and roles:
1. Job execution role
2. Update the trust policy for the job execution role
3. Deploy and create the IAM policy for the AWS Load Balancer Controller
4. EMR Studio service role
5. EMR Studio user role
6. EMR Studio user policies associated with AWS SSO users and groups
Register the Amazon EKS cluster with Amazon EMR to create the virtual EMR cluster
Create the appropriate security groups to be attached to each EMR Studio created:
1. Workspace security group
2. Engine security group
Tag the security groups with the appropriate tags. For instructions, see Create an EMR Studio service role.

Required installs in Amazon EKS

Deploy the AWS Load Balancer Controller in the Amazon EKS cluster if you haven’t already done so.

Create EMR on EKS relevant pieces and map the user to EMR Studio

Complete the following steps:

Create at least one EMR virtual cluster associated with the Amazon EKS cluster. For instructions, see Step 1 of Set up Amazon EMR on EKS for EMR Studio.
Create at least one managed endpoint. For instructions, see Step 2 of Set up Amazon EMR on EKS for EMR Studio.
Create at least one EMR Studio; associate the EMR Studio with the private subnets configured with the Amazon EKS cluster. For instructions, see Create an EMR Studio.
When the EMR Studio is available, map an AWS SSO user or group to the EMR Studio and apply an appropriate IAM policy to that user.

Use EMR Studio

To start using EMR Studio, complete the following steps:

Find the URL for EMR Studio by the studios in a Region:

$ aws emr list-studios --region us-east-1
{
    "Studios": [
        {
            "StudioId": "es-XXXXXXXXXXXXXXXXXXXXXX",
            "Name": "emr_studio_1",
            "VpcId": "vpc-XXXXXXXXXXXXXXXXXXXX",
            "Url": "https://es-XXXXXXXXXXXXXXXXXXXXXX.emrstudio-prod.us-east-1.amazonaws.com",
            "CreationTime": "2021-02-10T14:04:13.672000+00:00"
        }
    ]
}

With the listed URL, log in using the AWS SSO username you used earlier.

After authentication, the user is routed to the EMR Studio dashboard.

Choose Create Workspace.
For Workspace name, enter a name.
For Subnet, choose the subnet that corresponds to one of the subnets associated with the managed node group.
For S3 location, enter an S3 bucket where you can store the notebook content.

After you create the Workspace, choose one that is in the Ready status.

In the sidebar, choose the EMR cluster icon.
Under Cluster type¸ choose EMR Cluster on EKS.
Choose the available virtual cluster and available managed endpoint.
Choose Attach.

After it’s attached, EMR Studio displays the kernels available in the Notebook and Console section.

Choose PySpark (Kubernetes) to launch a notebook kernel and start a Spark session.

Because the endpoint configuration here uses AWS Glue for its metastore, you can list the databases and tables connected to the AWS Glue Data Catalog. You can use the following example script to test the setup. Modify the script as necessary for the appropriate database and table that you have in your Data Catalog:

words='Welcome to Amazon EMR Studio'.split(' ')
wordRDD = sc.parallelize(words)
wc = wordRDD.map(lambda word: (word, 1)).reduceByKey(lambda a,b: a+b)
print(wc.collect())

# Connect to Glue Catalog
spark.sql("""show databases like '< Database Name >'""").show(truncate=False)
spark.sql("""show tables in < Database Name >""").show(truncate=False)
# Run a simple select
spark.sql("""select * from < Database Name >.< Table Name > limit 10""").show(truncate=False)

Clean up

To avoid incurring future charges, delete the resources launched here by running remove_setup.sh:

# Launch the script
$ bash ./remove_setup.sh</p>

Conclusion

EMR on EKS allows you to run applications on a common pool of resources inside an Amazon EKS cluster without having to provision infrastructure. EMR Studio is a fully managed Jupyter notebook and tool that provisions kernels that run on EMR clusters, including virtual clusters on Amazon EKS. In this post, we described the architecture of how EMR Studio connects with EMR on EKS and provided scripts to automatically deploy all the components to connect the two services.

If you have questions or suggestions, please leave a comment.

About the Authors

Randy DeFauw is a Principal Solutions Architect at Amazon Web Services. He works with the AWS customers to provide guidance and technical assistance on database projects, helping them improve the value of their solutions when using AWS.

Matthew Tan is a Senior Analytics Solutions Architect at Amazon Web Services and provides guidance to customers developing solutions with AWS Analytics services on their analytics workloads.

Optimizing Cloud Infrastructure Cost and Performance with Starburst on AWS

2021-09-17 Kinnar Kumar Sen

Post Syndicated from Kinnar Kumar Sen original https://aws.amazon.com/blogs/architecture/optimizing-cloud-infrastructure-cost-and-performance-with-starburst-on-aws/

Amazon Web Services (AWS) Cloud is elastic, convenient to use, easy to consume, and makes it simple to onboard workloads. Because of this simplicity, the cost associated with onboarding workloads is sometimes overlooked.

There is a notion that when an organization moves its workload to the cloud, agility, scalability, performance, and cost issues will disappear. While this may be true for agility and scalability, you must optimize your workload. You can do this with services like Amazon EC2 Auto Scaling via Amazon Elastic Compute Cloud (Amazon EC2) and Amazon EC2 Spot Instances to realize the performance and cost benefits the cloud offers.

In this blog post, we show you how Starburst Enterprise (Starburst) addressed a sudden increase in cost for their data analytics platform as they grew their internal teams and scaled out their infrastructure. After reviewing their architecture and deployments with AWS specialist architects, Starburst and AWS concluded that they could take the following steps to greatly reduce costs:

Use Spot Instances to run workloads.
Add Amazon EC2 Auto Scaling into their training and demonstration environments, because the Starburst platform is designed to elastically scale up and down.

For analytics workloads, when you rein in costs, you typically rein in performance. Starburst and AWS worked together to balance the cost and performance of Starburst’s data analytics platform while also harnessing the flexibility, scalability, security, and performance of the cloud.

What is Starburst Enterprise?

Starburst provides a Massively Parallel Processing SQL (MPPSQL) engine based on open source Trino. It is an analytics platform that provides the cornerstone in customers’ intelligent data mesh and offers the following benefits and services:

The platform gives you a single point to access, monitor, and secure your data mesh.
The platform gives you options for your data compute. You no longer have to wait on data migrations or extract, transform, and load (ETL), there is no vendor lock-in, and there is no need to swap out your existing analytics tools.
Starburst Stargate (Stargate) ensures that large jobs are completed within each data domain of your data mesh. Only the result set is retrieved from the domain.
- Stargate reduces data output, which reduces costs and increases performance.
- Data governance policies can also be applied uniquely in each data domain, ensuring security compliance and federation.

As shown in Figure 1, there are many connectors for input and output that ensure you experience improved performance and security.

Figure 1. Starburst platform

Integrating Starburst Enterprise with AWS

As shown in Figure 2, Starburst Enterprise uses AWS services to deliver elastic scaling and optimize cost. The platform is architected with decoupled storage and compute. This allows the platform to scale as needed to analyze petabytes of data.

The platform can be deployed via AWS CloudFormation or Amazon Elastic Kubernetes Service (Amazon EKS). Starburst on AWS allows you to run analytic queries across AWS data sources and on-premises systems such as Teradata and Oracle.

Figure 2. Deployment architecture of Starburst platform on AWS

Amazon EC2 Auto Scaling

Enterprises have diverse analytic workloads; their compute and memory requirements vary with time. Starburst uses Amazon EKS and Amazon EC2 Auto Scaling to elastically scale compute resources to meet the demands of their analytics workloads.

Amazon EC2 Auto Scaling ensures that you have the compute capacity for your workloads to handle the load elastically. It is used to architect sophisticated, elastic, and resilient applications on the AWS Cloud.
- Starburst uses the scheduled scaling feature of Amazon EC2 Auto Scaling to scale the cluster up/down based on time. Thus, they incur no costs when the cluster is not in use.
Amazon EKS is a fully managed Kubernetes service that allows you to run Kubernetes on AWS without needing to install, operate, and maintain your own Kubernetes control plane.

Scaling cloud resource consumption on demand has a major impact on controlling cloud costs. Starburst supports scaling down elastically, which means removing compute resources doesn’t impact the underlying processes.

Amazon EC2 Spot Instances

Spot Instances let you take advantage of unused EC2 capacity in the AWS Cloud. They are available at up to a 90% discount compared to On-Demand Instance prices. If EC2 needs capacity for On-Demand Instance usage, Spot Instances can be interrupted by Amazon EC2 with a two-minute notification. There are many ways to handle the interruption to ensure that the application is well architected for resilience and fault tolerance.

Starburst has integrated Spot Instances as a part of the Amazon EKS managed node groups to cost optimize the analytics workloads. This best practice of instance diversification is implemented by using the integration eksctl and instance selector with dry-run flag. This creates a list of instances of same size (vCPU/Mem ratio) and uses them in the underlying node groups.

Same size instances are required to make best use of Kubernetes Cluster Autoscaler, which is used to manage the size of the cluster.

Scaling down, handling interruptions, and provisioning compute

“Scaling in” an active application is tricky, but Starburst was built with resiliency in mind, and it can effectively manage shut downs.

Spot Instances are an ideal compute option because Starburst can handle potential interruptions natively. Starburst also uses Amazon EKS managed node groups to provision nodes in the cluster. This requires significantly less operational effort compared to using self-managed node groups. It allows Starburst to enforce best practices like capacity optimized allocation strategy, capacity rebalancing, and instance diversification.

When you need to “scale out” the deployment, Amazon EKS and Amazon EC2 Auto Scaling help to provision capacity, as depicted in Figure 3.

Figure 3. Depicting “scale out” in a Starburst deployment

Benefits realized from using AWS services

In a short period, Starburst was able to increase the number of people working on AWS. They added five times the number of Solutions Architects, they have previously. Additionally, in their initial tests of their new deployment architecture, their Solutions Architects were able to complete up to three times the amount of work than they had been able to previously. Even after the workload increased more than 15 times, with two simple changes they only had a slight increase in total cost.

This cost and performance optimization allows Starburst to be more productive internally and realize value for each dollar spent. This further justified investing more into building out infrastructure footprint.

Conclusion

In building their architecture with AWS, Starburst realized the importance of having a robust and comprehensive cloud administration plan that they can implement and manage going forward. They are now able to balance the cloud costs with performance and stability, even after considering the SLA requirements. Starburst is planning to teach their customers about the Spot Instance and Amazon EC2 Auto Scaling best practices to ensure they maintain a cost and performance optimized cloud architecture.

If you want to see the Starburst data analytics platform in action, you can get a free trial in the AWS Marketplace: Starburst Data Free Trial.

Reduce costs and increase resource utilization of Apache Spark jobs on Kubernetes with Amazon EMR on Amazon EKS

2021-09-17 Saurabh Bhutyani

Post Syndicated from Saurabh Bhutyani original https://aws.amazon.com/blogs/big-data/reduce-costs-and-increase-resource-utilization-of-apache-spark-jobs-on-kubernetes-with-amazon-emr-on-amazon-eks/

Amazon EMR on Amazon EKS is a deployment option for Amazon EMR that allows you to run Apache Spark on Amazon Elastic Kubernetes Service (Amazon EKS). If you run open-source Apache Spark on Amazon EKS, you can now use Amazon EMR to automate provisioning and management, and run Apache Spark up to three times faster. If you already use Amazon EMR, you can run Amazon EMR-based Apache Spark applications with other types of applications on the same Amazon EKS cluster to improve resource utilization and simplify infrastructure management.

Earlier this year, we launched support for pod templates in Amazon EMR on Amazon EKS to make it simpler to run Spark jobs on shared Amazon EKS clusters. A pod is a group of one or more containers, with shared storage and network resources, and a specification for how to run the containers. Pod templates are specifications that determine how each pod runs.

When you submit analytics jobs to a virtual cluster on Amazon EMR on EKS, Amazon EKS schedules the pods to execute the jobs. Your Amazon EKS cluster may have multiple node groups and instance types attached to it, and these pods could get scheduled on any of those Amazon Elastic Compute Cloud (Amazon EC2) instances. Organizations today have requirements to have better resource utilization, running jobs on specific instances based on instance type, amount of disk, disk IOPS, and more, and also control costs when jobs are submitted by multiple teams to a virtual cluster on Amazon EMR on EKS.

In this post, we look at support in Amazon EMR on EKS for Spark’s pod template feature and how to use that for resource isolation and controlling costs.

Pod templates have many uses cases:

Cost reduction – To reduce costs, you can schedule Spark driver pods to run on EC2 On-Demand Instances while scheduling Spark executor pods to run on EC2 Spot Instances.
Resource utilization – To increase resource utilization, you can support multiple teams running their workloads on the same Amazon EKS cluster. Each team gets a designated Amazon EC2 node group to run their workloads on. You can use pod templates to enforce scheduling on the relevant node groups.
Logging and monitoring – To improve monitoring, you can run a separate sidecar container to forward logs to your existing monitoring application.
Initialization – To run initialization steps, you can run a separate init container that is run before the Spark main container starts. You can have your init container run initialization steps, such as downloading dependencies or generating input data. Then the Spark main container consumes the data.

Prerequisites

To follow along with the walkthrough, ensure that you have the following resources created:

An AWS account that provides access to AWS services
An AWS Identity and Access Management User (IAM) user with an access key and secret key to configure the AWS Command Line Interface (AWS CLI), and permissions to create IAM roles, IAM policies, and AWS CloudFormation stacks
An Amazon Simple Storage Service (Amazon S3) bucket to store your pod templates
The AWS CLI, eksctl, and kubectl. Instructions for installation of these tools are given in Step 1

Solution overview

We look at a common use in organizations where multiple teams want to submit jobs and need resource isolation and cost reduction. In this post, we simulate two teams trying to submit jobs to the Amazon EMR on EKS cluster and see how to isolate the resources between them when running jobs. We also look at cost reduction by having the Spark driver run on EC2 On-Demand Instances while using Spark executors to run on EC2 Spot Instances. The following diagram illustrates this architecture.

To implement the solution, we complete the following high-level steps:

Create an Amazon EKS cluster.
Create an Amazon EMR virtual cluster.
Set up IAM roles.
Create pod templates.
Submit Spark jobs.

Create an Amazon EKS cluster

To create your Amazon EKS cluster, complete the following steps:

Create a new file (create-cluster.yaml) with the following contents:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: blog-eks-cluster
  region: us-east-1

managedNodeGroups:
  # On-Demand nodegroup for job submitter/controller
 - name:controller
   instanceTypes: ["m5.xlarge", "m5a.xlarge","m5d.xlarge"]
   desiredCapacity: 1

  # On-Demand nodegroup for team-1 Spark driver
 - name: team-1-driver
   labels: { team: team-1-spark-driver }
   taints: [{ key: team-1, value: general-purpose, effect: NoSchedule }]
   instanceTypes: ["m5.xlarge", "m5a.xlarge","m5d.xlarge"]
   desiredCapacity: 3

  # Spot nodegroup for team-1 Spark executors
 - name: team-1-executor
   labels: { team: team-1-spark-executor }
   taints: [{ key: team-1, value: general-purpose, effect: NoSchedule }]
   instanceTypes: ["m5.2xlarge", "m5a.2xlarge", "m5d.2xlarge"]
   desiredCapacity: 5
   spot: true

  # On-Demand nodegroup for team-2 Spark driver
 - name: team-2-driver
   labels: { team: team-2-spark-driver }
   taints: [{ key: team-2, value: compute-intensive , effect: NoSchedule }]
   instanceTypes: ["c5.xlarge", "c5a.xlarge", "c5d.xlarge"]
   desiredCapacity: 3

  # Spot nodegroup for team-2 Spark executors
 - name: team-2-executor
   labels: { team: team-2-spark-executor }
   taints: [{ key: team-2, value: compute-intensive , effect: NoSchedule }]
   instanceTypes: ["c5.2xlarge","c5a.2xlarge","c5d.2xlarge"]
   spot: true
   desiredCapacity: 5

Install the AWS CLI.

You can use version 1.18.157 or later, or version 2.0.56 or later. The following command is for Linux OS:

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

For other operating systems, see Installing, updating, and uninstalling the AWS CLI version.

Install eksctl (you must have eksctl 0.34.0 version or later):

curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
sudo mv /tmp/eksctl /usr/local/bin
eksctl version

Install kubectl:

curl -o kubectl https://amazon-eks.s3.us-west-2.amazonaws.com/1.18.8/2020-09-18/bin/linux/amd64/kubectl
chmod +x ./kubectl
sudo mv ./kubectl /usr/local/bin

Create the Amazon EKS cluster using the create-cluster.yaml config file created in earlier steps:

eksctl create cluster -f create-cluster.yaml

This step launches an Amazon EKS cluster with five managed node groups: two node groups each for team-1 and team-2, one node group for Spark drivers using EC2 On-Demand capacity, while another one for Spark executors using EC2 Spot capacity.

After the Amazon EKS cluster is created, run the following command to check the node groups:

eksctl get nodegroups --cluster blog-eks-cluster

You should see a response similar to the following screenshot.

Create an Amazon EMR virtual cluster

We launch the EMR virtual cluster in the default namespace:

eksctl create iamidentitymapping \
    --cluster blog-eks-cluster \
    --namespace default \
    --service-name "emr-containers"
aws emr-containers create-virtual-cluster \
--name blog-emr-on-eks-cluster \
--container-provider '{"id": "blog-eks-cluster","type": "EKS","info": {"eksInfo": {"namespace": "default"}} }'

The command creates an EMR virtual cluster in the Amazon EKS default namespace and outputs the virtual cluster ID:

{
    "id": "me9zfn2lbt241wxhxg81gjlxb",
    "name": "blog-emr-on-eks-cluster",
    "arn": "arn:aws:emr-containers:us-east-1:xxxx:/virtualclusters/me9zfn2lbt241wxhxg81gjlxb"
}

Note the ID of the EMR virtual cluster to use to run the jobs.

Set up an IAM role

In this step, we create an Amazon EMR Spark job execution role with the following IAM policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:PutLogEvents",
                "logs:CreateLogStream",
                "logs:DescribeLogGroups",
                "logs:DescribeLogStreams",
               "logs:CreateLogGroup"
            ],
            "Resource": [
                "arn:aws:logs:*:*:*"
            ]
        }
    ]
}

Navigate to the IAM console to create the role. Let’s call the role EMR_EKS_Job_Execution_Role. For more information, see Creating IAM roles and Creating IAM Policies.

Set up the trust policy for the role with the following command:

aws emr-containers update-role-trust-policy \
       --cluster-name blog-eks-cluster \
       --namespace default \
       --role-name EMR_EKS_Job_Execution_Role

Enable IAM roles for service accounts (IRSA) on the Amazon EKS cluster:

eksctl utils associate-iam-oidc-provider --cluster blog-eks-cluster --approve

Create pod templates with node selectors and taints

In this step, we create pod templates for the Team-1 Spark driver pods and Spark executor pods, and templates for the Team-2 Spark driver pods and Spark executor pods.

Run the following commands to view the nodes corresponding to team-1 for the label team=team-1-spark-driver:

$ kubectl get nodes --selector team=team-1-spark-driver
NAME                              STATUS   ROLES    AGE    VERSION
ip-192-168-123-197.ec2.internal   Ready    <none>   107m   v1.20.4-eks-6b7464
ip-192-168-25-114.ec2.internal    Ready    <none>   107m   v1.20.4-eks-6b7464
ip-192-168-91-201.ec2.internal    Ready    <none>   107m   v1.20.4-eks-6b7464

Similarly, you can view the nodes corresponding to team-1 for the label team=team-1-spark-executor. You can repeat the same commands to view the nodes corresponding to team-2 by changing the role labels:

$ kubectl get nodes --selector team=team-1-spark-executor
NAME                             STATUS   ROLES    AGE    VERSION
ip-192-168-100-76.ec2.internal   Ready    <none>   107m   v1.20.4-eks-6b7464
ip-192-168-5-196.ec2.internal    Ready    <none>   67m    v1.20.4-eks-6b7464
ip-192-168-52-131.ec2.internal   Ready    <none>   78m    v1.20.4-eks-6b7464
ip-192-168-58-137.ec2.internal   Ready    <none>   107m   v1.20.4-eks-6b7464
ip-192-168-70-68.ec2.internal    Ready    <none>   107m   v1.20.4-eks-6b7464

You can constrain a pod so that it can only run on particular set of nodes. There are several ways to do this and the recommended approaches all use label selectors to facilitate the selection. In some circumstances, you may want to control which node the pod deploys to; for example, to ensure that a pod ends up on a machine with an SSD attached to it, or to co-locate pods from two different services that communicate a lot into the same availability zone.

nodeSelector is the simplest recommended form of node selection constraint. nodeSelector is a field of PodSpec. It specifies a map of key-value pairs. For the pod to be eligible to run on a node, the node must have each of the indicated key-value pairs as labels.

Taints are used to repel pods from specific nodes. Taints and tolerations work together to ensure that pods aren’t scheduled onto inappropriate nodes. One or more taints are applied to a node; this marks that the node shouldn’t accept any pods that don’t tolerate the taints. Amazon EKS supports configuring Kubernetes taints through managed node groups. Taints and tolerations are a flexible way to steer pods away from nodes or evict pods that shouldn’t be running. A few of the use cases are dedicated nodes: If you want to dedicate a set of nodes, such as GPU instances for exclusive use by a particular group of users, you can add a taint to those nodes, and then add a corresponding toleration to their pods.

nodeSelector provides a very simple way to attract pods to nodes with particular labels. Taints on the other hand are used to repel pods from specific nodes. You can apply taints to a team’s node group and use pod templates to apply a corresponding toleration to their workload. This ensures that only the designated team can schedule jobs to their node group. The label, using affinity, directs the application to the team’s designated node group and a toleration enables it to schedule over the taint. During the Amazon EKS cluster creation, we provided taints for each of the managed node groups. We create pod templates to specify both nodeSelector and tolerations to schedule work to a team’s node group.

Create a new file team-1-driver-pod-template.yaml with the following contents:

apiVersion: v1
kind: Pod
spec:
  nodeSelector:
    team: team-1-spark-driver
  tolerations:
  - key: "team-1"
    operator: "Equal"
    value: "general-purpose"
    effect: "NoSchedule"
  containers:
  - name: spark-kubernetes-driver

Here, we specify nodeSelector as team: team-1-spark-driver. This makes sure that Spark driver pods are running on nodes created as part of node group team-1-spark-driver, which we created for Team-1. At the same time, we have a toleration for nodes tainted as team-1.

Create a new file team-1-executor-pod-template.yaml with the following contents:

apiVersion: v1
kind: Pod
spec:
  nodeSelector:
    team: team-1-spark-executor
  tolerations:
  - key: "team-1"
    operator: "Equal"
    value: "general-purpose"
    effect: "NoSchedule"
  containers:
  - name: spark-kubernetes-executor

Here, we specify nodeSelector as team: team-1-spark-executor. This makes sure that Spark executor pods are running on nodes created as part of node group team-1-spark-executor, which we created for Team-1. At the same time, we have a toleration for nodes tainted as team-1.

Create a new file team-2-driver-pod-template.yaml with the following contents:

apiVersion: v1
kind: Pod
spec:
  nodeSelector:
    team: team-2-spark-driver
  tolerations:
  - key: "team-2"
    operator: "Equal"
    value: "compute-intensive"
    effect: "NoSchedule"
  containers:
  - name: spark-kubernetes-driver

Here, we specify nodeSelector as team: team-2-spark-driver. This makes sure that Spark driver pods are running on nodes created as part of node group team-2-spark-driver, which we created for Team-2. At the same time, we have a toleration for nodes tainted as team-2.

Create a new file team-2-executor-pod-template.yaml with the following contents.

apiVersion: v1
kind: Pod
spec:
nodeSelector:
team: team-2-spark-executor
tolerations:
- key: "team-2"
operator: "Equal"
value: "compute-intensive"
effect: "NoSchedule"
containers:
- name: spark-kubernetes-executor

Here, we specify nodeSelector as team: team-2-spark-executor. This makes sure that Spark executor pods are running on nodes created as part of node group team-2-spark-executor, which we created for Team-2. At the same time, we have a toleration for nodes tainted as team-2.

Save the preceding pod template files to your S3 bucket or refer to them using the following links:

s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/team-1-driver-template.yaml
s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/team-1-executor-template.yaml
s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/team-2-driver-template.yaml
s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/team-2-executor-template.yaml

Submit Spark jobs

In this step, we submit the Spark jobs and observe the output.

Substitute the values of the EMR virtual cluster ID, EMR_EKS_Job_Execution_Role ARN, and S3_Bucket:

export EMR_EKS_CLUSTER_ID=<<EMR virtual cluster id>>
export EMR_EKS_EXECUTION_ARN=<<EMR_EKS_Job_Execution_Role ARN>>
export S3_BUCKET=<<S3_Bucket>>

Submit the Spark job:

aws emr-containers start-job-run \
--virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
--name spark-pi-pod-template \
--execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
--release-label emr-5.33.0-latest \
--job-driver '{
    "sparkSubmitJobDriver": {
        "entryPoint": "s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/pi.py",
        "sparkSubmitParameters": "--conf spark.executor.instances=6 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
        }
    }' \
--configuration-overrides '{
    "applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.driver.memory":"2G",
          "spark.kubernetes.driver.podTemplateFile":"s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/team-1-driver-template.yaml",
          "spark.kubernetes.executor.podTemplateFile":"s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/team-1-executor-template.yaml"
         }
      }
    ],
    "monitoringConfiguration": {
      "cloudWatchMonitoringConfiguration": {
        "logGroupName": "/emr-containers/jobs", 
        "logStreamNamePrefix": "blog-emr-eks"
      }, 
      "s3MonitoringConfiguration": {
        "logUri": "'"$S3_BUCKET"'/logs/"
      }
    }   
}'

After submitting the job, run the following command to check if the Spark driver and executor pods are created and running:

kubectl get pods

You should see output similar to the following:

$ kubectl get pods
NAME READY STATUS RESTARTS AGE
00000002uf8fut3g6o6-m4nfz 3/3 Running 0 25s
spark-00000002uf8fut3g6o6-driver 2/2 Running 0 14s

Let’s check the pods deployed on team-1’s On-Demand Instances:

$ for n in $(kubectl get nodes -l team=team-1-spark-driver --no-headers | cut -d " " -f1); do echo "Pods on instance ${n}:";kubectl get pods -n default --no-headers --field-selector spec.nodeName=${n} ; echo ; done

Pods on instance ip-192-168-27-110.ec2.internal:
No resources found in default namespace.

Pods on instance ip-192-168-59-167.ec2.internal:
spark-00000002uf8fut3g6o6-driver 2/2 Running 0 29s

Pods on instance ip-192-168-73-223.ec2.internal:
No resources found in default namespace.

Let’s check the pods deployed on team-1’s Spot Instances:

$ for n in $(kubectl get nodes -l team=team-1-spark-executor --no-headers | cut -d " " -f1); do echo "Pods on instance ${n}:";kubectl get pods -n default --no-headers --field-selector spec.nodeName=${n} ; echo ; done

Pods on instance ip-192-168-108-90.ec2.internal:
No resources found in default namespace.

Pods on instance ip-192-168-39-31.ec2.internal:
No resources found in default namespace.

Pods on instance ip-192-168-87-75.ec2.internal:
pythonpi-1623711779860-exec-3 0/2 Running 0 27s
pythonpi-1623711779937-exec-4 0/2 Running 0 26s

Pods on instance ip-192-168-88-145.ec2.internal:
pythonpi-1623711779748-exec-2 0/2 Running 0 27s

Pods on instance ip-192-168-92-149.ec2.internal:
pythonpi-1623711779097-exec-1 0/2 Running 0 28s
pythonpi-1623711780071-exec-5 0/2 Running 0 27s

When the executor pods are running, you should see output similar to the following:

$ kubectl get pods
NAME READY STATUS RESTARTS AGE
00000002uf8fut3g6o6-m4nfz 3/3 Running 0 56s
pythonpi-1623712009087-exec-1 0/2 Running 0 2s
pythonpi-1623712009603-exec-2 0/2 Running 0 2s
pythonpi-1623712009735-exec-3 0/2 Running 0 2s
pythonpi-1623712009833-exec-4 0/2 Running 0 2s
pythonpi-1623712009945-exec-5 0/2 Running 0 1s
spark-00000002uf8fut3g6o6-driver 2/2 Running 0 45s

To check the status of the jobs on Amazon EMR console, choose the cluster on the Virtual Clusters page. You can also check the Spark History Server by choosing View logs.

When the job is complete, go to Amazon CloudWatch Logs and check the output by choosing the log (/emr-containers/jobs/<<xxx-driver>>/stdout) on the Log groups page. You should see output similar to the following screenshot.

Now submit the Spark job as team-2 and specify the pod template files pointing to team-2’s driver and executor pod specifications and observe where the pods are created:

aws emr-containers start-job-run \
--virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
--name spark-pi-pod-template \
--execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
--release-label emr-5.33.0-latest \
--job-driver '{
    "sparkSubmitJobDriver": {
        "entryPoint": "s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/pi.py",
        "sparkSubmitParameters": "--conf spark.executor.instances=6 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
        }
    }' \
--configuration-overrides '{
    "applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.driver.memory":"2G",
          "spark.kubernetes.driver.podTemplateFile":"s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/team-2-driver-template.yaml",
          "spark.kubernetes.executor.podTemplateFile":"s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/team-2-executor-template.yaml"
         }
      }
    ],
    "monitoringConfiguration": {
      "cloudWatchMonitoringConfiguration": {
        "logGroupName": "/emr-containers/jobs", 
        "logStreamNamePrefix": "blog-emr-eks"
      }, 
      "s3MonitoringConfiguration": {
        "logUri": "'"$S3_BUCKET"'/logs/"
      }
    }   
}'

We can check the status of the job on the Amazon EMR console and also by checking the CloudWatch logs.

Now, let’s run a use case where Team-1 doesn’t specify the correct toleration in the Spark driver’s pod template. We use the following pod template. As per the toleration specification, Team-1 is trying to schedule a Spark driver pod on nodes with label team-1-spark-driver and also wants it to get scheduled over nodes tainted as team-2. Because team-1 doesn’t have any nodes with that specification, we should see an error.

apiVersion: v1
kind: Pod
spec:
  nodeSelector:
    team: team-1-spark-driver
  tolerations:
  - key: "team-2"
    operator: "Equal"
    value: "compute-intensive"
    effect: "NoSchedule"
  containers:
  - name: spark-kubernetes-driver

Submit the Spark job using this new pod template:

aws emr-containers start-job-run \
--virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
--name spark-pi-pod-template \
--execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
--release-label emr-5.33.0-latest \
--job-driver '{
    "sparkSubmitJobDriver": {
        "entryPoint": "s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/pi.py",
        "sparkSubmitParameters": "--conf spark.executor.instances=6 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
        }
    }' \
--configuration-overrides '{
    "applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.driver.memory":"2G",
          "spark.kubernetes.driver.podTemplateFile":"s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/team-1-driver-template-negative.yaml",
          "spark.kubernetes.executor.podTemplateFile":"s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/team-1-executor-template.yaml"
         }
      }
    ],
    "monitoringConfiguration": {
      "cloudWatchMonitoringConfiguration": {
        "logGroupName": "/emr-containers/jobs", 
        "logStreamNamePrefix": "blog-emr-eks"
      }, 
      "s3MonitoringConfiguration": {
        "logUri": "'"$S3_BUCKET"'/logs/"
      }
    }   
}'

Run the following command to check the status of the Spark driver pod:

$ kubectl get pods
NAME READY STATUS RESTARTS AGE
00000002uott6onmu8p-7t64m 3/3 Running 0 36s
spark-00000002uott6onmu8p-driver 0/2 Pending 0 25s

Let’s describe the driver pod to check the details. You should notice a failed event similar to Warning FailedScheduling 28s (x3 over 31s) default-scheduler 0/17 nodes are available: 8 node(s) had taint {team-1: general-purpose}, that the pod didn't tolerate, 9 node(s) didn't match Pod's node affinity

$ kubectl describe pod <<driver-pod-name>>
Name:           spark-00000002uott6onmu8p-driver
Namespace:      default
......
QoS Class:       Burstable
Node-Selectors:  team=team-1-spark-driver
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                 team-2=compute-intensive:NoSchedule
Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  28s (x3 over 31s)  default-scheduler  0/17 nodes are available: 8 node(s) had taint {team-1: general-purpose}, that the pod didn't tolerate, 9 node(s) didn't match Pod's node affinity..

This shows that if the pod template doesn’t have the right tolerations, the tainted nodes don’t tolerate the pod and don’t schedule over those nodes.

Clean up

Don’t forget to clean up the resources you created to avoid any unnecessary charges.

Delete all the virtual clusters that you created:

#List all the virtual cluster ids
aws emr-containers list-virtual-clusters
#Delete virtual cluster by passing virtual cluster id
aws emr-containers delete-virtual-cluster —id <virtual-cluster-id>

Delete the Amazon EKS cluster:

eksctl delete cluster blog-eks-cluster

Delete the EMR_EKS_Job_Execution_Role role and policies.

Summary

In this post, we saw how to create an Amazon EKS cluster, configure Amazon EKS managed node groups, create an EMR virtual cluster on Amazon EKS, and submit Spark jobs. With pod templates, we saw how to manage resource isolation between various teams when submitting jobs and also learned how to reduce cost by running Spark driver pods on EC2 On-Demand Instances and Spark executor pods on EC2 Spot Instances.

To get started with pod templates, try out the Amazon EMR on EKS workshop or see the following resources:

About the Author

Saurabh Bhutyani is a Senior Big Data specialist solutions architect at Amazon Web Services. He is an early adopter of open source Big Data technologies. At AWS, he works with customers to provide architectural guidance for running analytics solutions on Amazon EMR, Amazon Athena, AWS Glue, and AWS Lake Formation

Orchestrate Jenkins Workloads using Dynamic Pod Autoscaling with Amazon EKS

2021-09-08 Vladimir Toussaint

Post Syndicated from Vladimir Toussaint original https://aws.amazon.com/blogs/devops/orchestrate-jenkins-workloads-using-dynamic-pod-autoscaling-with-amazon-eks/

This blog post will demonstrate how to leverage Jenkins with Amazon Elastic Kubernetes Service (EKS) by running a Jenkins Manager within an EKS pod. In doing so, we can run Jenkins workloads by allowing Amazon EKS to spawn dynamic Jenkins Agent(s) in order to perform application and infrastructure deployment. Traditionally, customers will setup a Jenkins Manager-Agent architecture that contains a set of manually added nodes with no autoscaling capabilities. Implementing this strategy will ensure that a robust approach optimizes the performance with the right-sized compute capacity and work needed to successfully perform the build tasks.

In setting up our Amazon EKS cluster with Jenkins, we’ll utilize the eksctl simple CLI tool for creating clusters on EKS. Then, we’ll build both the Jenkins Manager and Jenkins Agent image. Afterward, we’ll run a container deployment on our cluster to access the Jenkins application and utilize the dynamic Jenkins Agent pods to run pipelines and jobs.

Solution Overview

The architecture below illustrates the execution steps.

Solution Architecture diagram
Figure 1. Solution overview diagram

Disclaimer(s): (Note: This Jenkins application is not configured with a persistent volume storage. Therefore, you must establish and configure this template to fit that requirement).

To accomplish this deployment workflow, we will do the following:

Centralized Shared Services account

Deploy the Amazon EKS Cluster into a Centralized Shared Services Account.
Create the Amazon ECR Repository for the Jenkins Manager and Jenkins Agent to store docker images.
Deploy the kubernetes manifest file for the Jenkins Manager.

Target Account(s)

Establish a set of AWS Identity and Access Management (IAM) roles with permissions for cross-across access from the Share Services account into the Target account(s).

Jenkins Application UI

Jenkins Plugins – Install and configure the Kubernetes Plugin and CloudBees AWS Credentials Plugin from Manage Plugins (you will not have to manually install this since it will be packaged and installed as part of the Jenkins image build).
Jenkins Pipeline Example—Fetch the Jenkinsfile to deploy an S3 Bucket with CloudFormation in the Target account using a Jenkins parameterized pipeline.

Prerequisites

The following is the minimum requirements for ensuring this solution will work.

Account Prerequisites

Shared Services Account: The location of the Amazon EKS Cluster.
Target Account: The destination of the CI/CD pipeline deployments.

Build Requirements

AWS CLI
Docker CLI — Note: The docker engine must be running in order to build images.
aws-iam-authenticator
kubectl
eksctl
Configure your AWS credentials with the associated region.
Verify if the build requirements were correctly installed by checking the version.
Create an EKS Cluster using eksctl.
Verify that EKS nodes are running and available.
Create an AWS ECR Repository for the Jenkins Manager and Jenkins Agent.

Clone the Git Repository

git clone https://github.com/aws-samples/jenkins-cloudformation-deployment-example.git

Security Considerations

This blog provides a high-level overview of the best practices for cross-account deployment and isolation maintenance between the applications. We evaluated the cross-account application deployment permissions and will describe the current state as well as what to avoid. As part of the security best practices, we will maintain isolation among multiple apps deployed in these environments, e.g., Pipeline 1 does not deploy to the Pipeline 2 infrastructure.

Requirement

A Jenkins manager is running as a container in an EC2 compute instance that resides within a Shared AWS account. This Jenkins application represents individual pipelines deploying unique microservices that build and deploy to multiple environments in separate AWS accounts. The cross-account deployment utilizes the target AWS account admin credentials in order to do the deployment.

This methodology means that it is not good practice to share the account credentials externally. Additionally, the deployment errors risk should be eliminated and application isolation should be maintained within the same account.

Note that the deployment steps are being run using AWS CLIs, thus our solution will be focused on AWS CLI usage.

The risk is much lower when utilizing CloudFormation / CDK to conduct deployments because the AWS CLIs executed from the build jobs will specify stack names as parametrized inputs and the very low probability of stack-name error. However, it remains inadvisable to utilize admin credentials of the target account.

Best Practice — Current Approach

We utilized cross-account roles that can restrict unauthorized access across build jobs. Behind this approach, we will utilize the assume-role concept that will enable the requesting role to obtain temporary credentials (from the STS service) of the target role and execute actions permitted by the target role. This is safer than utilizing hard-coded credentials. The requesting role could be either the inherited EC2 instance role OR specific user credentials. However, in our case, we are utilizing the inherited EC2 instance role.

For ease of understanding, we will refer the target-role as execution-role below.

Cross account roles for Jenkins build jobs
Figure 2. Current approach

As per the security best practice of assigning minimum privileges, we must first create execution role in IAM in the target account that has deployment permissions (either via CloudFormation OR via CLI’s), e.g., app-dev-role in Dev account and app-prod-role in Prod account.
For each of those roles, we configure a trust relationship with the parent account ID (Shared Services account). This enables any roles in the Shared Services account (with assume-role permission) to assume the execution-role and deploy it on respective hosting infrastructure, e.g., the app-dev-role in Dev account will be a common execution role that will deploy various apps across infrastructure.
Then, we create a local role in the Shared Services account and configure credentials within Jenkins to be utilized by the Build Jobs. Provide the job with the assume-role permissions and specify the list of ARNs across every account. Alternatively, the inherited EC2 instance role can also be utilized to assume the execution-role.

Create Cross-Account IAM Roles

Cross-account IAM roles allow users to securely access AWS resources in a target account while maintaining the observability of that AWS account. The cross-account IAM role includes a trust policy allowing AWS identities in another AWS account to assume the given role. This allows us to create a role in one AWS account that delegates specific permissions to another AWS account.

Create an IAM role with a common name in each target account. The role name we’ve created is AWSCloudFormationStackExecutionRole. The role must have permissions to perform CloudFormation actions and any actions regarding the resources that will be created. In our case, we will be creating an S3 Bucket utilizing CloudFormation.
This IAM role must also have an established trust relationship to the Shared Services account. In this case, the Jenkins Agent will be granted the ability to assume the role of the particular target account from the Shared Services account.
In our case, the IAM entity that will assume the AWSCloudFormationStackExecutionRole is the EKS Node Instance Role that associated with the EKS Cluster Nodes.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "cloudformation:CreateUploadBucket",
                "cloudformation:ListStacks",
                "cloudformation:CancelUpdateStack",
                "cloudformation:ExecuteChangeSet",
                "cloudformation:ListChangeSets",
                "cloudformation:ListStackResources",
                "cloudformation:DescribeStackResources",
                "cloudformation:DescribeStackResource",
                "cloudformation:CreateChangeSet",
                "cloudformation:DeleteChangeSet",
                "cloudformation:DescribeStacks",
                "cloudformation:ContinueUpdateRollback",
                "cloudformation:DescribeStackEvents",
                "cloudformation:CreateStack",
                "cloudformation:DeleteStack",
                "cloudformation:UpdateStack",
                "cloudformation:DescribeChangeSet",
                "s3:PutBucketPublicAccessBlock",
                "s3:CreateBucket",
                "s3:DeleteBucketPolicy",
                "s3:PutEncryptionConfiguration",
                "s3:PutBucketPolicy",
                "s3:DeleteBucket"
            ],
            "Resource": "*"
        }
    ]
}

Build Docker Images

Build the custom docker images for the Jenkins Manager and the Jenkins Agent, and then push the images to AWS ECR Repository. Navigate to the docker/ directory, then execute the command according to the required parameters with the AWS account ID, repository name, region, and the build folder name jenkins-manager/ or jenkins-agent/ that resides in the current docker directory. The custom docker images will contain a set of starter package installations.

Build and push the Jenkins Manager and Jenkins Agent docker images to the AWS ECR Repository

Deploy Jenkins Application

After building both images, navigate to the k8s/ directory, modify the manifest file for the Jenkins image, and then execute the Jenkins manifest.yaml template to setup the Jenkins application. (Note: This Jenkins application is not configured with a persistent volume storage. Therefore, you will need to establish and configure this template to fit that requirement).

Deploy the Jenkins Application to the EKS Cluster

# Fetch the Application URL or navigate to the AWS Console for the Load Balancer
kubectl get svc -n jenkins

# Verify that jenkins deployment/pods are up running
kubectl get pods -n jenkins

# Replace with jenkins manager pod name and fetch Jenkins login password
kubectl exec -it pod/<JENKINS-MANAGER-POD-NAME> -n jenkins -- cat /var/jenkins_home/secrets/initialAdminPassword

The Kubernetes Plugin and CloudBees AWS Credentials Plugin should be installed as part of the Jenkins image build from the Managed Plugins.
Navigate: Manage Jenkins → Configure Global Security
Set the Crumb Issuer to remove the error pages in order to prevent Cross Site Request Forgery exploits.

Screenshot of Crumb isssuer
Figure 3. Configure Global Security

Configure Jenkins Kubernetes Cloud

Navigate: Manage Jenkins → Manage Nodes and Clouds → Configure Clouds
Click: Add a new cloud → select Kubernetes from the drop menus

Screenshot to configure Cloud on Jenkins
Figure 4a. Jenkins Configure Nodes and Clouds

Note: Before proceeding, please ensure that you can access your Amazon EKS cluster information, whether it is through Console or CLI.

Enter a Name in the Kubernetes Cloud configuration field.
Enter the Kubernetes URL which can be found via AWS Console by navigating to the Amazon EKS service and locating the API server endpoint of the cluster, or run the command kubectl cluster-info.
Enter the namespace that will be utilized in the Kubernetes Namespace field. This will determine where the dynamic kubernetes pods will spawn. In our case, the name of the namespace is jenkins.
During the initial setup of Jenkins Manager on kubernetes, there is an environment variable JENKINS_URL that automatically utilizes the Load Balancer URL to resolve requests. However, we will resolve our requests locally to the cluster IP address.
- The format is as follows: https://<service-name>.<namespace>.svc.cluster.local

Configuring Kubernetes cloud for Jenkins
Figure 4b. Configure Kubernetes Cloud

Set AWS Credentials

Security concerns are a key reason why we’re utilizing an IAM role instead of access keys. For any given approach involving IAM, it is the best practice to utilize temporary credentials.

You must have the AWS Credentials Binding Plugin installed before this step. Enter the unique ID name as shown in the example below.
Enter the IAM Role ARN you created earlier for both the ID and IAM Role to use in the field as shown below.

Setting up credentials on Jenkins
Figure 5. AWS Credentials Binding

Configuring Global credentials
Figure 6. Managed Credentials

Create a pipeline

Navigate to the Jenkins main menu and select new item
Create a Pipeline

Screenshot for Pipeline configuration
Figure 7. Create a pipeline

Configure Jenkins Agent

Setup a Kubernetes YAML template after you’ve built the agent image. In this example, we will be using the k8sPodTemplate.yaml file stored in the k8s/ folder.

Configure Jenkins Agent k8s Pod Template

CloudFormation Execution Scripts

This deploy-stack.sh file can accept four different parameters and conduct several types of CloudFormation stack executions such as deploy, create-changeset, and execute-changeset. This is also reflected in the stages of this Jenkinsfile pipeline. As for the delete-stack.sh file, two parameters are accepted, and, when executed, it will delete a CloudFormation stack based on the given stack name and region.

Understanding Pipeline Execution Scripts for CloudFormation

Jenkinsfile

In this Jenkinsfile, the individual pipeline build jobs will deploy individual microservices. The k8sPodTemplate.yaml is utilized to specify the kubernetes pod details and the inbound-agent that will be utilized to run the pipeline.

Jenkinsfile Parameterized Pipeline Configurations

Jenkins Pipeline: Execute a pipeline

Click Build with Parameters and then select a build action.

Configuring stackname in Jenkins configuration
Figure 8a. Build with Parameters

Examine the pipeline stages even further for the choice you selected. Also, view more details of the stages below and verify in your AWS account that the CloudFormation stack was executed.

Jenkins pipeline dashboard
Figure 8b. Pipeline Stage View

The Final Step is to execute your pipeline and watch the pods spin up dynamically in your terminal. As is shown below, the Jenkins agent pod spawned and then terminated after the work completed. Watch this task on your own by executing the following command:

# Watch the pods spawn in the "jenkins" namespace
kubectl get pods -n jenkins -w

CLI output showing Jenkins POD status
Figure 9. Watch Jenkins Agent Pods Spawn

Code Repository

Amazon EKS Jenkins Integration

References

Cleanup

In order to avoid incurring future charges, delete the resources utilized in the walkthrough.

Delete the EKS cluster. You can utilize the eksctl to delete the cluster.
Delete any remaining AWS resources created by EKS such as AWS LoadBalancer, Target Groups, etc.
Delete any related IAM entities.

Conclusion

This post walked you through the process of building out Amazon EKS based infrastructure and integrating Jenkins to orchestrate workloads. We demonstrated how you can utilize this to deploy securely across multiple accounts with dynamic Jenkins agents and create alignment to your business with similar use cases. To learn more about Amazon EKS, see our documentation pages or explore our console.

About the Authors

Vladimir Toussaint Headshot1.png

Vladimir P. Toussaint

Vladimir is a DevOps Cloud Architect at Amazon Web Services. He works with GovCloud customers to build solutions and capabilities as they move to the cloud. Previous to Amazon Web Services, Vladimir has leveraged container orchestration tools such as Kubernetes to securely manage microservice applications for large enterprises.

Matt Noyce Headshot1.png

Matt Noyce

Matt is a Sr. Cloud Application Architect at Amazon Web Services. He works primarily with health care and life sciences customers to help them architect and build applications, data lakes, and DevOps pipelines that solve their business needs. In his spare time Matt likes to run and hike along with enjoying time with friends and family.

Nikunj Vaidya Headshot1.png

Nikunj Vaidya

Nikunj is a DevOps Tech Leader at Amazon Web Services. He offers technical guidance to the customers on AWS DevOps solutions and services that would streamline the application development process, accelerate application delivery, and enable maintaining a high bar of software quality. Prior to AWS, Nikunj has worked in software engineering roles, leading transformation projects, driving releases and improvements in the software quality and customer experience.

Amazon EKS Anywhere – Now Generally Available to Create and Manage Kubernetes Clusters on Premises

2021-09-08 Channy Yun

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/amazon-eks-anywhere-now-generally-available-to-create-and-manage-kubernetes-clusters-on-premises/

At AWS re:Invent 2020, we preannounced new deployment options of Amazon Elastic Container Service (Amazon ECS) Anywhere and Amazon Elastic Kubernetes Service (Amazon EKS) Anywhere in your own data center.

Today, I am happy to announce the general availability of Amazon EKS Anywhere, a deployment option for Amazon EKS that enables you to easily create and operate Kubernetes clusters on premises using VMware vSphere starting today. EKS Anywhere provides an installable software package for creating and operating Kubernetes clusters on premises and automation tooling for cluster lifecycle support.

EKS Anywhere brings a consistent AWS management experience to your data center, building on the strengths of Amazon EKS Distro, an open-source distribution for Kubernetes used by Amazon EKS.

EKS Anywhere is also Open Source. You can reduce the complexity of buying or building your own management tooling to create EKS Distro clusters, configure the operating environment, and update software. EKS Anywhere enables you to automate cluster management, reduce support costs, and eliminate the redundant effort of using multiple open-source or third-party tools for operating Kubernetes clusters. EKS Anywhere is fully supported by AWS. In addition, you can leverage the EKS console to view all your Kubernetes clusters, running anywhere.

We provide several deployment options for your Kubernetes cluster:

Feature	Amazon EKS	EKS on Outposts	EKS Anywhere	EKS Distro
Hardware	Managed by AWS		Managed by customer
Deployment types	Amazon EC2, AWS Fargate (Serverless)	EC2 on Outposts	Customer Infrastructure
Control plane management	Managed by AWS		Managed by customer
Control plane location	AWS cloud	Customer’s on-premises or data center
Cluster updates	Managed in-place update process for control plane and data plane		CLI (Flux supported rolling update for data plane, manual update for control plane)
Networking and Security	Amazon VPC Container Network Interface (CNI), Other compatible 3rd party CNI plugins		Cilium CNI	3rd party CNI plugins
Console support	Amazon EKS console		EKS console using EKS Connector	Self-service
Support	AWS Support		EKS Anywhere support subscription	Self-service

EKS Anywhere integrates with a variety of products from our partners to help customers take advantage of EKS Anywhere and provide additional functionality. This includes Flux for cluster updates, Flux Controller for GitOps, eksctl – a simple CLI tool for creating and managing clusters on EKS, and Cilium for networking and security.

We also provide flexibility for you to integrate with your choice of tools in other areas. To add integrations to your EKS Anywhere cluster, see this list of suggested third-party tools for your consideration.

Get Started with Amazon EKS Anywhere
To get started with EKS Anywhere, you can create a bootstrap cluster in your machine for local development and test purposes. Currently, it allows you to create clusters in a VMWare vSphere environment for production workloads.

Let’s create a cluster on your desktop machine using eksctl! You can install eksctl and eksctl-anywhere with homebrew on Mac. Optionally, you can install some additional tools you may want for your EKS Anywhere clusters, such as kubectl. To learn more on Linux, see the installation guide in EKS Anywhere documentation.

$ brew install aws/tap/eks-anywhere
$ eksctl anywhere version
0.63.0

Generate a cluster config and create a cluster.

$ CLUSTER_NAME=dev-cluster
$ eksctl anywhere generate clusterconfig $CLUSTER_NAME \
    --provider docker > $CLUSTER_NAME.yaml
$ eksctl anywhere create cluster -f $CLUSTER_NAME.yaml
[i] Performing setup and validations
[v] validation succeeded {"validation": "docker Provider setup is valid"}
[i] Creating new bootstrap cluster
[i] Installing cluster-api providers on bootstrap cluster
[i] Provider specific setup
[i] Creating new workload cluster
[i] Installing networking on workload cluster
[i] Installing cluster-api providers on workload cluster
[i] Moving cluster management from bootstrap to workload cluster
[i] Installing EKS-A custom components (CRD and controller) on workload cluster
[i] Creating EKS-A CRDs instances on workload cluster
[i] Installing AddonManager and GitOps Toolkit on workload cluster
[i] GitOps field not specified, bootstrap flux skipped
[i] Deleting bootstrap cluster
[v] Cluster created!

Once your workload cluster is created, a KUBECONFIG file is stored on your admin machine with admin permissions for the workload cluster. You’ll be able to use that file with kubectl to set up and deploy workloads.

$ export KUBECONFIG=${PWD}/${CLUSTER_NAME}/${CLUSTER_NAME}-eks-a-cluster.kubeconfig
$ kubectl get ns
NAME                                STATUS   AGE
capd-system                         Active   21m
capi-kubeadm-bootstrap-system       Active   21m
capi-kubeadm-control-plane-system   Active   21m
capi-system                         Active   21m
capi-webhook-system                 Active   21m
cert-manager                        Active   22m
default                             Active   23m
eksa-system                         Active   20m
kube-node-lease                     Active   23m
kube-public                         Active   23m
kube-system                         Active   23m

You can create a simple test application for you to verify your cluster is working properly. Deploy and see a new pod running in your cluster, and forward the deployment port to your local machine with the following commands:

$ kubectl apply -f "https://anywhere.eks.amazonaws.com/manifests/hello-eks-a.yaml"
$ kubectl get pods -l app=hello-eks-a
NAME                                     READY   STATUS    RESTARTS   AGE
hello-eks-a-745bfcd586-6zx6b   1/1     Running   0          22m
$ kubectl port-forward deploy/hello-eks-a 8000:80
$ curl localhost:8000
⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢

Thank you for using

███████╗██╗  ██╗███████╗
██╔════╝██║ ██╔╝██╔════╝
█████╗  █████╔╝ ███████╗
██╔══╝  ██╔═██╗ ╚════██║
███████╗██║  ██╗███████║
╚══════╝╚═╝  ╚═╝╚══════╝

 █████╗ ███╗   ██╗██╗   ██╗██╗    ██╗██╗  ██╗███████╗██████╗ ███████╗
██╔══██╗████╗  ██║╚██╗ ██╔╝██║    ██║██║  ██║██╔════╝██╔══██╗██╔════╝
███████║██╔██╗ ██║ ╚████╔╝ ██║ █╗ ██║███████║█████╗  ██████╔╝█████╗  
██╔══██║██║╚██╗██║  ╚██╔╝  ██║███╗██║██╔══██║██╔══╝  ██╔══██╗██╔══╝  
██║  ██║██║ ╚████║   ██║   ╚███╔███╔╝██║  ██║███████╗██║  ██║███████╗
╚═╝  ╚═╝╚═╝  ╚═══╝   ╚═╝    ╚══╝╚══╝ ╚═╝  ╚═╝╚══════╝╚═╝  ╚═╝╚══════╝

You have successfully deployed the hello-eks-a pod hello-eks-a-c5b9bc9d8-qp6bg

For more information check out
https://anywhere.eks.amazonaws.com

⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢

EKS Anywhere also supports a VMWare vSphere 7.0 version or higher for production clusters. To create a production cluster, see the requirements for VMware vSphere deployment and follow Create production cluster in EKS Anywhere documentation. It’s almost the same process as creating a test cluster on your machine.

A production-grade EKS Anywhere cluster should include at least three control plane nodes and three worker nodes on the vSphere for high availability and rolling upgrades. See the Cluster management in EKS Anywhere documentation for more information on common operational tasks like scaling, updating, and deleting the cluster.

EKS Connector – Public Preview
EKS Connector is a new capability that allows you to connect any Kubernetes clusters to the EKS console. You can connect any Kubernetes cluster, including self-managed clusters on EC2, EKS Anywhere clusters running on premises, and other Kubernetes clusters running outside of AWS to the EKS console. It makes it easy for you to view all connected clusters centrally.

To connect your EKS Anywhere cluster, visit the Clusters section in EKS console and select Register in the Add cluster drop-down menu.

Define a name for your cluster and select the Provider (if you don’t find an appropriate provider, select Other).

After registering the cluster, you will be redirected to the Cluster Overview page. Select Download YAML file to get the Kubernetes configuration file to deploy all the necessary infrastructure to connect your cluster to EKS.

Apply downloaded eks-connector.yaml and role binding eks-connector-binding.yaml file from the EKS Connector in our documentation. EKS Connector acts as a proxy and forwards the EKS console requests to the Kubernetes API server on your cluster, so you need to associate the connector’s service account with an EKS Connector Role, which gives permission to impersonate AWS IAM entities.

$ kubectl apply -f eks-connector.yaml
$ kubectl apply -f eks-connector-binding.yaml

After completing the registration, the cluster should be in the ACTIVE state.

$ eks describe-cluster --name "my-first-registered-cluster" --region ${AWS_REGION}

Here is the expected output:

{
    "cluster": {
    "name": "my-first-registered-cluster",
    "arn": "arn:aws:eks:{EKS-REGION}:{ACCOUNT-ID}:cluster/my-first-registered-cluster", 
    "createdAt": 1627672425.765,
    "connectorConfig": {
    "activationId": "xxxxxxxxACTIVATION_IDxxxxxxxx", 
    "activationExpiry": 1627676019.0,
    "provider": "OTHER",
     "roleArn": "arn:aws:iam::{ACCOUNT-ID}:role/eks-connector-agent"
    },
  "status": "ACTIVE", "tags": {}
  } 
}

EKS Connector is now in public preview in all AWS Regions where Amazon EKS is available. Please choose a region that’s closest to your cluster location to minimize latency. To learn more, visit EKS Connector in the Amazon EKS User Guide.

Things to Know
Here are a couple of things to keep in mind about EKS Anywhere:

Connectivity: There are three connectivity options: fully connected, partially disconnected, and fully disconnected. For fully connected and partially disconnected connectivity, you can connect your EKS Anywhere clusters to the EKS console via the EKS Connector and see the cluster configuration and workload status. You can leverage AWS services through AWS Controllers for Kubernetes (ACK). You can connect EKS Anywhere infrastructure resources using AWS System Manager Agents and view them using the SSM console.

Security Model: AWS follows the Shared Responsibility Model, where AWS is responsible for the security of the cloud, while the customer is responsible for security in the cloud. However, EKS Anywhere is an open-source tool, and the distribution of responsibility differs from that of a managed cloud service like Amazon EKS. AWS is responsible for building and delivering a secure tool. This tool will provision an initially secure Kubernetes cluster. To learn more, see Security Best Practices in EKS Anywhere documentation.

AWS Support: AWS Enterprise Support is a prerequisite for purchasing an Amazon EKS Anywhere Support subscription. If you would like business support for your EKS Anywhere clusters, please contact your Technical Account Manager (TAM) for details. Also, EKS Anywhere is supported by the open-source community. If you have a problem, open an issue and someone will get back to you as soon as possible.

Available Now
Amazon EKS Anywhere is now available to leverage EKS features with your on-premise infrastructure, accelerate adoption with partner integrations, managed add-ons, and curated open-source tools.

To learn more with a live demo and Q&A, join us for Containers from the Couch on September 13. You can see full demos to create a cluster and show admin workflows for scaling, upgrading the cluster version, and GitOps management.

Please send us feedback either through your usual AWS Support contacts, on the AWS Forum for Amazon EKS or on the container roadmap on Github.

– Channy

Run and debug Apache Spark applications on AWS with Amazon EMR on Amazon EKS

2021-08-25 Dipankar Kushari

Post Syndicated from Dipankar Kushari original https://aws.amazon.com/blogs/big-data/run-and-debug-apache-spark-applications-on-aws-with-amazon-emr-on-amazon-eks/

Customers today want to focus more on their core business model and less on the underlying infrastructure and operational burden. As customers migrate to the AWS Cloud, they’re realizing the benefits of being able to innovate faster on their own applications by relying on AWS to handle big data platforms, operations, and automation.

Many of AWS’s customers have migrated their big data workloads from on premises to Amazon Elastic Compute Cloud (Amazon EC2) and Amazon EMR, and process large amounts of data to get insights from it in a secure and cost-effective manner.

If you’re using open-source Apache Spark on Amazon Elastic Kubernetes Service (Amazon EKS) clusters to run your big data workloads, you may want to use Amazon EMR to eliminate the heavy lifting of installing and managing your frameworks and integrations with other AWS services.

In this post, we discuss how to run and debug Apache Spark applications with Amazon EMR on Amazon EKS.

Benefits of using Amazon EMR on EKS

Amazon EMR on EKS is primarily beneficial for two key audiences:

Users that are self-managing open-source applications on Amazon EKS – You can get the benefits of Amazon EMR by having the ability to use the latest fully managed versions of open-source big data analytics frameworks and optimized EMR runtime for Spark with two times faster performance than open-source Apache Spark. You can take advantage of the integrated developer experience for data scientists and developers with Amazon EMR Studio, and a fully managed persistent application user interface (Spark History Server) for simplified logging, monitoring, and debugging. Amazon EMR also provides native integrations with AWS services including Amazon CloudWatch, Amazon Simple Storage Service (Amazon S3), the AWS Glue Data Catalog, AWS Step Functions, and Amazon Managed Workflows for Apache Airflow (Amazon MWAA).
Existing Amazon EMR users – You can use Amazon EMR on EKS to improve resource utilization by simplifying infrastructure management and consolidating your Amazon EMR applications to run alongside other container-based applications on shared Amazon EKS clusters. You can also centrally manage infrastructure using familiar Kubernetes tools. Additionally, this provides the advantage of running different versions and configurations of the same runtime on a single Amazon EKS cluster with separation of compute, which is no longer tied to a specific analytics framework, version, or configuration.

With Amazon EMR on EKS, you can now let your teams focus on developing big data applications on Spark as rapidly as possible in a highly reliable, available, secure, and cost-efficient manner.

The following diagram shows a high-level representation of Amazon EMR on EKS. The architecture loosely coupled applications to the infrastructure that they run on. When you submit a job to Amazon EMR, your job definition contains all of its application-specific parameters. Amazon EMR uses these parameters to instruct Amazon EKS about which pods and containers to deploy. Amazon EKS then brings online the computing resources from Amazon EC2 and AWS Fargate required to run the job. With this loose coupling of services, you can run multiple, securely isolated jobs simultaneously.

Solution overview

In this post, we guide you through a step-by-step process of deploying an Amazon EMR on EKS cluster and then walk you through various options and techniques for troubleshooting your Apache Spark jobs.

We then show you how to run a Spark application on that cluster using NOAA Global Historical Climatology Network Daily (GHCN-D). This job reads weather data, joins it with weather station data, and produces an output dataset in Apache Parquet format that contains the details of precipitation readings for the US for 2011.

We also look at various options to monitor the Spark jobs and view the logs.

The following diagram illustrates our high-level architecture.

The solution contains the following deployment steps:

Install and configure the prerequisites, including the AWS Command Line Interface (AWS CLI) kubectl, and eksctl.
Provision the Amazon EKS cluster using an AWS CloudFormation stack.
Configure the AWS CLI tools and create credentials and permissions.
Provision compute and set up an EMR virtual cluster on Amazon EKS.
Create the Amazon EMR Spark application.
Run the Spark application.

Prerequisites

Before you get started, complete the following prerequisites:

Install the AWS CLI v2.
Install kubectl.
Install eksctl.

Provision the Amazon EKS cluster using AWS CloudFormation

This post uses two CloudFormation stacks. You can download the CloudFormation templates we reference in this post from a public S3 bucket, or you can launch them directly from this post. AWS Identity and Access Management (IAM) roles are also provisioned as part of this step. For more information about the IAM permissions required to provision and manage an Amazon EKS cluster, see Using service-linked roles for Amazon EKS.

The CloudFormation template eks_cluster.yaml creates the following resources in your preferred AWS account and Region:

Network resources (one VPC, three public and three private subnets, and two security groups)
One S3 bucket required to store data and artifacts to run the Spark job
An Amazon EKS cluster with managed node groups with m5.2xlarge EC2 instances (configurable in the provided CloudFormation template)

For instructions on creating a stack, see Creating a stack on the AWS CloudFormation console.

Choose Launch Stack to launch the stack via the AWS CloudFormation console.

The default parameter values are already populated for your convenience. Proceed with CloudFormation stack creation after verifying these values.

For Stack name, enter emr-on-eks-stack.
For ClusterName, enter eks-cluster.
For EKSVersion, enter 1.19.
For JobexecutionRoleName, enter eksjobexecutionrole.

CloudFormation stack creation should take 10–15 minutes. Make sure the stack is complete by verifying the status as CREATE_COMPLETE.

Additionally, you can verify that the Amazon EKS cluster was created using the following command, which displays the details of the cluster and shows the status as ACTIVE:

aws eks describe-cluster --name eks-cluster

Note the S3 bucket name (oS3BucketName) and the job execution role (rJobExecutionServiceRole) from the stack.

We upload our artifacts (PySpark script) and data into the S3 bucket.

Configure the AWS CLI tools and create credentials and permissions

To configure the AWS CLI tools, credentials, and permissions, complete the following steps:

Configure kubectl to use the Amazon EKS cluster (the kubectl and eksctl commands need to run with the same AWS profile used when deploying the CloudFormation templates):
```
aws eks --region <<Your AWS Region>> update-kubeconfig --name eks-cluster
```
Create a dedicated namespace for running Apache Spark jobs using Amazon EMR on EKS:
```
kubectl create namespace emroneks
```
To enable Amazon EMS on EKS to access the namespace we created, we have to create a Kubernetes role and Kubernetes user, and map the Kubernetes user to the Amazon EMR on EKS linked role:
```
eksctl create iamidentitymapping --cluster eks-cluster --namespace emroneks --service-name "emr-containers"
```

To use IAM roles for service accounts, an IAM OIDC provider must exist for your cluster.

Create an IAM OIDC identity provider for the Amazon EKS cluster:

eksctl utils associate-iam-oidc-provider --cluster eks-cluster –approve

When you use IAM roles for service accounts to run jobs on a Kubernetes namespace, an administrator must create a trust relationship between the job execution role and the identity of the Amazon EMR managed service account.

The following command updates the trust relationship of the job execution role (refer to the preceding screenshot of the CloudFormation stack):

aws emr-containers update-role-trust-policy \
  --cluster-name eks-cluster \
  --namespace emroneks \
  --role-name eksjobexecutionrole

Provision compute and set up an EMR virtual cluster on Amazon EKS

For the minimum IAM permissions required to manage and submit jobs on the Amazon EMR on EKS cluster, see Grant users access to Amazon EMR on EKS. The roles are provisioned as part of this step.

Use the second CloudFormation template (emr_virtual_cluster.yaml) to create the following resources in the same preferred AWS account and Region:

Amazon EMR virtual cluster
Amazon EKS managed node groups

Choose Launch Stack to launch the stack via the AWS CloudFormation console.

The default values are already populated for your convenience. Proceed with stack creation after verifying these values.

For Stack name, enter EMRvirtualcluster.
For ClusterStackName, enter emr-on-eks-stack.
For Namespace, enter emroneks.
For NodeAutoscalinggroupDesiredCapacity, enter 1.
For NodeAutoScalingGroupMaxSize, enter 1.
For NodeInstanceType, enter m5.2xlarge.

Stack creation should take 10–15 minutes. Make sure the stack is complete by verifying the status as CREATE_COMPLETE.

Note the oEMRvirtualclusterID value as the output of the stack. We use this virtualclusterID to submit our Spark application.

Additionally, you can verify that the node groups are set up correctly using the following commands:

aws eks list-nodegroups --cluster-name eks-cluster

You receive the following result:

{
    "nodegroups": [
        "emr-virtual-cluster-NodeGroup"
    ]
}

You can verify the details of the nodes with the following command (use the node group name from the preceding command):

aws eks describe-nodegroup --cluster-name eks-cluster --nodegroup-name emr-virtual-cluster-NodeGroup

This lists the details of all the nodes provisioned, the instance type, and subnet associations, among other details.

You’re now ready to create and run a Spark application on the cluster.

Create an Amazon EMR Spark application

To create the PySpark job, perform the following steps:

Copy the NOAA Open data registry 2011 Weather Station data and the Weather Station Lookup data and save the files under the s3://<<Your S3 Bucket>>/noaa/csv.gz/ prefix.
1. To copy the 2011 Weather Station data, use the following AWS CLI command:
```
aws s3 cp s3://noaa-ghcn-pds/csv.gz/2011.csv.gz s3://<<Your S3 Bucket>>/noaa/csv.gz/2011.csv.gz
```
2. To copy the Weather Station Lookup data, use the following AWS CLI command:
```
aws s3 cp s3://noaa-ghcn-pds/ghcnd-stations.txt s3://<<Your S3 Bucket>>/noaa/ghcnd-stations.txt
```

You can find the value for <<Your S3 Bucket>> in the oS3Bucketname key on the Outputs tab for the emr-on-eks-stack CloudFormation stack.

Download the PySpark script and upload it under s3://<<Your S3 Bucket>>/scripts/.

Run the Spark application

We run the Spark job using the AWS CLI. The parameters for the job (virtual cluster ID, script location, parameters) are mentioned in the JSON file.

Save the following JSON template as jobparameters.json in a local folder (for example, /path/to/jobparameters.json):

{
  "name": "emr-on-eks-spark-job",
  "virtualClusterId": "<virtualclusterid>",
  "executionRoleArn": "arn:aws:iam::<<Your AWS Account Number>>:role/eksjobexecutionrole",
  "releaseLabel": "emr-6.2.0-latest",
  "jobDriver": {
    "sparkSubmitJobDriver": {
      "entryPoint": "s3://<<Your S3 Bucket>>/scripts/etl.py",
          "entryPointArguments": ["s3://<<Your S3 Bucket>>/"],
       "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.driver.memory=2G --conf spark.executor.cores=4"
    }
  },
  "configurationOverrides": {
    "applicationConfiguration": [
      {
        "classification": "spark-defaults",
        "properties": {
"spark.scheduler.minRegisteredResourcesRatio": "0.8",
          "spark.scheduler.maxRegisteredResourcesWaitingTime": "300s" }
      }
    ],
    "monitoringConfiguration": {
      "cloudWatchMonitoringConfiguration": {
        "logGroupName": "/emr-containers/jobs",
        "logStreamNamePrefix": "emreksblog"
      },
      "s3MonitoringConfiguration": {
        "logUri": "s3://<<Your S3 Bucket>>/joblogs"
      }
    }
  }
}

The configurationOverrides section is optional and can be used to backport any Spark configurations that are set for jobs running in Amazon EMR on EC2. The Spark job runs successfully without any additional configuration changes.

Modify the following keys in your JSON file (/path/to/jobparameters.json):
1. virtualClusterId – The ID of the EMR cluster on Amazon EKS. You can get this by looking at the oEMRvirtualclusterID output from the CloudFormation template or by running the following code:
```
aws emr-containers list-virtual-clusters \
--state RUNNING \
--query 'virtualClusters[?containerProvider.info.eksInfo.namespace==`emroneks`]'
```
2. executionRoleArn – The ARN of the role created in the CloudFormation template. Replace <<Your AWS Account Number>> with the AWS account number you deploy this stack in.
3. entryPoint – The value of the path to the ETL script S3 bucket provisioned in the CloudFormation stack (for example, s3://<<Your S3 Bucket>>/scripts/etl.py).
4. entryPointArguments – The Spark job accepts one argument—the S3 bucket name where the data files are stored (s3://<<Your S3 Bucket>>/).
5. logUri – The path were the controller logs, Spark driver, and executor logs are written into. Enter it as s3://<<Your S3 Bucket>>/ joblogs.
6. cloudWatchMonitoringConfiguration – The CloudWatch log group details where logs are published. Enter the value for logGroupName as /emr-containers/jobs and logStreamNamePrefix as emreksblog.

You can change the sparkSubmitParameters parameter in the preceding JSON as per your needs, but your node groups must have the right capacity to accommodate the combination of Spark executors, memory, and cores that you define in sparkSubmitParameters. The preceding configuration works for the cluster we provisioned through the CloudFormation template without any changes.

Submit the job with the following AWS CLI command:

aws emr-containers start-job-run --cli-input-json file://path/to/jobparameters.json

This returns a response with the job ID, which we can use to track the status of the job:

{
    "id": "00000002ucgkgj546u1",
    "name": "emr-on-eks-spark-job",
    "arn": "arn:aws:emr-containers:region:accountID:/virtualclusters/mdeljhfj0ejq5iprtzchljuh1/jobruns/00000002ucgkgj546u1",
    "virtualClusterId": "mdeljhfj0ejq5iprtzchljuh1"
}

You can get the status of a job by running the following command:

aws emr-containers describe-job-run --id <your job run id>   --virtual-cluster-id <<your virtualcluster id>>

You can observe the status change from SUBMITTED to RUNNING to COMPLETED or FAILED.

{
    "jobRun": {
        "id": "00000002ucgkstiadcs",
        "name": "emr-on-eks-spark-job",
        "virtualClusterId": "mdeljhfj0ejq5iprtzchljuh1",
        "arn": "arn:aws:emr-containers:<<region>>:<<Your AWS Account Number>>:/virtualclusters/mdeljhfj0ejq5iprtzchljuh1/jobruns/00000002ucgkstiadcs",
        "state": "SUBMITTED",
        "clientToken": "52203410-3e55-4294-a548-dc9212d10b37",


{
    "jobRun": {
        "id": "00000002ucgkstiadcs",
        "name": "emr-on-eks-spark-job",
        "virtualClusterId": "mdeljhfj0ejq5iprtzchljuh1",
        "arn": "arn:aws:emr-containers:<<region>>: :<<Your AWS Account Number>>:/virtualclusters/mdeljhfj0ejq5iprtzchljuh1/jobruns/00000002ucgkstiadcs",
        "state": "RUNNING",
        "clientToken": "52203410-3e55-4294-a548-dc9212d10b37",

{
    "jobRun": {
        "id": "00000002ucgkstiadcs",
        "name": "emr-on-eks-spark-job",
        "virtualClusterId": "mdeljhfj0ejq5iprtzchljuh1",
        "arn": "arn:aws:emr-containers:<<region>>: :<<Your AWS Account Number>>:/virtualclusters/mdeljhfj0ejq5iprtzchljuh1/jobruns/00000002ucgkstiadcs",
        "state": "COMPLETED",
        "clientToken": "52203410-3e55-4294-a548-dc9212d10b37",

When the job state changes to COMPLETED, you can see a prefix in your S3 bucket called noaaparquet with a dataset created within the prefix.

If the job status reaches the FAILED state, you can troubleshoot by going through the details found in the CloudWatch logs or the logs written into Amazon S3. For details on how to access and use those logs, refer to the following debugging section.

Occasionally, you may notice that the job is stuck in SUBMITTED status for a long time. This could be due to the fact that the Amazon EKS cluster is running other jobs and doesn’t have available capacity. When the existing job is complete, your job changes to the RUNNING state.

Another scenario could be that you set the driver and executor memory requirements in your Spark configuration (jobparameters.json) to more than what is available. Consider adjusting the spark.executor.memory and spark.driver.memory values based on the instance type in your node group. See the following code:

{
  "name": "emr-on-eks-spark-job",
  "virtualClusterId": "<virtualclusterid>",
  "executionRoleArn": "arn:aws:iam::<<Your AWS Account Number>>:role/eksjobexecutionrole",
  "releaseLabel": "emr-6.2.0-latest",
  "jobDriver": {
    "sparkSubmitJobDriver": {
      "entryPoint": "s3://<<Your S3 Bucket>>/scripts/etl.py",
          "entryPointArguments": ["s3://<<Your S3 Bucket>>/"],
       "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.driver.memory=2G --conf spark.executor.cores=4"

If your job is stuck or failing due to insufficient capacity, consider increasing the number of nodes in your node group or setting up the Amazon EKS Cluster Autoscaler. Refer to the debugging section for additional details from the Kubernetes dashboard.

For additional information on Amazon EMR on EKS fundamentals, refer to the appendix at the end of this post.

Debug your Spark application

Amazon EMR on EKS provides multiple options to debug and view the logs of the Spark application.

For issues specific to Spark applications, use Spark History Server, CloudWatch logs, or logs on Amazon S3, which we discuss in this section.

For troubleshooting issues, such as your jobs aren’t starting (job status in SUBMITTED state) or issues with Spark drivers, start with Kubernetes dashboard or kubectl CLI commands, discussed in detail in this section.

Spark History Server

Spark History Server provides an elaborate web UI that allows us to inspect various components of our applications. It offers details on memory usage, jobs, stages, and tasks, as well as event timelines, logs, and various metrics and statistics both at the Spark driver level and for individual executors. It shows collected metrics and the state of the program, revealing clues about possible performance bottlenecks that you can utilize for tuning and optimizing the application. You can look at the Spark History Server (in the Spark UI) from the Amazon EMR console to see the driver and executor logs, as long as you have Amazon S3 logging enabled (which we enabled as part of the job submission JSON payload). The Spark UI is available even after the job is complete and the cluster is stopped. For more information on troubleshooting, see How do I troubleshoot a failed Spark step in Amazon EMR?

The following screenshots show the Spark UI of the job submitted on the cluster.

Choose a specific app ID to see the details of the Spark SQL and stages that ran. This helps you see the explain plan of the query and rows processed by each stage to narrow down any bottlenecks in your process.

If you don’t see the Spark UI link enabled or you see an error message “Unable to launch application UI,” verify the parameter s3MonitoringConfiguration in the jobparameters.json to ensure that a valid S3 path is provided. Additionally, ensure that the job execution role has appropriate permissions to access the S3 bucket. This was defined in the CloudFormation template that you deployed earlier. See the following code:

"monitoringConfiguration": {
      "cloudWatchMonitoringConfiguration": {
        "logGroupName": "/emr-containers/jobs",
        "logStreamNamePrefix": "emreksblog"
      },
      "s3MonitoringConfiguration": {
        "logUri": "s3://<<Your S3 Bucket>>/joblogs"
      }
    }

To increase the logging level of the Spark application to DEBUG, update the spark-log4j configuration. For instructions, see Change Log level for Spark application on EMR on EKS.

CloudWatch logs

In the preceding jobparameters.json, the log group name was /emr-containers/jobs and the prefix was emrjobs. You can access logs via the CloudWatch console for this prefix.

The path for various types of logs available in CloudWatch are as follows:

Controller logs – logGroup/logStreamPrefix/virtual-cluster-id/jobs/job-id/containers/pod-name/(stderr/stdout)
Driver logs – logGroup/logStreamPrefix/virtual-cluster-id/jobs/job-id/containers/spark-application-id/spark-job-id-driver/(stderrstdout)
Executor logs – logGroup/logStreamPrefix/virtual-cluster-id/jobs/job-id/containers/spark-application-id/executor-pod-name/(stderr/stdout)

In the jobparameters.json configuration, logGroup is set as /emr-containers/jobs and logStreamPrefix is set as emreksblog.

Here you can retrieve the Spark driver and executor logs to view additional details and stack trace of any error messages when your Spark job has failed.

You can filter the CloudWatch log stream by driver/stdout to see the output and driver/stderr to see details of errors from your Spark job.

The following are some common scenarios to verify in case the logs aren’t available in CloudWatch:

Ensure that the log group parameter is defined in jobparameters.json under monitoringConfiguration (refer to the JSON file for the details of parameters):

    "monitoringConfiguration": {
      "cloudWatchMonitoringConfiguration": {
        "logGroupName": "/emr-containers/jobs",
        "logStreamNamePrefix": "emreksblog"
      },

Ensure that the service role associated with the Amazon EMR on EKS cluster has access to write to the CloudWatch log group. The CloudFormation template you deployed has the policy associated with the IAM role to grant appropriate permissions to allow access to write to the log groups. For additional IAM policy examples, see Using identity-based policies (IAM policies) for CloudWatch Logs.

Amazon S3 logs

In the configuration, the log path is listed as S3://<Your S3 bucket>/joblogs under the corresponding job ID.

You can go to S3 bucket you specified to check the logs. Your log data is sent to the following Amazon S3 locations depending on the types of logs:

Controller logs – S3://<Your S3 bucket>//joblogs/virtual-cluster-id/jobs/job-id/containers/pod-name/(stderr.gz/stdout.gz)
Driver logs – S3://<Your S3 bucket>//joblogs/virtual-cluster-id/jobs/job-id/containers/spark-application-id/spark-job-id-driver/(stderr.gz/stdout.gz)
Executor logs – S3://<Your S3 bucket>//joblogs/virtual-cluster-id/jobs/job-id/containers/spark-application-id/executor-pod-name/(stderr.gz/stdout.gz)

Here you can retrieve the Spark driver and executor logs to view additional details and stack trace of any error messages when your Spark job has failed.

Kubernetes dashboard

You can view and monitor the online logs of a running Spark job in a Kubernetes dashboard. The dashboard provides information on the state of Kubernetes resources in your cluster and any errors that may have occurred while the job is running. The logs are only accessible through the Kubernetes dashboard while the cluster is running the job. The dashboard is a useful way to quickly identify any issue with the job while it’s running and the logs are getting written.

For details on how to deploy, set up, and view the dashboard, see Tutorial: Deploy the Kubernetes Dashboard (web UI). After you deploy the Kubernetes dashboard and launch it, complete the following steps to see the details of your job run:

Choose the right namespace that was registered with the EMR virtual cluster for the Amazon EKS cluster.
Choose Pods in the navigation pane to see all the running pods.
Choose the additional options icon (three vertical dots) to open logs for each pod.

The following screenshot shows the Spark driver that was spawned when the Spark job was submitted to the EMR virtual cluster.

Choose the spark-kubernetes-executor container log to see the running online logs of your Spark job.

The following screenshots show the running log of the Spark application while it’s running on the EMR virtual cluster.

Choose Pods to see the CPU and memory consumption of the individual POD running the application.

The following screenshots the CPU and memory usage of the Spark application for the duration of the job. This helps determine if you have provisioned adequate capacity for your jobs.

In case of insufficient capacity with memory or CPU, you see the following error. You can choose the pod to see additional details.

Kubernetes CLI

You can view Spark driver and Spark executer logs using the Kubernetes CLI (kubectl). Logs are accessible through the Kubernetes CLI while the cluster is running the job.

Get the name of the Spark driver and Spark executor pods in the emroneks namespace:

kubectl get pods -n emroneks

You see multiple pods for the Spark driver and executors that are currently running.

Use the pod name for the driver to see the driver logs:

kubectl logs <Spark driver pod name> -n emroneks -c spark-kubernetes-driver

Use the pod name for the executors to see the executor logs:

kubectl logs <Spark executor pod name> -n emroneks -c spark-kubernetes-executor

For more issues and resolutions when running jobs on Amazon EMR on EKS, see Common errors when running jobs.

Clean up

When you’re done using this solution, you should delete the following CloudFormation stacks, via the CloudFormation console, to avoid incurring any further charges:

EMRvirtualcluster
emr-on-eks-stack

Conclusion

This post describes how you can run your existing Apache Spark workloads on Amazon EMR on EKS. The use case demonstrates setting up the infrastructure, and running and monitoring your Spark job. We also showed you various options and techniques to debug and troubleshoot your jobs.

Amazon EMR also provides the capability to perform data analysis and data engineering tasks in a web-based integrated development environment (IDE), using fully managed Jupyter notebooks. Refer to this post to set up EMR Studio with EMR on EKS.

Appendix: Explaining the solution

In this solution, we first built an Amazon EKS cluster using a CloudFormation template and registered it with Amazon EMR. Then we submitted a Spark job using the AWS CLI on the EMR virtual cluster on Amazon EKS. Let’s look at some of the important concepts related to running a Spark job on Amazon EMR on EKS.

Kubernetes namespaces

Amazon EKS uses Kubernetes namespaces to divide cluster resources between multiple users and applications. These namespaces are the foundation for multi-tenant environments. A Kubernetes namespace can have both Amazon EC2 and Fargate as the compute provider. Fargate selection for pods can be done using user-defined Fargate profiles. This flexibility provides different performance and cost options for the Spark jobs to run on. In this post, we provisioned an Amazon EKS cluster with node groups containing an m5.2x large EC2 instance.

Virtual cluster

A virtual cluster is a Kubernetes namespace that Amazon EMR is registered with. Amazon EMR uses virtual clusters to run jobs and host endpoints. Multiple virtual clusters can be backed by the same physical cluster, and each virtual cluster maps to one namespace on an Amazon EKS cluster.

Job run

A job run is a unit of work, such as a Spark JAR (Scala or Java application), PySpark script, or SparkSQL query, that you submit to Amazon EMR on EKS. One job can have multiple job runs. When you submit a job run, it should include the following information:

A virtual cluster where the job should run
A job name to identify the job
The execution role, which is a scoped IAM role that runs the job (in a Kubernetes service account), is used to run the pod, and allows you to specify which resources can be accessed by the job
The Amazon EMR release label that specifies the version of Amazon EMR Spark to use
The artifacts to use when submitting your job, such as spark-submit parameters

Amazon EMR containers

An Amazon EMR container is the API name for Amazon EMR on EKS. The emr-containers prefix is used in the following scenarios:

In the AWS CLI commands for Amazon EMR on EKS. For example, aws emr-containers start-job-run.
Before IAM policy actions for Amazon EMR on EKS. For example, "Action": [ "emr-containers:StartJobRun"]. For more information, see Policy actions for Amazon EMR on EKS.
In Amazon EMR on EKS service endpoints. For example, emr-containers.us-east-1.amazonaws.com.

In the solution overview, we went step by step through how we used above resources to create the Amazon EMR on EKS cluster and run a Spark job. For further details on these concepts, see Concepts.

About the Authors

Dipankar Kushari is a Senior Analytics Solutions Architect with AWS, helping customers build analytics platform and solutions. He has a keen interest in distributed computing. Dipankar enjoys spending time playing chess and watching old Hollywood movies.

Ashok Padmanabhan is a big data consultant with AWS Professional Services, helping customers build big data and analytics platform and solutions. When not building and designing data lakes, Ashok enjoys spending time at beaches near his home in Florida.

Gaurav Gundal is a DevOps consultant with AWS Professional Services, helping customers build solutions on the customer platform. When not building, designing, or developing solutions, Gaurav spends time with his family, plays guitar, and enjoys traveling to different places.

Naveen Madhire is a Big Data Architect with AWS Professional Services, helping customers create data lake solutions on AWS. Outside of work, he loves playing video games and watching crime series on TV.

Run a Spark SQL-based ETL pipeline with Amazon EMR on Amazon EKS

2021-08-24 Melody Yang

Post Syndicated from Melody Yang original https://aws.amazon.com/blogs/big-data/run-a-spark-sql-based-etl-pipeline-with-amazon-emr-on-amazon-eks/

This blog post has been translated into the following languages:

Increasingly, a business’s success depends on its agility in transforming data into actionable insights, which requires efficient and automated data processes. In the previous post – Build a SQL-based ETL pipeline with Apache Spark on Amazon EKS, we described a common productivity issue in a modern data architecture. To address the challenge, we demonstrated how to utilize a declarative approach as the key enabler to improve efficiency, which resulted in a faster time to value for businesses.

Generally speaking, managing applications declaratively in Kubernetes is a widely adopted best practice. You can use the same approach to build and deploy Spark applications with open-source or in-house build frameworks to achieve the same productivity goal.

For this post, we use the open-source data processing framework Arc, which is abstracted away from Apache Spark, to transform a regular data pipeline to an “extract, transform, and load (ETL) as definition” job. The steps in the data pipeline are simply expressed in a declarative definition (JSON) file with embedded declarative language SQL scripts.

The job definition in an Arc Jupyter notebook looks like the following screenshot.

This representation makes ETL much easier for a wider range of personas: analysts, data scientists, and any SQL authors who can fully express their data workflows without the need to write code in a programming language like Python.

In this post, we explore some key advantages of the latest Amazon EMR deployment option Amazon EMR on Amazon EKS to run Spark applications. We also explain its major difference from the commonly used Spark resource manager YARN, and demonstrate how to schedule a declarative ETL job with EMR on EKS. Building and testing the job on a custom Arc Jupyter kernel is out of scope for this post. You can find more tutorials on the Arc website.

Why choose Amazon EMR on Amazon EKS?

The following are some of the benefits of EMR on EKS:

Simplified architecture by unifying workloads – EMR on EKS enables us to run Apache Spark workloads on Amazon Elastic Kubernetes Service (Amazon EKS) without provisioning dedicated EMR clusters. If you have an existing Amazon EKS landscape in your organization, it makes sense to unify analytical workloads with other Kubernetes-based applications on the same Amazon EKS cluster. It improves resource utilization and significantly simplifies your infrastructure management.
More resources to share with a smaller JVM footprint – A major difference in this deployment option is the resource manager shift from YARN to Kubernetes and from a Hadoop cluster manager to a generic containerized application orchestrator. As shown in the following diagram, each Spark executor runs as a YARN container (compute and memory unit) in Hadoop. Broadly, YARN creates a JVM in each container requested by Hadoop applications, such as Apache Hive. When you run Spark on Kubernetes, it keeps your JVM footprint minimal, so that the Amazon EKS cluster can accommodate more applications, resulting in more spare resources for your analytical workloads.

Efficient resource sharing and cost savings – With the YARN cluster manager, if you want to reuse the same EMR cluster for concurrent Spark jobs to reduce cost, you have to compromise on resource isolation. Additionally, you have to pay for compute resources that aren’t fully utilized, such as a master node, because only Amazon EMR can use these unused compute resources. With EMR on EKS, you can enjoy the optimized resource allocation feature by sharing them across all your applications, which reduces cost.
Faster EMR runtime for Apache Spark – One of the key benefits of running Spark with EMR on EKS is the faster EMR runtime for Apache Spark. The runtime is a performance-optimized environment, which is available and turned on by default on Amazon EMR release 5.28.0 and later. In our performance tests using TPC-DS benchmark queries at 3 TB scale, we found EMR runtime for Apache Spark 3.0 provides a 1.7 times performance improvement on average, and up to 8 times improved performance for individual queries over open-source Apache Spark 3.0.0. It means you can run your Apache Spark applications faster and cheaper without requiring any changes to your applications.
Minimum setup to support multi-tenancy – While taking advantage of Spark’s Dynamic Resource Allocation, auto scaling in Amazon EKS, high availability with multiple Availability Zones, you can isolate your workloads for multi-tenancy use cases, with a minimum configuration required. Additionally, without any infrastructure setup, you can use an Amazon EKS cluster to run a single application that requires different Apache Spark versions and configurations, for example for development vs test environments.

Cost effectiveness

EMR on EKS pricing is calculated based on the vCPU and memory resources used from the time you start to download your Amazon EMR application image until the Spark pod on Amazon EKS stops, rounded up to the nearest second. The following screenshot is an example of the cost in the us-east-1 Region.

The Amazon EMR uplift price is in addition to the Amazon EKS pricing and any other services used by Amazon EKS, such as EC2 instances and EBS volumes. You pay $0.10 per hour for each Amazon EKS cluster that you use. However, you can use a single Amazon EKS cluster to run multiple applications by taking advantage of Kubernetes namespaces and AWS Identity and Access Management (IAM) security policies.

While the application runs, your resources are allocated and removed automatically by the Amazon EKS auto scaling feature, in order to eliminate over-provisioning or under-utilization of these resources. It enables you to lower costs because you only pay for the resources you use.

To further reduce the running cost for jobs that aren’t time-critical, you can schedule Spark executors on Spot Instances to save up to 90% over On-Demand prices. In order to maintain the resiliency of your Spark cluster, it is recommended to run driver on a reliable On-Demand Instance, because the driver is responsible for requesting new executors to replace failed ones when an unexpected event happens.

Kubernetes comes with a YAML specification called a pod template that can help you to assign Spark driver and executor pods to Spot or On-Demand EC2 instances. You define nodeSelector rules in pod templates, then upload to Amazon Simple Storage Service (Amazon S3). Finally, at the job submission, specify the Spark properties spark.kubernetes.driver.podTemplateFile and spark.kubernetes.executor.podTemplateFile to point to the pod templates in Amazon S3.

For example, the following is the code for executor_pod_template.yaml:

apiVersion: v1
kind: Pod
spec:
  nodeSelector:
    eks.amazonaws.com/capacityType: SPOT

The following is the code for driver_pod_template.yaml:

apiVersion: v1
kind: Pod
spec:
  nodeSelector:
    eks.amazonaws.com/capacityType: ON_DEMAND

The following is the code for the Spark job submission:

aws emr-containers start-job-run \
--virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
--name spark-pi-pod-template \
--execution-role-arn ${EMR_EKS_ROLE_ARN} \
--release-label emr-5.33.0-latest \
--job-driver '{
    "sparkSubmitJobDriver": {
        "entryPoint": "s3://'${s3DemoBucket}'/someAppCode.py",
        "sparkSubmitParameters": "--conf spark.kubernetes.driver.podTemplateFile=\"s3://'${s3DemoBucket}'/pod_templates/driver_pod_template.yaml\" --conf spark.kubernetes.executor.podTemplateFile=\"s3://'${s3DemoBucket}'/pod_templates/executor_pod_template.yaml\" --conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2"
	}
    }'

Beginning with Amazon EMR versions 5.33.0 or 6.3.0, Amazon EMR on EKS supports the Amazon S3-based pod template feature. If you’re using an unsupported Amazon EMR version, such as EMR 6.1.0, you can use the pod template feature without Amazon S3 support. Make sure your Spark version is 3.0 or later, and copy the template files into your custom Docker image. The job submit script is changed to the following:

"--conf spark.kubernetes.driver.podTemplateFile=/local/path/to/driver_pod_template.yaml" 
"--conf spark.kubernetes.executor.podTemplateFile=/local/path/to/executor_pod_template.yaml"

Serverless compute option: AWS Fargate

The sample solution runs on an Amazon EKS cluster with AWS Fargate. Fargate is a serverless compute engine for Amazon EKS and Amazon ECS. It makes it easy for you to focus on building applications because it removes the need to provision and manage EC2 instances or managed node groups in EKS. Fargate runs each task or pod in its own kernel, providing its own isolated compute environment. This enables your application to have resource isolation and enhanced security by design.

With Fargate, you don’t need to be an expert in Kubernetes operations. It automatically allocates the right amount of compute, eliminating the need to choose instances and scale cluster capacity, so the Kubernetes Cluster Autoscaler is no longer required to increase your cluster’s compute capacity.

In our instance, each Spark executor or driver is provisioned by a separate Fargate pod, to form a Spark cluster dedicated to an ETL pipeline. You only need to specify and pay for resources required to run the application—no more concerns about complex cluster management, queues, and isolation trade-offs.

Other Amazon EC2 options

In addition to the serverless choice, EMR on EKS can operate on all types of EKS clusters. For example, Amazon EKS managed node groups with Amazon Elastic Compute Cloud (Amazon EC2) On-Demand and Spot Instances.

Previously, we mentioned placing a Spark driver pod on an On-Demand Instance to reduce interruption risk. To further improve your cluster stability, it’s important to understand the Kubernetes high availability and restart policy features. These allow for exciting new possibilities, not only in computational multi-tenancy, but also in the ability of application self-recovery, for example relaunching a Spark driver on Spot or On-Demand instances. For more information and an example, see the GitHub repo.

As of this writing, our test result shows that a Spark driver can’t restart on failure with the EMR on EKS deployment type. So be mindful when designing a Spark application with the minimum downtime need, especially for a time-critical job. We recommend the following:

Diversify your Spot requests – Similar to Amazon EMR’s instance fleet, EMR on EKS allows you to deploy a single application across multiple instance types to further enhance availability. With Amazon EC2 Spot best practices, such as capacity rebalancing, you can diversify the Spot request across multiple instance types within each Availability Zone. It limits the impact of Spot interruptions on your workload, if a Spot Instance is reclaimed by Amazon EC2. For more details, see Running cost optimized Spark workloads on Kubernetes using EC2 Spot Instances.
Increase resilience – Repeatedly restarting a Spark application compromises your application performance or the length of your jobs, especially for time-sensitive data processes. We recommend the following best practice to increase your application resilience:
- Ensure your job is stateless so that it can self-recover without waiting for a dependency.
- If a checkpoint is required, for example in a stateful Spark streaming ETL, make sure your checkpoint storage is decoupled from the Amazon EKS compute resource, which can be detached and attached to your Kubernetes cluster via the persistent volume claims (PVCs), or simply use S3 Cloud Storage.
- Run the Spark driver on the On-Demand Instance defined by a pod template. It adds an extra layer of resiliency to your Spark application with EMR on EKS.

Security

EMR on EKS inherits the fine-grained security feature IAM roles for service accounts, (IRSA), offered by Amazon EKS. This means your data access control is no longer at the compute instance level, instead it’s granular at the container or pod level and controlled by an IAM role. The token-based authentication approach allows us to use one of the AWS default credential providers WebIdentityTokenCredentialsProvider to exchange the Kubernetes-issued token for IAM role credentials. It makes sure our applications deployed with EMR on EKS can communicate to other AWS services in a secure and private channel, without the need to store a long-lived AWS credential pair as a Kubernetes secret.

To learn more about the implementation details, see the GitHub repo.

Solution overview

In this example, we introduce a quality-aware design with the open-source declarative data processing Arc framework to abstract Spark technology away from you. It makes it easy for you to focus on business outcomes, not technologies.

We walk through the steps to run a predefined Arc ETL job with the EMR on EKS approach. For more information, see the GitHub repo.

The sample solution launches a serverless Amazon EKS cluster, loads TLC green taxi trip records from a public S3 bucket, applies dataset schema, aggregates the data, and outputs to an S3 bucket as Parquet file format. The sample Spark application is defined as a Jupyter notebook green_taxi_load.ipynb powered by Arc, and the metadata information is defined in green_taxi_schema.json.

The following diagram illustrates our solution architecture.

Launch Amazon EKS

Provisioning takes approximately 20 minutes.

To get started, open AWS CloudShell in the us-east-1 Region. Run the following command to provision the new cluster eks-cluster, backed by Fargate. Then build the EMR virtual cluster emr-on-eks-cluster:

curl https://raw.githubusercontent.com/aws-samples/sql-based-etl-on-amazon-eks/main/emr-on-eks/provision.sh | bash

At the end of the provisioning, the shell script automatically creates an EMR virtual cluster by the following command. It registers Amazon EMR with the newly created Amazon EKS cluster. The dedicated namespace on the EKS is called emr.

Deploy the sample ETL job

When provisioning is complete, submit the sample ETL job to EMR on EKS with a serverless virtual cluster called emr-on-eks-cluster:

curl https://raw.githubusercontent.com/aws-samples/sql-based-etl-on-amazon-eks/main/emr-on-eks/submit_arc_job.sh | bash

It runs the following job summit command:

The declarative ETL job can be found on the blogpost’s GitHub repository. This is a screenshot of the job specification:

Check your job progress and auto scaling status:

kubectl get pod -n emr
kubectl get node --label-columns=topology.kubernetes.io/zone

As the job requested 10 executors, it automatically scales the Spark application from 0 to 10 pods (executors) on the EKS cluster. The Spark application automatically removes itself from the EKS when the job is done.

Navigate to your Amazon EMR console to view application logs on the Spark History Server.

You can also check the logs in CloudShell, as long as your Spark Driver is running:

driver_name=$(kubectl get pod -n emr | grep "driver" | awk '{print $1}')
kubectl logs ${driver_name} -n emr -c spark-kubernetes-driver | grep 'event'

Clean up

To clean up the AWS resources you created, run the following code:

curl https://raw.githubusercontent.com/aws-samples/sql-based-etl-on-amazon-eks/main/emr-on-eks/deprovision.sh | bash

Region support

At the time of this writing, Amazon EMR on EKS is available in US East (N. Virginia), US West (Oregon), US West (N. California), US East (Ohio), Canada (Central), Europe (Ireland, Frankfurt, and London), and Asia Pacific (Mumbai, Seoul, Singapore, Sydney, and Tokyo) Regions. If you want to use EMR on EKS in a Region that isn’t available yet, check out the open-source Apache Spark on Amazon EKS alternative on aws-samples GitHub. You can deploy the sample solution to your Region as long as Amazon EKS is available. Migrating a Spark workload on Amazon EKS to the fully managed EMR on EKS is easy and straightforward, with minimum changes required. Because the self-contained Spark application remains the same, only the deployment implementation differs.

Conclusion

This post introduces Amazon EMR on Amazon EKS and provides a walkthrough of a sample solution to demonstrate the “ETL as definition” concept. A declarative data processing framework enables you to build and deploy your Spark workloads with enhanced efficiency. With EMR on EKS, running applications built upon a declarative framework maximizes data process productivity, performance, reliability, and availability at scale. This pattern abstracts Spark technology away from you, and helps you to focus on deliverables that optimize business outcomes.

The built-in optimizations provided by the managed EMR on EKS can help not only data engineers with analytical skills, but also analysts, data scientists, and any SQL authors who can fully express their data workflows declaratively in Spark SQL. You can use this architectural pattern to drive your data ownership shift in your organization, from IT to non-IT stakeholders who have a better understanding of business operations and needs.

About the Authors

Melody Yang is a Senior Analytics Specialist Solution Architect at AWS with expertise in Big Data technologies. She is an experienced analytics leader working with AWS customers to provide best practice guidance and technical advice in order to assist their success in data transformation. Her areas of interests are open-source frameworks and automation, data engineering and DataOps.

Shiva Achari is a Senior Data Lab Architect at AWS. He helps AWS customers to design and build data and analytics prototypes via the AWS Data Lab engagement. He has over 14 years of experience working with enterprise customers and startups primarily in the Data and Big Data Analytics space.

Daniel Maldonado is an AWS Solutions Architect, specialist in Microsoft workloads and Big Data technologies, focused on helping customers to migrate their applications and data to AWS. Daniel has over 12 years of experience working with Information Technologies and enjoys helping clients reap the benefits of running their workloads in the cloud.

Igor Izotov is an AWS enterprise solutions architect, and he works closely with Australia’s largest financial services organizations. Prior to AWS, Igor held solution architecture and engineering roles with tier-1 consultancies and software vendors. Igor is passionate about all things data and modern software engineering. Outside of work, he enjoys writing and performing music, a good audiobook, or a jog, often combining the latter two.

Access token security for microservice APIs on Amazon EKS

2021-08-19 Timothy James Power

Post Syndicated from Timothy James Power original https://aws.amazon.com/blogs/security/access-token-security-for-microservice-apis-on-amazon-eks/

In this blog post, I demonstrate how to implement service-to-service authorization using OAuth 2.0 access tokens for microservice APIs hosted on Amazon Elastic Kubernetes Service (Amazon EKS). A common use case for OAuth 2.0 access tokens is to facilitate user authorization to a public facing application. Access tokens can also be used to identify and authorize programmatic access to services with a system identity instead of a user identity. In service-to-service authorization, OAuth 2.0 access tokens can be used to help protect your microservice API for the entire development lifecycle and for every application layer. AWS Well Architected recommends that you validate security at all layers, and by incorporating access tokens validated by the microservice, you can minimize the potential impact if your application gateway allows unintended access. The solution sample application in this post includes access token security at the outset. Access tokens are validated in unit tests, local deployment, and remote cluster deployment on Amazon EKS. Amazon Cognito is used as the OAuth 2.0 token issuer.

Benefits of using access token security with microservice APIs

Some of the reasons you should consider using access token security with microservices include the following:

Access tokens provide production grade security for microservices in non-production environments, and are designed to ensure consistent authentication and authorization and protect the application developer from changes to security controls at a cluster level.
They enable service-to-service applications to identify the caller and their permissions.
Access tokens are short-lived credentials that expire, which makes them preferable to traditional API gateway long-lived API keys.
You get better system integration with a web or mobile interface, or application gateway, when you include token validation in the microservice at the outset.

Overview of solution

In the solution described in this post, the sample microservice API is deployed to Amazon EKS, with an Application Load Balancer (ALB) for incoming traffic. Figure 1 shows the application architecture on Amazon Web Services (AWS).

Figure 1: Application architecture

The application client shown in Figure 1 represents a service-to-service workflow on Amazon EKS, and shows the following three steps:

The application client requests an access token from the Amazon Cognito user pool token endpoint.
The access token is forwarded to the ALB endpoint over HTTPS when requesting the microservice API, in the bearer token authorization header. The ALB is configured to use IP Classless Inter-Domain Routing (CIDR) range filtering.
The microservice deployed to Amazon EKS validates the access token using JSON Web Key Sets (JWKS), and enforces the authorization claims.

Walkthrough

The walkthrough in this post has the following steps:

Amazon EKS cluster setup
Amazon Cognito configuration
Microservice OAuth 2.0 integration
Unit test the access token claims
Deployment of microservice on Amazon EKS
Integration tests for local and remote deployments

Prerequisites

For this walkthrough, you should have the following prerequisites in place:

An AWS account
The AWS Command Line Interface (AWS CLI) installed and configured
The Amazon EKS command-line tool (eksctl) installed
The Kubernetes command-line tool (kubectl) installed
Docker installed
A basic familiarity with Kotlin or Java, and frameworks such as Quarkus or Spring
A Unix terminal

Set up

Amazon EKS is the target for your microservices deployment in the sample application. Use the following steps to create an EKS cluster. If you already have an EKS cluster, you can skip to the next section: To set up the AWS Load Balancer Controller. The following example creates an EKS cluster in the Asia Pacific (Singapore) ap-southeast-1 AWS Region. Be sure to update the Region to use your value.

To create an EKS cluster with eksctl

In your Unix editor, create a file named eks-cluster-config.yaml, with the following cluster configuration:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: token-demo
  region: <ap-southeast-1>
  version: '1.20'

iam:
  withOIDC: true
managedNodeGroups:
  - name: ng0
    minSize: 1
    maxSize: 3
    desiredCapacity: 2
    labels: {role: mngworker}

    iam:
      withAddonPolicies:
        albIngress: true
        cloudWatch: true

cloudWatch:
  clusterLogging:
    enableTypes: ["*"]

Create the cluster by using the following eksctl command:
```
eksctl create cluster -f eks-cluster-config.yaml
```
Allow 10–15 minutes for the EKS control plane and managed nodes creation. eksctl will automatically add the cluster details in your kubeconfig for use with kubectl.

Validate your cluster node status as “ready” with the following command
```
kubectl get nodes
```
Create the demo namespace to host the sample application by using the following command:
```
kubectl create namespace demo
```

With the EKS cluster now up and running, there is one final setup step. The ALB for inbound HTTPS traffic is created by the AWS Load Balancer Controller directly from the EKS cluster using a Kubernetes Ingress resource.

To set up the AWS Load Balancer Controller

Follow the installation steps to deploy the AWS Load Balancer Controller to Amazon EKS.
For your domain host (in this case, gateway.example.com) create a public certificate using Amazon Certificate Manager (ACM) that will be used for HTTPS.

An Ingress resource defines the ALB configuration. You customize the ALB by using annotations. Create a file named alb.yml, and add resource definition as follows, replacing the inbound IP CIDR with your values:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: alb-ingress
  namespace: demo
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
    alb.ingress.kubernetes.io/inbound-cidrs: <xxx.xxx.xxx.xxx>/n
  labels:
    app: alb-ingress
spec:
  rules:
    - host: <gateway.example.com>
      http:
        paths:
          - path: /api/demo/*
            pathType: Prefix
            backend:
              service:
                name: demo-api
                port:
                  number: 8080

Deploy the Ingress resource with kubectl to create the ALB by using the following command:
```
kubectl apply -f alb.yml
```
After a few moments, you should see the ALB move from status provisioning to active, with an auto-generated public DNS name.
Validate the ALB DNS name and the ALB is in active status by using the following command:
```
kubectl -n demo describe ingress alb-ingress
```
To alias your host, in this case gateway.example.com with the ALB, create a Route 53 alias record. The remote API is now accessible using your Route 53 alias, for example: https://gateway.example.com/api/demo/*

The ALB that you created will only allow incoming HTTPS traffic on port 443, and restricts incoming traffic to known source IP addresses. If you want to share the ALB across multiple microservices, you can add the alb.ingress.kubernetes.io/group.name annotation. To help protect the application from common exploits, you should add an annotation to bind AWS Web Application Firewall (WAFv2) ACLs, including rate-limiting options for the microservice.

Configure the Amazon Cognito user pool

To manage the OAuth 2.0 client credential flow, you create an Amazon Cognito user pool. Use the following procedure to create the Amazon Cognito user pool in the console.

To create an Amazon Cognito user pool

Log in to the Amazon Cognito console.
Choose Manage User Pools.
In the top-right corner of the page, choose Create a user pool.
Provide a name for your user pool, and choose Review defaults to save the name.
Review the user pool information and make any necessary changes. Scroll down and choose Create pool.
Note down your created Pool Id, because you will need this for the microservice configuration.

Next, to simulate the client in subsequent tests, you will create three app clients: one for read permission, one for write permission, and one for the microservice.

To create Amazon Cognito app clients

In the left navigation pane, under General settings, choose App clients.
On the right pane, choose Add an app client.
Enter the App client name as readClient.
Leave all other options unchanged.
Choose Create app client to save.
Choose Add another app client, and add an app client with the name writeClient, then repeat step 5 to save.
Choose Add another app client, and add an app client with the name microService. Clear Generate Client Secret, as this isn’t required for the microservice. Leave all other options unchanged. Repeat step 5 to save.
Note down the App client id created for the microService app client, because you will need it to configure the microservice.

You now have three app clients: readClient, writeClient, and microService.

With the read and write clients created, the next step is to create the permission scope (role), which will be subsequently assigned.

To create read and write permission scopes (roles) for use with the app clients

In the left navigation pane, under App integration, choose Resource servers.
On the right pane, choose Add a resource server.
Enter the name Gateway for the resource server.
For the Identifier enter your host name, in this case https://gateway.example.com.Figure 2 shows the resource identifier and custom scopes for read and write role permission.

Figure 2: Resource identifier and custom scopes
In the first row under Scopes, for Name enter demo.read, and for Description enter Demo Read role.
In the second row under Scopes, for Name enter demo.write, and for Description enter Demo Write role.
Choose Save changes.

You have now completed configuring the custom role scopes that will be bound to the app clients. To complete the app client configuration, you will now bind the role scopes and configure the OAuth2.0 flow.

To configure app clients for client credential flow

In the left navigation pane, under App Integration, select App client settings.
On the right pane, the first of three app clients will be visible.
Scroll to the readClient app client and make the following selections:
- For Enabled Identity Providers, select Cognito User Pool.
- Under OAuth 2.0, for Allowed OAuth Flows, select Client credentials.
- Under OAuth 2.0, under Allowed Custom Scopes, select the demo.read scope.
- Leave all other options blank.
Scroll to the writeClient app client and make the following selections:
- For Enabled Identity Providers, select Cognito User Pool.
- Under OAuth 2.0, for Allowed OAuth Flows, select Client credentials.
- Under OAuth 2.0, under Allowed Custom Scopes, select the demo.write scope.
- Leave all other options blank.
Scroll to the microService app client and make the following selections:
- For Enabled Identity Providers, select Cognito User Pool.
- Under OAuth 2.0, for Allowed OAuth Flows, select Client credentials.
- Under OAuth 2.0, under Allowed Custom Scopes, select the demo.read scope.
- Leave all other options blank.

Figure 3 shows the app client configured with the client credentials flow and custom scope—all other options remain blank

Figure 3: App client configuration

Your Amazon Cognito configuration is now complete. Next you will integrate the microservice with OAuth 2.0.

Microservice OAuth 2.0 integration

For the server-side microservice, you will use Quarkus with Kotlin. Quarkus is a cloud-native microservice framework with strong Kubernetes and AWS integration, for the Java Virtual Machine (JVM) and GraalVM. GraalVM native-image can be used to create native executables, for fast startup and low memory usage, which is important for microservice applications.

To create the microservice quick start project

Open the Quarkus quick-start website code.quarkus.io.
On the top left, you can modify the Group, Artifact and Build Tool to your preference, or accept the defaults.
In the Pick your extensions search box, select each of the following extensions:
- RESTEasy JAX-RS
- RESTEasy Jackson
- Kubernetes
- Container Image Jib
- OpenID Connect
Choose Generate your application to download your application as a .zip file.

Quarkus permits low-code integration with an identity provider such as Amazon Cognito, and is configured by the project application.properties file.

To configure application properties to use the Amazon Cognito IDP

Edit the application.properties file in your quick start project:
```
src/main/resources/application.properties
```
Add the following properties, replacing the variables with your values. Use the cognito-pool-id and microservice App client id that you noted down when creating these Amazon Cognito resources in the previous sections, along with your Region.
```
quarkus.oidc.auth-server-url= https://cognito-idp.<region>.amazonaws.com/<cognito-pool-id>
quarkus.oidc.client-id=<microService App client id>
quarkus.oidc.roles.role-claim-path=scope
```
Save and close your application.properties file.

The Kotlin code sample that follows verifies the authenticated principle by using the @Authenticated annotation filter, which performs JSON Web Key Set (JWKS) token validation. The JWKS details are cached, adding nominal latency to the application performance.

The access token claims are auto-filtered by the @RolesAllowed annotation for the custom scopes, read and write. The protected methods are illustrations of a microservice API and how to integrate this with one to two lines of code.

import io.quarkus.security.Authenticated
import javax.annotation.security.RolesAllowed
import javax.enterprise.context.RequestScoped
import javax.ws.rs.*

@Authenticated
@RequestScoped
@Path("/api/demo")
class DemoResource {

    @GET
    @Path("protectedRole/{name}")
    @RolesAllowed("https://gateway.example.com/demo.read")
    fun protectedRole(@PathParam(value = "name") name: String) = mapOf("protectedAPI" to "true", "paramName" to name)
    

    @POST
    @Path("protectedUpload")
    @RolesAllowed("https://gateway.example.com/demo.write")
    fun protectedDataUpload(values: Map<String, String>) = "Received: $values"

}

Unit test the access token claims

For the unit tests you will test three scenarios: unauthorized, forbidden, and ok. The @TestSecurity annotation injects an access token with the specified role claim using the Quarkus test security library. To include access token security in your unit test only requires one line of code, the @TestSecurity annotation, which is a strong reason to include access token security validation upfront in your development. The unit test code in the following example maps to the protectedRole method for the microservice via the uri /api/demo/protectedRole, with an additional path parameter sample-username to be returned by the method for confirmation.

import io.quarkus.test.junit.QuarkusTest
import io.quarkus.test.security.TestSecurity
import io.restassured.RestAssured
import io.restassured.http.ContentType
import org.junit.jupiter.api.Test

@QuarkusTest
class DemoResourceTest {

    @Test
    fun testNoAccessToken() {
        RestAssured.given()
            .`when`().get("/api/demo/protectedRole/sample-username")
            .then()
            .statusCode(401)
    }

    @Test
    @TestSecurity(user = "writeClient", roles = [ "https://gateway.example.com/demo.write" ])
    fun testIncorrectRole() {
        RestAssured.given()
            .`when`().get("/api/demo/protectedRole/sample-username")
            .then()
            .statusCode(403)
    }

    @Test
    @TestSecurity(user = "readClient", roles = [ "https://gateway.example.com/demo.read" ])
    fun testProtecedRole() {
        RestAssured.given()
            .`when`().get("/api/demo/protectedRole/sample-username")
            .then()
            .statusCode(200)
            .contentType(ContentType.JSON)
    }

}

Deploy the microservice on Amazon EKS

Deploying the microservice to Amazon EKS is the same as deploying to any upstream Kubernetes-compliant installation. You declare your application resources in a manifest file, and you deploy a container image of your application to your container registry. You can do this in a similar low-code manner with the Quarkus Kubernetes extension, which automatically generates the Kubernetes deployment and service resources at build time. The Quarkus Container Image Jib extension to automatically build the container image and deploys the container image to Amazon Elastic Container Registry (ECR), without the need for a Dockerfile.

Amazon ECR setup

Your microservice container image created during the build process will be published to Amazon Elastic Container Registry (Amazon ECR) in the same Region as the target Amazon EKS cluster deployment. Container images are stored in a repository in Amazon ECR, and in the following example uses a convention for the repository name of project name and microservice name. The first command that follows creates the Amazon ECR repository to host the microservice container image, and the second command obtains login credentials to publish the container image to Amazon ECR.

To set up the application for Amazon ECR integration

In the AWS CLI, create an Amazon ECR repository by using the following command. Replace the project name variable with your parent project name, and replace the microservice name with the microservice name.
```
aws ecr create-repository --repository-name <project-name>/<microservice-name>  --region <region>
```
Obtain an ECR authorization token, by using your IAM principal with the following command. Replace the variables with your values for the AWS account ID and Region.
```
aws ecr get-login-password --region <region> | docker login --username AWS --password-stdin <aws-account-id>.dkr.ecr.<region>.amazonaws.com
```

Configure the application properties to use Amazon ECR

To update the application properties with the ECR repository details

Edit the application.properties file in your Quarkus project:
```
src/main/resources/application.properties
```

Add the following properties, replacing the variables with your values, for the AWS account ID and Region.

quarkus.container-image.group=<microservice-name>
quarkus.container-image.registry=<aws-account-id>.dkr.ecr.<region>.amazonaws.com
quarkus.container-image.build=true
quarkus.container-image.push=true

Save and close your application.properties.
Re-build your application

After the application re-build, you should now have a container image deployed to Amazon ECR in your region with the following name [project-group]/[project-name]. The Quarkus build will give an error if the push to Amazon ECR failed.

Now, you can deploy your application to Amazon EKS, with kubectl from the following build path:

kubectl apply -f build/kubernetes/kubernetes.yml

Integration tests for local and remote deployments

The following environment assumes a Unix shell: either MacOS, Linux, or Windows Subsystem for Linux (WSL 2).

How to obtain the access token from the token endpoint

Obtain the access token for the application client by using the Amazon Cognito OAuth 2.0 token endpoint, and export an environment variable for re-use. Replace the variables with your Amazon Cognito pool name, and AWS Region respectively.

export TOKEN_ENDPOINT=https://<pool-name>.auth.<region>.amazoncognito.com/token

To generate the client credentials in the required format, you need the Base64 representation of the app client client-id:client-secret. There are many tools online to help you generate a Base64 encoded string. Export the following environment variables, to avoid hard-coding in configuration or scripts.

export CLIENT_CREDENTIALS_READ=Base64(client-id:client-secret)
export CLIENT_CREDENTIALS_WRITE=Base64(client-id:client-secret)

You can use curl to post to the token endpoint, and obtain an access token for the read and write app client respectively. You can pass grant_type=client_credentials and the custom scopes as appropriate. If you pass an incorrect scope, you will receive an invalid_grant error. The Unix jq tool extracts the access token from the JSON string. If you do not have the jq tool installed, you can use your relevant package manager (such as apt-get, yum, or brew), to install using sudo [package manager] install jq.

The following shell commands obtain the access token associated with the read or write scope. The client credentials are used to authorize the generation of the access token. An environment variable stores the read or write access token for future use. Update the scope URL to your host, in this case gateway.example.com.

export access_token_read=$(curl -s -X POST --location "$TOKEN_ENDPOINT" \
     -H "Authorization: Basic $CLIENT_CREDENTIALS_READ" \
     -H "Content-Type: application/x-www-form-urlencoded" \
     -d "grant_type=client_credentials&scope=https://<gateway.example.com>/demo.read" \
| jq --raw-output '.access_token')

export access_token_write=$(curl -s -X POST --location "$TOKEN_ENDPOINT" \
     -H "Authorization: Basic $CLIENT_CREDENTIALS_WRITE" \
     -H "Content-Type: application/x-www-form-urlencoded" \
     -d "grant_type=client_credentials&scope=https://<gateway.example.com>/demo.write" \ 
| jq --raw-output '.access_token')

If the curl commands are successful, you should see the access tokens in the environment variables by using the following echo commands:

echo $access_token_read
echo $access_token_write

For more information or troubleshooting, see TOKEN Endpoint in the Amazon Cognito Developer Guide.

Test scope with automation script

Now that you have saved the read and write access tokens, you can test the API. The endpoint can be local or on a remote cluster. The process is the same, all that changes is the target URL. The simplicity of toggling the target URL between local and remote is one of the reasons why access token security can be integrated into the full development lifecycle.

To perform integration tests in bulk, use a shell script that validates the response code. The example script that follows validates the API call under three test scenarios, the same as the unit tests:

If no valid access token is passed: 401 (unauthorized) response is expected.
A valid access token is passed, but with an incorrect role claim: 403 (forbidden) response is expected.
A valid access token and valid role-claim is passed: 200 (ok) response with content-type of application/json expected.

Name the following script, demo-api.sh. For each API method in the microservice, you duplicate these three tests, but for the sake of brevity in this post, I’m only showing you one API method here, protectedRole.

#!/bin/bash

HOST="http://localhost:8080"
if [ "_$1" != "_" ]; then
  HOST="$1"
fi

validate_response() {
  typeset http_response="$1"
  typeset expected_rc="$2"

  http_status=$(echo "$http_response" | awk 'BEGIN { FS = "!" }; { print $2 }')
  if [ $http_status -ne $expected_rc ]; then
    echo "Failed: Status code $http_status"
    exit 1
  elif [ $http_status -eq 200 ]; then
      echo "  Output: $http_response"
  fi
}

echo "Test 401-unauthorized: Protected /api/demo/protectedRole/{name}"
http_response=$(
  curl --silent -w "!%{http_code}!%{content_type}" \
    -X GET --location "$HOST/api/demo/protectedRole/sample-username" \
    -H "Cache-Control: no-cache" \
    -H "Accept: text/plain"
)
validate_response "$http_response" 401

echo "Test 403-forbidden: Protected /api/demo/protectedRole/{name}"
http_response=$(
  curl --silent -w "!%{http_code}!%{content_type}" \
    -X GET --location "$HOST/api/demo/protectedRole/sample-username" \
    -H "Accept: application/json" \
    -H "Cache-Control: no-cache" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $access_token_write"
)
validate_response "$http_response" 403

echo "Test 200-ok: Protected /api/demo/protectedRole/{name}"
http_response=$(
  curl --silent -w "!%{http_code}!%{content_type}" \
    -X GET --location "$HOST/api/demo/protectedRole/sample-username" \
    -H "Accept: application/json" \
    -H "Cache-Control: no-cache" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $access_token_read"
)
validate_response "$http_response" 200

Test the microservice API against the access token claims

Run the script for a local host deployment on http://localhost:8080, and on the remote EKS cluster, in this case https://gateway.example.com.

If everything works as expected, you will have demonstrated the same test process for local and remote deployments of your microservice. Another advantage of creating a security test automation process like the one demonstrated, is that you can also include it as part of your continuous integration/continuous delivery (CI/CD) test automation.

The test automation script accepts the microservice host URL as a parameter (the default is local), referencing the stored access tokens from the environment variables. Upon error, the script will exit with the error code. To test the remote EKS cluster, use the following command, with your host URL, in this case gateway.example.com.

./demo-api.sh https://<gateway.example.com>

Expected output:

Test 401-unauthorized: No access token for /api/demo/protectedRole/{name}
Test 403-forbidden: Incorrect role/custom-scope for /api/demo/protectedRole/{name}
Test 200-ok: Correct role for /api/demo/protectedRole/{name}
  Output: {"protectedAPI":"true","paramName":"sample-username"}!200!application/json

Best practices for a well architected production service-to-service client

For elevated security in alignment with AWS Well Architected, it is recommend to use AWS Secrets Manager to hold the client credentials. Separating your credentials from the application permits credential rotation without the requirement to release a new version of the application or modify environment variables used by the service. Access to secrets must be tightly controlled because the secrets contain extremely sensitive information. Secrets Manager uses AWS Identity and Access Management (IAM) to secure access to the secrets. By using the permissions capabilities of IAM permissions policies, you can control which users or services have access to your secrets. Secrets Manager uses envelope encryption with AWS KMS customer master keys (CMKs) and data key to protect each secret value. When you create a secret, you can choose any symmetric customer managed CMK in the AWS account and Region, or you can use the AWS managed CMK for Secrets Manager aws/secretsmanager.

Access tokens can be configured on Amazon Cognito to expire in as little as 5 minutes or as long as 24 hours. To avoid unnecessary calls to the token endpoint, the application client should cache the access token and refresh close to expiry. In the Quarkus framework used for the microservice, this can be automatically performed for a client service by adding the quarkus-oidc-client extension to the application.

Cleaning up

To avoid incurring future charges, delete all the resources created.

To delete the EKS cluster, and the associated Application Load Balancer, follow the steps described in Deleting a cluster in the Amazon EKS User Guide.
Delete the Amazon Cognito user pool.
Delete any AWS Certificate Manager public certificate or AWS Route 53 DSN alias that was created.

Conclusion

This post has focused on the last line of defense, the microservice, and the importance of a layered security approach throughout the development lifecycle. Access token security should be validated both at the application gateway and microservice for end-to-end API protection.

As an additional layer of security at the application gateway, you should consider using Amazon API Gateway, and the inbuilt JWT authorizer to perform the same API access token validation for public facing APIs. For more advanced business-to-business solutions, Amazon API Gateway provides integrated mutual TLS authentication.

To learn more about protecting information, systems, and assets that use Amazon EKS, see the Amazon EKS Best Practices Guide for Security.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon Cognito forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Insights for CTOs: Part 1 – Building and Operating Cloud Applications

2021-08-02 Syed Jaffry

Post Syndicated from Syed Jaffry original https://aws.amazon.com/blogs/architecture/insights-for-ctos-part-1-building-and-operating-cloud-applications/

This 6-part series shares insights gained from various CTOs during their cloud adoption journeys at their respective organizations. This post takes those learnings and summarizes architecture best practices to help you build and operate applications successfully in the cloud. This series will also cover topics on cloud financial management, security, modern data and artificial intelligence (AI), cloud operating models, and strategies for cloud migration.

Optimize cost and performance with AWS services

Your technology costs vs. return on investment (ROI) is likely being constantly evaluated. In the cloud, you “pay as you go.” This means your technology spend is an operational cost rather than capital expenditure, as discussed in Part 3: Cloud economics – OPEX vs. CAPEX.

So, how do you maximize the ROI of your cloud spend? The following sections provide hosting options and to help you choose a hosting model that best suits your needs.

Evaluating your hosting options

EC2 instances and the lift and shift strategy

Using cloud native dynamic provisioning/de-provisioning of Amazon Elastic Compute Cloud (Amazon EC2) instances will help you meet business needs more accurately and optimize compute costs. EC2 instances allow you to use the “lift and shift” migration strategy for your applications. This helps you avoid overhead costs you may incur from upfront capacity planning.

Figure 1. Comparing on-premises vs. cloud infrastructure provisioning

Containerized hosting (with EC2 hosts)

Engineering teams already skilled in containerized hosting have saved additional costs by using Amazon Elastic Kubernetes Service (Amazon EKS) or Amazon Elastic Container Service (Amazon ECS). This is because your unit of deployment is a container instead of an entire instance, and Amazon EKS or Amazon ECS can pack multiple containers into a single instance. Application change management is also less risky because you can leverage Amazon EKS or Amazon ECS built-in orchestration to manage non-disruptive deployments.

Serverless architecture

Use AWS Lambda and AWS Fargate to scale to match unpredictable usage. We have seen AWS software as a service (SaaS) customers build better measures of “cost per user” of an application into their metering systems using serverless. This is because instead of paying for server uptime, you only pay for runtime usage (down to millisecond increments for Lambda) when you run your application.

Further considerations for choosing the right hosting platform

The following table provides considerations for implementing the most cost-effective model for some use cases you may encounter when building your architecture:

Building a cloud operating model and managing risk

Building an effective people, governance, and platform capability is summarized in the following sections and discussed in detail in Part 5: Organizing teams to enable effective build/run/manage.

People

If your team only builds applications on virtual machines, asking them to move to the cloud serverless model without sufficiently training them could go poorly. We suggest starting small. Select a handful of applications that have lower risk yet meaningful business value and allow your team to build their cloud “muscles.”

Governance

If your teams don’t have the “muscle memory” to make cloud architecture decisions, build a Cloud Center of Excellence (CCOE) to enforce a consistent approach to building in the cloud. Without this team, managing cost, security, and reliability will be harder. Ask the CCOE team to regularly review the application architecture suitability (cost, performance, resiliency) against changing business conditions. This will help you incrementally evolve architecture as appropriate.

Platform

In a typical on-premises environment, changes are deployed “in-place.” This requires a slow and “involve everyone” approach. Deploying in the cloud replaces the in-place approach with blue/green deployments, as shown in Figure 2.

With this strategy, new application versions can be deployed on new machines (green) running side by side with the old machines (blue). Once the new version is validated, switch traffic to the new (green) machines and they become production. This model reduces risk and increases velocity of change.

Figure 2. AWS blue/green deployment model

Securing your application and infrastructure

Security controls in the cloud are defined and enforced in software, which brings risks and opportunities. If not managed with a robust change management process, software-defined firewall misconfiguration can create unexpected threat vectors.

To avoid this, use cloud native patterns like “infrastructure as code” that express all infrastructure provisioning and configuration as declarative templates (JSON or YAML files). Then apply the same “Git pull request” process to infrastructure change management as you do for your applications to enforce strong governance. Use tools like AWS CloudFormation or AWS Cloud Development Kit (AWS CDK) to implement infrastructure templates into your cloud environment.

Apply a layered security model (“defense in depth”) to your application stack, as shown in Figure 3, to prevent against distributed denial of service (DDoS) and application layer attacks. Part 2: Protecting AWS account, data, and applications provides a detailed discussion on security.

Figure 3. Defense in depth

Data stores

How many is too many?

In on-premises environments, it is typically difficult to provision a separate database per microservice. As a result, the application or microservice isolation stops at the compute layer, and the database becomes the key shared dependency that slows down change.

The cloud provides API instantiable, fully managed databases like Amazon Relational Database Service (Amazon RDS) (SQL), Amazon DynamoDB (NoSQL), and others. This allows you to isolate your application end to end and create a more resilient architecture. For example, in a cell-based architecture where users are placed into self-contained, isolated application stack “cells,” the “blast radius” of an impact, such as application downtime or user experience degradation, is limited to each cell.

Database engines

Relational databases are typically the default starting point for many organizations. While relational databases offer speed and flexibility to bootstrap a new application, they bring complexity when you need to horizontally scale.

Your application needs will determine whether you use a relational or non-relational database. In the cloud, API instantiated, fully managed databases give you options to closely match your application’s use case. For example, in-memory databases like Amazon ElastiCache reduce latency for website content and key-value databases like DynamoDB provide a horizontally scalable backend for building an ecommerce shopping cart.

Summary

We acknowledge that CTO responsibilities can differ among organizations; however, this blog discusses common key considerations when building and operating an application in the cloud.

Choosing the right application hosting platform depends on your application’s use case and can impact the operational cost of your application in the cloud. Consider the people, governance, and platform aspects carefully because they will influence the success or failure of your cloud adoption. Use lower risk application deployment patterns in the cloud. Managed data stores in the cloud open your choice for data stores beyond relational. In the next post of this series, Part 2: Protecting AWS account, data, and applications, we will explore best practices and principles to apply when thinking about security in the cloud.

Related information

Part 2: Protecting AWS account, data and applications
Part 3: Cloud economics – OPEX vs CAPEX
Part 4: Building a modern data platform for AI
Part 5: Organizing teams to enable effective build/run/manage
Part 6: Strategies and lessons on migrating workloads to the cloud

Choosing a Well-Architected CI/CD approach: Open Source on AWS

2021-07-27 Mikhail Vasilyev

Post Syndicated from Mikhail Vasilyev original https://aws.amazon.com/blogs/devops/choosing-a-well-architected-ci-cd-approach-open-source-on-aws/

Introduction

When building a CI/CD platform, it is important to make an informed decision regarding every underlying tool. This post explores evaluating the criteria for selecting each tool focusing on a balance between meeting functional and non-functional requirements, and maximizing value.

Your first decision: source code management.

Source code is potentially your most valuable asset, and so we start by choosing a source code management tool. These tools normally have high non-functional requirements in order to protect your assets and to ensure they are available to the organization when needed. The requirements usually include demand for high durability, high availability (HA), consistently high throughput, and strong security with role-based access controls.

At the same time, source code management tools normally have many specific functional requirements as well. For example, the ability to provide collaborative code review in the UI, flexible and tunable merge policies including both automated and manual gates (code checks), and out-of-box UI-level integrations with numerous other tools. These kinds of integrations can include enabling monitoring, CI, chats, and agile project management.

Many teams also treat source code management tools as their portal into other CI/CD tools. They make them shareable between teams, and might prefer to stay within one single context and user interface throughout the entire DevOps cycle. Many source code management tools are actually a stack of services that support multiple steps of your CI/CD workflows from within a single UI. This makes them an excellent starting point for building your CI/CD platforms.

The first decision your need to make is whether to go with an open source solution for managing code or with AWS-managed solutions, such as AWS CodeCommit. Open source solutions include (but are not limited to) the following: Gerrit, Gitlab, Gogs, and Phabricator.

You decision will be influenced by the amount of benefit your team can gain from the flexibility provided through open source, and how well your team can support deploying and managing these solutions. You will also need to consider the infrastructure and management overhead cost.

Engineering teams that have the capacity to develop their own plugins for their CI/CD platforms, or whom even contribute directly to open source projects, will often prefer open source solutions for the flexibility they provide. This will be especially true if they are fluent in designing and supporting their own cloud infrastructure. If the team gets more value by trading the flexibility of open source for not having to worry about managing infrastructure (especially if High Availability, Scalability, Durability, and Security are more critical) an AWS-managed solution would be a better choice.

Source Code Management Solution

When the choice is made in favor of an open-source code management solution (such as Gitlab), the next decision will be how to architect the deployment. Will the team deploy to a single instance, or design for high availability, durability, and scalability? Teams that want to design Gitlab for HA can use the following guide to proceed: Installing GitLab on Amazon Web Services (AWS)

By adopting AWS services (such as Amazon RDS, Amazon ElastiCache for Redis, and Autoscaling Groups), you can lower the management burden of supporting the underlying infrastructure in this self-managed HA scenario.

High level overview of self-managed HA Gitlab deployment

Your second decision: Continuous Integration engine

Selecting your CI engine, you might be able to benefit from additional features of previously selected solutions. Gitlab provides both source control services, as well as built-in CI tools, called Gitlab CI. Gitlab Runners are responsible for running CI jobs, and the actual jobs are described as YML files stored in Gitlab’s git repository along with product code. For security and performance reasons, GitLab Runners should be on resources separate from your GitLab instance.

You could manage those resources or you could use one of the AWS services that can support deploying and managing Runners. The use of an on-demand service removes the expense of implementing and managing a capability that is undifferentiated heavy lifting for you. This provides cost optimization and enables operational excellence. You pay for what you use and the service team manages the underlying service.

Continuous Integration engine Solution

In an architecture example (below), Gitlab Runners are deployed in containers running on Amazon EKS. The team has less infrastructure to manage, can start focusing on development faster by not having to implement the capability, and can provision resources in an optimal way for their on-demand needs.

To further optimize costs, you can use EC2 Spot Instances for your EKS nodes. CI jobs are normally compute intensive and limited in run time. The runner jobs can easily be restarted on a different resource with little impact. This makes them tolerant of failure and the use of EC2 Spot instances very appealing. Amazon EKS and Spot Instances are supported out-of-box in Gitlab. As a result there is no integration to develop, only configuration is required.

To support infrastructure as code best practices, Runners are deployed with Helm and are stored and versioned as Helm charts. All of the infrastructure as code information used to implement the CI/CD platform itself is stored in templates such as Terraform.

High level overview of Infrastructure as Code on Gitlab and Gitlab CI

Your third decision: Container Registry

You will be unable to deploy Runners if the container images are not available. As a result, the primary non-functional requirements for your production container registry are likely to include high availability, durability, transparent scalability, and security. At the same time, your functional requirements for a container registry might be lower. It might be sufficient to have a simple UI, and simple APIs supporting basic flows. Customers looking for a managed solution can use Amazon ECR, which is OCI compliant and supports Helm Charts.

Container Registry Solution

For this set of requirements, the flexibility and feature velocity of open source tools does not provide an advantage. Self-supporting high availability and strengthened security could be costly in implementation time and long-term management. Based on [Blog post 1 Diagram 1], an AWS-managed solution provides cost advantages and has no management overhead. In this case, an AWS-managed solution is a better choice for your container registry than an open-source solution hosted on AWS. In this example, Amazon ECR is selected. Customers who prefer to go with open-source container registries might consider solutions like Harbor.

High level overview of Gitlab CI with Amazon ECR

Additional Considerations

Now that the main services for the CI/CD platform are selected, we will take a high level look at additional important considerations. You need to make sure you have observability into both infrastructure and applications, that backup tools and policies are in place, and that security needs are addressed.

There are many mechanisms to strengthen security including the use of security groups. Use IAM for granular permission control. Robust policies can limit the exposure of your resources and control the flow of traffic. Implement policies to prevent your assets leaving your CI environment inappropriately. To protect sensitive data, such as worker secrets, encrypt these assets while in transit and at rest. Select a key management solution to reduce your operational burden and to support these activities such as AWS Key Management Service (AWS KMS). To deliver secure and compliant application changes rapidly while running operations consistently with automation, implement DevSecOps.

Amazon S3 is durable, secure, and highly available by design making it the preferred choice to store EBS-level backups by many customers. Amazon S3 satisfies the non-functional requirements for a backup store. It also supports versioning and tiered storage classes, making it a cost-effective as well.

Your observability requirements may emphasize versatility and flexibility for application-level monitoring. Using Amazon CloudWatch to monitor your infrastructure and then extending your capabilities through an open-source solutions such as Prometheus may be advantageous. You can get many of the benefits of both open-source Prometheus and AWS services with Amazon Managed Service for Prometheus (AMP). For interactive visualization of metrics, many customers choose solutions such as open-source Grafana, available as an AWS service Amazon Managed Service for Grafana (AMG).

CI/CD Platform with Gitlab and AWS

Conclusion

We have covered how making informed decisions can maximize value and synergy between open-source solutions on AWS, such as Gitlab, and AWS-managed services, such as Amazon EKS and Amazon ECR. You can find the right balance of open-source tools and AWS services that will meet your functional and non-functional requirements, and help maximizing the value you get from those resources.

Pete Goldberg, Director of Partnerships at GitLab: “When aligning your development process to AWS Well Architected Framework, GitLab allows customers to build and automate processes to achieve Operational Excellence. As a single tool designed to facilitate collaboration across the organization, GitLab simplifies the process to follow the Fully Separated Operating Model where Engineering and Operations come together via automated processes that remove the historical barriers between the groups. This gives organizations the ability to efficiently and rapidly deploy new features and applications that drive the business while providing the risk mitigation and compliance they require. By allowing operations teams to define infrastructure as code in the same tool that the engineering teams are storing application code, and allowing your automation bring those together for your CI/CD workflows companies can move faster while having compliance and controls built-in, providing the entire organization greater transparency. With GitLab’s integrations with different AWS compute options (EC2, Lambda, Fargate, ECS or EKS), customers can choose the best type of compute for the job without sacrificing the controls required to maintain Operational Excellence.”

Author bio

Mikhail is a Solutions Architect for RUS-CIS. Mikhail supports customers on their cloud journeys with Well-architected best practices and adoption of DevOps techniques on AWS. Mikhail is a fan of ChatOps, Open Source on AWS and Operational Excellence design principles.

Build a SQL-based ETL pipeline with Apache Spark on Amazon EKS

2021-07-21 Melody Yang

Post Syndicated from Melody Yang original https://aws.amazon.com/blogs/big-data/build-a-sql-based-etl-pipeline-with-apache-spark-on-amazon-eks/

Today, the most successful and fastest growing companies are generally data-driven organizations. Taking advantage of data is pivotal to answering many pressing business problems; however, this can prove to be overwhelming and difficult to manage due to data’s increasing diversity, scale, and complexity. One of the most popular technologies that businesses use to overcome these challenges and harness the power of their growing data is Apache Spark.

Apache Spark is an open-source, distributed data processing framework capable of performing analytics on large-scale datasets, enabling businesses to derive insights from all of their data whether it is structured, semi-structured, or unstructured in nature.

You can flexibly deploy Spark applications in multiple ways within your AWS environment, including on the Amazon managed Kubernetes offering Amazon Elastic Kubernetes Service (Amazon EKS). With the release of Spark 2.3, Kubernetes became a new resource scheduler (in addition to YARN, Mesos, and Standalone) to provision and manage Spark workloads. Increasingly, it has become the new standard resource manager for new Spark projects, as we can tell from the popularity of the open-source project. With Spark 3.1, the Spark on Kubernetes project is officially production-ready and Generally Available. More data architecture and patterns are available for businesses to accelerate data-driven transitions. However, for organizations accustomed to SQL-based data management systems and tools, adapting to the modern data practice with Apache Spark may slow down the pace of innovation.

In this post, we address this challenge by using the open-source data processing framework Arc, which subscribes to the SQL-first design principle. Arc abstracts from Apache Spark and container technologies, in order to foster simplicity whilst maximizing efficiency. Arc is used as a publicly available example to prove the ETL architecture. It can be replaced by your own choice of in-house build or other data framework that supports the declarative ETL build and deployment pattern.

Why do we need to build a codeless and declarative data
pipeline?

Data platforms often repeatedly perform extract, transform, and load (ETL) jobs to achieve similar outputs and objectives. This can range from simple data operations, such as standardizing a date column, to performing complex change data capture processes (CDC) to track historical changes of a record. Although the outcomes are highly similar, the productivity and cost can vary heavily if not implemented suitably and efficiently.

A codeless data processing design pattern enables data personas to build reusable and performant ETL pipelines, without having to delve into the complexities of writing verbose Spark code. Writing your ETL pipeline in native Spark may not scale very well for organizations not familiar with maintaining code, especially when business requirements change frequently. The SQL-first approach provides a declarative harness towards building idempotent data pipelines that can be easily scaled and embedded within your continuous integration and continuous delivery (CI/CD) process.

The Arc declarative data framework simplifies ETL implementation in Spark and enables a wider audience of users ranging from business analysts to developers, who already have existing skills in SQL. It further accelerates users’ ability to develop efficient ETL pipelines to deliver higher business value.

For this post, we demonstrate how simple it is to use Arc to facilitate CDC to track incremental data changes from a source system.

Why adopt SQL to build Spark workloads?

When writing Spark applications, developers often opt for an imperative or procedural approach that involves explicitly defining the computational steps and the order of implementation.

A declarative approach involves the author defining the desired target state without describing the control flow. This is determined by the underlying engine or framework, which in this post will be determined by the Spark SQL engine.

Let’s explore a common data use case of masking columns in a table and how we can write our transformation code in these two paradigms.

The following code shows the imperative method (PySpark):

# Step 1 – Define a dataframe with a column to be masked
df1 = spark.sql("select phone_number from customer")

# Step 2 – Define a new dataframe with a new column that has masked
df2 = df1.withColumn("phone_number_masked", regexp_replace("phone_number", "[0-9]", "*"))

# Step 3 – Drop the old column that is unmasked
df3 = df2.drop("phone_number")

The following code shows the declarative method (Spark SQL):

SELECT regexp_replace(phone_number, '[0-9]', '*') AS phone_number_masked 
FROM customer

The imperative approach dictates how to construct a representation of the customer table with a masked phone number column; whereas the declarative approach defines just the “what” or the desired target state, leaving the “how” for the underlying engine to manage.

As a result, the declarative approach is much simpler and yields code that is easier to read. Furthermore, in this context, it takes advantage of SQL—a declarative language and more widely adopted and known tool, which enables you to easily build data pipelines and achieve your analytical objectives quicker.

If the underlying ETL technology changes, the SQL script remains the same as long as business rules remain unchanged. However, with an imperative approach processing data, the code will most likely require a rewrite and regression testing, such as when organizations upgrade Python from version 2 to 3.

Why deploy Spark on Amazon EKS?

Amazon EKS is a fully managed offering that enables you to run containerized applications without needing to install or manage your own Kubernetes control plane or worker nodes. When you deploy Apache Spark on Amazon EKS, applications can inherit the underlying advantages of Kubernetes, improving the overall flexibility, availability, scalability, and security:

Optimized resource isolation – Amazon EKS supports Kubernetes namespaces, network policies, and pods priority to provide isolation between workloads. In multi-tenant environments, its optimized resource allocation feature enables different personas such as IT engineers, data scientists, and business analysts to focus their attention towards innovation and delivery. They don’t need to worry about resource segregation and security.
Simpler Spark cluster management – Spark applications can interact with the Amazon EKS API to automatically configure and provision Spark clusters based on your Spark submit request. Amazon EKS spins up a number of pods or containers accordingly for your data processing needs. If you turn on the Dynamic Resource Allocation feature in your application, the Spark cluster on Amazon EKS dynamically evolves based on workload. This significantly simplifies the Spark cluster management.
Scale Spark applications seamlessly and efficiently – Spark on Amazon EKS follows a pod-centric architecture pattern. This means an isolated cluster of pods on Amazon EKS is dedicated to a single Spark ETL job. You can expand or shrink a Spark cluster per job in a matter of seconds in some cases. To better manage spikes, for example when training a machine learning model over a long period of time, Amazon EKS offers the elastic control through the Cluster Autoscaler at node level and the Horizontal pod Autoscaler at the pod level. Additionally, scaling Spark on Amazon EKS with the AWS Fargate launch type offers you a serverless ETL option with the least operational effort.
Improved resiliency and cloud support – Kubernetes was introduced in 2018 as a native Spark resource scheduler. As adoption grew, this project became Generally Available with Spark 3.1 (2021), alongside better cloud support. A major update in this exciting release is the Graceful Executor Decommissioning, which makes Apache Spark more robust to Amazon Elastic Compute Cloud (Amazon EC2) Spot Instance interruption. As of this writing, the feature is only available in Kubernetes and Standalone mode.

Spark on Amazon EKS can use all of these features provided by the fully managed Kubernetes service for more optimal resource allocation, simpler deployment, and improved operational excellence.

Solution overview

This post comes with a ready-to-use blueprint, which automatically provisions the necessary infrastructure and spins up two web interfaces in Amazon EKS to support interactive ETL build and orchestration. Additionally, it enforces the best practice in data DevOps and CI/CD deployment.

The following diagram illustrates the solution architecture.

The architecture has four main components:

Orchestration on Amazon EKS – The solution offers a highly pluggable workflow management layer. In this post, we use Argo Workflows to orchestrate ETL jobs in a declarative way. It’s consistent with the standard deployment method in Amazon EKS. Apache Airflow and other tools are also available to use.
Data workload on Amazon EKS – This represents a workspace on the same Amazon EKS cluster, for you to build, test, and run ETL jobs interactively. It’s powered by Jupyter Notebooks with a custom kernel called Arc Jupyter. Its Git integration feature reinforces the best practice in CI/CD deployment operation. This means every notebook created on a Jupyter instance must check in to a Git repository for the standard source and version control. The Git repository should be your single source of truth for the ETL workload. When your Jupyter notebook files (job definition) and SQL scripts land to Git, followed by an Amazon Simple Storage Service (Amazon S3) upload, it runs your ETL automatically or based on a time schedule. The entire deployment process is seamless to prevent any unintentional human mistakes.
Security – This layer secures Arc, Jupyter Docker images, and other sensitive information. The IAM roles for service accounts feature (IRSA) on Amazon EKS provides token authorization with fine-grained access control to other AWS services. In this solution, Amazon EKS integrates with Amazon Athena, AWS Glue, and S3 buckets securely, so you don’t need to maintain a long-lived AWS credential for your applications. We also use Amazon CloudWatch for collecting ETL application logs and monitoring Amazon EKS with the container insights
Data lake – As an output of the solution, the data destination is an S3 bucket. You should be able to query the data directly in Athena, backed up by a Data Catalog in AWS Glue.

Prerequisites

To run the sample solution on a local machine, you should have the following prerequisites:

Python 3.6 or later.
The AWS Command Line Interface (AWS CLI) version 1. For Windows, use the MSI installer. For Linux, macOS, or Unix, use the bundled installer.
The AWS CLI is configured to communicate with services in your deployment account. Otherwise, either set your profile by EXPORT AWS_PROFILE=<your_aws_profile> or run aws configure to set up your AWS account access.

If you don’t want to install anything on your computer, use AWS CloudShell, a browser-based shell that makes it easy to run scripts with the AWS CLI.

Download the project

Clone the sample code either to your computer or your AWS CloudShell console:

git clone https://github.com/aws-samples/sql-based-etl-on-amazon-eks.git 
cd sql-based-etl-on-amazon-eks

Deploy the infrastructure

The deployment process takes approximately 30 minutes to complete.

Launch the AWS CloudFormation template to deploy the solution. Follow the Customization instructions if you want to make a change or deploy to a different Region.

Region	Launch Template
US East (N. Virginia)

Deploy with the default settings (recommended). If you want to use your own username for the Jupyter login, update the parameter jhubuser. If performing ETL on your own data, update the parameter datalakebucket with your S3 bucket. The bucket must be in the same Region as the deployment Region.

Post-deployment

Run the script to install command tools:

cd spark-on-eks
./deployment/post-deployment.sh

Test the job in Jupyter

To test the job in Jupyter, complete the following steps:

Log in with the details from the preceding script output. Or look it up from the Secrets Manager Console.
For Server Options, select the default server size.

By following the best security practice, the notebook session times out if idle for 30 minutes. You may need to refresh your web browser and log in again.

Open a sample job spark-on-eks/source/example/notebook/scd2-job.ipynb from your notebook instance.
Choose the refresh icon to see the file if needed.
Run each block and observe the result. The job outputs a table to support the Slowly Changing Dimension Type 2 (SCD2) business need.

To demonstrate the best practice in Data DevOps, the JupyterHub is configured to synchronize the latest code from this project GitHub repo. In practice, you must save all changes to a source repository in order to schedule your ETL job to run.
Run the query on the Athena console to see if it’s a SCD type 2 table:

SELECT * FROM default.deltalake_contact_jhub WHERE id=12

(Optional) If it’s your first time running an Athena query, configure your result location to: s3://sparkoneks-appcode<random_string>/

Submit the job via Argo

To submit the job via Argo, complete the following steps:

Check your connection in CloudShell or your local computer.

kubectl get svc & argo version --short

If you don’t have access to Amazon EKS or the Argo CLI isn’t installed, run the post-deployment script again:

ARGO_URL=$(aws cloudformation describe-stacks --stack-name SparkOnEKS --query "Stacks[0].Outputs[?OutputKey=='ARGOURL'].OutputValue" --output text)
LOGIN=$(argo auth token)
echo -e "\nArgo website:\n$ARGO_URL\n" && echo -e "Login token:\n$LOGIN\n"

Run the script again to get a new login token if you experience a timeout

Choose the Workflows option icon on the sidebar to view job status.

To demonstrate the job dependency feature in Argo Workflows, we break the previous Jupyter notebook into three files, in our case, three ETL jobs.
Submit the same SCD2 data pipeline with three jobs:

# change the CFN stack name if yours is different
app_code_bucket=$(aws cloudformation describe-stacks --stack-name SparkOnEKS --query "Stacks[0].Outputs[?OutputKey=='CODEBUCKET'].OutputValue" --output text)
argo submit source/example/scd2-job-scheduler.yaml -n spark --watch -p codeBucket=$app_code_bucket

Under the spark namespace, check the job progress and application logs on the Argo website.

Query the table in Athena to see if it has the same outcome as the test in Jupyter earlier:

SELECT * FROM default.contact_snapshot WHERE id=12

The following screenshot shows the query results.

Submit a native Spark job

Previously, we ran the AWS CloudFormation-like ETL job defined in a Jupyter notebook powered by Arc. Now, let’s reuse the Arc Docker image that contains the latest Spark distribution, to submit a native PySpark job that processes around 50GB of data. The application code looks like this:

The job submitter is defined by spark-on-k8s-operator in a declarative manner. It follows the same declarative pattern as other applications deployment processes. As shown in the following code, we use the same command syntax kubectl apply.

Submit the Spark job as a usual application on Amazon EKS:

# get the s3 bucket from CFN output
app_code_bucket=$(aws cloudformation describe-stacks --stack-name SparkOnEKS --query "Stacks[0].Outputs[?OutputKey=='CODEBUCKET'].OutputValue" --output text)

kubectl create -n spark configmap special-config --from-literal=codeBucket=$app_code_bucket
kubectl apply -f source/example/native-spark-job-scheduler.yaml

Check the job status:

kubectl get pod -n spark

# the Spark cluster is running across two AZs.
kubectl get node \
--label-columns=eks.amazonaws.com/capacityType,topology.kubernetes.io/zone

# watch progress on SparkUI, if the job was submitted from local computer
kubectl port-forward word-count-driver 4040:4040 -n spark
# go to `localhost:4040` from your web browser

Test fault tolerance and resiliency.

You can perform self-recovery with a simpler retry mechanism. In Spark, we know the driver is a single point of failure. If a Spark driver dies, the entire application won’t survive. It often requires extra effort to set up a job rerun, in order to provide the fault tolerance capability. However, it’s much simpler in Amazon EKS with only a few lines of retry declaration. It works for both batch and streaming Spark applications.

Simulate a Spot interruption scenario by manually deleting the EC2 instance running the driver:

# find the ec2 host name
kubectl describe pod word-count-driver -n spark

# replace the placeholder 
kubectl delete node <ec2_host_name>

# has the driver come back?
kubectl get pod -n spark

Delete the executor exec-1 when it’s running:

kubectl get pod -n spark
exec_name=$(kubectl get pod -n spark | grep "exec-1" | awk '{print $1}')
kubectl delete -n spark pod $exec_name ––force

# has it come back with a different number suffix? 
kubectl get pod -n spark

Stop the job or rerun with different job configuration:

kubectl delete -f source/example/native-spark-job-scheduler.yaml

# modify the scheduler file and rerun
Kubectl apply -f source/example/native-spark-job-scheduler.yaml

Clean up

To avoid incurring future charges, delete the resources generated if you don’t need the solution anymore.

Run the cleanup script with your CloudFormation stack name. The default name is SparkOnEKS:

cd sql-based-etl-on-amazon-eks/spark-on-eks
./deployment/delete_all.sh <OPTIONAL:stack_name>

On the AWS CloudFormation console, manually delete the remaining resources if needed.

Conclusion

To accelerate data innovation, improve time-to-insight and support business agility by advancing engineering productivity, this post introduces a declarative ETL option driven by an SQL-centric architecture. Managing applications declaratively in Kubernetes is a widely adopted best practice. You can use the same approach to build and deploy Spark applications with an open-source data processing framework or in-house built software to achieve the same productivity goal. Abstracting from Apache Spark and container technologies, you can build a modern data solution on AWS managed services simply and efficiently.

You can flexibly deploy Spark applications in multiple ways within your AWS environment. In this post, we demonstrated how to deploy an ETL pipeline on Amazon EKS. Another option is to leverage the optimized Spark runtime available in Amazon EMR. You can deploy the same solution via Amazon EMR on Amazon EKS. Switching to this deployment option is effortless and straightforward, and doesn’t need an application change or regression test.

Additional reading

For more information about running Apache Spark on Amazon EKS, see the following:

About the Authors

Melody Yang is a Senior Analytics Specialist Solution Architect at AWS with expertise in Big Data technologies. She is an experienced analytics leader working with AWS customers to provide best practice guidance and technical advice in order to assist their success in data transformation. Her areas of interests are open-source frameworks and automation, data engineering and DataOps.

Avnish Jain is a Specialist Solution Architect in Analytics at AWS with experience designing and implementing scalable, modern data platforms on the cloud for large scale enterprises. He is passionate about helping customers build performant and robust data-driven solutions and realise their data & analytics potential.

Shiva Achari is a Senior Data Lab Architect at AWS. He helps AWS customers to design and build data and analytics prototypes via the AWS Data Lab engagement. He has over 14 years of experience working with enterprise customers and startups primarily in the Data and Big Data Analytics space.

Building a centralized Amazon CodeGuru Profiler dashboard for multi-account scenarios

2021-07-19 Rafael Ramos

Post Syndicated from Rafael Ramos original https://aws.amazon.com/blogs/devops/building-a-centralized-codeguru-profiler-dashboard-multi-account/

Amazon CodeGuru is a machine learning service for development teams who want to automate code reviews, identify the most expensive lines of code in their applications, and receive intelligent recommendations on how to fix or improve their code. CodeGuru has two components: CodeGuru Profiler and CodeGuru Reviewer.

CodeGuru Profiler searches for application performance optimizations and recommends ways to fix issues such as excessive recreation of expensive objects, expensive deserialization, usage of inefficient libraries, and excessive logging. CodeGuru Profiler runs continuously, consuming minimal CPU capacity so it doesn’t significantly impact application performance. You can run an agent or import libraries into your application and send the data collected to CodeGuru, and review the findings on the CodeGuru console.

As a best practice, AWS recommends a multi-account strategy for your cloud environment. See Benefits of using multiple AWS accounts for additional details. In this scenario, your CI/CD pipeline deploys your application on the multiple software development lifecycle (SDLC) and production accounts. As a consequence, you’d have one CodeGuru Profiler dashboard per account, which makes it hard for developers to analyze the applications’ performance. This post shows you how to configure CodeGuru Profiler to collect multiple applications’ profiling data into a central account and review the applications’ performance data on one dashboard.

See the diagram below for a typical CI/CD pipeline on multi-account environment, and a central CodeGuru Profiler dashboard:

Diagram with multi-account CI/CD pipeline and central CodeGuru Profiler dashboard

Solution overview

The benefits of sending CodeGuru profiling data to a central AWS account within the same region is that it gives you a single pane of glass to review all of CodeGuru Profiler’s findings and recommendations in one place. At the time of this writing, we don’t recommend sending CodeGuru profiling data across regions. Additionally, we show you how to configure the AWS Identity and Access Management (IAM) roles to assume a cross-account role when running pods on Amazon Elastic Kubernetes Service (Amazon EKS) with OpenID Connect (OIDC), as well as how to configure IAM roles to allow access when using other resources, such as Amazon Elastic Compute Cloud (EC2). On this solution we discuss how to enable the CodeGuru agent within your code, but it’s also possible enable the agent with no code changes. For more information, see Enable the agent from the command line.

The following diagram illustrates our architecture.

Multi-account CodeGuru profiling

Our solution has the following components:

The application in the application account assumes the CodeGuruProfilerAgentRole IAM role. This IAM role has a policy that allows the application to assume the role in the CodeGuru Profiler central account.
AWS Security Token Service (AWS STS) returns temporary credentials required to assume the IAM role in the central account.
The application assumes the CodeGuruCrossAccountRole IAM role in the central account.
The application using the CodeGuruCrossAccountRole role sends CodeGuru Profiler data to the central account.

Prerequisites

Before getting started, you must have the following requirements:

Two AWS accounts:
- Central account – Where the single pane of glass to review all of CodeGuru Profiler’s findings and recommendations resides. The applications from all the other AWS accounts send profiling data to CodeGuru Profiler in the central account.
- Application account – The dedicated account where the profiled application resides.
Install and authenticate the AWS Command Line Interface (AWS CLI). You can authenticate with an IAM user or AWS STS token.
If you’re using Amazon EKS, you need the following installed:
- kubectl – The Kubernetes command line tool.
- eksctl – The Amazon EKS command line tool.

Walkthrough overview

At the time of this writing, CodeGuru supports two programming languages: Python and Java. Each language has a different configuration for sending CodeGuru profiling data to a different AWS account.

Let’s consider a use case where we have two applications that we want to configure to send CodeGuru profiling data to a centralized account. The first application is running Python 3.x and the other application is running Java on JRE 1.8. We need to configure two IAM roles, one in the application account and one in the central account.

We complete the following steps:

Configure the IAM roles and policies.
Configure the CodeGuru profiling groups.
Configure your Java application to send profiling data to the central account.
Configure your Python application to send profiling data to the central account.
Configure IAM with Amazon EKS.
Review findings and recommendations on the CodeGuru Profiler dashboard.

Configuring IAM roles and policies

To allow a CodeGuru Profiler agent on the application account to send profiling information to CodeGuru Profiler on the central account, we need to create a cross-account IAM role in the central account that the CodeGuru Profiler agent can assume. We also need to attach a policy with the necessary privileges to the cross-account role, and configure a role in the application account to allow assuming the cross-account role in the central account.

First, we must configure a cross-account IAM role in the central account that the CodeGuru Profiler agent in the application account assumes. We can create a role called CodeGuruCrossAccountRole on the central account, and assign the IAM trusted entity to allow the CodeGuru Profiler agent to assume the role from the application account. For more information, see IAM tutorial: Delegate access across AWS accounts using IAM roles. The CodeGuruCrossAccountRole trust relationship should look like the following code:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<APPLICATION_ACCOUNT_ID>:role/<CODEGURU_PROFILER_AGENT_ROLE_NAME>"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

After that, we need to attach an AWS managed policy called AmazonCodeGuruProfilerAgentAccess to the CodeGuruCrossAccountRole role. It allows the CodeGuru Profiler agent to send data to the two profiling groups we configure in the next step. For more fine-grained access control, you can create your own IAM policy to allow sending data only to the two specific profiling groups we create. The following code is an example IAM policy that allows a CodeGuru Profiler agent to send profiler data to two profiling groups:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "codeguru-profiler:ConfigureAgent",
      "codeguru-profiler:PostAgentProfile"
    ],
    "Resource": [
        "arn:aws:codeguru-profiler:eu-west-1:<CODEGURU_CENTRAL_ACCOUNT_ID>:profilingGroup/
        JavaAppProfilingGroup",
        "arn:aws:codeguru-profiler:eu-west-1:<CODEGURU_CENTRAL_ACCOUNT_ID>:profilingGroup/
        PythonAppProfilingGroup"
    ]
  }]
}

Here are the IAM policy actions that CodeGuru Profiler requires:

codeguru-profiler:ConfigureAgent grants permission for an agent to register with the orchestration service and retrieve profiling configuration information.
codeguru-profiler:PostAgentProfile grants permission to submit a profile collected by an agent belonging to a specific profiling group for aggregation.

Last, we need to configure the CodeGuruProfilerAgentRole IAM role in the application account with an IAM policy that allows the application to assume the role in the central account:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": "sts:AssumeRole",
    "Resource": "arn:aws:iam::<CODEGURU_CENTRAL_ACCOUNT_ID>:role/CodeGuruCrossAccountRole
  }]
}

To apply this IAM policy to the CodeGuruProfilerAgentRole IAM role, the IAM role needs to be created depending on the platform you are running, such as Amazon Elastic Compute Cloud (Amazon EC2), AWS Lambda, or Amazon EKS. In this post, we discuss how to configure IAM roles for Amazon EKS. For information about IAM roles for Amazon EC2 and Lambda, see IAM roles for Amazon EC2 and AWS Lambda execution role, respectively.

Configuring the CodeGuru profiling groups

Now we need to create two CodeGuru profiling groups on the central account: one for our Java application and one for our Python application. For this post, we just demonstrate the steps for configuring the Python application.

On the CodeGuru console, under Profiler, choose Profiling groups.
Choose Create profiling group.
For Name, enter a name (for this post, we use PythonAppProfilingGroup).
For Compute platform, select Other.

Create profiling group

After you create your profiling group, you need to provide some additional settings.

Select the profiling group you created.
On the Actions menu, choose Manage permissions.
Choose the IAM role CodeGuruCrossAccountRole you created before.
Choose Save.

Manage permissions for PythonAppProfilingGroup

When you complete the configuration, the status of the profiling group shows as Setup required. This is because we need to configure the CodeGuru profiling agent to send data to the profiling group. We cover how to do this for both the Java and Python agent in the next section.

Profiling group setup required

After we configure the agent, we see the status change from Pending to Profiling approximately 15 minutes after CodeGuru Profiler starts sending data.

Configuring your Java application to send profiling data to the central account

To send CodeGuru profiling data to a centralized account when running the agent within the application code, we need to import the CodeGuru agent JAR file into your Java application.

For more information on how to enable the agent, see Enabling the agent with code.

The following table shows the Java types and API calls for each type.

Type	API call
Profiling group name (required)	`.profilingGroupName(String)`
AWS Credentials Provider	`.awsCredentialsProvider(AwsCredentialsProvider)`
Region (optional)	`.awsRegionToReportTo(Region)`
Heap summary data collection (optional)	`.withHeapSummary(Boolean)`

As part of the CodeGuru Profiler.Builder class, we have an option to provide an AWS credentials provider type, which also includes an AWS role ARN, as follows:

import software.amazon.codeguruprofilerjavaagent.Profiler;
...

class MyApplication{

    static String roleArn =
        "arn:aws:iam::<CODEGURU_CENTRAL_ACCOUNT_ID>:role/CodeGuruCrossAccountRole";
    static String sessionName = "codeguru-java-session";
    
    public static void main(String[] args) {
        ...
        Profiler.builder()
            .profilingGroupName("JavaAppProfilingGroup")
            .awsCredentialsProvider(AwsCredsProvider.getCredentials(
                                        roleArn,
                                        sessionName))
            .withHeapSummary(true)
            .build()
            .start();
        ...
    }
}

The awsCredentialsProvider API call allows you to provide an interface for loading AwsCredentials that are used for authentication. For this post, I create an AwsCredsProvider class with the getCredentials method.

The following code shows the AwsCredsProvider class in more detail:

import software.amazon.awssdk.auth.credentials.AwsCredentialsProvider;
import software.amazon.awssdk.auth.credentials.DefaultCredentialsProvider;
import software.amazon.awssdk.services.sts.StsClient;
import software.amazon.awssdk.services.sts.auth.StsAssumeRoleCredentialsProvider;
import software.amazon.awssdk.services.sts.model.AssumeRoleRequest;


public class AwsCredsProvider {

    public static AwsCredentialsProvider getCredentials(
                                            String roleArn,
                                            String sessionName){
                                            
        final AssumeRoleRequest assumeRoleRequest = AssumeRoleRequest.builder()
                .roleArn(roleArn)
                .roleSessionName(sessionName)
                .build();
      
        
        return StsAssumeRoleCredentialsProvider
                .builder()
                        .stsClient(StsClient.builder()
                        .credentialsProvider(DefaultCredentialsProvider.create())
                        .build())
                .refreshRequest(assumeRoleRequest)
                .build();
    }
}

Configuring your Python application to send profiling data to the central account

You can run the CodeGuru Profiler library in your Python codebase by installing the CodeGuru agent using pip install codeguru_profiler_agent.

You can configure the agent by passing different parameters to the Profiler object, as summarized in the following table.

Option	Constructor Argument
Profiling group name (required)	`profiling_group_name="MyProfilingGroup"`
Region	`region_name="eu-west-2"`
AWS session	`aws_session=boto3.session.Session()`

We use the same concepts as with the Java app to assume a role in the CodeGuru central account. The following Python snippet shows you how to instantiate the CodeGuru Profiler object:

import boto3
from codeguru_profiler_agent import Profiler

def assume_role(iam_role):

    sts_client = boto3.client('sts')
    assumed_role = sts_client.assume_role(RoleArn =  iam_role,
                                      RoleSessionName = "codeguru-python-session",
                                      DurationSeconds = 900)

    codeguru_session = boto3.Session(
        aws_access_key_id     = assumed_role['Credentials']['AccessKeyId'],
        aws_secret_access_key = assumed_role['Credentials']['SecretAccessKey'],
        aws_session_token     = assumed_role['Credentials']['SessionToken']
    )

    return codeguru_session


if __name__ == "__main__":

    iam_role = "arn:aws:iam::<CODEGURU_CENTRAL_ACCOUNT_ID>:role/CodeGuruCrossAccountRole"
    codeguru_session = assume_role(iam_role)
    Profiler(profiling_group_name="PythonAppProfilingGroup",
             region_name="<YOUR REGION>",
             aws_session=codeguru_session).start()
    ...

Configuring IAM with Amazon EKS

You can use two different methods to configure the IAM role to allow the CodeGuru Profiler agent to send data from the application account to the CodeGuru Profiler central account:

Associate an IAM role with a Kubernetes service account; this account can then provide AWS permissions to the containers in any pod that uses that service account
Provide an IAM instance profile to the EKS node so that all pods running on this node have access to this role

For more information about configuring either option, see Enabling cross-account access to Amazon EKS cluster resources.

You can use the same CodeGuru codebase to assume a role by either using an IAM role for service accounts or an adding IAM instance profile to an EKS node.

The following diagram shows our architecture for a cross-account profiler using IAM roles for service accounts.

Cross Account CodeGuru profiler using IAM roles for service accounts

In this architecture, the pods use Amazon web identity to get credentials for the IAM role assigned to the pod via the Kubernetes service account. This role needs permission to assume the IAM role in the CodeGuru central account.

The following diagram shows the architecture of a cross-account profiler using an EKS node IAM instance profile.

Cross Account CodeGuru profiler using an EKS node IAM instance profile

In this scenario, the pods use the IAM role assigned to the EKS node. This role needs permission to assume the IAM role in the central account.

In either case, the code doesn’t change and you can assume the IAM role in the central account.

Reviewing findings and recommendations on the CodeGuru Profiler dashboard

Approximately 15 minutes after the CodeGuru Profiler agent starts sending application data to the profiler on the central account, the profiling group changes its status to Profiling. You can visualize the application performance data on the central account’s CodeGuru Profiler dashboard.

CodeGuru Profiler dashboard showing application performance data

In addition, the dashboard provides recommendations on how to improve your application performance based on the data the agent is collecting.

CodeGuru Profiler dashboard showing recommendations for application performance improvements

Conclusion

In this post, you learned how to configure CodeGuru Profiler to collect multiple applications’ profiling data into a central account. This approach makes profiling dashboards more accessible on multi-account setups, and facilitates how developers can analyze the behavior of their distributed workloads. Moreover, because developers only need access to the central account, you can follow the least privilege best practice and isolate the other accounts.

You also learned how to initiate the CodeGuru Profiler agent from both Python and Java applications using cross-account IAM roles, as well as how to use IAM roles for service accounts on Amazon EKS for fine-grained access control and enhanced security. Although CodeGuru Profiler supports Lambda functions profiling, at the time of this writing, it’s only possible to send profiling data to the same AWS account where the Lambda function runs.

For more details regarding how CodeGuru Profiler can help improve application performance, see Optimizing application performance with Amazon CodeGuru Profiler.

Oli Leach

Oli is a Global Solutions Architect at Amazon Web Services, and works with Financial Services customers to help them architect, build, and scale applications to achieve their business goals.

Rafael Ramos

Rafael is a Solutions Architect at AWS, where he helps ISVs on their journey to the cloud. He spent over 13 years working as a software developer, and is passionate about DevOps and serverless. Outside of work, he enjoys playing tabletop RPG, cooking and running marathons.

TLS-enabled Kubernetes clusters with ACM Private CA and Amazon EKS

2021-07-15 Param Sharma

Post Syndicated from Param Sharma original https://aws.amazon.com/blogs/security/tls-enabled-kubernetes-clusters-with-acm-private-ca-and-amazon-eks-2/

In this blog post, we show you how to set up end-to-end encryption on Amazon Elastic Kubernetes Service (Amazon EKS) with AWS Certificate Manager Private Certificate Authority. For this example of end-to-end encryption, traffic originates from your client and terminates at an Ingress controller server running inside a sample app. By following the instructions in this post, you can set up an NGINX ingress controller on Amazon EKS. As part of the example, we show you how to configure an AWS Network Load Balancer (NLB) with HTTPS using certificates issued via ACM Private CA in front of the ingress controller.

AWS Private CA supports an open source plugin for cert-manager that offers a more secure certificate authority solution for Kubernetes containers. cert-manager is a widely-adopted solution for TLS certificate management in Kubernetes. Customers who use cert-manager for application certificate lifecycle management can now use this solution to improve security over the default cert-manager CA, which stores keys in plaintext in server memory. Customers with regulatory requirements for controlling access to and auditing their CA operations can use this solution to improve auditability and support compliance.

Solution components

Kubernetes is an open-source system for automating the deployment, scaling, and management of containerized applications.
Amazon EKS is a managed service that you can use to run Kubernetes on Amazon Web Services (AWS) without needing to install, operate, and maintain your own Kubernetes control plane or nodes.
cert-manager is an add on to Kubernetes to provide TLS certificate management. cert-manager requests certificates, distributes them to Kubernetes containers, and automates certificate renewal. cert-manager ensures certificates are valid and up-to-date, and attempts to renew certificates at an appropriate time before expiry.
ACM Private CA enables the creation of private CA hierarchies, including root and subordinate CAs, without the investment and maintenance costs of operating an on-premises CA. With ACM Private CA, you can issue certificates for authenticating internal users, computers, applications, services, servers, and other devices, and for signing computer code. The private keys for private CAs are stored in AWS managed hardware security modules (HSMs), which are FIPS 140-2 certified, providing a better security profile compared to the default CAs in Kubernetes. Private certificates help identify and secure communication between connected resources on private networks such as servers, mobile and IoT devices, and applications.
AWS Private CA Issuer plugin. Kubernetes containers and applications use digital certificates to provide secure authentication and encryption over TLS. With this plugin, cert-manager requests TLS certificates from Private CA. The integration supports certificate automation for TLS in a range of configurations, including at the ingress, on the pod, and mutual TLS between pods. You can use the AWS Private CA Issuer plugin with Amazon Elastic Kubernetes Service, self managed Kubernetes on AWS, and Kubernetes on-premises.
The AWS Load Balancer controller manages AWS Elastic Load Balancers for a Kubernetes cluster. The controller provisions the following resources.
- An AWS Application Load Balancer (ALB) when you create a Kubernetes Ingress.
- An AWS Network Load Balancer (NLB) when you create a Kubernetes Service of type LoadBalancer.

Different points for terminating TLS in Kubernetes

How and where you terminate your TLS connection depends on your use case, security policies, and need to comply with regulatory requirements. This section talks about four different use cases that are regularly used for terminating TLS. The use cases are illustrated in Figure 1 and described in the text that follows.

Figure 1: Terminating TLS at different points

At the load balancer: The most common use case for terminating TLS at the load balancer level is to use publicly trusted certificates. This use case is simple to deploy and the certificate is bound to the load balancer itself. For example, you can use ACM to issue a public certificate and bind it with AWS NLB. You can learn more from How do I terminate HTTPS traffic on Amazon EKS workloads with ACM?
At the ingress: If there is no strict requirement for end-to-end encryption, you can offload this processing to the ingress controller or the NLB. This helps you to optimize the performance of your workloads and make them easier to configure and manage. We examine this use case in this blog post.
On the pod: In Kubernetes, a pod is the smallest deployable unit of computing and it encapsulates one or more applications. End-to-end encryption of the traffic from the client all the way to a Kubernetes pod provides a secure communication model where the TLS is terminated at the pod inside the Kubernetes cluster. This could be useful for meeting certain security requirements. You can learn more from the blog post Setting up end-to-end TLS encryption on Amazon EKS with the new AWS Load Balancer Controller.
Mutual TLS between pods: This use case focuses on encryption in transit for data flowing inside Kubernetes cluster. For more details on how this can be achieved with Cert-manager using an Istio service mesh, please see the Securing Istio workloads with mTLS using cert-manager blog post. You can use the AWS Private CA Issuer plugin in conjunction with cert-manager to use ACM Private CA to issue certificates for securing communication between the pods.

In this blog post, we use a scenario where there is a requirement to terminate TLS at the ingress controller level, demonstrating the second example above.

Figure 2 provides an overall picture of the solution described in this blog post. The components and steps illustrated in Figure 2 are described fully in the sections that follow.

Figure 2: Overall solution diagram

Prerequisites

Before you start, you need the following:

An AWS account
The IAM security principal that you’re using must have permissions to work with Amazon EKS IAM roles and service-linked roles, AWS CloudFormation, Amazon Virtual Private Cloud (Amazon VPC), and related resources. For more information, see Actions, resources, and condition keys for Amazon Elastic Container Service for Kubernetes and Using service-linked roles in the IAM User Guide. In addition, this IAM security principal should have an AWSCertificateManagerPrivateCAFullAccess IAM managed policy attached to it.
The AWS Command Line Interface (AWS CLI), with the kubectl and eksctl tools installed and configured. Follow the instructions in Getting started with Amazon EKS – eksctl in the Amazon EKS User Guide.
Helm CLI.

Verify that you have the latest versions of these tools installed before you begin.

Provision an Amazon EKS cluster

If you already have a running Amazon EKS cluster, you can skip this step and move on to install NGINX Ingress.

You can use the AWS Management Console or AWS CLI, but this example uses eksctl to provision the cluster. eksctl is a tool that makes it easier to deploy and manage an Amazon EKS cluster.

This example uses the US-EAST-2 Region and the T2 node type. Select the node type and Region that are appropriate for your environment. Cluster provisioning takes approximately 15 minutes.

To provision an Amazon EKS cluster

Run the following eksctl command to create an Amazon EKS cluster in the us-east-2 Region with Kubernetes version 1.19 and two nodes. You can change the Region to the one that best fits your use case.
```
eksctl create cluster \
--name acm-pca-lab \
--version 1.19 \
--nodegroup-name acm-pca-nlb-lab-workers \
--node-type t2.medium \
--nodes 2 \
--region us-east-2
```

Once your cluster has been created, verify that your cluster is running correctly by running the following command:

$ kubectl get pods --all-namespaces
NAMESPACE     NAME                       READY   STATUS    RESTARTS   AGE
kube-system   aws-node-t94rp             1/1     Running   0          3m4s
kube-system   aws-node-w7dm6             1/1     Running   0          3m19s
kube-system   coredns-56b458df85-6tgjl   1/1     Running   0          10m
kube-system   coredns-56b458df85-8gp94   1/1     Running   0          10m
kube-system   kube-proxy-2pjx7           1/1     Running   0          3m19s
kube-system   kube-proxy-hz8wq           1/1     Running   0          3m4s

You should see output similar to the above, with all pods in a running state.

Install NGINX Ingress

NGINX Ingress is built around the Kubernetes Ingress resource, using a ConfigMap to store the NGINX configuration.

To install NGINX Ingress

Use the following command to install NGINX Ingress:

kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v0.46.0/deploy/static/provider/aws/deploy.yaml

Run the following command to determine the address that AWS has assigned to your NLB:

$ kubectl get service -n ingress-nginx
NAME                                 TYPE           CLUSTER-IP      EXTERNAL-IP                                                                     PORT(S)                      AGE
ingress-nginx-controller             LoadBalancer   10.100.214.10   a3ebe22e7ca0522d1123456fbc92605c-8ac7f1d49be2fc42.elb.us-east-2.amazonaws.com   80:32598/TCP,443:30624/TCP   14s
ingress-nginx-controller-admission   ClusterIP      10.100.118.1    <none>                                                                          443/TCP                      14s

It can take up to 5 minutes for the load balancer to be ready. Once the external IP is created, run the following command to verify that traffic is being correctly routed to ingress-nginx:

curl http://a3ebe22e7ca0522d1123456fbc92605c-8ac7f1d49be2fc42.elb.us-east-2.amazonaws.com
<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx</center>
</body>
</html>

Note: Even though, it’s returning an HTTP 404 error code, in this case curl is still reaching the ingress controller and getting the expected HTTP response back.

Configure your DNS records

Once your load balancer is provisioned, the next step is to point the application’s DNS record to the URL of the NLB.

You can use your DNS provider’s console, for example Route53, and set a CNAME record pointing to your NLB. See CNAME record type for more details on how to setup a CNAME record using Route53.

This scenario uses the sample domain rsa-2048.example.com.

rsa-2048.example.com CNAME a3ebe22e7ca0522d1123456fbc92605c-8ac7f1d49be2fc42.elb.us-east-2.amazonaws.com

As you go through the scenario, replace rsa-2048.example.com with your registered domain.

Install cert-manager

cert-manager is a Kubernetes add-on that you can use to automate the management and issuance of TLS certificates from various issuing sources. It runs within your Kubernetes cluster and will ensure that certificates are valid and attempt to renew certificates at an appropriate time before they expire.

You can use the regular installation on Kubernetes guide to install cert-manager on Amazon EKS.

After you’ve deployed cert-manager, you can verify the installation by following these instructions. If all the above steps have completed without error, you’re good to go!

Note: If you’re planning to use Amazon EKS with Kubernetes pods running on AWS Fargate, please follow the cert-manager Fargate instructions to make sure cert-manager installation works as expected. AWS Fargate is a technology that provides on-demand, right-sized compute capacity for containers.

Install aws-privateca-issuer

The AWS PrivateCA Issuer plugin acts as an addon (see external cert configuration) to cert-manager that signs certificate requests using ACM Private CA.

To install aws-privateca-issuer

For installation, use the following helm commands:

kubectl create namespace aws-pca-issuer

helm repo add awspca https://cert-manager.github.io/aws-privateca-issuer
helm repo update
helm install awspca/aws-pca-issuer --generate-name --namespace aws-pca-issuer

Verify that the AWS Private CA Issuer is configured correctly by running the following command and ensure that it is in READY state with status as Running:

$ kubectl get pods --namespace aws-pca-issuer
NAME                                         READY   STATUS    RESTARTS   AGE
aws-pca-issuer-1622570742-56474c464b-j6k8s   1/1     Running   0          21s

You can check the chart configuration in the default values file.

Create an ACM Private CA

In this scenario, you create a private certificate authority in ACM Private CA with RSA 2048 selected as the key algorithm. You can create a CA using the AWS console, AWS CLI, or AWS CloudFormation.

To create an ACM Private CA

Download the CA certificate using the following command. Replace the <CA_ARN> and <Region> values with the values from the CA you created earlier and save it to a file named cacert.pem:

aws acm-pca get-certificate-authority-certificate --certificate-authority-arn <CA_ARN> -- region <region> --output text > cacert.pem

Once your private CA is active, you can proceed to the next step. You private CA will look similar to the CA in Figure 3.

Figure 3: Sample ACM Private CA

Set EKS node permission for ACM Private CA

In order to issue a certificate from ACM Private CA, add the IAM policy from the prerequisites to your EKS NodeInstanceRole. Replace the <CA_ARN> value with the value from the CA you created earlier:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "awspcaissuerpolicy",
            "Effect": "Allow",
            "Action": [
                "acm-pca:GetCertificate",
                "acm-pca:DescribeCertificateAuthority",
                "acm-pca:IssueCertificate"
            ],
            "Resource": "<CA_ARN>"
        }
        
    ]
}

Create an Issuer in Amazon EKS

Now that the ACM Private CA is active, you can begin requesting private certificates which can be used by Kubernetes applications. Use aws-privateca-issuer plugin to create the ClusterIssuer, which will be used with the ACM PCA to issue certificates.

Issuers (and ClusterIssuers) represent a certificate authority from which signed x509 certificates can be obtained, such as ACM Private CA. You need at least one Issuer or ClusterIssuer before you can start requesting certificates in your cluster. There are two custom resources that can be used to create an Issuer inside Kubernetes using the aws-privateca-issuer add-on:

AWSPCAIssuer is a regular namespaced issuer that can be used as a reference in your Certificate custom resources.
AWSPCAClusterIssuer is specified in exactly the same way, but it doesn’t belong to a single namespace and can be referenced by certificate resources from multiple different namespaces.

To create an Issuer in Amazon EKS

For this scenario, you create an AWSPCAClusterIssuer. Start by creating a file named cluster-issuer.yaml and save the following text in it, replacing <CA_ARN> and <Region> information with your own.
```
apiVersion: awspca.cert-manager.io/v1beta1
kind: AWSPCAClusterIssuer
metadata:
          name: demo-test-root-ca
spec:
          arn: <CA_ARN>
          region: <Region>
```
Deploy the AWSPCAClusterIssuer:
```
kubectl apply -f cluster-issuer.yaml
```
Verify the installation and make sure that the following command returns a Kubernetes service of kind AWSPCAClusterIssuer:
```
$ kubectl get AWSPCAClusterIssuer
NAME                AGE
demo-test-root-ca   51s
```

Request the certificate

Now, you can begin requesting certificates which can be used by Kubernetes applications from the provisioned issuer. For more details on how to specify and request Certificate resources, please check Certificate Resources guide.

To request the certificate

As a first step, create a new namespace that contains your application and secret:
```
$ kubectl create namespace acm-pca-lab-demo
namespace/acm-pca-lab-demo created
```
Next, create a basic X509 private certificate for your domain.
Create a file named rsa-2048.yaml and save the following text in it. Replace rsa-2048.example.com with your domain.

kind: Certificate
apiVersion: cert-manager.io/v1
metadata:
  name: rsa-cert-2048
  namespace: acm-pca-lab-demo
spec:
  commonName: www.rsa-2048.example.com
  dnsNames:
    - www.rsa-2048.example.com
    - rsa-2048.example.com
  duration: 2160h0m0s
  issuerRef:
    group: awspca.cert-manager.io
    kind: AWSPCAClusterIssuer
    name: demo-test-root-ca
  renewBefore: 360h0m0s
  secretName: rsa-example-cert-2048
  usages:
    - server auth
    - client auth
  privateKey:
    algorithm: "RSA"
    size: 2048

For a certificate with a key algorithm of RSA 2048, create the resource:
```
kubectl apply -f rsa-2048.yaml -n acm-pca-lab-demo
```

Verify that the certificate is issued and in READY state by running the following command:

$ kubectl get certificate -n acm-pca-lab-demo
NAME            READY   SECRET                  AGE
rsa-cert-2048   True    rsa-example-cert-2048   12s

Run the command kubectl describe certificate -n acm-pca—lab-demo to check the progress of your certificate.

Once the certificate status shows as issued, you can use the following command to check the issued certificate details:

kubectl get secret rsa-example-cert-2048 -n acm-pca-lab-demo -o 'go-template={{index .data "tls.crt"}}' | base64 --decode | openssl x509 -noout -text

Deploy a demo application

For the purpose of this scenario, you can create a new service—a simple “hello world” website that uses echoheaders that respond with the HTTP request headers along with some cluster details.

To deploy a demo application

Create a new file named hello-world.yaml with below content:

apiVersion: v1
kind: Service
metadata:
  name: hello-world
  namespace: acm-pca-lab-demo
spec:
  type: ClusterIP
  ports:
  - port: 80
    targetPort: 8080
  selector:
    app: hello-world
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello-world
  namespace: acm-pca-lab-demo
spec:
  replicas: 3
  selector:
    matchLabels:
      app: hello-world
  template:
    metadata:
      labels:
        app: hello-world
    spec:
      containers:
      - name: echoheaders
        image: k8s.gcr.io/echoserver:1.10
        args:
        - "-text=Hello World"
        imagePullPolicy: IfNotPresent
        resources:
          requests:
            cpu: 100m
            memory: 100Mi
        ports:
        - containerPort: 8080

Create the service using the following command:
```
$ kubectl apply -f hello-world.yaml
```

Expose and secure your application

Now that you’ve issued a certificate, you can expose your application using a Kubernetes Ingress resource.

To expose and secure your application

Create a new file called example-ingress.yaml and add the following content:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: acm-pca-demo-ingress
  namespace: acm-pca-lab-demo
  annotations:
    kubernetes.io/ingress.class: "nginx"
spec:
  tls:
  - hosts:
    - www.rsa-2048.example.com
    secretName: rsa-example-cert-2048
  rules:
  - host: www.rsa-2048.example.com
    http:
      paths:
      - path: /
        pathType: Exact
        backend:
          service:
            name: hello-world
            port:
              number: 80

Create a new Ingress resource by running the following command:
```
kubectl apply -f example-ingress.yaml 
```

Access your application using TLS

After completing the previous step, you can to access this service from any computer connected to the internet.

To access your application using TLS

Log in to a terminal window on a machine that has access to the internet, and run the following:
```
$ curl https://rsa-2048.example.com --cacert cacert.pem 
```

You should see an output similar to the following:

Hostname: hello-world-d8fbd49c6-9bczb

Pod Information:
	-no pod information available-

Server values:
	server_version=nginx: 1.13.3 - lua: 10008

Request Information:
	client_address=192.162.32.64
	method=GET
	real path=/
	query=
	request_version=1.1
	request_scheme=http
	request_uri=http://rsa-2048.example.com:8080/

Request Headers:
	accept=*/*
	host=rsa-2048.example.com
	user-agent=curl/7.62.0
	x-forwarded-for=52.94.2.2
	x-forwarded-host=rsa-2048.example.com
	x-forwarded-port=443
	x-forwarded-proto=https
	x-real-ip=52.94.2.2
	x-request-id=371b6fc15a45d189430281693cccb76e
	x-scheme=https

Request Body:
	-no body in request-…

This response is returned from the service running behind the Kubernetes Ingress controller and demonstrates that a successful TLS handshake happened at port 443 with https protocol.

You can use the following command to verify that the certificate issued earlier is being used for the SSL handshake:

echo | openssl s_client -showcerts -servername www.rsa-2048.example.com -connect www.rsa-2048.example.com:443 2>/dev/null | openssl x509 -inform pem -noout -text

Cleanup

To avoid incurring future charges on your AWS account, perform the following steps to remove the scenario.

Delete the ACM Private CA

You can delete the ACM Private CA by following the instructions in Deleting your private CA.

As an alternative, you can use the following commands to delete the ACM Private CA, replacing the <CA_ARN> and <Region> with your own:

Disable the CA.

aws acm-pca update-certificate-authority \
--certificate-authority-arn <CA_ARN>
--region <Region>
--status DISABLED

Call the Delete Certificate Authority API

aws acm-pca delete-certificate-authority \
--certificate-authority-arn <CA_ARN>
--region <Region>
--permanent-deletion-time-in-days 7

Continue the cleanup

Once the ACM Private CA has been deleted, continue the cleanup by running the following commands.

Delete the services:
```
kubectl delete -f hello-world.yaml
```
Delete the Ingress controller:
```
kubectl delete -f example-ingress.yaml
```

Delete the IAM NodeInstanceRole, replace role name with your EKS Node instance role created for the demo:

aws iam delete-role --role-name eksctl-acm-pca-lab-nodegroup-acm-pca-nlb-lab-workers-NodeInstanceProfile-XXXXXXX

Delete the Amazon EKS cluster using ekctl command:

eksctl delete cluster acm-pca-lab --region us-east-2

You can also clean up from your Cloudformation console by deleting the stacks named eksctl-acm-pca-lab-nodegroup-acm-pca-nlb-lab-workers and eksctl-acm-pca-lab-cluster.

Conclusion

In this blog post, we showed you how to set up a Kubernetes Ingress controller with a service running in Amazon EKS cluster using AWS Load Balancer Controller with Network Load Balancer and set up HTTPS using private certificates issued by ACM Private CA. If you have questions or want to contribute, join the aws-privateca-issuer add-on project on GitHub.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Certificate Manager forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Customize and Package Dependencies With Your Apache Spark Applications on Amazon EMR on Amazon EKS

2021-06-24 Channy Yun

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/customize-and-package-dependencies-with-your-apache-spark-applications-on-amazon-emr-on-amazon-eks/

Last AWS re:Invent, we announced the general availability of Amazon EMR on Amazon Elastic Kubernetes Service (Amazon EKS), a new deployment option for Amazon EMR that allows customers to automate the provisioning and management of Apache Spark on Amazon EKS.

With Amazon EMR on EKS, customers can deploy EMR applications on the same Amazon EKS cluster as other types of applications, which allows them to share resources and standardize on a single solution for operating and managing all their applications. Customers running Apache Spark on Kubernetes can migrate to EMR on EKS and take advantage of the performance-optimized runtime, integration with Amazon EMR Studio for interactive jobs, integration with Apache Airflow and AWS Step Functions for running pipelines, and Spark UI for debugging.

When customers submit jobs, EMR automatically packages the application into a container with the big data framework and provides prebuilt connectors for integrating with other AWS services. EMR then deploys the application on the EKS cluster and manages running the jobs, logging, and monitoring. If you currently run Apache Spark workloads and use Amazon EKS for other Kubernetes-based applications, you can use EMR on EKS to consolidate these on the same Amazon EKS cluster to improve resource utilization and simplify infrastructure management.

Developers who run containerized, big data analytical workloads told us they just want to point to an image and run it. Currently, EMR on EKS dynamically adds externally stored application dependencies during job submission.

Today, I am happy to announce customizable image support for Amazon EMR on EKS that allows customers to modify the Docker runtime image that runs their analytics application using Apache Spark on your EKS cluster.

With customizable images, you can create a container that contains both your application and its dependencies, based on the performance-optimized EMR Spark runtime, using your own continuous integration (CI) pipeline. This reduces the time to build the image and helps predicting container launches for a local development or test.

Now, data engineers and platform teams can create a base image, add their corporate standard libraries, and then store it in Amazon Elastic Container Registry (Amazon ECR). Data scientists can customize the image to include their application specific dependencies. The resulting immutable image can be vulnerability scanned, deployed to test and production environments. Developers can now simply point to the customized image and run it on EMR on EKS.

Customizable Runtime Images – Getting Started
To get started with customizable images, use the AWS Command Line Interface (AWS CLI) to perform these steps:

Register your EKS cluster with Amazon EMR.
Download the EMR-provided base images from Amazon ECR and modify the image with your application and libraries.
Publish your customized image to a Docker registry such as Amazon ECR and then submit your job while referencing your image.

You can download one of the following base images. These images contain the Spark runtime that can be used to run batch workloads using the EMR Jobs API. Here is the latest full image list available.

Release Label	Spark Hadoop Versions	Base Image Tag
emr-5.32.0-latest	Spark 2.4.7 + Hadoop 2.10.1	emr-5.32.0-20210129
emr-5.33-latest	Spark 2.4.7-amzn-1 + Hadoop 2.10.1-amzn-1	emr-5.33.0-20210323
emr-6.2.0-latest	Spark 3.0.1 + Hadoop 3.2.1	emr-6.2.0-20210129
emr-6.3-latest	Spark 3.1.1-amzn-0 + Hadoop 3.2.1-amzn-3	emr-6.3.0:latest

These base images are located in an Amazon ECR repository in each AWS Region with an image URI that combines the ECR registry account, AWS Region code, and base image tag in the case of US East (N. Virginia) Region.

755674844232.dkr.ecr.us-east-1.amazonaws.com/spark/emr-5.32.0-20210129

Now, sign in to the Amazon ECR repository and pull the image into your local workspace. If you want to pull an image from a different AWS Region to reduce network latency, choose the different ECR repository that corresponds most closely to where you are pulling the image from US West (Oregon) Region.

$ aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 895885662937.dkr.ecr.us-west-2.amazonaws.com
$ docker pull 895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-5.32.0-20210129

Create a Dockerfile on your local workspace with the EMR-provided base image and add commands to customize the image. If the application requires custom Java SDK, Python, or R libraries, you can add them to the image directly, just as with other containerized applications.

The following example Docker commands are for a use case in which you want to install useful Python libraries such as Natural Language Processing (NLP) using Spark and Pandas.

FROM 895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-5.32.0-20210129
USER root
### Add customizations here ####
RUN pip3 install pyspark pandas spark-nlp // Install Python NLP Libraries
USER hadoop:hadoop

In another use case, as I mentioned, you can install a different version of Java (for example, Java 11):

FROM 895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-5.32.0-20210129
USER root
### Add customizations here ####
RUN yum install -y java-11-amazon-corretto // Install Java 11 and set home
ENV JAVA_HOME /usr/lib/jvm/java-11-amazon-corretto.x86_64
USER hadoop:hadoop

If you’re changing Java version to 11, then you also need to change Java Virtual Machine (JVM) options for Spark. Provide the following options in applicationConfiguration when you submit jobs. You need these options because Java 11 does not support some Java 8 JVM parameters.

"applicationConfiguration": [ 
  {
    "classification": "spark-defaults",
    "properties": {
        "spark.driver.defaultJavaOptions" : "
		    -XX:OnOutOfMemoryError='kill -9 %p' -XX:MaxHeapFreeRatio=70",
        "spark.executor.defaultJavaOptions" : "
		    -verbose:gc -Xlog:gc*::time -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
			-XX:OnOutOfMemoryError='kill -9 %p' -XX:MaxHeapFreeRatio=70 
			-XX:+IgnoreUnrecognizedVMOptions"
    }
  }
]

To use custom images with EMR on EKS, publish your customized image and submit a Spark workload in Amazon EMR on EKS using the available Spark parameters.

You can submit batch workloads using your customized Spark image. To submit batch workloads using the StartJobRun API or CLI, use the spark.kubernetes.container.image parameter.

$ aws emr-containers start-job-run \
    --virtual-cluster-id <enter-virtual-cluster-id> \
    --name sample-job-name \
    --execution-role-arn <enter-execution-role-arn> \
    --release-label <base-release-label> \ # Base EMR Release Label for the custom image
    --job-driver '{
        "sparkSubmitJobDriver": {
        "entryPoint": "local:///usr/lib/spark/examples/jars/spark-examples.jar",
        "entryPointArguments": ["1000"],
        "sparkSubmitParameters": [ "--class org.apache.spark.examples.SparkPi --conf spark.kubernetes.container.image=123456789012.dkr.ecr.us-west-2.amazonaws.com/emr5.32_custom"
		  ]
      }
  }'

Use the kubectl command to confirm the job is running your custom image.

$ kubectl get pod -n <namespace> | grep "driver" | awk '{print $1}'
Example output: k8dfb78cb-a2cc-4101-8837-f28befbadc92-1618856977200-driver

Get the image for the main container in the Driver pod (Uses jq).

$ kubectl get pod/<driver-pod-name> -n <namespace> -o json | jq '.spec.containers
| .[] | select(.name=="spark-kubernetes-driver") | .image '
Example output: 123456789012.dkr.ecr.us-west-2.amazonaws.com/emr5.32_custom

To view jobs in the Amazon EMR console, under EMR on EKS, choose Virtual clusters. From the list of virtual clusters, select the virtual cluster for which you want to view logs. On the Job runs table, select View logs to view the details of a job run.

Automating Your CI Process and Workflows
You can now customize an EMR-provided base image to include an application to simplify application development and management. With custom images, you can add the dependencies using your existing CI process, which allows you to create a single immutable image that contains the Spark application and all of its dependencies.

You can apply your existing development processes, such as vulnerability scans against your Amazon EMR image. You can also validate for correct file structure and runtime versions using the EMR validation tool, which can be run locally or integrated into your CI workflow.

The APIs for Amazon EMR on EKS are integrated with orchestration services like AWS Step Functions and AWS Managed Workflows for Apache Airflow (MWAA), allowing you to include EMR custom images in your automated workflows.

Now Available
You can now set up customizable images in all AWS Regions where Amazon EMR on EKS is available. There is no additional charge for custom images. To learn more, see the Amazon EMR on EKS Development Guide and a demo video how to build your own images for running Spark jobs on Amazon EMR on EKS.

You can send feedback to the AWS forum for Amazon EMR or through your usual AWS support contacts.

— Channy