All posts by Hendrik Schoeneberg

How to Run Massively Scalable ADAS Simulation Workloads on CAEdge

Post Syndicated from Hendrik Schoeneberg original https://aws.amazon.com/blogs/architecture/how-to-run-massively-scalable-adas-simulation-workloads-on-caedge/

This post was co-written by Hendrik Schoeneberg, Sr. Global Big Data Architect, The An Binh Nguyen, Product Owner for Cloud Simulation at Continental, Autonomous Mobility – Engineering Platform, Rumeshkrishnan Mohan, Global Big Data Architect, and Junjie Tang, Principal Consultant at AWS Professional Services.

AV/ADAS simulations processing large-scale field sensor data such as radar, lidar, and high-resolution video come with many challenges. Typically, the simulation workloads are spiky with occasional, but high compute demands, so the platform must scale up or down elastically to match the compute requirements. The platform must be flexible enough to integrate specialized ADAS simulation software, use distributed computing or HPC frameworks, and leverage GPU accelerated compute resources where needed.

Continental created the Continental Automotive Edge (CAEdge) Framework to address these challenges. It is a modular multi-tenant hardware and software framework that connects the vehicle to the cloud. You can learn more about this in Developing a Platform for Software-defined Vehicles with Continental Automotive Edge (CAEdge) and Developing a platform for software-defined vehicles (re:Invent session-AUT304).

In this blog post, we’ll illustrate how the CAEdge Framework orchestrates ADAS simulation workloads with Amazon Managed Workflows for Apache Airflow (MWAA). We’ll show how it delegates the high-performance workloads to AWS Batch for elastic, highly scalable, and customizable compute needs. We’ll showcase the “bring your own software-in-the-loop” (BYO-SIL) pattern, detailing how to leverage specialized and proprietary simulation software in your workflow. We’ll also demonstrate how to integrate the simulation platform with tenant data in the CAEdge framework. This orchestrate and delegate pattern has previously been introduced in Field Notes: Deploying Autonomous Driving and ADAS Workloads at Scale with Amazon Managed Workflows for Apache Airflow and the Reference Architecture for Autonomous Driving Data Lake.

Solution overview for autonomous driving simulation

The following diagram shows a high-level overview of this solution.

Figure 1. Architecture diagram for autonomous driving simulation

Figure 1. Architecture diagram for autonomous driving simulation

It can be broken down into five major parts, illustrated in Figure 1.

  1. Simulation API: Amazon API Gateway (label 1) provides a REST API for authenticated users to schedule or monitor simulation. It runs on Amazon Managed Workflows for Apache Airflow (MWAA) using AWS Lambda (label 2).
  2. Simulation control plane: The simulation control plane (label 3) built on MWAA enables users to design and initiate workflows and integrate with AWS services.
  3. Scalable compute backend: We leverage the parallelization and elastic scalability capabilities of AWS Batch to distribute the workload (label 4). Additionally, we want to be able to run proprietary software components as part of the simulation workflows and use highly customizable Amazon EC2 compute environments. For example, we can use GPU acceleration or Graviton-based instances for workloads that must run on ARM, instead of x86 architectures.
  4. Autonomous Driving Data Lake (ADDL) integration: The simulations’ input and output data will be stored in a data lake (label 5) on Amazon S3. To provide efficient read and write access, data gets copied before each simulation run using RAID0 bundled ephemeral instance storage drives. After the simulation, results are written back to the data lake and are ready for reporting and analytics. We use AWS Lake Formation for metadata storage, data cataloging, and permission handling.
  5. Automated deployment: The solution architecture’s deployment is fully automated using a CI/CD pipeline in AWS CodePipeline and AWS CodeBuild (label 6). We can then perform automated testing and deployment to multiple target environments.

In this blog we will focus on the simulation control plane, scalable compute backend, and ADDL integration.

Simulation control plane

In a typical simulation, there are tasks like data movement (copying input data and persisting the results). These steps can require high levels of parallelization or GPU support. In addition, they can involve third-party or proprietary software, specialized runtime environments, or architectures like ARM. To facilitate all task requirements, we will delegate task initiation to the AWS services that best match the requirements. Correct sequence of tasks is achieved by an orchestration layer. Decoupling the simulation orchestration from task initiation follows the key design pattern to separate concerns. This enables the architecture to be adapted to specific requirements.

For the high-level task orchestration, we’ll introduce a simulation control plane built on MWAA. Modeling all simulation tasks in a directed acyclic graph (DAG), MWAA performs scheduling and task initiation in the correct sequence. It also handles the integration with AWS’ services. You can intuitively monitor and debug the progress of a simulation run across different services and networks. For the workflow within CAEdge, we’ll use AWS Batch to run simulations that are packaged as Docker containers.

Parameterizable workflows: Airflow supports parameterization of DAGs by providing a JSON object when triggering a DAG run that can be accessed at runtime. This enables us to create a simulation execution framework in which we model the simulation steps. It lets you specify the simulations’ parameters such as which Docker container to execute, runtime configuration, or input and output data locations, see Figures 2 and 3.

Figure 2. Specify simulations’ parameters in JSON

Figure 2. Specify simulations’ parameters in JSON

Figure 3. Read simulations’ parameters from JSON

Figure 3. Read simulations’ parameters from JSON

Scalable compute backend

Publishing the Docker image: To run a simulation, you must provide a Docker container in Amazon Elastic Container Registry (ECR). Manually push Docker images to ECR using the command line interface (CLI) or using CI/CD pipeline automation.

Choosing the compute option: Containers at AWS describes the services options for developers. Various factors can contribute to your decision-making. For the CAEdge platform, we want to run thousands of containers with fine-grained control over the underlying compute instance’s configuration. We also want to run containers in privileged mode, so AWS Batch on EC2 is a great match. Figure 4 outlines the compute backend’s architecture.

Figure 4. Compute backend architecture diagram

Figure 4. Compute backend architecture diagram

To run a Docker container on AWS Batch, we need three components: A job definition, a compute environment, and a job queue. The AWS Batch job definition specifies which job to run and how to run it. For example, we can define the Docker image to use, and specify the vCPU and memory configuration. The job definition will be submitted to a job queue, which in turn is linked to one or more compute environments. The compute environment specifies the compute configuration used to run a containerized workload, in addition to instance types and storage configurations. Creating a compute environment describes this more in detail. In the section ‘ADDL integration,’ we’ll describe how to select and configure the EC2 container instances to maximize network and disk I/O.

The AWS Batch array job feature can spin up thousands of independent, but similarly configured jobs. Instead of submitting a single job for each input file from the simulation control plane, we can instead submit a single array job. This provides the collection of input files as input. AWS Batch can spin up child jobs to process each entry of the collection in parallel, reducing operational overhead drastically.

With these components, we can now submit a job definition to a job queue from the simulation control plane, which will initiate on the corresponding compute environment.

ADDL integration

In the CAEdge platform, all simulation recordings and their metadata are stored and cataloged in the data lake. On average, single recordings are around 100 GB in size with simulation requests containing 100–300 recordings. A simulation request can therefore require 10–30 TB of data movement from S3 to the containers before the simulation starts. The containers’ performance will directly depend on I/O performance, as the input data is read and processed during execution. To provide the highest performance of data transfer and simulation workload, we need data storage options that maximize network and disk I/O throughput.

Choosing the storage option: For CAEdge’s simulation workloads, the storage solution acts as a temporary scratch space for the simulation containers’ input data. It should maximize I/O throughput. These requirements are met in the most cost-efficient way by choosing EC2 instances of the M5d family and their attached instance storage.

The M5d are general-purpose instances with NVMe-based SSDs, which are physically connected to the host server and provide block-level storage coupled to the lifetime of the instance. As described in the previous section, we configured AWS Batch to create compute environments using EC2’s M5d instances. While other storage options like Amazon Elastic File System (EFS) and Amazon Elastic Block Storage (EBS) can scale to match even demanding throughput scenarios, the simulation benefits from a high-performance, temporary storage location that is directly attached to the host.

Bundling the volumes: M5d instances can have multiple NVMe drives attached, depending on their size and storage configuration. To provide a single stable storage location for the simulation containers, bundle all attached NVMe SSDs into a single RAID0 volume during the instance’s launch, using a modified user data script. With this, we can provide a fixed storage location that can be accessed from the simulation containers. Additionally, we are distributing I/O operations evenly across all disks in the RAID0 configuration, which improves the read and write performance.

KPIs used for measurement

The SIL factor tells you how long the simulation takes compared to the duration of the recorded data.

SIL factor example:

  • For a recording with 60 minutes of data and a simulation duration of 120 minutes, the resulting SIL factor is 120 / 60 = 2. The simulation runs twice as long as real time.
  • For a recording with 60 minutes of data and a simulation duration of 60 minutes, the resulting SIL factor is 60 / 60 = 1. The simulation runs in real time.

Similarly, the aggregated SIL factor tells you how long the simulation for multiple recordings takes compared to the duration of the recorded data. It also factors in the horizontal scaling capabilities.

Aggregated SIL factor example:

  • For 3 recordings with 60 minutes of data and an overall simulation duration of 120 minutes, the resulting SIL factor is 120 / 3 * 60 = 0.67. The overall simulation is a third faster than real time.

Performance results

The storage optimizations described in the ADDL integration KPI section preceding, led to an improvement of the overall simulation duration of 50%. To benchmark the overall simulation platform’s performance, we created a simulation run with 15 recordings. The simulation run completed successfully with an aggregated SIL factor of 0.363, or, alternatively put, the simulation was roughly three times faster than real time. During production use, the platform will handle simulation runs with an average of 100-300 recordings. For these runs, the aggregated SIL factor is expected to be even smaller, as more simulations can be processed in parallel.

Conclusion

In this post, we showed how to design a platform for ADAS simulation workloads using the Bring Your Own Software-In-the-Loop (BYO-SIL) concept. We covered the key components including simulation control plane, scalable compute backend, and ADDL integration based on Continental Automotive Edge (CAEdge). We discussed the key performance benefits due to horizontal scalability and how to choose an optimized storage integration pattern. This led to an improvement of the overall simulation duration of 50%.

Field Notes: Deploying Autonomous Driving and ADAS Workloads at Scale with Amazon Managed Workflows for Apache Airflow

Post Syndicated from Hendrik Schoeneberg original https://aws.amazon.com/blogs/architecture/field-notes-deploying-autonomous-driving-and-adas-workloads-at-scale-with-amazon-managed-workflows-for-apache-airflow/

Cloud Architects developing autonomous driving and ADAS workflows are challenged by loosely distributed process steps along the tool chain in hybrid environments. This is accelerated by the need to create a holistic view of all running pipelines and jobs. Common challenges include:

  • finding and getting access to the data sources specific to your use case, understanding the data, moving and storing it,
  • cleaning and transforming it and,
  • preparing it for downstream consumption.

Verifying that your data pipeline works correctly and ensuring you provide a certain data quality while your application is being developed adds to the complexity. Furthermore, the fact that the data itself constantly changes poses more challenges. In this blog post, we show you the steps to follow from a workflow run in an Amazon Managed Workflows for Apache Airflow (MWAA) environment. All required infrastructure is created leveraging the AWS Cloud Development Kit (CDK).

We illustrate how image data collected during field operational tests (FOTs) of vehicles can be processed by:

  • detecting objects within the video stream
  • adding the bounding boxes of all detected objects to the image data
  • providing visual and metadata information for data cataloguing as well as potential applications on the bounding boxes.

We define a workflow that processes ROS bag files on Amazon S3, and extracts individual PNGs from a video stream using AWS Fargate on Amazon Elastic Container Service (Amazon ECS). Then we show how to use Amazon Rekognition to detect objects within the exported images and draw bounding boxes for the detected objects within the images. Finally, we demonstrate how users’ interface with MWAA to develop and publish their workflows. We also show methods to keep track of the performance and costs of your workflows.

Overview of the solution

Deploying Autonomous Driving & ADAS workloads at scale

Figure 1 – Architecture for deploying Autonomous Driving & ADAS workloads at scale

The preceding diagram shows our target solution architecture. It includes five components:

  • Amazon MWAA environment: we use Amazon MWAA to orchestrate our image processing workflow’s tasks. A workflow, or DAG (directed acyclic graph), is a set of task definitions and dependencies written in Python. We create a MWAA environment using CDK and define our image processing pipeline as a DAG which MWAA can orchestrate.
  • ROS bag file processing workflow: this is a workflow run on Managed Workflows for Apache Airflow. This detects and highlights objects within the camera stream of a ROS bag file. The workflow is as follows:
    • Monitors the location in Amazon S3 for the new ROS bag files.
    • If a new file gets uploaded, we extract video data from the ROS bag file’s topic that holds the video stream, extract individual PNGs from the video stream and store them on S3.
    • We use AWS Fargate on Amazon ECS  to run a containerized image extraction workload.
    • Once the extracted images are stored on S3, they are passed to Amazon Rekognition for object detection. We then provide an image or video to the Amazon Rekognition API, and the service can identify objects, people, text, scenes, and activities. Amazon Rekognition Image can return the bounding box for many object labels.
    • We use the information to highlight the detected objects within the image by drawing the bounding box.
    • The bounding boxes of the objects Amazon Rekognition detects within the extracted PNGs are rendered directly into the image using Pillow, a fork of the Python Imaging Library (PIL).
    • The resulting images are then stored in Amazon S3 and our workflow is complete.
  • Predefined DAGs: Our solution comes with a pre-defined ROS bag image processing DAG that will be copied into the Amazon MWAA environment with CDK.
  • AWS CodePipeline for automated CI/CD pipeline: CodePipeline automates the build, test, and deploy phases of your release process every time there is a code change, based on the release model you define
  • Amazon CloudWatch for monitoring and operational dashboard: With Amazon CloudWatch we can monitor important metrics of the ROS bag image processing workflow, for example. to verify the compute load, number of running workflows or tasks or to monitor and alert on processing errors.

Prerequisites

Deployment

To deploy and run the solution, we follow these steps:

  1. Configure and create an MWAA environment in AWS using CDK
  2. Copy the ROS bag file to S3
  3. Manage the ROS bag processing DAG from the Airflow UI
  4. Inspect the results
  5. Monitor the operations dashboard

Each component illustrated in the solution architecture diagram, as well as all required IAM roles and configuration parameters are created from an AWS CDK application. This is included in these example sources.

1. Configure and create an MWAA environment in AWS using CDK

  • Download the deployment templates
  • Extract the zip file to a directory
  • Open a shell and go to the extracted directory
  • Start the deployment by typing the following code:
./deploy.sh <named-profile> deploy true <region>
  • Replace <named-profile> with the named profile which you configured for your AWS CLI to point to your target AWS account.
  • Replace <region> with the Region that is associated with your named profile, for example, us-east-1.
  • As an example: if your named profile for the AWS CLI is called rosbag-processing and your associated Region is us-east-1 the resulting command would be:
./deploy.sh rosbag-processing deploy true us-east-1

After confirming the deployment, the CDK will take several minutes to synthesize the CloudFormation templates, build the Docker image and deploy the solution to the target AWS account.

Figure 2 - Screenshot of Rosbag processing stack

Figure 2 – Screenshot of Rosbag processing stack

Inspect the CloudFormation stacks being created in the AWS console. Sign into your AWS console and select the service CloudFormation in the AWS Region that you chose for your named profile (for exmaple, us-east-After the installation script has run, the CloudFormation stack overview in the console should show the following stacks:

Figure 3 - Screenshot of CloudFormation stack overview

Figure 3 – Screenshot of CloudFormation stack overview

2. Copy the ROS bag file to S3

  • You can download the ROS bag file here. After downloading you need to copy the file to the input bucket that was created during the CDK deployment in the previous step.
  • Open the AWS management console and navigate to the service S3. You should find a list of your buckets including the buckets for input data, output data and DAG definitions that were created during the CDK deployment.
Figure 4 - Screenshot showing S3 buckets created

Figure 4 – Screenshot showing S3 buckets created

  • Select the bucket with prefix rosbag-processing-stack-srcbucket and copy the ROS bag file into the bucket.
  • Now that the ROS bag file is in S3, we need to enable the image processing workflow in MWAA.

3. Manage the ROS bag processing DAG from the Airflow UI

  • In order to access the Airflow web server, open the AWS management console and navigate to the MWAA service.
  • From the overview, select the MWAA environment that was created and select Open Airflow UI as shown in the following screenshot:
Figure 5 - Screenshot showing Airflow Environments

Figure 5 – Screenshot showing Airflow Environments

The Airflow UI and the ROS bag image processing DAG should appear as in the following screenshot:

  • We can now enable the DAG by flipping the switch (1) from “Off” to “On” position. Amazon MWAA will now schedule the launch for the workflow.
  • By selecting the DAG’s name (rosbag_processing) and selecting the Graph View, you can review the workflow’s individual tasks:
Figure 6 - The Airflow UI and the ROS bag image processing DAG

Figure 6 – The Airflow UI and the ROS bag image processing DAG

Figure 7 - Dag Workflow

Figure 7 – Dag Workflow

After enabling the image processing DAG by flipping the switch to “on” in the previous step, Airflow will detect the newly uploaded ROS bag file in S3 and start processing it. You can monitor the progress from the DAG’s tree view (1), as shown in the following screenshot:

Figure 8 - Monitor progress in the DAG's tree view

Figure 8 – Monitor progress in the DAG’s tree view

Re-running the image processing pipeline

Our solution uses S3 metadata to keep track of the processing status of each ROS bag file in the input bucket in S3. If you want to re-process a ROS bag file:

  • open the AWS management console,
  • navigate to S3, click on the ROS bag file you want to re-process,
  • remove the file’s metadata tag processing.status from the properties tab:
Figure 9 - Metadata tag

Figure 9 – Metadata tag

With the metadata tag processing.status removed, Amazon MWAA picks up the file the next time the image processing workflow is launched. You can manually start the workflow by choosing Trigger DAG (1) from the Airflow web server UI.

Figure 10 - Manually start the workflow by choosing Trigger DAG

Figure 10 – Manually start the workflow by choosing Trigger DAG

4. Inspect the results

In the AWS management console navigate to S3, and select the output bucket with prefix rosbag-processing-stack-destbucket. The bucket holds both the extracted images from the ROS bag’s topic containing the video data (1) and the images with highlighted bounding boxes for the objects detected by Amazon Rekognition (2).

Figure 11 - Output S3 buckets

Figure 11 – Output S3 buckets

  • Navigate to the folder 20201005 and select an image,
  • Navigate to the folder bounding_boxes
  • Select the corresponding image. You should receive an output similar to the following two examples:
Left: Image extracted from ROS bag camera feed topic. Right: Same image with highlighted objects that Amazon Rekognition detected.

Figure 12 – Left: Image extracted from ROS bag camera feed topic. Right: Same image with highlighted objects that Amazon Rekognition detected.

Amazon CloudWatch for monitoring and creating operational dashboard

Amazon MWAA comes with CloudWatch metrics to monitor your MWAA environment. You can create dashboards for quick and intuitive operational insights. To do so:

  • Open the AWS management console and navigate to the service CloudWatch.
  • Select the menu “Dashboards” from the navigation pane and click on “Create dashboard”.
  • After providing a name for your dashboard and selecting a widget type, you can choose from a variety of metrics that MWAA provides.  This is shown in the following picture.

MWAA comes with a range of pre-defined metrics that allow you to monitor failed tasks, the health of your managed Airflow environment and helps you right-size your environment.

Figure 13 - CloudWatch metrics in a dashboard help to monitor your MWAA environment.

Figure 13 – CloudWatch metrics in a dashboard help to monitor your MWAA environment.

Clean Up

In order to delete the deployed solution from your AWS account, you must:

1. Delete all data in your S3 buckets

  • In the management console navigate to the service S3. You should see three buckets containing rosbag-processing-stack that were created during the deployment.
  • Select each bucket and delete its contents. Note: Do not delete the bucket itself.

2. Delete the solution’s CloudFormation stack

You can delete the deployed solution from your AWS account either from the management console or the command line interface.

Using the management console:

  • Open the management console and navigate to the CloudFormation service. Make sure to select the same region you deployed your solution to (for exmaple,. us-east-1).
  • Select stacks from the side menu. The solution’s CloudFormation stack is shown in the following screenshot:
Figure 14 - Cleaning up the CloudFormation Stack

Figure 14 – Cleaning up the CloudFormation Stack

Select the rosbag-processing-stack and select Delete. CloudFormation will now destroy all resources related to the ROS bag image processing pipeline.

Using the Command Line Interface (CLI):

Open a shell and navigate to the directory from which you started the deployment, then type the following:

cdk destroy --profile <named-profile>

Replace <named-profile> with the named profile which you configured for your AWS CLI to point to your target AWS account.

As an example: if your named profile for the AWS CLI is called rosbag-processing and your associated Region is us-east-1 the resulting command would be:

cdk destroy --profile rosbag-processing

After confirming that you want to destroy the CloudFormation stack, the solution’s resources will be deleted from your AWS account.

Conclusion

This blog post illustrated how to deploy and leverage workflows for running Autonomous Driving & ADAS workloads at scale. This solution was designed using fully managed AWS services. The solution was deployed on AWS using AWS CDK templates. Using MWAA we were able to set up a centralized, transparent, intuitive workflow orchestration across multiple systems. Even though the pipeline involved several steps in multiple AWS services, MWAA helped us understand and visualize the workflow, its individual tasks and dependencies.

Usually, distributed systems are hard to debug in case of errors or unexpected outcomes. However, building the ROS bag processing pipelines on MWAA allowed us to inspect the log files involved centrally from the Airflow UI, for debugging or auditing. Furthermore, with MWAA the ROS bag processing workflow is part of the code base and can be tested, refactored, versioned just like any other software artifact. By providing the CDK templates, we were able to architect a fully automated solution which you can adapt to your use case.

We hope you found this interesting and helpful and invite your comments on the solution!

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.