All posts by Randy DeFauw

Build a RAG data ingestion pipeline for large-scale ML workloads

2024-03-13 Randy DeFauw

Post Syndicated from Randy DeFauw original https://aws.amazon.com/blogs/big-data/build-a-rag-data-ingestion-pipeline-for-large-scale-ml-workloads/

For building any generative AI application, enriching the large language models (LLMs) with new data is imperative. This is where the Retrieval Augmented Generation (RAG) technique comes in. RAG is a machine learning (ML) architecture that uses external documents (like Wikipedia) to augment its knowledge and achieve state-of-the-art results on knowledge-intensive tasks. For ingesting these external data sources, Vector databases have evolved, which can store vector embeddings of the data source and allow for similarity searches.

In this post, we show how to build a RAG extract, transform, and load (ETL) ingestion pipeline to ingest large amounts of data into an Amazon OpenSearch Service cluster and use Amazon Relational Database Service (Amazon RDS) for PostgreSQL with the pgvector extension as a vector data store. Each service implements k-nearest neighbor (k-NN) or approximate nearest neighbor (ANN) algorithms and distance metrics to calculate similarity. We introduce the integration of Ray into the RAG contextual document retrieval mechanism. Ray is an open source, Python, general purpose, distributed computing library. It allows distributed data processing to generate and store embeddings for a large amount of data, parallelizing across multiple GPUs. We use a Ray cluster with these GPUs to run parallel ingest and query for each service.

In this experiment, we attempt to analyze the following aspects for OpenSearch Service and the pgvector extension on Amazon RDS:

As a vector store, the ability to scale and handle a large dataset with tens of millions of records for RAG
Possible bottlenecks in the ingest pipeline for RAG
How to achieve optimal performance in ingestion and query retrieval times for OpenSearch Service and Amazon RDS

To understand more about vector data stores and their role in building generative AI applications, refer to The role of vector datastores in generative AI applications.

Overview of OpenSearch Service

OpenSearch Service is a managed service for secure analysis, search, and indexing of business and operational data. OpenSearch Service supports petabyte-scale data with the ability to create multiple indexes on text and vector data. With optimized configuration, it aims for high recall for the queries. OpenSearch Service supports ANN as well as exact k-NN search. OpenSearch Service supports a selection of algorithms from the NMSLIB, FAISS, and Lucene libraries to power the k-NN search. We created the ANN index for OpenSearch with the Hierarchical Navigable Small World (HNSW) algorithm because it’s regarded as a better search method for large datasets. For more information on the choice of index algorithm, refer to Choose the k-NN algorithm for your billion-scale use case with OpenSearch.

Overview of Amazon RDS for PostgreSQL with pgvector

The pgvector extension adds an open source vector similarity search to PostgreSQL. By utilizing the pgvector extension, PostgreSQL can perform similarity searches on vector embeddings, providing businesses with a speedy and proficient solution. pgvector provides two types of vector similarity searches: exact nearest neighbor, which results with 100% recall, and approximate nearest neighbor (ANN), which provides better performance than exact search with a trade-off on recall. For searches over an index, you can choose how many centers to use in the search, with more centers providing better recall with a trade-off of performance.

Solution overview

The following diagram illustrates the solution architecture.

Let’s look at the key components in more detail.

Dataset

We use OSCAR data as our corpus and the SQUAD dataset to provide sample questions. These datasets are first converted to Parquet files. Then we use a Ray cluster to convert the Parquet data to embeddings. The created embeddings are ingested to OpenSearch Service and Amazon RDS with pgvector.

OSCAR (Open Super-large Crawled Aggregated corpus) is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the ungoliant architecture. Data is distributed by language in both original and deduplicated form. The Oscar Corpus dataset is approximately 609 million records and takes up about 4.5 TB as raw JSONL files. The JSONL files are then converted to Parquet format, which minimizes the total size to 1.8 TB. We further scaled the dataset down to 25 million records to save time during ingestion.

SQuAD (Stanford Question Answering Dataset) is a reading comprehension dataset consisting of questions posed by crowd workers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. We use SQUAD, licensed as CC-BY-SA 4.0, to provide sample questions. It has approximately 100,000 questions with over 50,000 unanswerable questions written by crowd workers to look similar to answerable ones.

Ray cluster for ingestion and creating vector embeddings

In our testing, we found that the GPUs make the biggest impact to performance when creating the embeddings. Therefore, we decided to use a Ray cluster to convert our raw text and create the embeddings. Ray is an open source unified compute framework that enables ML engineers and Python developers to scale Python applications and accelerate ML workloads. Our cluster consisted of 5 g4dn.12xlarge Amazon Elastic Compute Cloud (Amazon EC2) instances. Each instance was configured with 4 NVIDIA T4 Tensor Core GPUs, 48 vCPU, and 192 GiB of memory. For our text records, we ended up chunking each into 1,000 pieces with a 100-chunk overlap. This breaks out to approximately 200 per record. For the model used to create embeddings, we settled on all-mpnet-base-v2 to create a 768-dimensional vector space.

Infrastructure setup

We used the following RDS instance types and OpenSearch service cluster configurations to set up our infrastructure.

The following are our RDS instance type properties:

Instance type: db.r7g.12xlarge
Allocated storage: 20 TB
Multi-AZ: True
Storage encrypted: True
Enable Performance Insights: True
Performance Insight retention: 7 days
Storage type: gp3
Provisioned IOPS: 64,000
Index type: IVF
Number of lists: 5,000
Distance function: L2

The following are our OpenSearch Service cluster properties:

Version: 2.5
Data nodes: 10
Data node instance type: r6g.4xlarge
Primary nodes: 3
Primary node instance type: r6g.xlarge
Index: HNSW engine: nmslib
Refresh interval: 30 seconds
ef_construction: 256
m: 16
Distance function: L2

We used large configurations for both the OpenSearch Service cluster and RDS instances to avoid any performance bottlenecks.

We deploy the solution using an AWS Cloud Development Kit (AWS CDK) stack, as outlined in the following section.

Deploy the AWS CDK stack

The AWS CDK stack allows us to choose OpenSearch Service or Amazon RDS for ingesting data.

Pre-requsities

Before proceeding with the installation, under cdk, bin, src.tc, change the Boolean values for Amazon RDS and OpenSearch Service to either true or false depending on your preference.

You also need a service-linked AWS Identity and Access Management (IAM) role for the OpenSearch Service domain. For more details, refer to Amazon OpenSearch Service Construct Library. You can also run the following command to create the role:

aws iam create-service-linked-role --aws-service-name es.amazonaws.com

npm install
cdk deploy

This AWS CDK stack will deploy the following infrastructure:

A VPC
A jump host (inside the VPC)
An OpenSearch Service cluster (if using OpenSearch service for ingestion)
An RDS instance (if using Amazon RDS for ingestion)
An AWS Systems Manager document for deploying the Ray cluster
An Amazon Simple Storage Service (Amazon S3) bucket
An AWS Glue job for converting the OSCAR dataset JSONL files to Parquet files
Amazon CloudWatch dashboards

Download the data

Run the following commands from the jump host:

stack_name="RAGStack"
output_key="S3bucket"

export AWS_REGION=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone | sed 's/\(.*\)[a-z]/\1/')
aws configure set region $AWS_REGION

bucket_name=$(aws cloudformation describe-stacks --stack-name "$stack_name" --query "Stacks[0].Outputs[?OutputKey=='bucketName'].OutputValue" --output text )

Before cloning the git repo, make sure you have a Hugging Face profile and access to the OSCAR data corpus. You need to use the user name and password for cloning the OSCAR data:

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/oscar-corpus/OSCAR-2301
cd OSCAR-2301
git lfs pull --include en_meta
cd en_meta
for F in `ls *.zst`; do zstd -d $F; done
rm *.zst
cd ..
aws s3 sync en_meta s3://$bucket_name/oscar/jsonl/

Convert JSONL files to Parquet

The AWS CDK stack created the AWS Glue ETL job oscar-jsonl-parquet to convert the OSCAR data from JSONL to Parquet format.

After you run the oscar-jsonl-parquet job, the files in Parquet format should be available under the parquet folder in the S3 bucket.

Download the questions

From your jump host, download the questions data and upload it to your S3 bucket:

stack_name="RAGStack"
output_key="S3bucket"

export AWS_REGION=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone | sed 's/\(.*\)[a-z]/\1/')
aws configure set region $AWS_REGION

bucket_name=$(aws cloudformation describe-stacks --stack-name "$stack_name" --query "Stacks[0].Outputs[?OutputKey=='bucketName'].OutputValue" --output text )

wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
cat train-v2.0.json| jq '.data[].paragraphs[].qas[].question' > questions.csv
aws s3 cp questions.csv s3://$bucket_name/oscar/questions/questions.csv

Set up the Ray cluster

As part of the AWS CDK stack deployment, we created a Systems Manager document called CreateRayCluster.

To run the document, complete the following steps:

On the Systems Manager console, under Documents in the navigation pane, choose Owned by Me.
Open the CreateRayCluster document.
Choose Run.

The run command page will have the default values populated for the cluster.

The default configuration requests 5 g4dn.12xlarge. Make sure your account has limits to support this. The relevant service limit is Running On-Demand G and VT instances. The default for this is 64, but this configuration requires 240 CPUS.

After you review the cluster configuration, select the jump host as the target for the run command.

This command will perform the following steps:

Copy the Ray cluster files
Set up the Ray cluster
Set up the OpenSearch Service indexes
Set up the RDS tables

You can monitor the output of the commands on the Systems Manager console. This process will take 10–15 minutes for the initial launch.

Run ingestion

From the jump host, connect to the Ray cluster:

sudo -i
cd /rag
ray attach llm-batch-inference.yaml

The first time connecting to the host, install the requirements. These files should already be present on the head node.

pip install -r requirements.txt

For either of the ingestion methods, if you get an error like the following, it’s related to expired credentials. The current workaround (as of this writing) is to place credential files in the Ray head node. To avoid security risks, don’t use IAM users for authentication when developing purpose-built software or working with real data. Instead, use federation with an identity provider such as AWS IAM Identity Center (successor to AWS Single Sign-On).

OSError: When reading information for key 'oscar/parquet_data/part-00497-f09c5d2b-0e97-4743-ba2f-1b2ad4f36bb1-c000.snappy.parquet' in bucket 'ragstack-s3bucket07682993-1e3dic0fvr3rf': AWS Error [code 15]: No response body.

Usually, the credentials are stored in the file ~/.aws/credentials on Linux and macOS systems, and %USERPROFILE%\.aws\credentials on Windows, but these are short-term credentials with a session token. You also can’t override the default credential file, and so you need to create long-term credentials without the session token using a new IAM user.

To create long-term credentials, you need to generate an AWS access key and AWS secret access key. You can do that from the IAM console. For instructions, refer to Authenticate with IAM user credentials.

After you create the keys, connect to the jump host using Session Manager, a capability of Systems Manager, and run the following command:

$ aws configure
AWS Access Key ID [None]: <Your AWS Access Key>
AWS Secret Access Key [None]: <Your AWS Secret access key>
Default region name [None]: us-east-1
Default output format [None]: json

Now you can rerun the ingestion steps.

Ingest data into OpenSearch Service

If you’re using OpenSearch service, run the following script to ingest the files:

export AWS_REGION=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone | sed 's/\(.*\)[a-z]/\1/')
aws configure set region $AWS_REGION

python embedding_ray_os.py

When it’s complete, run the script that runs simulated queries:

python query_os.py

Ingest data into Amazon RDS

If you’re using Amazon RDS, run the following script to ingest the files:

export AWS_REGION=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone | sed 's/\(.*\)[a-z]/\1/')
aws configure set region $AWS_REGION

python embedding_ray_rds.py

When it’s complete, make sure to run a full vacuum on the RDS instance.

Then run the following script to run simulated queries:

python query_rds.py

Set up the Ray dashboard

Before you set up the Ray dashboard, you should install the AWS Command Line Interface (AWS CLI) on your local machine. For instructions, refer to Install or update the latest version of the AWS CLI.

Complete the following steps to set up the dashboard:

Install the Session Manager plugin for the AWS CLI.
In the Isengard account, copy the temporary credentials for bash/zsh and run in your local terminal.
Create a session.sh file in your machine and copy the following content to the file:

#!/bin/bash
echo Starting session to $1 to forward to port $2 using local port $3
aws ssm start-session --target $1 --document-name AWS-StartPortForwardingSession --parameters ‘{“portNumber”:[“‘$2’“], “localPortNumber”:[“‘$3’“]}'

Change the directory to where this session.sh file is stored.
Run the command Chmod +x to give executable permission to the file.
Run the following command:

./session.sh <Ray cluster head node instance ID> 8265 8265

For example:

./session.sh i-021821beb88661ba3 8265 8265

You will see a message like the following:

Starting session to i-021821beb88661ba3 to forward to port 8265 using local port 8265

Starting session with SessionId: abcdefgh-Isengard-0d73d992dfb16b146
Port 8265 opened for sessionId abcdefgh-Isengard-0d73d992dfb16b146.
Waiting for connections...

Open a new tab in your browser and enter localhost:8265.

You will see the Ray dashboard and statistics of the jobs and cluster running. You can track metrics from here.

For example, you can use the Ray dashboard to observe load on the cluster. As shown in the following screenshot, during ingest, the GPUs are running close to 100% utilization.

You can also use the RAG_Benchmarks CloudWatch dashboard to see the ingestion rate and query response times.

Extensibility of the solution

You can extend this solution to plug in other AWS or third-party vector stores. For every new vector store, you will need to create scripts for configuring the data store as well as ingesting data. The rest of the pipeline can be reused as needed.

Conclusion

In this post, we shared an ETL pipeline that you can use to put vectorized RAG data in both OpenSearch Service as well as Amazon RDS with the pgvector extension as vector datastores. The solution used a Ray cluster to provide the necessary parallelism to ingest a large data corpus. You can use this methodology to integrate any vector database of your choice to build RAG pipelines.

About the Authors

Randy DeFauw is a Senior Principal Solutions Architect at AWS. He holds an MSEE from the University of Michigan, where he worked on computer vision for autonomous vehicles. He also holds an MBA from Colorado State University. Randy has held a variety of positions in the technology space, ranging from software engineering to product management. He entered the big data space in 2013 and continues to explore that area. He is actively working on projects in the ML space and has presented at numerous conferences, including Strata and GlueCon.

David Christian is a Principal Solutions Architect based out of Southern California. He has his bachelor’s in Information Security and a passion for automation. His focus areas are DevOps culture and transformation, infrastructure as code, and resiliency. Prior to joining AWS, he held roles in security, DevOps, and system engineering, managing large-scale private and public cloud environments.

Prachi Kulkarni is a Senior Solutions Architect at AWS. Her specialization is machine learning, and she is actively working on designing solutions using various AWS ML, big data, and analytics offerings. Prachi has experience in multiple domains, including healthcare, benefits, retail, and education, and has worked in a range of positions in product engineering and architecture, management, and customer success.

Richa Gupta is a Solutions Architect at AWS. She is passionate about architecting end-to-end solutions for customers. Her specialization is machine learning and how it can be used to build new solutions that lead to operational excellence and drive business revenue. Prior to joining AWS, she worked in the capacity of a Software Engineer and Solutions Architect, building solutions for large telecom operators. Outside of work, she likes to explore new places and loves adventurous activities.

Minimizing Dependencies in a Disaster Recovery Plan

2022-01-26 Randy DeFauw

Post Syndicated from Randy DeFauw original https://aws.amazon.com/blogs/architecture/minimizing-dependencies-in-a-disaster-recovery-plan/

The Availability and Beyond whitepaper discusses the concept of static stability for improving resilience. What does static stability mean with regard to a multi-Region disaster recovery (DR) plan? What if the very tools that we rely on for failover are themselves impacted by a DR event?

In this post, you’ll learn how to reduce dependencies in your DR plan and manually control failover even if critical AWS services are disrupted. As a bonus, you’ll see how to use service control policies (SCPs) to help simulate a Regional outage, so that you can test failover scenarios more realistically.

Failover plan dependencies and considerations

Let’s dig into the DR scenario in more detail. Using Amazon Route 53 for Regional failover routing is a common pattern for DR events. In the simplest case, we’ve deployed an application in a primary Region and a backup Region. We have a Route 53 DNS record set with records for both Regions, and all traffic goes to the primary Region. In an event that triggers our DR plan, we manually or automatically switch the DNS records to direct all traffic to the backup Region.

Relying on an automated health check to control Regional failover can be tricky. A health check might not be perfectly reliable if a Region is experiencing some type of degradation. Often, we prefer to initiate our DR plan manually, which then initiates with automation.

What are the dependencies that we’ve baked into this failover plan? First, Route 53, our DNS service, has to be available. It must continue to serve DNS queries, and we have to be able to change DNS records manually. Second, if we do not have a full set of resources already deployed in the backup Region, we must be able to deploy resources into it.

Both dependencies might violate static stability, because we are relying on resources in our DR plan that might be affected by the outage we’re seeing. Ideally, we don’t want to depend on other services running so we can failover and continue to serve our own traffic. How do we reduce additional dependencies?

Static stability

Let’s look at our first dependency on Route 53 – control planes and data planes. Briefly, a control plane is used to configure resources, and the data plane delivers services (see Understanding Availability Needs for a more complete definition.)

The Route 53 data plane, which responds to DNS queries, is highly resilient across Regions. We can safely rely on it during the failure of any single Region. But let’s assume that for some reason we are not able to call on the Route 53 control plane.

Amazon Route 53 Application Recovery Controller (Route 53 ARC) was built to handle this scenario. It provisions a Route 53 health check that we can manually control with a Route 53 ARC routing control, and is a data plane operation. The Route 53 ARC data plane is highly resilient, using a cluster of five Regional endpoints. You can revise the health check if three of the five Regions are available.

Figure 1. Simple Regional failover scenario using Route 53 Application Recovery Controller

The second dependency, being able to deploy resources into the second Region, is not a concern if we run a fully scaled-out set of resources. We must make sure that our deployment mechanism doesn’t rely only on the primary Region. Most AWS services have Regional control planes, so this isn’t an issue.

The AWS Identity and Access Management (IAM) data plane is highly available in each Region, so you can authorize the creation of new resources as long as you’ve already defined the roles. Note: If you use federated authentication through an identity provider, you should test that the IdP does not itself have a dependency on another Region.

Testing your disaster recovery plan

Once we’ve identified our dependencies, we need to decide how to simulate a disaster scenario. Two mechanisms you can use for this are network access control lists (NACLs) and SCPs. The first one enables us to restrict network traffic to our service endpoints. However, the second allows defining policies that specify the maximum permissions for the target accounts. It also allows us to simulate a Route 53 or IAM control plane outage by restricting access to the service.

For the end-to-end DR simulation, we’ve published an AWS samples repository on GitHub that you can use to deploy. This evaluates Route 53 ARC capabilities if both Route 53 and IAM control planes aren’t accessible.

By deploying test applications across us-east-1 and us-west-1 AWS Regions, we can simulate a real-world scenario that determines the business continuity impact, failover timing, and procedures required for successful failover with unavailable control planes.

Figure 2. Simulating Regional failover using service control policies

Before you conduct the test outlined in our scenario, we strongly recommend that you create a dedicated AWS testing environment with an AWS Organizations setup. Make sure that you don’t attach SCPs to your organization’s root but instead create a dedicated organization unit (OU). You can use this pattern to test SCPs and ensure that you don’t inadvertently lock out users from key services.

Chaos engineering

Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent production conditions. Chaos engineering and its principles are important tools when you plan for disaster recovery. Even a simple distributed system may be too complex to operate reliably. It can be hard or impossible to plan for every failure scenario in non-trivial distributed systems, because of the number of failure permutations. Chaos experiments test these unknowns by injecting failures (for example, shutting down EC2 instances) or transient anomalies (for example, unusually high network latency.)

In the context of multi-Region DR, these techniques can help challenge assumptions and expose vulnerabilities. For example, what happens if a health check passes but the system itself is unhealthy, or vice versa? What will you do if your entire monitoring system is offline in your primary Region, or too slow to be useful? Are there control plane operations that you rely on that themselves depend on a single AWS Region’s health, such as Amazon Route 53? How does your workload respond when 25% of network packets are lost? Does your application set reasonable timeouts or does it hang indefinitely when it experiences large network latencies?

Questions like these can feel overwhelming, so start with a few, then test and iterate. You might learn that your system can run acceptably in a degraded mode. Alternatively, you might find out that you need to be able to failover quickly. Regardless of the results, the exercise of performing chaos experiments and challenging assumptions is critical when developing a robust multi-Region DR plan.

Conclusion

In this blog, you learned about reducing dependencies in your DR plan. We showed how you can use Amazon Route 53 Application Recovery Controller to reduce a dependency on the Route 53 control plane, and how to simulate a Regional failure using SCPs. As you evaluate your own DR plan, be sure to take advantage of chaos engineering practices. Formulate questions and test your static stability assumptions. And of course, you can incorporate these questions into a custom lens when you run a Well-Architected review using the AWS Well-Architected Tool.

Applying Federated Learning for ML at the Edge

2021-12-09 Randy DeFauw

Post Syndicated from Randy DeFauw original https://aws.amazon.com/blogs/architecture/applying-federated-learning-for-ml-at-the-edge/

Federated Learning (FL) is an emerging approach to machine learning (ML) where model training data is not stored in a central location. During ML training, we typically need to access the entire training dataset on a single machine. For purposes of performance scaling, we divide the training data between multiple CPUs, multiple GPUs, or a cluster of machines. But in actuality, the training data is available in a single place. This poses challenges when we gather data from many distributed devices, like mobile phones or industrial sensors. For privacy reasons, we may not be able to collect training data from mobile phones. In an industrial setting, we may not have the network connectivity to push a large volume of sensor data to a central location. In addition, raw sensor data may contain sensitive information about a plant’s operation.

In this blog, we will demonstrate a working example of federated learning on AWS. We’ll also discuss some design challenges you may face when implementing FL for your own use case.

Federated learning challenges

In FL, all data stays on the devices. A training coordinator process invokes a training round or epoch, and the devices update a local copy of the model using local data. When complete, the devices each send their updated copy of the model weights to the coordinator. The coordinator combines these weights using an approach like federated averaging, and sends the new weights to each device. Figures 1 and 2 compare traditional ML with federated learning.

Traditional ML training:

Figure 1. Traditional ML training

Federated Learning:

Figure 2. Federated learning. Note that in federated learning, all data stays on the devices, as compared to traditional ML.

Compared to traditional ML, federated learning poses several challenges.

Unpredictable training participants. Mobile and industrial devices may have intermittent connectivity. We may have thousands of devices, and the communication with these devices may be asynchronous. The training coordinator needs a way to discover and communicate with these devices.

Asynchronous communication. As previously noted, devices like IoT sensors may use communication protocols like MQTT, and rely on IoT frameworks with asynchronous pub/sub messaging. For example, the training coordinator cannot simply make a synchronous call to a device to send updated weights.

Passing large messages. The model weights can be several MB in size, making them too large for typical MQTT payloads. This can be problematic when network connections are not high bandwidth.

Emerging frameworks. Deep learning frameworks like TensorFlow and PyTorch do not yet fully support FL. Each has emerging options for edge-oriented ML, but they are not yet production-ready, and are intended mostly for simulation work.

Solving for federated learning challenges

Let’s begin by discussing the framework used for FL. This framework should work with any of the major deep learning systems like PyTorch and TensorFlow. It should clearly separate what happens on the training coordinator versus what happens on the devices. The Flower framework satisfies these requirements. In the simplest cases, Flower only requires that a device or client implement four methods: get model weights, set model weights, run a training round, and return metrics (like accuracy). Flower provides a default weight combination strategy, federated averaging, although we could design our own.

Next, let’s consider the challenges of devices and coordinator-to-device communication. Here it helps to have a specific use case in mind. Let’s focus on an industrial use case using the AWS IoT services. The IoT services will provide device discovery and asynchronous messaging tools. Running Flower code in tasks running on Amazon ECS on AWS Fargate containers orchestrated by an AWS Step Functions workflow, bridges the gap between the synchronous training process and the asynchronous device communication.

Devices: Our devices are registered with the AWS IoT Core, which gives access to MQTT communication. We use an AWS IoT Greengrass device, which lets us push AWS Lambda functions on the core device to run the actual ML training code. Greengrass also provides access to other AWS services and provides a streaming capability for pushing large payloads to the AWS Lambdacloud. Devices publish heartbeat messages to an MQTT topic to facilitate device discovery by the training coordinator.

Flower processes: The Flower framework requires a server and one or more clients. The clients communicate with the server synchronously over gRPC, so we cannot directly run the Flower client code on the devices. Rather, we use Amazon ECS Fargate containers to run the Flower server and client code. The client container acts as a proxy and communicates with the devices using MQTT messaging, Greengrass streaming, and direct Amazon S3 file transfer. Since the devices cannot send responses directly to the proxy containers, we use IoT rules to push responses to an Amazon DynamoDB bookkeeping table.

Orchestration: We use a Step Functions workflow to launch a training run. It performs device discovery, launches the server container, and then launches one proxy client container for each Greengrass core.

Metrics: The Flower server and proxy containers emit useful data to logs. We use Amazon CloudWatch log metric filters to collect this information on a CloudWatch dashboard.

Figure 3 shows the high-level design of the prototype.

Figure 3. FL prototype deployed on Amazon ECS Fargate containers and AWS IoT Greengrass cores

Production considerations

As you move into production with FL, you must design for a new set of challenges compared to traditional ML.

What type of devices do you have? Mobile and IoT devices are the most common candidates for FL.

How do the devices communicate with the coordinator? Devices come and go, and don’t always support synchronous communication. How will you discover devices that are available for FL and manage asynchronous communication? IoT frameworks are designed to work with large fleets of devices with intermittent connectivity. But mobile devices will find it easier to push results back to a training coordinator using regular API endpoints.

How will you integrate FL into your ML practice? As you saw in this example, we built a working FL prototype using AWS IoT, container, database, and application integration services. A data science team is likely to work in Amazon SageMaker or an equivalent ML environment that provides model registries, experiment tracking, and other features. How will you bridge this gap? (We may need to invent a new term – “MLEdgeOps.”)

Conclusion

In this blog, we gave a working example of federated learning in an IoT scenario using the Flower framework. We discussed the challenges involved in FL compared to traditional ML when building your own FL solution. As FL is an important and emerging topic in edge ML scenarios, we invite you to try our GitHub sample code. Please give us feedback as GitHub issues, particularly if you see additional federated learning challenges that we haven’t covered yet.

Configure Amazon EMR Studio and Amazon EKS to run notebooks with Amazon EMR on EKS

2021-09-25 Randy DeFauw

Post Syndicated from Randy DeFauw original https://aws.amazon.com/blogs/big-data/configure-amazon-emr-studio-and-amazon-eks-to-run-notebooks-with-amazon-emr-on-eks/

Amazon EMR on Amazon EKS provides a deployment option for Amazon EMR that allows you to run analytics workloads on Amazon Elastic Kubernetes Service (Amazon EKS). This is an attractive option because it allows you to run applications on a common pool of resources without having to provision infrastructure. In addition, you can use Amazon EMR Studio to build analytics code running on Amazon EKS clusters. EMR Studio is a web-based, integrated development environment (IDE) using fully managed Jupyter notebooks that can be attached to any EMR cluster, including EMR on EKS. It uses AWS Single Sign-On (SSO) or a compatible identity provider (IdP) to log directly in to EMR Studio through a secure URL using corporate credentials.

Deploying EMR Studio to attach to EMR on EKS requires integrating several AWS services:

Amazon EKS
Amazon EMR
AWS SSO
Amazon Virtual Private Cloud (Amazon VPC)
Amazon Elastic Compute Cloud (Amazon EC2) Application Load Balancer
AWS Identity and Access Management (IAM) roles and policies

In addition, you need to install the following EMR on EKS components:

AWS Load Balancer Controller
Jupyter Enterprise Gateway

This post helps you build all the necessary components and stitch them together by running a single script. We also describe the architecture of this setup and how the components work together.

Architecture overview

With EMR on EKS, you can run Spark applications alongside other types of applications on the same Amazon EKS cluster, which improves resource allocation and simplifies infrastructure management. For more information about how Amazon EMR operates inside an Amazon EKS cluster, see New – Amazon EMR on Amazon Elastic Kubernetes Service (EKS). EMR Studio provides a web-based IDE that makes it easy to develop, visualize, and debug applications that run in EMR. For more information, see Amazon EMR Studio (Preview): A new notebook-first IDE experience with Amazon EMR.

Spark kernels are scheduled pods in a namespace in an Amazon EKS cluster. EMR Studio uses Jupyter Enterprise Gateway (JEG) to launch Spark kernels on Amazon EKS. A managed endpoint of type JEG is provisioned as a Kubernetes deployment in the EMR virtual cluster’s associated namespace and exposed as a Kubernetes service. Each EMR virtual cluster maps to a Kubernetes namespace registered with the Amazon EKS cluster; virtual clusters don’t manage physical compute or storage, but point to the Kubernetes namespace where the workload is scheduled. Each virtual cluster can have several managed endpoints, each with their own configured kernels for different use cases and needs. JEG managed endpoints provide HTTPS endpoints, serviced by an Application Load Balancer (ALB), that are reachable only from EMR Studio and self-hosted notebooks that are created within a private subnet of the Amazon EKS VPC.

The following diagram illustrates the solution architecture.

The managed endpoint is created in the virtual cluster’s Amazon EKS namespace (in this case, sparkns) and the HTTPS endpoints are serviced from private subnets. The kernel pods run with the job-execution IAM role defined in the managed endpoint. During managed endpoint creation, EMR on EKS uses the AWS Load Balancer Controller in the kube-system namespace to create an ALB with a target group that connects with the JEG managed endpoint in the virtual cluster’s Kubernetes namespace.

You can configure each managed endpoint’s kernel differently. For example, to permit a Spark kernel to use AWS Glue as their catalog, you can apply the following configuration JSON file in the —configuration-overrides flag when creating a managed endpoint:

aws emr-containers create-managed-endpoint \
--type JUPYTER_ENTERPRISE_GATEWAY \
--virtual-cluster-id ${virtclusterid} \
--name ${virtendpointname} \
--execution-role-arn ${role_arn} \
--release-label ${emr_release_label} \
--certificate-arn ${certarn} \
--region ${region} \
--configuration-overrides '{
    "applicationConfiguration": [
      {
        "classification": "spark-defaults",
        "properties": {
          "spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
          "spark.sql.catalogImplementation": "hive"
        }
      }
    ]
  }'

The managed endpoint is a Kubernetes deployment fronted by a service inside the configured namespace (in this case, sparkns). When we trace the endpoint information, we can see how the Jupyter Enterprise Gateway deployment connects with the ALB and the target group:

# Get the endpoint ID
aws emr-containers list-managed-endpoints --region us-east-1 --virtual-cluster-id idzdhw2qltdr0dxkgx2oh4bp1
{
    "endpoints": [
        {
            "id": "5vbuwntrbzil1",
            "name": "virtual-emr-endpoint-demo",
            ...
            "serverUrl": "https://internal-k8s-default-ingress5-4f482e2d41-2097665209.us-east-1.elb.amazonaws.com:18888",

# List the deployment
kubectl get deployments -n sparkns -l "emr-containers.amazonaws.com/managed-endpoint-id=5vbuwntrbzil1"

NAME                READY   UP-TO-DATE   AVAILABLE   AGE
jeg-5vbuwntrbzil1   1/1     1            1           4h54m


# List the service
kubectl get svc -n sparkns -l "emr-containers.amazonaws.com/managed-endpoint-id=5vbuwntrbzil1"

NAME                    TYPE       CLUSTER-IP       EXTERNAL-IP   PORT(S)           AGE
service-5vbuwntrbzil1   NodePort   10.100.172.157   <none>        18888:30091/TCP   4h58m

# List the TargetGroups to get the TargetGroup ARN

kubectl get targetgroupbinding -n sparkns -o json | jq .items | jq .[].spec.targetGroupARN

"arn:aws:elasticloadbalancing:us-east-1:< account id >:targetgroup/k8s-sparkns-servicey-a37caa5e1e/02d10652a64cebd8"

# Get the TargetGroup Port number

aws elbv2 describe-target-groups --target-group-arns arn:aws:elasticloadbalancing:us-east-1:< account id >:targetgroup/k8s-sparkns-servicey-a37caa5e1e/02d10652a64cebd8 | jq .TargetGroups | jq .[].Port

30091


# Get Load Balancer ARN

aws elbv2 describe-target-groups --target-group-arns arn:aws:elasticloadbalancing:us-east-1:< account id >:targetgroup/k8s-sparkns-servicey-a37caa5e1e/02d10652a64cebd8 | jq .TargetGroups | jq .[].LoadBalancerArns | jq .[]

"arn:aws:elasticloadbalancing:us-east-1:< account id >:loadbalancer/app/k8s-sparkns-ingressy-830efa48aa/12199b1a7baee273"

# Get Listener Port number

aws elbv2 describe-listeners --load-balancer-arn arn:aws:elasticloadbalancing:us-east-1:< account id >:loadbalancer/app/k8s-sparkns-ingressy-830efa48aa/12199b1a7baee273 | jq .Listeners | jq .[].Port

18888

To look at how this connects, consider two EMR Studio sessions. The ALB exposes port 18888 to the EMR Studio sessions. The JEG service maps the external port 18888 on the ALB to the dynamic NodePort on the JEG service (in this case, 30091). The JEG service forwards the traffic to the TargetPort 9547, which routes the traffic to the appropriate Spark driver pod. Each notebook session has its own kernel, which has its own respective Spark driver and executor pods, as the following diagram illustrates.

Attach EMR Studio to a virtual cluster and managed endpoint

Each time a user attaches a virtual cluster and a managed endpoint to their Studio Workspace and launches a Spark session, Spark drivers and Spark executors are scheduled. You can see that when you run kubectl to check what pods were launched:

$ kubectl get all -l app=enterprise-gateway
NAME                                  READY   STATUS      RESTARTS   AGE
pod/kb1a317e8-b77b-448c-9b7d-exec-1   1/1     Running     0          2m30s
pod/kb1a317e8-b77b-448c-9b7d-exec-2   1/1     Running     0          2m30s
pod/kb1a317e8-b77b-448c-9b7d-driver   2/2     Running     0          2m38s

$ kubectl get pods -n sparkns
NAME                                  READY   STATUS      RESTARTS   AGE
jeg-5vbuwntrbzil1-5fc8469d5f-pfdv9    1/1     Running     0          3d7h
kb1a317e8-b77b-448c-9b7d-exec-1       1/1     Running     0          2m38s
kb1a317e8-b77b-448c-9b7d-exec-2       1/1     Running     0          2m38s
kb1a317e8-b77b-448c-9b7d-driver       2/2     Running     0          2m46s

Each notebook Spark kernel session deploys a driver pod and executor pods that continue running until the kernel session is shut down.

The code in the notebook cells runs in the executor pods that were deployed in the Amazon EKS cluster.

Set up EMR on EKS and EMR Studio

Several steps and pieces are required to set up both EMR on EKS and EMR Studio. Enabling AWS SSO is a prerequisite. You can use the two provided launch scripts in this section or manually deploy it using the steps provided later in this post.

We provide two launch scripts in this post. One is a bash script that uses AWS CloudFormation, eksctl, and AWS Command Line Interface (AWS CLI) commands to provide an end-to-end deployment of a complete solution. The other uses the AWS Cloud Development Kit (AWS CDK) to do so.

The following diagram shows the architecture and components that we deploy.

Prerequisites

Make sure to complete the following prerequisites:

Enable AWS SSO in the same Region where the EMR Studio resides. If the account is part of an organizational account, AWS SSO needs to be enabled in the primary account.
Enable AWS SSO and set up users in AWS SSO. For instructions, see Getting Started and How to create and manage users within AWS Single Sign-On.

For information about the supported IdPs, see Enable AWS Single Sign-On for Amazon EMR Studio.

Bash script

The script is available on GitHub.

Prerequisites

The script requires you to use AWS Cloud9. Follow the instructions in the Amazon EKS Workshop. Make sure to follow these instructions carefully:

After you deploy the AWS Cloud9 desktop, proceed to the next steps.

Preparation

Use the following code to clone the GitHub repo and prepare the AWS Cloud9 prerequisites:

# Download script from the repository
$ git clone https://github.com/aws-samples/amazon-emr-on-eks-emr-studio.git

# Prepare the Cloud9 Desktop pre-requisites
$ cd amazon-emr-on-eks-emr-studio
$ bash ./prepare_cloud9.sh

Deploy the stack

Before running the script, provide the following information:

The AWS account ID and Region, if your AWS Cloud9 desktop isn’t in the same account ID or Region where you want to deploy EMR on EKS
The name of the Amazon Simple Storage Service (Amazon S3) bucket to create
The AWS SSO user to be associated with the EMR Studio session

After the script deploys the stack, the URL to the deployed EMR Studio is displayed:

# Launch the script and follow the instructions to provide user parameters
$ bash ./deploy_eks_cluster_bash.sh

...
Go to https://***. emrstudio-prod.us-east-1.amazonaws.com and login using < SSO user > ...

AWS CDK script

The AWS CDK scripts are available on GitHub. You need to checkout the main branch. The stacks deploy an Amazon EKS cluster and EMR on EKS virtual cluster in a new VPC with private subnets, and optionally an Amazon Managed Apache Airflow (Amazon MWAA) environment and EMR Studio.

Prerequisites

You need the AWS CDK version 1.90.1 or higher. For more information, see Getting started with the AWS CDK.

We use a prefix list to restrict access to some resources to network IP ranges that you approve. Create a prefix list if you don’t already have one.

If you plan to use EMR Studio, you need AWS SSO configured in your account.

Preparation

After you clone the repository and checkout the main branch, create and activate a new Python virtual environment:

# Clone the repository
$ git clone https://github.com/aws-samples/aws-cdk-for-emr-on-eks.git
$ cd aws-cdk-for-emr-on-eks/
$ git checkout main

# 
$ python3 -m venv .venv
$ source .venv/bin/activate

Now install the Python dependencies:

$ pip install -r requirements.txt

Lastly, bootstrap the AWS CDK:

$ cdk bootstrap aws://<account>/<region> \
  --context prefix=<prefix list> \
  --context instance=m5.xlarge \
  --context username=<SSO user name>

Deploy the stacks

Synthesize the AWS CDK stacks with the following code:

$ cdk synth \
  --context prefix=<prefix list> \
  --context instance=m5.xlarge \
  --context username=<SSO user name>

This command generates four stacks:

emr-eks-cdk – The main stack
mwaa-cdk – Adds Amazon MWAA
studio-cdk – Adds EMR Studio prerequisites
studio-cdk-live – Adds EMR Studio

The following diagram illustrates the resources deployed by the AWS CDK stacks.

Start by deploying the first stack:

$ cdk deploy <stack name> \
  --context prefix=<prefix list> \
  --context instance=m5.xlarge  \
  --context username=<SSO user name> \
  emr-eks-cdk

If you want to use Apache Airflow as your orchestrator, deploy that stack:

$ cdk deploy <stack name> \
  --context prefix=<prefix list> \
  --context instance=m5.xlarge \
  --context username=<SSO user name> \
  mwaa-cdk

Deploy the first EMR Studio stack:

$ cdk deploy <stack name> \
  --context prefix=<prefix list> \
  --context instance=m5.xlarge \
  --context username=<SSO user name> \
  studio-cdk

Wait for the managed endpoint to become active. You can check the status by running the following code:

$ aws emr-containers list-managed-endpoints --virtual-cluster-id <cluster ID> | jq '.endpoints[].state'

The virtual cluster ID is available in the AWS CDK output from the emr-eks-cdk stack.

When the endpoint is active, deploy the second EMR Studio stack:

$ cdk deploy <stack name> \
  --context prefix=<prefix list> \
  --context instance=m5.xlarge \
  --context username=<SSO user name> \
  studio-live-cdk

Manual deployment

If you prefer to manually deploy EMR on EKS and EMR Studio, use the steps in this section.

Set up a VPC

If you’re using Amazon EKS v. 1.18, set up a VPC that also has private subnets and appropriately tagged for external load balancers. For tagging, see: Application load balancing on Amazon EKS and Create an EMR Studio service role.

Create an Amazon EKS cluster

Launch an Amazon EKS cluster with at least one managed node group. For instructions, see Setting up and Getting Started with Amazon EKS.

Create relevant IAM policies, roles, IdP, and SSL/TLS certificate

To create your IAM policies, roles, IdP, and SSL/TLS certificate, complete the following steps:

Enable cluster access for EMR on EKS.
Create an IdP in IAM based on the EKS OIDC provider URL.
Create an SSL/TLS certificate and place it in AWS Certificate Manager.
Create the relevant IAM policies and roles:
1. Job execution role
2. Update the trust policy for the job execution role
3. Deploy and create the IAM policy for the AWS Load Balancer Controller
4. EMR Studio service role
5. EMR Studio user role
6. EMR Studio user policies associated with AWS SSO users and groups
Register the Amazon EKS cluster with Amazon EMR to create the virtual EMR cluster
Create the appropriate security groups to be attached to each EMR Studio created:
1. Workspace security group
2. Engine security group
Tag the security groups with the appropriate tags. For instructions, see Create an EMR Studio service role.

Required installs in Amazon EKS

Deploy the AWS Load Balancer Controller in the Amazon EKS cluster if you haven’t already done so.

Create EMR on EKS relevant pieces and map the user to EMR Studio

Complete the following steps:

Create at least one EMR virtual cluster associated with the Amazon EKS cluster. For instructions, see Step 1 of Set up Amazon EMR on EKS for EMR Studio.
Create at least one managed endpoint. For instructions, see Step 2 of Set up Amazon EMR on EKS for EMR Studio.
Create at least one EMR Studio; associate the EMR Studio with the private subnets configured with the Amazon EKS cluster. For instructions, see Create an EMR Studio.
When the EMR Studio is available, map an AWS SSO user or group to the EMR Studio and apply an appropriate IAM policy to that user.

Use EMR Studio

To start using EMR Studio, complete the following steps:

Find the URL for EMR Studio by the studios in a Region:

$ aws emr list-studios --region us-east-1
{
    "Studios": [
        {
            "StudioId": "es-XXXXXXXXXXXXXXXXXXXXXX",
            "Name": "emr_studio_1",
            "VpcId": "vpc-XXXXXXXXXXXXXXXXXXXX",
            "Url": "https://es-XXXXXXXXXXXXXXXXXXXXXX.emrstudio-prod.us-east-1.amazonaws.com",
            "CreationTime": "2021-02-10T14:04:13.672000+00:00"
        }
    ]
}

With the listed URL, log in using the AWS SSO username you used earlier.

After authentication, the user is routed to the EMR Studio dashboard.

Choose Create Workspace.
For Workspace name, enter a name.
For Subnet, choose the subnet that corresponds to one of the subnets associated with the managed node group.
For S3 location, enter an S3 bucket where you can store the notebook content.

After you create the Workspace, choose one that is in the Ready status.

In the sidebar, choose the EMR cluster icon.
Under Cluster type¸ choose EMR Cluster on EKS.
Choose the available virtual cluster and available managed endpoint.
Choose Attach.

After it’s attached, EMR Studio displays the kernels available in the Notebook and Console section.

Choose PySpark (Kubernetes) to launch a notebook kernel and start a Spark session.

Because the endpoint configuration here uses AWS Glue for its metastore, you can list the databases and tables connected to the AWS Glue Data Catalog. You can use the following example script to test the setup. Modify the script as necessary for the appropriate database and table that you have in your Data Catalog:

words='Welcome to Amazon EMR Studio'.split(' ')
wordRDD = sc.parallelize(words)
wc = wordRDD.map(lambda word: (word, 1)).reduceByKey(lambda a,b: a+b)
print(wc.collect())

# Connect to Glue Catalog
spark.sql("""show databases like '< Database Name >'""").show(truncate=False)
spark.sql("""show tables in < Database Name >""").show(truncate=False)
# Run a simple select
spark.sql("""select * from < Database Name >.< Table Name > limit 10""").show(truncate=False)

Clean up

To avoid incurring future charges, delete the resources launched here by running remove_setup.sh:

# Launch the script
$ bash ./remove_setup.sh</p>

Conclusion

EMR on EKS allows you to run applications on a common pool of resources inside an Amazon EKS cluster without having to provision infrastructure. EMR Studio is a fully managed Jupyter notebook and tool that provisions kernels that run on EMR clusters, including virtual clusters on Amazon EKS. In this post, we described the architecture of how EMR Studio connects with EMR on EKS and provided scripts to automatically deploy all the components to connect the two services.

If you have questions or suggestions, please leave a comment.

About the Authors

Randy DeFauw is a Principal Solutions Architect at Amazon Web Services. He works with the AWS customers to provide guidance and technical assistance on database projects, helping them improve the value of their solutions when using AWS.

Matthew Tan is a Senior Analytics Solutions Architect at Amazon Web Services and provides guidance to customers developing solutions with AWS Analytics services on their analytics workloads.

Emerging Solutions for Operations Research on AWS

2021-09-03 Randy DeFauw

Post Syndicated from Randy DeFauw original https://aws.amazon.com/blogs/architecture/emerging-solutions-for-operations-research-on-aws/

Operations research (OR) uses mathematical and analytical tools to arrive at optimal solutions for complex business problems like workforce scheduling. The mathematical techniques used to solve these problems, such as linear programming and mixed-integer programming, require the use of optimization software (solvers). There are several popular and powerful solvers available, ranging from commercial options like IBM CPLEX to open-source packages like ORTools. While these solvers incorporate decades of algorithmic expertise and can solve large and complex problems effectively, they have some scalability limitations.

In this post, we’ll describe three alternatives that you can consider for solving OR problems (see Figure 1). None of these are as general purpose as traditional solvers, but they should be on your “emerging technologies” radar.

Figure 1. OR optimization options

These include:

A traditional solver running on a compute platform
Reinforcement and machine learning (ML) algorithms running on Amazon SageMaker
A quantum computing algorithm running on Amazon Braket. Experiments are collected in Amazon DynamoDB and the results are visualized in Amazon Elasticsearch Service.

A reference problem and solution

Let’s start with a reference problem and solve it with a traditional solver. We’ll tackle an inventory management issue (see Figure 2). We have a sales depot that supplies products for local sales outlets. For the depot’s Region, there are seven weeks of historical sales data for each product. We also know how much each product costs and for how much it can be sold. Finally, we know the overall weekly capacity of the depot. This depends on logistical constraints like the size of the warehouse and transportation availability. This scenario is loosely based on the Grupo Bimbo retailer’s Kaggle competition and dataset.

Figure 2. Sales depot inventory management scenario

Our job is to place an inventory order to restock our sales depot each week. We quantify our work through a reward function. We want to maximize our revenue:

revenue = (sale price * number of units sold)

(Note that the sample dataset does not include cost of goods sold, only sale price.)

We use these constraints:

total units sold <= depot capacity
0 <= quantity sold of any given item <= forecasted demand for that item

There are many possible solutions to this problem. Using ORTools, we get an average reward (profit) of about $5,700, in about 1,000 simulations.

We can make the scenario slightly more realistic by acknowledging that our sales forecasts are not perfect. After we get the solution from the solver, we can penalize the reward (profit) by subtracting the cost of unsold goods. With this approach, we get a reward of about $2,450.

Solving OR problems with reinforcement learning

An alternative approach to the traditional solver is reinforcement learning (RL). RL is a field of ML that handles problems where the right answer is not immediately known, like playing a game of chess. RL fits our sales depot scenario, because we don’t know how well we will do until after we place the order and are able to view a week of sales activity.

Our sales depot problem resembles a knapsack problem. This is a common OR pattern where we want to fill a container (in this case, our sales depot) with as many items as possible until capacity is reached. Each item has a value (sales price) and a weight (cost). In RL we have to translate this into an observation space, an action space, a state, and a reward (see Figure 3).

The observation space is what our purchasing agent sees. This includes our depot capacity, the sales price, and the forecasted demand. The action space is what our agent can do. In the simplest case, it’s the number of each item to order for the depot, each week. The state is what the agent sees right now, and we model that as the sales results from last week. Finally, the reward function is our profit equation.

One important distinction between OR solvers and RL is that we can’t easily enforce hard constraints in RL. We can limit the amount of an individual product we purchase each week, but we can’t enforce an overall limit on the number of items purchased. We may exceed the capacity of our depot. The simplest way to handle that is to enforce a penalty. There are more sophisticated techniques available, such as interpreting our action as the percentage of budget to spend on each item. But let’s illustrate the simple case here.

Using an RL algorithm from the Ray RLLib package, our reward was $7,000 on average, including penalties for ordering too much of any given item.

Figure 3. Translating OR problem to RL

Solving OR problems with machine learning

It’s possible to model a knapsack problem using ML rather than RL in some cases, and there are simple reference implementations available. The design assumes that we know, or can accurately estimate the reward for a given week. With our simple scenario, we can compute the reward using estimates of future sales. We can use this in a custom loss function to train a neural network.

Solving OR problems with quantum computing

Quantum computers are fundamentally different than the computers most of us use. The appeal of quantum computers is that they can tackle some types of problems much more efficiently than standard computers. Quantum computers can, in theory, solve prime number factoring for decryption in orders of magnitude faster than a standard computer. But they are still in their infancy and limited to the size of problem they can handle, due to hardware limitations.

D-Wave Systems, which make some of the types of quantum computers available through Amazon Braket, has a solver called QBSolv. QBSolv works on a specific type of optimization problem called quadratic unconstrained binary optimization (QUBO). It breaks large problems into smaller pieces that a quantum computer can handle. There is a reference pattern for translating a knapsack problem to a QUBO problem.

Running the sales depot problem through QBSolv on Amazon Braket and using a subset of the data, I was able to obtain a reward of $900. When I tried to run on the full dataset, I was not able to complete the decomposition step, likely due to a hardware limitation.

Conclusion

In this blog post, I review OR problems and traditional OR solvers. I then discussed three alternative approaches, RL, ML, and quantum computing. Each of these alternatives has drawbacks and none is a general-purpose replacement for traditional OR solvers.

However, RL and ML are potentially more scalable because you can train those solutions on a cluster of machines, rather than running an OR solver on a single machine. RL agents can also learn from experience, giving them flexibility to handle scenarios that may be difficult to incorporate into an OR solver. Quantum computing solutions are promising but the current state of the art for quantum computers limits their application to small-scale problems at the moment. All of these alternatives can potentially derive a solution more quickly than an OR solver.

Further Reading:

Overview of OpenSearch Service

Overview of Amazon RDS for PostgreSQL with pgvector

Solution overview

Dataset

Ray cluster for ingestion and creating vector embeddings

Infrastructure setup

Deploy the AWS CDK stack

Pre-requsities

Download the data

Convert JSONL files to Parquet

Download the questions

Set up the Ray cluster

Run ingestion

Ingest data into OpenSearch Service

Ingest data into Amazon RDS

Set up the Ray dashboard

Extensibility of the solution

Conclusion

About the Authors

Failover plan dependencies and considerations

Static stability

Testing your disaster recovery plan

Chaos engineering

Conclusion

Federated learning challenges

Solving for federated learning challenges

Production considerations

Conclusion

Architecture overview

Attach EMR Studio to a virtual cluster and managed endpoint

Set up EMR on EKS and EMR Studio

Prerequisites

Bash script

Prerequisites

Preparation

Deploy the stack

AWS CDK script

Prerequisites

Preparation

Deploy the stacks

Manual deployment

Set up a VPC

Create an Amazon EKS cluster

Create relevant IAM policies, roles, IdP, and SSL/TLS certificate

Required installs in Amazon EKS

Create EMR on EKS relevant pieces and map the user to EMR Studio

Use EMR Studio

Clean up

Conclusion

About the Authors

A reference problem and solution

Solving OR problems with reinforcement learning

Solving OR problems with machine learning

Solving OR problems with quantum computing

Conclusion

The collective thoughts of the interwebz