TVS Supply Chain Solutions built a file transfer platform using AWS Transfer Family for AS2 for B2B collaboration

2025-01-13 Suresh Kanniappan

Post Syndicated from Suresh Kanniappan original https://aws.amazon.com/blogs/architecture/tvs-supply-chain-solutions-built-a-file-transfer-platform-using-aws-transfer-family-for-as2-for-b2b-collaboration/

TVS Supply Chain Solutions (TVS SCS), promoted by the erstwhile TVS Group and now part of the $3 billion TVS Mobility Group, is an India-based multinational company who pioneered the development of the supply chain solutions market in India.

For the last 2 decades, it has provided supply chain management services to customers in the automotive, consumer goods, defense, and utility sectors in India, the United Kingdom, Europe, and the US. It has a presence in 26 countries with over 17,000 employees and provides services to 78 global Fortune 500 companies. The company went public in 2023.

To meet its customers’ compliance requirements, TVS SCS sought a reliable file transfer solution supporting Applicability Statement 2 (AS2), a business-to-business (B2B) messaging protocol. This post describes how TVS SCS built a secure file transfer platform using AWS Transfer Family for AS2 to exchange Electronic Data Interchange (EDI) documents with their B2B customers in the logistics industry.

Business use case

Several end customers in the manufacturing sector mandated the exchange of EDI documents through the AS2 protocol over the internet. To address this requirement while maintaining manageability, security, and scalability, TVS SCS implemented a file transfer platform on AWS.

TVS SCS serves end customers in the manufacturing sector who require supply chain solutions between various locations:

Source – Plants, warehouses, technology
Destination – OEM vendors, plants, dealers

The process involves the following steps:

The end customer sends a booking request document (booking fact) to TVS SCS.
TVS SCS and the end customer exchange a series of EDI documents.
TVS SCS must acknowledge, process, and update the end customer upon receipt of each EDI document.

TVS SCS built a file transfer platform using Transfer Family with AS2 configuration to achieve the following:

Securely exchange EDI documents with end customers
Provide continuous notification using Message Disposition Notifications (MDNs)

The following diagram illustrates the end-to-end business process (requisition, sourcing, purchase orders, receiving, and invoicing) between TVS SCS and an end customer using the AS2 protocol.

End to end business process

Why the cloud?

TVS SCS chose AWS to build their AS2-compliant file transfer platform for three key reasons:

Data location – All relevant data (such as order creation and customer details) already resides in AWS
Infrastructure management – AWS addresses challenges in the following areas:
- Maintaining highly available and scalable infrastructure
- Maintaining correct AS2 system interoperability with trading partners
- Meeting compliance requirements
Versatility for non-AS2 customers – TVS SCS uses multiple scalable and fully managed AWS services to build customized APIs and webhooks for customers not using AS2

This cloud-based approach allows TVS SCS to focus on their core business while AWS handles the complexities of secure, compliant, and scalable file transfer infrastructure.

Why Transfer Family and AS2?

AS2 is a B2B messaging protocol commonly used for exchanging EDI documents securely with integrity control according to the EDIFACT standard, reliably, and cost-effectively over the internet using the HTTP and HTTPS protocols. B2B integration over the AS2 protocol can be challenging, such as with trading partner onboarding, AS2 EDI integration, firewall configuration, certificate maintenance, and high licensing costs for commercial AS2 solutions.

By choosing Transfer Family with AS2 configuration, TVS SCS addresses these challenges and gains several advantages:

Simplified partner onboarding
Managed infrastructure, reducing maintenance overhead
Built-in security features
Flexible scaling to meet changing business needs
Pay-as-you-go pricing model

Solution overview

The following diagram shows the relationship between the AS2 objects involved in the inbound and outbound processes.

Relationship between the AS2 objects

The following diagram illustrates the solution architecture with AWS services.

Solution Architecture

For step-by-step instructions about creating an AS2 server using Transfer Family, refer to Create an AS2 server using the Transfer Family console.

The allowlisted IP address of the end-customer AS2 server is allowed to communicate with Transfer Family for AS2 on AWS. The customer sends the EDI document through Transfer Family, and the EDIs are stored in Amazon Simple Storage Service (Amazon S3). The business logic is implemented in AWS Lambda functions to read the EDI documents, process them, and update customers. AWS B2B Data Interchange, a fully managed service for automating EDI document transformation, can be considered as a complementary or alternative solution for EDI processing. There are two Lambda functions created: one handles truck booking using NodeJS, and the other handles outbound file transfer (from Amazon S3 to the AS2 server) using Python 3.2.

This architecture enables TVS SCS to securely and efficiently manage the EDI document flow, from receipt through processing and outbound transfer, using scalable and serverless AWS services. The solution provides a compliant and cost-effective approach to B2B data exchange with customers and partners.

Prerequisites

For the prerequisites to configure Transfer Family with AS2, see Configuring AS2. To learn more about the security features in Transfer Family, see Security in AWS Transfer Family.

End customer to TVS SCS communication workflow

The following diagram illustrates the step-by-step process of a truck booking request from an end customer to TVS SCS using AWS services.

End customer to TVS SCS communication workflow

This streamlined workflow demonstrates how TVS SCS uses AWS services to efficiently handle truck booking requests from customers:

The customer initiates a truck booking by sending a booking fact EDI to TVS SCS. The EDI contains details like customer name, date, source location, destination location, and more.
The signed and encrypted booking fact EDI is sent as an inbound HTTP AS2 payload to Transfer Family through the internet.
Transfer Family writes the booking fact EDI to the S3 bucket.
TVS SCS confirms receipt of the booking fact EDI either through the inline HTTP response or an asynchronous HTTP POST request to the originating server.
The EDI exchange audit trail is logged in Amazon CloudWatch Logs.
The EDI document is available for TVS SCS consumption, and a Lambda function processes the document using business logic.

TVS SCS to end customer communication workflow

The following diagram depicts the workflow from TVS SCS to the end customer.

TVS SCS to end customer communication workflow

This workflow demonstrates how TVS SCS uses AWS services to provide timely and accurate updates to customers throughout the delivery process:

The customer confirms the price quote. TVS SCS uploads EDI documents to S3 bucket.
TVS SCS sends a series of updates using the AS2 outbound connector, such as truck allocation, truck departure, truck in-transit status, truck delay notifications, delivery confirmation, and billing invoice. A Lambda function reads the EDI documents from Amazon S3 and runs business logic to generate responses for the end customer.
The EDI documents are sent as an outbound HTTP payload.
The customer AS2 server sends an acknowledgment using MDN.
The EDI exchange audit trail is logged in CloudWatch Logs.
The EDI document is available for the customer’s consumption and further processing.

Results

The following customer challenges were addressed with this solution:

It meets end customer requirements for EDI file exchange through AS2 protocol
It eliminates the need for in-house AS2 infrastructure management
It provides flexibility to add new customers to the file transfer platform

By addressing these challenges and using AWS services, TVS SCS has created a future-proof file transfer platform.

Summary

This post demonstrated how cloud-based services can transform traditional B2B communication processes, offering supply chain companies a path to improved efficiency, compliance, and customer satisfaction. For supply chain providers facing similar challenges, this solution offers a blueprint for modernizing file transfer systems while maintaining compliance with industry standards.

To learn more about this AWS solution for supply chain companies, contact AWS for further assistance. AWS can provide detailed information about implementation, pricing, and how to tailor the solution to your specific business needs. They have teams of experts who can guide companies through the process of modernizing their B2B communication systems using cloud-based services.

About the Authors

Batch data ingestion into Amazon OpenSearch Service using AWS Glue

2025-01-13 Ravikiran Rao

Post Syndicated from Ravikiran Rao original https://aws.amazon.com/blogs/big-data/batch-data-ingestion-into-amazon-opensearch-service-using-aws-glue/

Organizations constantly work to process and analyze vast volumes of data to derive actionable insights. Effective data ingestion and search capabilities have become essential for use cases like log analytics, application search, and enterprise search. These use cases demand a robust pipeline that can handle high data volumes and enable efficient data exploration.

Apache Spark, an open source powerhouse for large-scale data processing, is widely recognized for its speed, scalability, and ease of use. Its ability to process and transform massive datasets has made it an indispensable tool in modern data engineering. Amazon OpenSearch Service—a community-driven search and analytics solution—empowers organizations to search, aggregate, visualize, and analyze data seamlessly. Together, Spark and OpenSearch Service offer a compelling solution for building powerful data pipelines. However, ingesting data from Spark into OpenSearch Service can present challenges, especially with diverse data sources.

This post showcases how to use Spark on AWS Glue to seamlessly ingest data into OpenSearch Service. We cover batch ingestion methods, share practical examples, and discuss best practices to help you build optimized and scalable data pipelines on AWS.

Overview of solution

AWS Glue is a serverless data integration service that simplifies data preparation and integration tasks for analytics, machine learning, and application development. In this post, we focus on batch data ingestion into OpenSearch Service using Spark on AWS Glue.

AWS Glue offers multiple integration options with OpenSearch Service using various open source and AWS managed libraries, including:

In the following sections, we explore each integration method in detail, guiding you through the setup and implementation. As we progress, we incrementally build the architecture diagram shown in the following figure, providing a clear path for creating robust data pipelines on AWS. Each implementation is independent of the others. We chose to showcase them separately, because in a real-world scenario, only one of the three integration methods is likely to be used.

Image showing the high level architecture diagram

You can find the code base in the accompanying GitHub repo. In the following sections, we walk through the steps to implement the solution.

Prerequisites

Before you deploy this solution, make sure the following prerequisites are in place:

Access to a valid AWS account
The latest AWS Command Line Interface (AWS CLI) installed on your local machine
git, awk, curl, and bash installed on your local machine
Permission to create AWS resources
Familiarity with Apache Spark, AWS Glue, and Amazon OpenSearch Service

Clone the repository to your local machine

Clone the repository to your local machine and set the BLOG_DIR environment variable. All the relative paths assume BLOG_DIR is set to the repository location in your machine. If BLOG_DIR is not being used, adjust the path accordingly.

git clone [email protected]:aws-samples/opensearch-glue-integration-patterns.git
cd opensearch-glue-integration-patterns
export BLOG_DIR=$(pwd)

Deploy the AWS CloudFormation template to create the necessary infrastructure

The main focus of this post is to demonstrate how to use the mentioned libraries in Spark on AWS Glue to ingest data into OpenSearch Service. Though we center on this core topic, several key AWS components will need to be pre-provisioned for the integration examples, such as a Amazon Virtual Private Cloud (Amazon VPC), multiple Subnets, an AWS Key Management Service (AWS KMS) key, an Amazon Simple Storage Service (Amazon S3) bucket, an AWS Glue role, and an OpenSearch Service cluster with domains for OpenSearch Service and Elasticsearch. To simplify the setup, we’ve automated the provisioning of this core infrastructure using the cloudformation/opensearch-glue-infrastructure.yaml AWS CloudFormation template.

Run the following commands

The CloudFormation template will deploy the necessary networking components (such as VPC and subnets), Amazon CloudWatch logging, AWS Glue role, and OpenSearch Service and Elasticsearch domains required to implement the proposed architecture. Use a strong password (8–128 characters, three of which are lowercase, uppercase, numbers, or special characters, and no /, “, or spaces) and adhere to your organization’s security standards for ESMasterUserPassword and OSMasterUserPassword in the following command:

cd ${BLOG_DIR}/cloudformation/
aws cloudformation deploy \
--template-file ${BLOG_DIR}/cloudformation/opensearch-glue-infrastructure.yaml \
--stack-name GlueOpenSearchStack \
--capabilities CAPABILITY_NAMED_IAM \
--region <AWS_REGION> \
--parameter-overrides \
ESMasterUserPassword=<ES_MASTER_USER_PASSWORD> \
OSMasterUserPassword=<OS_MASTER_USER_PASSWORD>

You should see a success message such as "Successfully created/updated stack – GlueOpenSearchStack" after the resources have been provisioned successfully. Provisioning this CloudFormation stack typically takes approximately 30 minutes to complete.

On the AWS CloudFormation console, locate the GlueOpenSearchStack stack, and confirm that its status is CREATE_COMPLETE.

Image showing the "CREATE_COMPLETE" status of cloudformation template

You can review the deployed resources on the Resources tab, as shown in the following screenshot.The screenshot does not display all the created resources.

Image showing the "Resources" tab of cloudformation template

Additional setup steps

In this section, we collect essential information, including the S3 bucket name and the OpenSearch Service and Elasticsearch domain endpoints. These details are required for executing the code in subsequent sections.

Capture the details of the provisioned resources

Use the following AWS CLI command to extract and save the output values from the CloudFormation stack to a file named GlueOpenSearchStack_outputs.txt. We refer to the values in this file in upcoming steps.

aws cloudformation describe-stacks \
--stack-name GlueOpenSearchStack \
--query 'sort_by(Stacks[0].Outputs[], &OutputKey)[].{Key:OutputKey,Value:OutputValue}' \
--output table \
--no-cli-pager \
--region <AWS_REGION> > ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

Download NY Green Taxi December 2022 dataset and copy to S3 bucket

The purpose of this post is to demonstrate the technical implementation of ingesting data into OpenSearch Service using AWS Glue. Understanding the dataset itself is not essential, aside from its data format, which we discuss in AWS Glue notebooks in later sections. To learn more about the dataset, you can find additional information on the NYC Taxi and Limousine Commission website.

We specifically request that you download the December 2022 dataset, because we have tested the solution using this particular dataset:

S3_BUCKET_NAME=$(awk -F '|' '$2 ~ /S3Bucket/ {gsub(/^[ \t]+|[ \t]+$/, "", $3); print $3}' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt)
mkdir -p ${BLOG_DIR}/datasets && cd ${BLOG_DIR}/datasets
curl -O https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-12.parquet
aws s3 cp green_tripdata_2022-12.parquet s3://${S3_BUCKET_NAME}/datasets/green_tripdata_2022-12.parquet

Download the required JARs from the Maven repository and copy to S3 bucket

We’ve specified a particular JAR file version to ensure stable deployment experience. However, we recommend adhering to your organization’s security best practices and reviewing any known vulnerabilities in the version of the JAR files before deployment. AWS does not guarantee the security of any open-source code used here. Additionally, please verify the downloaded JAR file’s checksum against the published value to confirm its integrity and authenticity.

mkdir -p ${BLOG_DIR}/jars && cd ${BLOG_DIR}/jars
# OpenSearch Service jar
curl -O https://repo1.maven.org/maven2/org/opensearch/client/opensearch-spark-30_2.12/1.0.1/opensearch-spark-30_2.12-1.0.1.jar
aws s3 cp opensearch-spark-30_2.12-1.0.1.jar s3://${S3_BUCKET_NAME}/jars/opensearch-spark-30_2.12-1.0.1.jar
# Elasticsearch jar
curl -O https://repo1.maven.org/maven2/org/elasticsearch/elasticsearch-spark-30_2.12/7.17.23/elasticsearch-spark-30_2.12-7.17.23.jar
aws s3 cp elasticsearch-spark-30_2.12-7.17.23.jar s3://${S3_BUCKET_NAME}/jars/elasticsearch-spark-30_2.12-7.17.23.jar

In the following sections, we implement the individual data ingestion methods as outlined in the architecture diagram.

Ingest data into OpenSearch Service using the OpenSearch Spark library

In this section, we load an OpenSearch Service index using Spark and the OpenSearch Spark library. We demonstrate this implementation by using AWS Glue notebooks, employing basic authentication using user name and password.

To demonstrate the ingestion mechanisms, we have provided the Spark-and-OpenSearch-Code-Steps.ipynb notebook with detailed instructions. Follow the steps in this section in conjunction with the instructions in the notebook.

Set up the AWS Glue Studio notebook

Complete the following steps:

On the AWS Glue console, choose ETL jobs in the navigation pane.
Under Create job, choose Notebook.

Image showing AWS console page for AWS Glue to open notebook

Upload the notebook file located at ${BLOG_DIR}/glue_jobs/Spark-and-OpenSearch-Code-Steps.ipynb.
For IAM role, choose the AWS Glue job IAM role that begins with GlueOpenSearchStack-GlueRole-*.

Image showing AWS console page for AWS Glue to open notebook

Enter a name for the notebook (for example, Spark-and-OpenSearch-Code-Steps) and choose Save.

Image showing AWS Glue OpenSearch Notebook

Replace the placeholder values in the notebook

Complete the following steps to update the placeholders in the notebook:

In Step 1 in the notebook, replace the placeholder <GLUE-INTERACTIVE-SESSION-CONNECTION-NAME> with the AWS Glue interactive session connection name. You can get the name of the interactive session by executing the following command:

cd ${BLOG_DIR}
awk -F '|' '$2 ~ /GlueInteractiveSessionConnectionName/ {gsub(/^[ \t]+|[ \t]+$/, "", $3); print $3}' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

In Step 1 in the notebook, replace the placeholder <S3-BUCKET-NAME> and populate the variable s3_bucket with the bucket name. You can get the name of the S3 bucket by executing the following command:

awk -F '|' '$2 ~ /S3Bucket/ {gsub(/^[ \t]+|[ \t]+$/, "", $3); print $3}' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

In Step 4 in the notebook, replace <OPEN-SEARCH-DOMAIN-WITHOUT-HTTPS> with the OpenSearch Service domain name. You can get the domain name by executing the following command:

awk -F '|' '$2 ~ /OpenSearchDomainEndpoint/ {gsub(/^[ \t]+|[ \t]+$/, "", $3); print $3}' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

Run the notebook

Run each cell of the notebook to load data into the OpenSearch Service domain and read it back to verify the successful load. Refer to the detailed instructions within the notebook for execution-specific guidance.

Spark write modes (append vs. overwrite)

It is recommended to write data incrementally into OpenSearch Service indexes using the append mode, as demonstrated in Step 8 in the notebook. However, in certain cases, you may need to refresh the entire dataset in the OpenSearch Service index. In these scenarios, you can use the overwrite mode, though it is not advised for large indexes. When using overwrite mode, the Spark library deletes rows from the OpenSearch Service index one by one and then rewrites the data, which can be inefficient for large datasets. To avoid this, you can implement a preprocessing step in Spark to identify insertions and updates, and then write the data into OpenSearch Service using append mode.

Ingest data into Elasticsearch using the Elasticsearch Hadoop library

In this section, we load an Elasticsearch index using Spark and the Elasticsearch Hadoop Library. We demonstrate this implementation by using AWS Glue as the engine for Spark.

Set up the AWS Glue Studio notebook

Complete the following steps to set up the notebook:

On the AWS Glue console, choose ETL jobs in the navigation pane.
Under Create job, choose Notebook.

Image showing AWS console page for AWS Glue to open notebook

Upload the notebook file located at ${BLOG_DIR}/glue_jobs/Spark-and-Elasticsearch-Code-Steps.ipynb.
For IAM role, choose the AWS Glue job IAM role that begins with GlueOpenSearchStack-GlueRole-*.

Image showing AWS console page for AWS Glue to open notebook

Enter a name for the notebook (for example, Spark-and-ElasticSearch-Code-Steps) and choose Save.

Image showing AWS Glue Elasticsearch Notebook

Replace the placeholder values in the notebook

Complete the following steps:

In Step 1 in the notebook, replace the placeholder <GLUE-INTERACTIVE-SESSION-CONNECTION-NAME> with the AWS Glue interactive session connection name. You can get the name of the interactive session by executing the following command:

awk -F '|' '$2 ~ /GlueInteractiveSessionConnectionName/ {gsub(/^[ \t]+|[ \t]+$/, "", $3); print $3}' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

In Step 1 in the notebook, replace the placeholder <S3-BUCKET-NAME> and populate the variable s3_bucket with the bucket name. You can get the name of the S3 bucket by executing the following command:

awk -F '|' '$2 ~ /S3Bucket/ {gsub(/^[ \t]+|[ \t]+$/, "", $3); print $3}' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

In Step 4 in the notebook, replace <ELASTIC-SEARCH-DOMAIN-WITHOUT-HTTPS> with the Elasticsearch domain name. You can get the domain name by executing the following command:

awk -F '|' '$2 ~ /ElasticsearchDomainEndpoint/ {gsub(/^[ \t]+|[ \t]+$/, "", $3); print $3}' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

Run the notebook

Run each cell in the notebook to load data to the Elasticsearch domain and read it back to verify the successful load. Refer to the detailed instructions within the notebook for execution-specific guidance.

Ingest data into OpenSearch Service using the AWS Glue OpenSearch Service connection

In this section, we load an OpenSearch Service index using Spark and the AWS Glue OpenSearch Service connection.

Create the AWS Glue job

Complete the following steps to create an AWS Glue Visual ETL job:

On the AWS Glue console, choose ETL jobs in the navigation pane.
Under Create job, choose Visual ETL

This will open the AWS Glue job visual editor. Image showing AWS console page for AWS Glue to open Visual ETL