Post Syndicated from Basheer Sheriff original https://aws.amazon.com/blogs/big-data/accelerate-analytics-on-amazon-opensearch-service-with-aws-glue-through-its-native-connector/
As the volume and complexity of analytics workloads continue to grow, customers are looking for more efficient and cost-effective ways to ingest and analyse data. Data is stored from online systems such as the databases, CRMs, and marketing systems to data stores such as data lakes on Amazon Simple Storage Service (Amazon S3), data warehouses in Amazon Redshift, and purpose-built stores such as Amazon OpenSearch Service, Amazon Neptune, and Amazon Timestream.
OpenSearch Service is used for multiple purposes, such as observability, search analytics, consolidation, cost savings, compliance, and integration. OpenSearch Service also has vector database capabilities that let you implement semantic search and Retrieval Augmented Generation (RAG) with large language models (LLMs) to build recommendation and media search engines. Previously, to integrate with OpenSearch Service, you could use open source clients for specific programming languages such as Java, Python, or JavaScript or use REST APIs provided by OpenSearch Service.
Movement of data across data lakes, data warehouses, and purpose-built stores is achieved by extract, transform, and load (ETL) processes using data integration services such as AWS Glue. AWS Glue is a serverless data integration service that makes it straightforward to discover, prepare, and combine data for analytics, machine learning (ML), and application development. AWS Glue provides both visual and code-based interfaces to make data integration effortless. Using a native AWS Glue connector increases agility, simplifies data movement, and improves data quality.
In this post, we explore the AWS Glue native connector to OpenSearch Service and discover how it eliminates the need to build and maintain custom code or third-party tools to integrate with OpenSearch Service. This accelerates analytics pipelines and search use cases, providing instant access to your data in OpenSearch Service. You can now use data stored in OpenSearch Service indexes as a source or target within the AWS Glue Studio no-code, drag-and-drop visual interface or directly in an AWS Glue ETL job script. When combined with AWS Glue ETL capabilities, this new connector simplifies the creation of ETL pipelines, enabling ETL developers to save time building and maintaining data pipelines.
Solution overview
The new native OpenSearch Service connector is a powerful tool that can help organizations unlock the full potential of their data. It enables you to efficiently read and write data from OpenSearch Service without needing to install or manage OpenSearch Service connector libraries.
In this post, we demonstrate exporting the New York City Taxi and Limousine Commission (TLC) Trip Record Data dataset into OpenSearch Service using the AWS Glue native connector. The following diagram illustrates the solution architecture.

By the end of this post, your visual ETL job will resemble the following screenshot.

Prerequisites
To follow along with this post, you need a running OpenSearch Service domain. For setup instructions, refer to Getting started with Amazon OpenSearch Service. Ensure it is public, for simplicity, and note the primary user and password for later use.
Note that as of this writing, the AWS Glue OpenSearch Service connector doesn’t support Amazon OpenSearch Serverless, so you need to set up a provisioned domain.
Create an S3 bucket
We use an AWS CloudFormation template to create an S3 bucket to store the sample data. Complete the following steps:
- Choose Launch Stack.

- On the Specify stack details page, enter a name for the stack.
- Choose Next.
- On the Configure stack options page, choose Next.
- On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources.
- Choose Submit.
The stack takes about 2 minutes to deploy.
Create an index in the OpenSearch Service domain
To create an index in the OpenSearch service domain, complete the following steps:
- On the OpenSearch Service console, choose Domains in the navigation pane.
- Open the domain you created as a prerequisite.
- Choose the link under OpenSearch Dashboards URL.

- On the navigation menu, choose Dev Tools.

- Enter the following code to create the index:
Create a secret for OpenSearch Service credentials
In this post, we use basic authentication and store our authentication credentials securely using AWS Secrets Manager. Complete the following steps to create a Secrets Manager secret:
- On the Secrets Manager console, choose Secrets in the navigation pane.
- Choose Store a new secret.
- For Secret type, select Other type of secret.
- For Key/value pairs, enter the user name
opensearch.net.http.auth.userand the passwordopensearch.net.http.auth.pass. - Choose Next.

- Complete the remaining steps to create your secret.
Create an IAM role for the AWS Glue job
Complete the following steps to configure an AWS Identity and Access Management (IAM) role for the AWS Glue job:
- On the IAM console, create a new role.
- Attach the AWS managed policy GlueServiceRole.
- Attach the following policy to the role. Replace each ARN with the corresponding ARN of the OpenSearch Service domain, Secrets Manager secret, and S3 bucket.
Create an AWS Glue connection
Before you can use the OpenSearch Service connector, you need to create an AWS Glue connection for connecting to OpenSearch Service. Complete the following steps:
- On the AWS Glue console, choose Connections in the navigation pane.
- Choose Create connection.
- For Name, enter
opensearch-connection. - For Connection type, choose Amazon OpenSearch.

- For Domain endpoint, enter the domain endpoint of OpenSearch Service.
- For Port, enter HTTPS port 443.
- For Resource, enter
yellow-taxi-index.
In this context, resource means the index of OpenSearch Service where the data is read from or written to.
- Select Wan only enabled.
- For AWS Secret, choose the secret you created earlier.

- Optionally, if you’re connecting to an OpenSearch Service domain in a VPC, specify a VPC, subnet, and security group to run AWS Glue jobs inside the VPC. For security groups, a self-referencing inbound rule is required. For more information, see Setting up networking for development for AWS Glue.

- Choose Create connection.
Create an ETL job using AWS Glue Studio
Complete the following steps to create your AWS Glue ETL job:
- On the AWS Glue console, choose Visual ETL in the navigation pane.
- Choose Create job and Visual ETL.
- On the AWS Glue Studio console, change the job name to
opensearch-etl. - Choose Amazon S3 for the data source and Amazon OpenSearch for the data target.
Between the source and target, you can optionally insert transform nodes. In this solution, we create a job that has only source and target nodes for simplicity.

- In the Data source properties section, specify the S3 bucket where the sample data is located, and choose Parquet as the data format.

- In the Data sink properties section, specify the connection you created in the previous section (
opensearch-connection).

- Choose the Job details tab, and in the Basic properties section, specify the IAM role you created earlier.

- Choose Save to save your job, and choose Run to run the job.
- Navigate to the Runs tab to check the status of the job. When it is successful, the run status should be Succeeded.

- After the job runs successfully, navigate to OpenSearch Dashboards, and log in to the dashboard.
- Choose Dashboards Management on the navigation menu.
- Choose Index patterns, and choose Create index pattern.
- Enter
yellow-taxi-indexfor Index pattern name. - Choose
tpep_pickup_datetimefor Time. - Choose Create index pattern. This index pattern will be used to visualize the index.
- Choose Discover on the navigation menu, and choose
yellow-taxi-index.

You have now created an index in OpenSearch Service and loaded data into it from Amazon S3 in just a few steps using the AWS Glue OpenSearch Service native connector.
Clean up
To avoid incurring charges, clean up the resources in your AWS account by completing the following steps:
- On the AWS Glue console, choose ETL jobs in the navigation pane.
- From the list of jobs, select the job
opensearch-etl, and on the Actions menu, choose Delete. - On the AWS Glue console, choose Data connections in the navigation pane.
- Select
opensearch-connectionfrom the list of connectors, and on the Actions menu, choose Delete. - On the IAM console, choose Roles in the navigation page.
- Select the role you created for the AWS Glue job and delete it.
- On the CloudFormation console, choose Stacks in the navigation pane.
- Select the stack you created for the S3 bucket and sample data and delete it.
- On the Secrets Manager console, choose Secrets in the navigation pane.
- Select the secret you created, and on the Actions menu, choose Delete.
- Reduce the waiting period to 7 days and schedule the deletion.
Conclusion
The integration of AWS Glue with OpenSearch Service adds the powerful ability to perform data transformation when integrating with OpenSearch Service for analytics use cases. This enables organizations to streamline data integration and analytics with OpenSearch Service. The serverless nature of AWS Glue means no infrastructure management, and you pay only for the resources consumed while your jobs are running. As organizations increasingly rely on data for decision-making, this native Spark connector provides an efficient, cost-effective, and agile solution to swiftly meet data analytics needs.
About the authors
Basheer Sheriff is a Senior Solutions Architect at AWS. He loves to help customers solve interesting problems leveraging new technology. He is based in Melbourne, Australia, and likes to play sports such as football and cricket.
Shunsuke Goto is a Prototyping Engineer working at AWS. He works closely with customers to build their prototypes and also helps customers build analytics systems.





































Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.
Arjun Nambiar is a Product Manager with Amazon OpenSearch Service. He focusses on ingestion technologies that enable ingesting data from a wide variety of sources into Amazon OpenSearch Service at scale. Arjun is interested in large scale distributed systems and cloud-native technologies and is based out of Seattle, Washington.
Jay is Customer Success Engineering leader for OpenSearch service. He focusses on overall customer experience with the OpenSearch. Jay is interested in large scale OpenSearch adoption, distributed data store and is based out of Northern Virginia.
Rich Giuli is a Principal Solutions Architect at Amazon Web Service (AWS). He works within a specialized group helping ISVs accelerate adoption of cloud services. Outside of work Rich enjoys running and playing guitar.




Fraser Sequeira is a Startups Solutions Architect with AWS based in Mumbai, India. In his role at AWS, Fraser works closely with startups to design and build cloud-native solutions on AWS, with a focus on analytics and streaming workloads. With over 10 years of experience in cloud computing, Fraser has deep expertise in big data, real-time analytics, and building event-driven architecture on AWS. He enjoys staying on top of the latest technology innovations from AWS and sharing his learnings with customers. He spends his free time tinkering with new open source technologies.
Kenneth Walsh is a New York-based Sr. Solutions Architect whose focus is AWS Marketplace. Kenneth is passionate about cloud computing and loves being a trusted advisor for his customers. When he’s not working with customers on their journey to the cloud, he enjoys cooking, audiobooks, movies, and spending time with his family and dog.
Max Winter is a Principal Solutions Architect for AWS Financial Services clients. He works with ISV customers to design solutions that allow them to leverage the power of AWS services to automate and optimize their business. In his free time, he loves hiking and biking with his family, music and theater, digital photography, 3D modeling, and imparting a love of science and reading to his two nearly-teenagers.
Manjula Nagineni is a Senior Solutions Architect with AWS based in New York. She works with major financial service institutions, architecting and modernizing their large-scale applications while adopting AWS Cloud services. She is passionate about designing big data workloads cloud-natively. She has over 20 years of IT experience in software development, analytics, and architecture across multiple domains such as finance, retail, and telecom.



Prabhat Chaturvedi



Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.



Mikhail Vaynshteyn is a Solutions Architect with Amazon Web Services. Mikhail works with healthcare and life sciences customers to build solutions that help improve patients’ outcomes. Mikhail specializes in data analytics services.
Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.











Aish Gunasekar is a Specialist Solutions Architect with a focus on Amazon OpenSearch Service. Her passion at AWS is to help customers design highly scalable architectures and help them in their cloud adoption journey. Outside of work, she enjoys hiking and baking.
Satish Nandi is a Senior Technical Product Manager for Amazon OpenSearch Service.
Jon Handler is a Senior Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included 4 years of coding a large-scale, ecommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a PhD in Computer Science and Artificial Intelligence from Northwestern University.





Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.
Hendy Wijaya is a Senior OpenSearch Specialist Solutions Architect at Amazon Web Services. Hendy enables customers to leverage AWS services to achieve their business objectives and gain competitive advantages. He is passionate in collaborating with customers in getting the best out of OpenSearch and Amazon OpenSearch
Utkarsh Agarwal is a Cloud Support Engineer in the Support Engineering team at Amazon Web Services. He specializes in Amazon OpenSearch Service. He provides guidance and technical assistance to customers thus enabling them to build scalable, highly available and secure solutions in AWS Cloud. In his free time, he enjoys watching movies, TV series and of course cricket! Lately, he his also attempting to master the art of cooking in his free time – The taste buds are excited, but the kitchen might disagree.






















Dhwanil Patel is a Software Developer Engineer working on Amazon OpenSearch Service. He likes to contribute to open-source software development, and is passionate about distributed systems.
Shweta Thareja is a Principal Engineer working on Amazon OpenSearch Service. She is interested in building distributed and autonomous systems. She is a maintainer and an active contributor to OpenSearch.
Jon Handler is a Senior Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included 4 years of coding a large-scale, ecommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a PhD in Computer Science and Artificial Intelligence from Northwestern University.