Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/introducing-generative-ai-troubleshooting-for-apache-spark-in-aws-glue-preview/
Organizations run millions of Apache Spark applications each month to prepare, move, and process their data for analytics and machine learning (ML). Building and maintaining these Spark applications is an iterative process, where developers spend significant time testing and troubleshooting their code. During development, data engineers often spend hours sifting through log files, analyzing execution plans, and making configuration changes to resolve issues. This process becomes even more challenging in production environments due to the distributed nature of Spark, its in-memory processing model, and the multitude of configuration options available. Troubleshooting these production issues requires extensive analysis of logs and metrics, often leading to extended downtimes and delayed insights from critical data pipelines.
Today, we are excited to announce the preview of generative AI troubleshooting for Spark in AWS Glue. This is a new capability that enables data engineers and scientists to quickly identify and resolve issues in their Spark applications. This feature uses ML and generative AI technologies to provide automated root cause analysis for failed Spark applications, along with actionable recommendations and remediation steps. This post demonstrates how you can debug your Spark applications with generative AI troubleshooting.
How generative AI troubleshooting for Spark works
For Spark jobs, the troubleshooting feature analyzes job metadata, metrics and logs associated with the error signature of your job to generates a comprehensive root cause analysis. You can initiate the troubleshooting and optimization process with a single click on the AWS Glue console. With this feature, you can reduce your mean time to resolution from days to minutes, optimize your Spark applications for cost and performance, and focus more on deriving value from your data.
Manually debugging Spark applications can get challenging for data engineers and ETL developers due to a few different reasons:
- Extensive connectivity and configuration options to a variety of resources with Spark while makes it a popular data processing platform, often makes it challenging to root cause issues when configurations are not correct, especially related to resource setup (S3 bucket, databases, partitions, resolved columns) and access permissions (roles and keys).
- Spark’s in-memory processing model and distributed partitioning of datasets across its workers while good for parallelism, often make it difficult for users to identify root cause of failures resulting from resource exhaustion issues like out of memory and disk exceptions.
- Lazy evaluation of Spark transformations while good for performance, makes it challenging to accurately and quickly identify the application code and logic which caused the failure from the distributed logs and metrics emitted from different executors.
Let’s look at a few common and complex Spark troubleshooting scenarios where Generative AI Troubleshooting for Spark can save hours of manual debugging time required to deep dive and come up with the exact root cause.
Resource setup or access errors
Spark applications allows to integrate data from a variety of resources like datasets with several partitions and columns on S3 buckets and Data Catalog tables, use the associated job IAM roles and KMS keys for correct permissions to access these resources, and require these resources to exist and be available in the right regions and locations referenced by their identifiers. Users can mis-configure their applications that result in errors requiring deep dive into the logs to understand the root cause being a resource setup or permission issue.
Manual RCA: Failure reason and Spark application Logs
Following example shows the failure reason for such a common setup issue for S3 buckets in a production job run. The failure reason coming from Spark does not help understand the root cause or the line of code that needs to be inspected for fixing it.
After deep diving into the logs of one of the many distributed Spark executors, it becomes clear that the error was caused due to a S3 bucket not existing, however the error stack trace is usually quite long and truncated to understand the precise root cause and location within Spark application where the fix is needed.
With Generative AI Spark Troubleshooting: RCA and Recommendations
With Spark Troubleshooting, you simply click the Troubleshooting analysis button on your failed job run, and the service analyzes the debug artifacts of your failed job to identify the root cause analysis along with the line number in your Spark application that you can inspect to further resolve the issue.

Spark Out of Memory Errors
Let’s take a common but relatively complex error that requires significant manual analysis to conclude its because of a Spark job running out of memory on Spark driver (master node) or one of the distributed Spark executors. Usually, troubleshooting requires an experienced data engineer to manually go over the following steps to identify the root cause.
- Search through Spark driver logs to find the exact error message
- Navigate to the Spark UI to analyze memory usage patterns
- Review executor metrics to understand memory pressure
- Analyze the code to identify memory-intensive operations
This process often takes hours because the failure reason from Spark is usually not challenging to understand that it was a out of memory issue on the Spark driver and what is the remedy to fix it.
Manual RCA: Failure reason and Spark application Logs
Following example shows the failure reason for the error.
Spark driver logs require extensive search to find the exact error message. In this case, the error stack trace consisted of more than hundred function calls and is challenging to understand the precise root cause as the Spark application terminated abruptly.
With Generative AI Spark Troubleshooting: RCA and Recommendations
With Spark Troubleshooting, you can click the Troubleshooting analysis button on your failed job run and get a detailed root cause analysis with the line of code which you can inspect, and also recommendations on best practices to optimize your Spark application for fixing the problem.

Spark Out of Disk Errors
Another complex error pattern with Spark is when it runs out of disk storage on one of the many Spark executors in the Spark application. Similar to Spark OOM exceptions, manual troubleshooting requires extensive deep dive into distributed executor logs and metrics to understand the root cause and identify the application logic or code causing the error due to Spark’s lazy execution of its transformations.
Manual RCA: Failure Reason and Spark application Logs
The associated failure reason and error stack trace in the application logs is again quiet long requiring the user to gather more insights from Spark UI and Spark metrics to identify the root cause and identify the resolution.
With Generative AI Spark Troubleshooting: RCA and Recommendations
With Spark Troubleshooting, it provides the RCA and the line number of code in the script where the data shuffle operation was lazily evaluated by Spark. It also points to best practices guide for optimizing the shuffle or wide transforms or using S3 shuffle plugin on AWS Glue.

Debug AWS Glue for Spark jobs
To use this troubleshooting feature for your failed job runs, complete following:
- On the AWS Glue console, choose ETL jobs in the navigation pane.
- Choose your job.
- On the Runs tab, choose your failed job run.
- Choose Troubleshoot with AI to start the analysis.
- You will be redirected to the Troubleshooting analysis tab with generated analysis.
You will see Root Cause Analysis and Recommendations sections.
The service analyzes your job’s debug artifacts and provide the results. Let’s look at a real example of how this works in practice.
We show below an end-to-end example where Spark Troubleshooting helps a user with identification of the root cause for a resource setup issue and help fix the job to resolve the error.

Considerations
During preview, the service focuses on common Spark errors like resource setup and access issues, out of memory exceptions on Spark driver and executors, out of disk exceptions on Spark executors, and will clearly indicate when an error type is not yet supported. Your jobs must run on AWS Glue version 4.0.
The preview is available at no additional charge in all AWS commercial Regions where AWS Glue is available. When you use this capability, any validation runs triggered by you to test proposed solutions will be charged according to the standard AWS Glue pricing.
Conclusion
This post demonstrated how generative AI troubleshooting for Spark in AWS Glue helps your day-to-day Spark application debugging. It simplifies the debugging process for your Spark applications by using generative AI to automatically identify the root cause of failures and provides actionable recommendations to resolve the issues.
To learn more about this new troubleshooting feature for Spark, please visit Troubleshooting Spark jobs with AI.
A special thanks to everyone who contributed to the launch of generative AI troubleshooting for Apache Spark in AWS Glue: Japson Jeyasekaran, Rahul Sharma, Mukul Prasad, Weijing Cai, Jeremy Samuel, Hirva Patel, Martin Ma, Layth Yassin, Kartik Panjabi, Maya Patwardhan, Anshi Shrivastava, Henry Caballero Corzo, Rohit Das, Peter Tsai, Daniel Greenberg, McCall Peltier, Takashi Onikura, Tomohiro Tanaka, Sotaro Hikita, Chiho Sugimoto, Yukiko Iwazumi, Gyan Radhakrishnan, Victor Pleikis, Sriram Ramarathnam, Matt Sampson, Brian Ross, Alexandra Tello, Andrew King, Joseph Barlan, Daiyan Alamgir, Ranu Shah, Adam Rohrscheib, Nitin Bahadur, Santosh Chandrachood, Matt Su, Kinshuk Pahare, and William Vambenepe.
About the Authors
Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.
Vishal Kajjam is a Software Development Engineer on the AWS Glue team. He is passionate about distributed computing and using ML/AI for designing and building end-to-end solutions to address customers’ data integration needs. In his spare time, he enjoys spending time with family and friends.
Shubham Mehta is a Senior Product Manager at AWS Analytics. He leads generative AI feature development across services such as AWS Glue, Amazon EMR, and Amazon MWAA, using AI/ML to simplify and enhance the experience of data practitioners building data applications on AWS.
Wei Tang is a Software Development Engineer on the AWS Glue team. She is strong developer with deep interests in solving recurring customer problems with distributed systems and AI/ML.
XiaoRun Yu is a Software Development Engineer on the AWS Glue team. He is working on building new features for AWS Glue to help customers. Outside of work, Xiaorun enjoys exploring new places in the Bay Area.
Jake Zych is a Software Development Engineer on the AWS Glue team. He has deep interest in distributed systems and machine learning. In his spare time, Jake likes to create video content and play board games.
Savio Dsouza is a Software Development Manager on the AWS Glue team. His team works on distributed systems & new interfaces for data integration and efficiently managing data lakes on AWS.
Mohit Saxena is a Senior Software Development Manager on the AWS Glue team. His team focuses on building distributed systems to enable customers with interactive and simple-to-use interfaces to efficiently manage and transform petabytes of data across data lakes on Amazon S3, and databases and data warehouses on the cloud.






Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his new road bike.
Keerthi Chadalavada is a Senior Software Development Engineer at AWS Glue, focusing on combining generative AI and data integration technologies to design and build comprehensive solutions for customers’ data and analytics needs.
Pradeep Patel is a Software Development Manager on the AWS Glue team. He is passionate about helping customers solve their problems by using the power of the AWS Cloud to deliver highly scalable and robust solutions. In his spare time, he loves to hike and play with web applications.
Chuhan Liu is a Software Engineer at AWS Glue. He is passionate about building scalable distributed systems for big data processing, analytics, and management. He is also keen on using generative AI technologies to provide brand-new experience to customers. In his spare time, he likes sports and enjoys playing tennis.
Vaibhav Naik is a software engineer at AWS Glue, passionate about building robust, scalable solutions to tackle complex customer problems. With a keen interest in generative AI, he likes to explore innovative ways to develop enterprise-level solutions that harness the power of cutting-edge AI technologies.
Gartner recognized AWS strengths as:




Dipal Mahajan is a Lead Consultant with Amazon Web Services based out of India, where he guides global customers to build highly secure, scalable, reliable, and cost-efficient applications on the cloud. He brings extensive experience on Software Development, Architecture and Analytics from industries like finance, telecom, retail and healthcare.
Anusha Challa is a Senior Analytics Specialist Solutions Architect focused on Amazon Redshift. She has helped many customers build large-scale data warehouse solutions in the cloud and on premises. She is passionate about data analytics and data science.
Ricardo Serafim is a Senior Analytics Specialist Solutions Architect at AWS.

Hang (Arthur) Zuo is a Senior Product Manager with Amazon OpenSearch Service. Arthur leads the core experience in the next-gen OpenSearch UI and data migration to Amazon OpenSearch Service. Arthur is passionate about cloud technologies and building data products that help users and businesses gain actionable insights and achieve operational excellence.
Chris Helma is a Senior Engineer at Amazon Web Services based in Austin, Texas. He is currently developing tools and techniques to enable users to shift petabyte-scale data workloads into OpenSearch. He has extensive experience building highly-scalable technologies in diverse areas such as search, security analytics, cryptography, and developer productivity. He has functional domain expertise in distributed systems, AI/ML, cloud-native design, and optimizing DevOps workflows. In his free time, he loves to explore specialty coffee and run through the West Austin hills.
Andre Kurait is a Software Development Engineer II at Amazon Web Services, based in Austin, Texas. He is currently working on Migration Assistant for Amazon OpenSearch Service. Prior to joining Amazon OpenSearch, Andre worked within Amazon Health Services. In his free time, Andre enjoys traveling, cooking, and playing in his church sport leagues. Andre holds Bachelor of the Science degrees from the University of Kansas in Computer Science and Mathematics.
Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Darshit Thakkar is a Technical Product Manager with AWS and works with the Amazon Athena team.
Selman Ay is a Data Architect in the AWS Professional Services team.
BP Yau is a Sr Partner Solutions Architect at AWS helping customers architect big data solutions to process data at scale














