Post Syndicated from Gagan Brahmi original https://aws.amazon.com/blogs/big-data/simplify-and-speed-up-apache-spark-applications-on-amazon-redshift-data-with-amazon-redshift-integration-for-apache-spark/
Customers use Amazon Redshift to run their business-critical analytics on petabytes of structured and semi-structured data. Apache Spark is a popular framework that you can use to build applications for use cases such as ETL (extract, transform, and load), interactive analytics, and machine learning (ML). Apache Spark enables you to build applications in a variety of languages, such as Java, Scala, and Python, by accessing the data in your Amazon Redshift data warehouse.
Amazon Redshift integration for Apache Spark helps developers seamlessly build and run Apache Spark applications on Amazon Redshift data. Developers can use AWS analytics and ML services such as Amazon EMR, AWS Glue, and Amazon SageMaker to effortlessly build Apache Spark applications that read from and write to their Amazon Redshift data warehouse. You can do so without compromising on the performance of your applications or transactional consistency of your data.
In this post, we discuss why Amazon Redshift integration for Apache Spark is critical and efficient for analytics and ML. In addition, we discuss use cases that use Amazon Redshift integration with Apache Spark to drive business impact. Finally, we walk you through step-by-step examples of how to use this official AWS connector in an Apache Spark application.
Amazon Redshift integration for Apache Spark
The Amazon Redshift integration for Apache Spark minimizes the cumbersome and often manual process of setting up a spark-redshift connector (community version) and shortens the time needed to prepare for analytics and ML tasks. You only need to specify the connection to your data warehouse, and you can start working with Amazon Redshift data from your Apache Spark-based applications within minutes.
You can use several pushdown capabilities for operations such as sort, aggregate, limit, join, and scalar functions so that only the relevant data is moved from your Amazon Redshift data warehouse to the consuming Apache Spark application. This allows you to improve the performance of your applications. Amazon Redshift admins can easily identify the SQL generated from Spark-based applications. In this post, we show how you can find out the SQL generated by the Apache Spark job.
Moreover, Amazon Redshift integration for Apache Spark uses Parquet file format when staging the data in a temporary directory. Amazon Redshift uses the UNLOAD SQL statement to store this temporary data on Amazon Simple Storage Service (Amazon S3). The Apache Spark application retrieves the results from the temporary directory (stored in Parquet file format), which improves performance.
You can also help make your applications more secure by utilizing AWS Identity and Access Management (IAM) credentials to connect to Amazon Redshift.
Amazon Redshift integration for Apache Spark is built on top of the spark-redshift connector (community version) and enhances it for performance and security, helping you gain up to 10 times faster application performance.
Use cases for Amazon Redshift integration with Apache Spark
For our use case, the leadership of the product-based company wants to know the sales for each product across multiple markets. As sales for the company fluctuate dynamically, it has become a challenge for the leadership to track the sales across multiple markets. However, the overall sales are declining, and the company leadership wants to find out which markets aren’t performing so that they can target these markets for promotion campaigns.
For sales across multiple markets, the product sales data such as orders, transactions, and shipment data is available on Amazon S3 in the data lake. The data engineering team can use Apache Spark with Amazon EMR or AWS Glue to analyze this data in Amazon S3.
The inventory data is available in Amazon Redshift. Similarly, the data engineering team can analyze this data with Apache Spark using Amazon EMR or an AWS Glue job by using the Amazon Redshift integration for Apache Spark to perform aggregations and transformations. The aggregated and transformed dataset can be stored back into Amazon Redshift using the Amazon Redshift integration for Apache Spark.
Using a distributed framework like Apache Spark with the Amazon Redshift integration for Apache Spark can provide the visibility across the data lake and data warehouse to generate sales insights. These insights can be made available to the business stakeholders and line of business users in Amazon Redshift to make informed decisions to run targeted promotions for the low revenue market segments.
Additionally, we can use the Amazon Redshift integration with Apache Spark in the following use cases:
- An Amazon EMR or AWS Glue customer running Apache Spark jobs wants to transform data and write that into Amazon Redshift as a part of their ETL pipeline
- An ML customer uses Apache Spark with SageMaker for feature engineering for accessing and transforming data in Amazon Redshift
- An Amazon EMR, AWS Glue, or SageMaker customer uses Apache Spark for interactive data analysis with data on Amazon Redshift from notebooks
Examples for Amazon Redshift integration for Apache Spark in an Apache Spark application
In this post, we show the steps to connect Amazon Redshift from Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2), Amazon EMR Serverless, and AWS Glue using a common script. In the following sample code, we generate a report showing the quarterly sales for the year 2008. To do that, we join two Amazon Redshift tables using an Apache Spark DataFrame, run a predicate pushdown, aggregate and sort the data, and write the transformed data back to Amazon Redshift. The script uses PySpark
The script uses IAM-based authentication for Amazon Redshift. IAM roles used by Amazon EMR and AWS Glue should have the appropriate permissions to authenticate Amazon Redshift, and access to an S3 bucket for temporary data storage.
The following example policy allows the IAM role to call the GetClusterCredentials operations:
The following example policy allows access to an S3 bucket for temporary data storage:
The complete script is as follows:
If you plan to use the preceding script in your environment, make sure you replace the values for the following variables with the appropriate values for your environment: jdbc_iam_url, temp_dir, and aws_role.
In the next section, we walk through the steps to run this script to aggregate a sample dataset that is made available in Amazon Redshift.
Prerequisites
Before we begin, make sure the following prerequisites are met:
- You have an AWS account
- You have access to create AWS CloudFormation stack
Deploy resources using AWS CloudFormation
Complete the following steps to deploy the CloudFormation stack:
- Sign in to the AWS Management Console, then launch the CloudFormation stack:

You can also download the CloudFormation template to create the resources mentioned in this post through infrastructure as code (IaC). Use this template when launching a new CloudFormation stack.
- Scroll down to the bottom of the page to select I acknowledge that AWS CloudFormation might create IAM resources under Capabilities, then choose Create stack.

The stack creation process takes 15–20 minutes to complete. The CloudFormation template creates the following resources:
-
- An Amazon VPC with the needed subnets, route tables, and NAT gateway
- An S3 bucket with the name
redshift-spark-databucket-xxxxxxx(note that xxxxxxx is a random string to make the bucket name unique) - An Amazon Redshift cluster with sample data loaded inside the database
devand the primary userredshiftmasteruser. For the purpose of this blog post,redshiftmasteruserwith administrative permissions is used. However, it is recommended to use a user with fine grained access control in production environment. - An IAM role to be used for Amazon Redshift with the ability to request temporary credentials from the Amazon Redshift cluster’s dev database
- Amazon EMR Studio with the needed IAM roles
- Amazon EMR release version 6.9.0 on an EC2 cluster with the needed IAM roles
- An Amazon EMR Serverless application release version 6.9.0
- An AWS Glue connection and AWS Glue job version 4.0
- A Jupyter notebook to run using Amazon EMR Studio using Amazon EMR on an EC2 cluster
- A PySpark script to run using Amazon EMR Studio and Amazon EMR Serverless
- After the stack creation is complete, choose the stack name
redshift-sparkand navigate to the Outputs
We utilize these output values later in this post.

In the next sections, we show the steps for Amazon Redshift integration for Apache Spark from Amazon EMR on Amazon EC2, Amazon EMR Serverless, and AWS Glue.
Use Amazon Redshift integration with Apache Spark on Amazon EMR on EC2
Starting from Amazon EMR release version 6.9.0 and above, the connector using Amazon Redshift integration for Apache Spark and Amazon Redshift JDBC driver are available locally on Amazon EMR. These files are located under the /usr/share/aws/redshift/ directory. However, in the previous versions of Amazon EMR, the community version of the spark-redshift connector is available.
The following example shows how to connect Amazon Redshift using a PySpark kernel via an Amazon EMR Studio notebook. The CloudFormation stack created Amazon EMR Studio, Amazon EMR on an EC2 cluster, and a Jupyter notebook available to run. To go through this example, complete the following steps:
- Download the Jupyter notebook made available in the S3 bucket for you:
- In the CloudFormation stack outputs, look for the value for
EMRStudioNotebook, which should point to theredshift-spark-emr.ipynbnotebook available in the S3 bucket. - Choose the link or open the link in a new tab by copying the URL for the notebook.
- After you open the link, download the notebook by choosing Download, which will save the file locally on your computer.

- In the CloudFormation stack outputs, look for the value for
- Access Amazon EMR Studio by choosing or copying the link provided in the CloudFormation stack outputs for the key
EMRStudioURL. - In the navigation pane, choose Workspaces.

- Choose Create Workspace.

- Provide a name for the Workspace, for instance
redshift-spark. - Expand the Advanced configuration section and select Attach Workspace to an EMR cluster.
- Under Attach to an EMR cluster, choose the EMR cluster with the name
emrCluster-Redshift-Spark. - Choose Create Workspace.

- After the Amazon EMR Studio Workspace is created and in Attached status, you can access the Workspace by choosing the name of the Workspace.
This should open the Workspace in a new tab. Note that if you have a pop-up blocker, you may have to allow the Workspace to open or disable the pop-up blocker.

In the Amazon EMR Studio Workspace, we now upload the Jupyter notebook we downloaded earlier.
- Choose Upload to browse your local file system and upload the Jupyter notebook (
redshift-spark-emr.ipynb).

- Choose (double-click) the
redshift-spark-emr.ipynbnotebook within the Workspace to open the notebook.

The notebook provides the details of different tasks that it performs. Note that in the section Define the variables to connect to Amazon Redshift cluster, you don’t need to update the values for jdbc_iam_url, temp_dir, and aws_role because these are updated for you by AWS CloudFormation. AWS CloudFormation has also performed the steps mentioned in the Prerequisites section of the notebook.
You can now start running the notebook.
- Run the individual cells by selecting them and then choosing Play.
You can also use the key combination of Shift+Enter or Shift+Return. Alternatively, you can run all the cells by choosing Run All Cells on the Run menu.
- Find the predicate pushdown operation performed on the Amazon Redshift cluster by the Amazon Redshift integration for Apache Spark.
We can also see the temporary data stored on Amazon S3 in the optimized Parquet format. The output can be seen from running the cell in the section Get the last query executed on Amazon Redshift.

- To validate the table created by the job from Amazon EMR on Amazon EC2, navigate to the Amazon Redshift console and choose the cluster
redshift-spark-redshift-clusteron the Provisioned clusters dashboard page.

- In the cluster details, on the Query data menu, choose Query in query editor v2.

- Choose the cluster in the navigation pane and connect to the Amazon Redshift cluster when it requests for authentication.
- Select Temporary credentials.
- For Database, enter
dev. - For User name, enter
redshiftmasteruser. - Choose Save.

- In the navigation pane, expand the cluster
redshift-spark-redshift-cluster, expand the dev database, expandtickit, and expand Tables to list all the tables inside the schematickit.
You should find the table test_emr.

- Choose (right-click) the table
test_emr, then choose Select table to query the table.

- Choose Run to run the SQL statement.

Use Amazon Redshift integration with Apache Spark on Amazon EMR Serverless
The Amazon EMR release version 6.9.0 and above provides the Amazon Redshift integration for Apache Spark JARs (managed by Amazon Redshift) and Amazon Redshift JDBC JARs locally on Amazon EMR Serverless as well. These files are located under the /usr/share/aws/redshift/ directory. In the following example, we use the Python script made available in the S3 bucket by the CloudFormation stack we created earlier.
- In the CloudFormation stack outputs, make a note of the value for
EMRServerlessExecutionScript, which is the location of the Python script in the S3 bucket. - Also note the value for
EMRServerlessJobExecutionRole, which is the IAM role to be used with running the Amazon EMR Serverless job. - Access Amazon EMR Studio by choosing or copying the link provided in the CloudFormation stack outputs for the key
EMRStudioURL. - Choose Applications under Serverless in the navigation pane.

You will find an EMR application created by the CloudFormation stack with the name emr-spark-redshift.
- Choose the application name to submit a job.

- Choose Submit job.

- Under Job details, for Name, enter an identifiable name for the job.
- For Runtime role, choose the IAM role that you noted from the CloudFormation stack output earlier.
- For Script location, provide the path to the Python script you noted earlier from the CloudFormation stack output.
- Expand the section Spark properties and choose the Edit in text
- Enter the following value in the text box, which provides the path to the
redshift-connector, Amazon Redshift JDBC driver,spark-avroJAR, andminimal-jsonJAR files:
- Choose Submit job.
- Wait for the job to complete and the run status to show as Success.
- Navigate to the Amazon Redshift query editor to view if the table was created successfully.

- Check the pushdown queries run for Amazon Redshift query group
emr-serverless-redshift. You can run the following SQL statement against the databasedev:
You can see that the pushdown query and return results are stored in Parquet file format on Amazon S3.

Use Amazon Redshift integration with Apache Spark on AWS Glue
Starting with AWS Glue version 4.0 and above, the Apache Spark jobs connecting to Amazon Redshift can use the Amazon Redshift integration for Apache Spark and Amazon Redshift JDBC driver. Existing AWS Glue jobs that already use Amazon Redshift as source or target can be upgraded to AWS Glue 4.0 to take advantage of this new connector. The CloudFormation template provided with this post creates the following AWS Glue resources:
- AWS Glue connection for Amazon Redshift – The connection to establish connection from AWS Glue to Amazon Redshift using the Amazon Redshift integration for Apache Spark
- IAM role attached to the AWS Glue job – The IAM role to manage permissions to run the AWS Glue job
- AWS Glue job – The script for the AWS Glue job performing transformations and aggregations using the Amazon Redshift integration for Apache Spark
The following example uses the AWS Glue connection attached to the AWS Glue job with PySpark and includes the following steps:
- On the AWS Glue console, choose Connections in the navigation pane.

- Under Connections, choose the AWS Glue connection for Amazon Redshift created by the CloudFormation template.

- Verify the connection details.

You can now reuse this connection within a job or across multiple jobs.
- On the Connectors page, choose the AWS Glue job created by the CloudFormation stack under Your jobs, or access the AWS Glue job by using the URL provided for the key
GlueJobin the CloudFormation stack output.

- Access and verify the script for the AWS Glue job.

- On the Job details tab, make sure that Glue version is set to Glue 4.0.
This ensures that the job uses the latest redshift-spark connector.

- Expand Advanced properties and in the Connections section, verify that the connection created by the CloudFormation stack is attached.

- Verify the job parameters added for the AWS Glue job. These values are also available in the output for the CloudFormation stack.

- Choose Save and then Run.
You can view the status for the job run on the Run tab.

- After the job run completes successfully, you can verify the output of the table test-glue created by the AWS Glue job.

- We check the pushdown queries run for Amazon Redshift query group
glue-redshift. You can run the following SQL statement against the databasedev:
Best practices
Keep in mind the following best practices:
- Consider using the Amazon Redshift integration for Apache Spark from Amazon EMR instead of using the
redshift-sparkconnector (community version) for your new Apache Spark jobs. - If you have existing Apache Spark jobs using the
redshift-sparkconnector (community version), consider upgrading them to use the Amazon Redshift integration for Apache Spark - The Amazon Redshift integration for Apache Spark automatically applies predicate and query pushdown to optimize for performance. We recommend using supported functions (
autopushdown) in your query. The Amazon Redshift integration for Apache Spark will turn the function into a SQL query and run the query in Amazon Redshift. This optimization results in required data being retrieved, so Apache Spark can process less data and have better performance.- Consider using aggregate pushdown functions like
avg,count,max,min, andsumto retrieve filtered data for data processing. - Consider using Boolean pushdown operators like
in,isnull,isnotnull,contains,endswith, andstartswithto retrieve filtered data for data processing. - Consider using logical pushdown operators like
and,or, andnot(or!) to retrieve filtered data for data processing.
- Consider using aggregate pushdown functions like
- It’s recommended to pass an IAM role using the parameter
aws_iam_rolefor the Amazon Redshift authentication from your Apache Spark application on Amazon EMR or AWS Glue. The IAM role should have necessary permissions to retrieve temporary IAM credentials to authenticate to Amazon Redshift as shown in this blog’s “Examples for Amazon Redshift integration for Apache Spark in an Apache Spark application” section. - With this feature, you don’t have to maintain your Amazon Redshift user name and password in the secrets manager and Amazon Redshift database.
- Amazon Redshift uses the UNLOAD SQL statement to store this temporary data on Amazon S3. The Apache Spark application retrieves the results from the temporary directory (stored in Parquet file format). This temporary directory on Amazon S3 is not cleaned up automatically, and therefore could add additional cost. We recommend using Amazon S3 lifecycle policies to define the retention rules for the S3 bucket.
- It’s recommended to turn on Amazon Redshift audit logging to log the information about connections and user activities in your database.
- It’s recommended to turn on Amazon Redshift at-rest encryption to encrypt your data as Amazon Redshift writes it in its data centers and decrypt it for you when you access it.
- It’s recommended to upgrade to AWS Glue v4.0 and above to use the Amazon Redshift integration for Apache Spark, which is available out of the box. Upgrading to this version of AWS Glue will automatically make use of this feature.
- It’s recommended to upgrade to Amazon EMR v6.9.0 and above to use the Amazon Redshift integration for Apache Spark. You don’t have to manage any drivers or JAR files explicitly.
- Consider using Amazon EMR Studio notebooks to interact with your Amazon Redshift data in your Apache Spark application.
- Consider using AWS Glue Studio to create Apache Spark jobs using a visual interface. You can also switch to writing Apache Spark code in either Scala or PySpark within AWS Glue Studio.
Clean up
Complete the following steps to clean up the resources that are created as a part of the CloudFormation template to ensure that you’re not billed for the resources if you’ll no longer be using them:
- Stop the Amazon EMR Serverless application:
- Access Amazon EMR Studio by choosing or copying the link provided in the CloudFormation stack outputs for the key
EMRStudioURL. - Choose Applications under Serverless in the navigation pane.

- Access Amazon EMR Studio by choosing or copying the link provided in the CloudFormation stack outputs for the key
You will find an EMR application created by the CloudFormation stack with the name emr-spark-redshift.
-
- If the application status shows as Stopped, you can move to the next steps. However, if the application status is Started, choose the application name, then choose Stop application and Stop application again to confirm.
- Delete the Amazon EMR Studio Workspace:
- Access Amazon EMR Studio by choosing or copying the link provided in the CloudFormation stack outputs for the key
EMRStudioURL. - Choose Workspaces in the navigation pane.

- Select the Workspace that you created and choose Delete, then choose Delete again to confirm.

- Access Amazon EMR Studio by choosing or copying the link provided in the CloudFormation stack outputs for the key
- Delete the CloudFormation stack:
-
- On the AWS CloudFormation console, navigate to the stack you created earlier.
- Choose the stack name and then choose Delete to remove the stack and delete the resources created as a part of this post.
- On the confirmation screen, choose Delete stack.

Conclusion
In this post, we explained how you can use the Amazon Redshift integration for Apache Spark to build and deploy applications with Amazon EMR on Amazon EC2, Amazon EMR Serverless, and AWS Glue to automatically apply predicate and query pushdown to optimize the query performance for data in Amazon Redshift. It’s highly recommended to use Amazon Redshift integration for Apache Spark for seamless and secure connection to Amazon Redshift from your Amazon EMR or AWS Glue.
Here is what some of our customers have to say about the Amazon Redshift integration for Apache Spark:
“We empower our engineers to build their data pipelines and applications with Apache Spark using Python and Scala. We wanted a tailored solution that simplified operations and delivered faster and more efficiently for our clients, and that’s what we get with the new Amazon Redshift integration for Apache Spark.”
—Huron Consulting
“GE Aerospace uses AWS analytics and Amazon Redshift to enable critical business insights that drive important business decisions. With the support for auto-copy from Amazon S3, we can build simpler data pipelines to move data from Amazon S3 to Amazon Redshift. This accelerates our data product teams’ ability to access data and deliver insights to end-users. We spend more time adding value through data and less time on integrations.”
—GE Aerospace
“Our focus is on providing self-service access to data for all of our users at Goldman Sachs. Through Legend, our open-source data management and governance platform, we enable users to develop data-centric applications and derive data-driven insights as we collaborate across the financial services industry. With the Amazon Redshift integration for Apache Spark, our data platform team will be able to access Amazon Redshift data with minimal manual steps, allowing for zero-code ETL that will increase our ability to make it easier for engineers to focus on perfecting their workflow as they collect complete and timely information. We expect to see a performance improvement of applications and improved security as our users can now easily access the latest data in Amazon Redshift.”
—Goldman Sachs
About the Authors
Gagan Brahmi is a Senior Specialist Solutions Architect focused on big data analytics and AI/ML platform at Amazon Web Services. Gagan has over 18 years of experience in information technology. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS. In his spare time, he spends time with his family and explores new places.
Vivek Gautam is a Data Architect with specialization in data lakes at AWS Professional Services. He works with enterprise customers building data products, analytics platforms, and solutions on AWS. When not building and designing data lakes, Vivek is a food enthusiast who also likes to explore new travel destinations and go on hikes.
Naresh Gautam is a Data Analytics and AI/ML leader at AWS with 20 years of experience, who enjoys helping customers architect highly available, high-performance, and cost-effective data analytics and AI/ML solutions to empower customers with data-driven decision-making. In his free time, he enjoys meditation and cooking.
Beaux Sharifi is a Software Development Engineer within the Amazon Redshift drivers’ team where he leads the development of the Amazon Redshift Integration with Apache Spark connector. He has over 20 years of experience building data-driven platforms across multiple industries. In his spare time, he enjoys spending time with his family and surfing.





















































Aniket Jiddigoudar is a Big Data Architect on the AWS Glue team. He works with customers to help improve their big data workloads. In his spare time, he enjoys trying out new food, playing video games, and kickboxing.
Sean Ma is a Principal Product Manager on the AWS Glue team. He has an 18+ year track record of innovating and delivering enterprise products that unlock the power of data for users. Outside of work, Sean enjoys scuba diving and college football.
A quintessential example of a company managing analytics on billions of data points across the data lake and the warehouse in a mission-critical business environment is
Playrix had a two-fold business goal, including marketing to its end-users (game players) with near-real-time data while also analyzing their historical data for the past 4–5 years. In seeking a solution, Playrix wanted to avoid disrupting other technical processes while also increasing cost savings. The company migrated to Redshift Serverless and scaled up to handle more complicated analytics on 600 TB from the past 5 years, all without storing two copies of the data or disrupting other analytics jobs. With Redshift Serverless, Playrix achieved a more flexible architecture and saved an overall 20% in costs of its marketing stack, decreasing its cost of customer acquisition.
Stripe, a payment processing platform for businesses, is an Amazon Redshift customer and a partner with thousands of end customers who require access to Stripe data for their applications. Stripe built the
Sana Ahmed is a Sr. Product Marketing Manager for Amazon Redshift. She is passionate about people, products and problem-solving with product marketing. As a Product Marketer, she has taken 50+ products to market and worked at various different companies including Sprinklr, PayPal and Facebook. Her hobbies include tennis, museum-hopping and fun conversations with friends and family.










Moira Lennox is a Senior Data Strategy Technical Specialist for AWS with 27 years’ experience helping companies innovate and modernize their data strategies to achieve new heights and allow for strategic decision-making. She has experience working in large enterprises and technology providers, in both business and technical roles across multiple industries, including health care live sciences, financial services, communications, digital entertainment, energy, and manufacturing.
Joel Farvault is Principal Specialist SA Analytics for AWS with 25 years’ experience working on enterprise architecture, data strategy, and analytics, mainly in the financial services industry. Joel has led data transformation projects on fraud analytics, claims automation, and data governance.
Mike Havey is a Solutions Architect for AWS with over 25 years of experience building enterprise applications. Mike is the author of two books and numerous articles. His 





















Utkarsh Agarwal is a Cloud Support Engineer in the Support Engineering team at Amazon Web Services. He specializes in Amazon OpenSearch Service. He provides guidance and technical assistance to customers thus enabling them to build scalable, highly available and secure solutions in AWS Cloud. In his free time, he enjoys watching movies, TV series and of course cricket! Lately, he his also attempting to master the art of cooking in his free time – The taste buds are excited, but the kitchen might disagree.
Ravi Bhatane is a software engineer with Amazon OpenSearch Serverless Service. He is passionate about security, distributed systems, and building scalable services. When he’s not coding, Ravi enjoys photography and exploring new hiking trails with his friends.
Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.
Constantin Scoarță is a Software Engineer at CyberSolutions Tech. He is mainly focused on building data cleaning and forecasting pipelines. In his spare time, he enjoys hiking, cycling, and skiing.
Horațiu Măiereanu is the Head of Python Development at CyberSolutions Tech. His team builds smart microservices for ecommerce retailers to help them improve and automate their workloads. In his free time, he likes hiking and traveling with his family and friends.
Ahmed Ewis is a Solutions Architect at the AWS Data Lab. He helps AWS customers design and build scalable data platforms using AWS database and analytics services. Outside of work, Ahmed enjoys playing with his child and cooking.
Jason D’Alba is an AWS Solutions Architect leader focused on databases and enterprise applications, helping customers architect highly available and scalable solutions.
Navnit Shukla is an AWS Specialist Solution Architect, Analytics, and is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions.
Vetri Natarajan is a Specialist Solutions Architect for Amazon QuickSight. Vetri has 15 years of experience implementing enterprise business intelligence (BI) solutions and greenfield data products. Vetri specializes in integration of BI solutions with business applications and enable data-driven decisions.
Sindhura Palakodety is a Solutions Architect at AWS. She is passionate about helping customers build enterprise-scale Well-Architected solutions on the AWS platform and specializes in Data Analytics domain.





Ennio Pastore is a Senior Data Architect on the AWS Data Lab team. He is an enthusiast of everything related to new technologies that have a positive impact on businesses and general livelihood. Ennio has over 10 years of experience in data analytics. He helps companies define and implement data platforms across industries, such as telecommunications, banking, gaming, retail, and insurance.





























Aaron Chong is an Enterprise Solutions Architect at Amazon Web Services Hong Kong. He specializes in the data analytics domain, and works with a wide range of customers to build big data analytics platforms, modernize data engineering practices, and advocate AI/ML democratization.










Lillie Atkins is a Product Manager for Amazon QuickSight, Amazon Web Service’s cloud-native, fully managed BI service.


















Raza Hafeez is a Senior Data Architect within the Shared Delivery Practice of AWS Professional Services. He has over 12 years of professional experience building and optimizing enterprise data warehouses and is passionate about enabling customers to realize the power of their data. He specializes in migrating enterprise data warehouses to AWS Modern Data Architecture.
Dipal Mahajan is a Lead Consultant with Amazon Web Services based out of India, where he guides global customers to build highly secure, scalable, reliable, and cost-efficient applications on the cloud. He brings extensive experience on Software Development, Architecture and Analytics from industries like finance, telecom, retail and healthcare.
































Nith Govindasivan, is a Data Lake Architect with AWS Professional Services, where he helps onboarding customers on their modern data architecture journey through implementing Big Data & Analytics solutions. Outside of work, Nith is an avid Cricket fan, watching almost any cricket during his spare time and enjoys long drives, and traveling internationally.
Vijay Velpula is a Data Architect with AWS Professional Services. He helps customers implement Big Data and Analytics Solutions. Outside of work, he enjoys spending time with family, traveling, hiking and biking.
Sriharsh Adari is a Senior Solutions Architect at Amazon Web Services (AWS), where he helps customers work backwards from business outcomes to develop innovative solutions on AWS. Over the years, he has helped multiple customers on data platform transformations across industry verticals. His core area of expertise include Technology Strategy, Data Analytics, and Data Science. In his spare time, he enjoys playing sports, binge-watching TV shows, and playing Tabla.








Alternatively, you can also use the AWS CLI to grant data location permission on bucket registered in central account to the crawler role using below command:










Sandeep Adwankar is a Senior Technical Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.
Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.
Piyali Kamra is a seasoned enterprise architect and a hands-on technologist who believes that building large scale enterprise systems is not an exact science but more like an art, in which tools and technologies must be carefully selected based on the team’s culture , strengths , weaknesses and risks , in tandem with having a futuristic vision as to how you want to shape your product a few years down the road.











Ahmed Zamzam is a Senior Partner Solutions Architect at Confluent, with a focus on the AWS partnership. In his role, he works with customers in the EMEA region across various industries to assist them in building applications that leverage their data using Confluent and AWS. Prior to Confluent, Ahmed was a Specialist Solutions Architect for Analytics AWS specialized in data streaming and search. In his free time, Ahmed enjoys traveling, playing tennis, and cycling.
Geetha Anne is a Partner Solutions Engineer at Confluent with previous experience in implementing solutions for data-driven business problems on the cloud, involving data warehousing and real-time streaming analytics. She fell in love with distributed computing during her undergraduate days and has followed her interest ever since. Geetha provides technical guidance, design advice, and thought leadership to key Confluent customers and partners. She also enjoys teaching complex technical concepts to both tech-savvy and general audiences.
















Sandeep Bajwa is a Sr. Analytics Specialist based out of Northern Virginia, specialized in the design and implementation of analytics and data lake solutions.


















Takeshi Nakatani is a Principal Bigdata Consultant on Professional Services team in Tokyo. He has 25 years of experience in IT industry, expertised in architecting data infrastructure. On his days off, he can be a rock drummer or a motorcyclyst.





Mahesh Pasupuleti is a VP of Data & Machine Learning Engineering at Poshmark. He has helped several startups succeed in different domains, including media streaming, healthcare, the financial sector, and marketplaces. He loves software engineering, building high performance teams, and strategy, and enjoys gardening and playing badminton in his free time.
Gaurav Shah is Director of Data Engineering and ML at Poshmark. He and his team help build data-driven solutions to drive growth at Poshmark.
Raghu Mannam is a Sr. Solutions Architect at AWS in San Francisco. He works closely with late-stage startups, many of which have had recent IPOs. His focus is end-to-end solutioning including security, DevOps automation, resilience, analytics, machine learning, and workload optimization in the cloud.
Deepesh Malviya is Solutions Architect Manager on the AWS Data Lab team. He and his team help customers architect and build data, analytics, and machine learning solutions to accelerate their key initiatives as part of the AWS Data Lab.





















John Telford is a Senior Consultant at Amazon Web Services. He is a specialist in big data and data warehouses. John has a Computer Science degree from Brunel University.
Anwar Rizal is a Senior Machine Learning consultant based in Paris. He works with AWS customers to develop data and AI solutions to sustainably grow their business.
Pauline Ting is a Data Scientist in the AWS Professional Services team. She supports customers in achieving and accelerating their business outcome by developing sustainable AI/ML solutions. In her spare time, Pauline enjoys traveling, surfing, and trying new dessert places.