Post Syndicated from Praveen Kumar original https://aws.amazon.com/blogs/big-data/author-visual-etl-flows-on-amazon-sagemaker-unified-studio/
Amazon SageMaker Unified Studio (preview) provides an integrated data and AI development environment within Amazon SageMaker. From the Unified Studio, you can collaborate and build faster using familiar AWS tools for model development, generative AI, data processing, and SQL analytics. This experience includes visual ETL, a new visual interface that makes it simple for data engineers to author, run, and monitor extract, transform, load (ETL) data integration flow. You can use a simple visual interface to compose flows that move and transform data and run them on serverless compute. Additionally, you can choose to author your visual flows with English using generative AI prompts powered by Amazon Q. Visual ETL also automatically converts your visual flow directed acyclic graph (DAG) into Spark native scripts so you can continue authoring by notebook, enabling a quick-start experience for developers who prefer to author using code.
This post shows how you can build a low-code and no-code (LCNC) visual ETL flow that enables seamless data ingestion and transformation across multiple data sources. We demonstrate how to:
- Connect to diverse data sources
- Perform table joins
- Apply custom filters
- Export aggregated data to Amazon Simple Storage Service (Amazon S3)
Additionally, we explore how generative AI can enhance your LCNC visual ETL development process, creating an intuitive and powerful workflow that streamlines the entire development experience.
Use case walkthrough
In this example, we use Amazon SageMaker Unified Studio to develop a visual ETL flow. This pipeline reads data from an Amazon S3 based file location, performs transformations on the data, and subsequently writes the transformed data back into an Amazon S3 based AWS Glue Data Catalog table. We use allevents_pipe and venue_pipe files from the TICKIT dataset to demonstrate this capability.
The TICKIT dataset records sales activities on the fictional TICKIT website, where users can purchase and sell tickets online for different types of events such as sports games, shows, and concerts. Analysts can use this dataset to track how ticket sales change over time, evaluate the performance of sellers, and determine the most successful events, venues, and seasons in terms of ticket sales.
The process involves merging the allevents_pipe and venue_pipe files from the TICKIT dataset. Next, the merged data is filtered to include only a specific geographic region. The data is then aggregated to calculate the number of events by venue name. In the end, the transformed output data is saved to Amazon S3, and a new AWS Glue Data Catalog table is created.
The following diagram illustrates the architecture:

Prerequisites
To run the instruction, you must complete the following prerequisites:
- An AWS account
- A SageMaker Unified Studio domain
- A SageMaker Unified Studio project with Data analytics and machine learning project profile
Build a visual ETL flow
Complete following steps to build a new visual ETL flow with sample dataset:
- On the SageMaker Unified Studio console, on the top menu, choose Build.
- Under DATA ANALYSIS & INTEGRATION, choose Visual ETL flows, as shown in the following screenshot.

- Select your project and choose Continue.

- Choose Create visual ETL flow.

This time, manually define the ETL flow.
- On the top left, choose the + icon in the circle. Under Data sources, choose Amazon S3, as shown in the following screenshot. Locate the icon at the canvas.

- Choose the Amazon S3 source node and enter the following values:
-
- S3 URI:
s3://aws-blogs-artifacts-public/artifacts/BDB-4798/data/venue.csv - Format: CSV
- Delimiter: ,
- Multiline: Enabled
- Header: Disabled
- S3 URI:
Leave the rest as default.

- Wait for the data preview to be available at the bottom of the screen.

- Choose the + icon in the circle to the right of the Amazon S3 node. Under Transforms, choose Rename Columns.

- Choose the Rename Columns node and choose Add new rename pair. For Current name and New name, enter the following pairs:
_c0:venueid_c1:venuename_c2:venuecity_c3:venuestate_c4:venueseats

- Choose the + icon to the right of Rename Columns node. Under Transforms, choose Filter.
- Choose Add new filter condition.
- For Key, choose
venuestate. For Operation, choose==. For Value, enterDC, as shown in the following screenshot.

- Repeat steps 5 and 6 to add the Amazon S3 source node for table events.
-
- S3 URI:
s3://aws-blogs-artifacts-public/artifacts/BDB-4798/data/events.csv - Format: CSV
- Sep: ,
- Multiline: Enabled
- Header: Disabled
- S3 URI:
Leave the rest as default
- Repeat steps 7 and 8 for the Amazon S3 source node. On the Rename Columns node, choose Add new rename pair. For Current name and New name, enter the following pairs:
_c0:eventid_c1:e_venueid_c2:catid_c3:dateid_c4:eventname_c5:starttime

- Choose the + icon to the right of Rename Column node. Under Transforms, choose Join.
- Drag the + icon at the right of the Filter node and drop it at the left of the Join node.
- For Join type, choose Inner. For Left data source, choose
e_venueid. For Right data source, choosevenue_id.

- Choose the + icon to the right of the Join node. Under Transforms, choose SQL Query.
- Enter the following query statement:

- Choose the + icon to the right of the SQL Query node. Under Data target, choose Amazon S3.
- Choose the Amazon S3 target node and enter the following values:
- S3 URI: <choose s3 location from project overview page and add suffix “
/output/venue_event/”> (for example,s3://<bucket-name>/dzd_bd693kieeb65yf/52d3z1nutb42w7/dev/output/venue_event/) - Format: Parquet
- Compression: Snappy
- Mode: Overwrite
- Update catalog: True
- Database: Choose your database
- Table:
venue_event_agg
- S3 URI: <choose s3 location from project overview page and add suffix “

At this point, you should encounter this end-to-end visual flow. Now you can publish it.

- On the top right, choose Save to project to save the draft flow. You can optionally change the name and add a description. Choose Save to project, as shown in the following screenshot.

The visual ETL flow has been successfully saved.
Run flow
This section shows you how to run the visual ETL flow you authored.
- On the top right, choose Run.
At the bottom of the screen, the run status is shown. The run status transitions from Starting to Running and Running to Finished.

- Wait for the run to be Finished.

Query using Amazon Athena
The output data has been written to the target S3 bucket. This section shows you how to query the output table.
- On the top left menu, under DATA ANALYSIS & INTEGRATION, choose Query Editor.

- On the data explorer, under Lakehouse, choose
AwsDataCatalog. Navigate to the tablevenue_event_agg. - From the three dots icon, choose Query with Athena.

Four records will be returned, as shown in the following screenshot. This indicates you succeeded in querying the output table written by the visual ETL flow.

Generative AI section to generate a visual ETL flow
The preceding instruction is done in step-by-step operations on the visual console. On the other hand, SageMaker Unified Studio can automate job authoring steps by using generative AI powered by Amazon Q.
- On the top left menu, choose Visual ETL flows.
- Choose Create visual ETL flow.
- Enter the following text and choose Submit.
Create a flow to connect 2 Glue catalog tables venue and event in database glue_db, join on event id , filter on venue state with condition as venuestate=='DC' and write output to a S3 location

This creates the following boilerplate flow that you can edit to quickly author the visual ETL flow.

The generated flow keeps the context of the prompt at the node level.

Clean Up
To avoid incurring future charges, clean up the resources you created during this walkthrough:
- From the SQL querybook, enter the following SQL to drop table:
- To delete the flow, under Actions, choose Delete flow

Conclusion
This post demonstrated how you can use Amazon SageMaker Unified Studio to build a low-code no-code (LCNC) visual ETL flow. This allows for a seamless data ingestion and transformation across multiple data sources.
To learn more, refer to our documentation and the AWS News Blog.
About the Authors
Praveen Kumar is an Analytics Solutions Architect at AWS with expertise in designing, building, and implementing modern data and analytics platforms using cloud-based services. His areas of interest are serverless technology, data governance, and data-driven AI applications.
Noritaka Sekiyama is a Principal Big Data Architect with AWS Analytics services. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling on his road bike.
Alexandra Tello is a Senior Front End Engineer with the AWS Analytics services in New York City. She is a passionate advocate for usability and accessibility. In her free time, she’s an espresso enthusiast and enjoys building mechanical keyboards.
Ranu Shah is a Software Development Manager with AWS Analytics services. She loves building data analytics features for customers. Outside work, she enjoys reading books or listening to music.
Gal Heyne is a Technical Product Manager for AWS Analytics services with a strong focus on AI/ML and data engineering. She is passionate about developing a deep understanding of customers’ business needs and collaborating with engineers to design simple-to-use data products.




















Shovan Kanjilal is a Senior Analytics and Machine Learning Architect with Amazon Web Services. He is passionate about helping customers build scalable, secure and high-performance data solutions in the cloud.
Vivek Pinyani is a Data Architect at AWS Professional Services with expertise in Big Data technologies. He focuses on helping customers build robust and performant Data Analytics solutions and Data Lake migrations. In his free time, he loves to spend time with his family and enjoys playing cricket and running.
Kartikay Khator is a Solutions Architect within Global Life Sciences at AWS, where he dedicates his efforts to developing innovative and scalable solutions that cater to the evolving needs of customers. His expertise lies in harnessing the capabilities of AWS analytics services. Extending beyond his professional pursuits, he finds joy and fulfillment in the world of running and hiking. Having already completed multiple marathons, he is currently preparing for his next marathon challenge.
Caio Sgaraboto Montovani is a Sr. Specialist Solutions Architect, Data Lake and AI/ML within AWS Professional Services, developing scalable solutions according customer needs. His vast experience has helped customers in different industries such as life sciences and healthcare, retail, banking, and aviation build solutions in data analytics, machine learning, and generative AI. He is passionate about rock and roll and cooking and loves to spend time with his family.
Kamen Sharlandjiev is a Sr. Big Data and ETL Solutions Architect, Amazon MWAA and AWS Glue ETL expert. He’s on a mission to make life easier for customers who are facing complex data integration and orchestration challenges. His secret weapon? Fully managed AWS services that can get the job done with minimal effort. Follow Kamen on LinkedIn to keep up to date with the latest Amazon MWAA and AWS Glue features and news!
















Sandeep Adwankar is a Senior Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.
Stuti Deshpande is a Big Data Specialist Solutions Architect at AWS. She works with customers around the globe, providing them strategic and architectural guidance on implementing analytics solutions using AWS. She has extensive experience in big data, ETL, and analytics. In her free time, Stuti likes to travel, learn new dance forms, and enjoy quality time with family and friends.
Scott Rigney is a Senior Technical Product Manager with AWS and has expertise in analytics, data science, and machine learning. He is passionate about building software products that enable enterprises to make data-driven decisions and drive innovation.
G2 Krishnamoorthy is VP of Analytics, leading AWS data lake services, data integration, Amazon OpenSearch Service, and Amazon QuickSight. Prior to his current role, G2 built and ran the Analytics and ML Platform at Facebook/Meta, and built various parts of the SQL Server database, Azure Analytics, and Azure ML at Microsoft.
Rahul Pathak is VP of Relational Database Engines, leading Amazon Aurora, Amazon Redshift, and Amazon QLDB. Prior to his current role, he was VP of Analytics at AWS, where he worked across the entire AWS database portfolio. He has co-founded two companies, one focused on digital media analytics and the other on IP-geolocation.



Leo Ramsamy is a Platform Architect specializing in data and analytics for ANZ’s Institutional division. He focuses on modern data practices, including Data Mesh architecture, data governance, quality management, and observability. His work aligns data strategies with business goals, improving accessibility and enabling better decision-making across ANZ.
Srinivasan Kuppusamy is a Senior Cloud Architect – Data at AWS ProServe, where he helps customers solve their business problems using the power of AWS Cloud technology. His areas of interests are data and analytics, data governance, and AI/ML.
Rada Stanic is a Chief Technologist at Amazon Web Services, where she helps ANZ customers across different segments solve their business problems using AWS Cloud technologies. Her special areas of interest are data analytics, machine learning/AI, and application modernization.













































































