All posts by Shovan Kanjilal

Simplify data integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse

Post Syndicated from Shovan Kanjilal original https://aws.amazon.com/blogs/big-data/simplify-data-integration-with-aws-glue-and-zero-etl-to-amazon-sagemaker-lakehouse/

With the growing emphasis on data, organizations are constantly seeking more efficient and agile ways to integrate their data, especially from a wide variety of applications. While traditional extract, transform, and load (ETL) processes have long been a staple of data integration due to its flexibility, for common use cases such as replication and ingestion, they often prove time-consuming, complex, and less adaptable to the fast-changing demands of modern data architectures.

In addition, organizations rely on an increasingly diverse array of digital systems, data fragmentation has become a significant challenge. Valuable information is often scattered across multiple repositories, including databases, applications, and other platforms. To harness the full potential of their data, businesses must enable seamless access and consolidation from these varied sources. However, this task is complicated by the unique characteristics of modern systems, such as differing API protocols, implementations, and rate limits. To address these challenges and accelerate innovation, AWS Glue has recently expanded its third-party application support by introducing native connectors for 19 applications.

To utilize these new application connectors for well-defined use cases such as replication and ingestion, AWS Glue is also launching zero-ETL integration support from external applications. With this new functionality, customers can create up-to-date replicas of their data from applications such as Salesforce, ServiceNow, and Zendesk in an Amazon SageMaker Lakehouse and Amazon Redshift.

Amazon SageMaker Lakehouse unifies all your data across Amazon S3 data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and AI/ML applications on a single copy of data. SageMaker Lakehouse gives you the flexibility to access and query your data in-place with all Apache Iceberg compatible tools and engines. By directly integrating with Lakehouse, all the data is automatically cataloged and can be secured through fine-grained permissions in Lake Formation.

What is zero-ETL?

Zero-ETL is a set of fully managed integrations by AWS that minimizes the need to build ETL data pipelines. It makes data available in Amazon SageMaker Lakehouse and Amazon Redshift from multiple operational, transactional, and enterprise sources. Extract, transform, and load (ETL) is the process of combining, cleaning, and normalizing data from different sources to prepare it for analytics, artificial intelligence (AI), and machine learning (ML) workloads. You don’t need to maintain complex ETL pipelines. We take care of the ETL for you by automating the creation and management of data replication.

What’s the difference between zero-ETL and Glue ETL?

AWS Glue now offers multiple ways for you to build data integration pipelines, depending on your integration needs.

  • Zero-ETL provides service-managed replication. It’s designed for scenarios where customers need a fully managed, efficient way to replicate data from one source to AWS with minimal configuration. Zero-ETL handles the entire replication process, including schema discovery and evolution, without requiring customers to write or manage any custom logic. This approach is ideal for creating up-to-date replicas of source data in near-real-time, with AWS managing the underlying infrastructure and replication process.
  • Glue ETL offers customer-managed data ingestion. It’s the preferred choice when customers need more control and customization over the data integration process or require complex transformations. With Glue ETL, customers can write custom transformation logic, combine data from multiple sources, apply data quality rules, add calculated fields, and perform advanced data cleansing or aggregation. This flexibility makes Glue ETL suitable for scenarios where data must be transformed or enriched before analysis.

It’s worth mentioning that the source connections are reusable between Glue ETL and Glue zero-ETL so that can easily support both patterns. After you create a connection once, you can choose to use the same connection across various AWS Glue components including Glue ETL, Glue Visual ETL and zero-ETL.  For example, you might start by creating a connection and a zero-ETL integration, but decide later to use the same connection to create a custom GlueETL pipeline.

This blog post will explore how zero-ETL capabilities combined with its new application connectors are transforming the way businesses integrate and analyze their data from popular platforms such as ServiceNow, Salesforce, Zendesk, SAP and others.

Use case

Consider a large company that relies heavily on data-driven insights to optimize its customer support processes. The company stores vast amounts of transactional data in ServiceNow. To gain a comprehensive understanding of their business and make informed decisions, the company needs to integrate and analyze data from ServiceNow seamlessly, identifying and addressing problems and root causes, managing service level agreements and compliance, and proactively planning for incident prevention.

The company is looking for an efficient, scalable, and cost-effective solution to collecting and ingesting data from ServiceNow, ensuring continuous near real-time replication, automated availability of new data attributes, robust monitoring capabilities to track data load statistics, and reliable data lake foundation supporting data versioning. This allows data analysts, data engineers, and data scientists to quickly explore ingested data and develop data products that meet the needs of business teams.

Solution overview

The following architecture diagram illustrates an efficient and scalable solution for collecting and ingesting replicated data from ServiceNow with zero-ETL integration. In this example we use ServiceNow as a source, but this can be done with any supported source such as Salesforce, Zendesk, SAP, or others. The AWS Glue managed connectors act as a bridge between ServiceNow and the target Amazon SageMaker Lakehouse, enabling seamless, near real-time data flow without the need for custom ETL and scheduling.

The following are the key components and steps in the integration process:

  1. Zero-ETL extracts and loads the data into Amazon S3, a highly scalable object storage service. The data is also registered in the Glue Data Catalog, a metadata repository. Additionally, it keeps the information synchronized by capturing changes that occur in ServiceNow and maintains data consistency by automatically performing schema evolution.
  2. Amazon CloudWatch, a monitoring and observability service, collects logs and metrics from the data integration process.
  3. Amazon EventBridge, a serverless event bus service, triggers a downstream process that allows you to build event-driven architecture as soon as your new data arrives in your target. Through EventBridge, customers can build on top of zero-ETL for a diverse set of use cases such as:

Prerequisites

Complete the following prerequisites before setting up the solution:

  1. Create a bucket in Amazon S3 called zero-etl-demo-<your AWS Account Number>-<AWS Region> (for example, zero-etl-demo-012345678901-us-east-1). The bucket will be used to store the data ingested by zero-ETL in Apache Iceberg which is an open table format (OTF) supporting ACID transactions (atomicity, consistency, isolation, and durability), seamless schema evolution, and data versioning using time travel.
  2. Create an AWS Glue database <your database name>, such as zero_etl_demo_db and associate the S3 bucket zero-etl-demo-<your AWS Account Number>-<AWS Region> as a location of the database. The database will be used to store the metadata related to the data integrations performed by zero-ETL.
  3. Update AWS Glue Data Catalog settings using the following IAM policy for fine-grained access control of the data catalog for zero-ETL.
  4. Create an AWS Identity and Access Management (IAM) role named zero_etl_demo_role. The IAM role will be used by zero-ETL to access the Glue Connector to read from the Service Now and write the data into the target. Optionally, you can create two separate IAM roles (one associated with your source data and another associated with your target).
  5. Make sure you have a ServiceNow instance named ServiceNowInstance, a user named ServiceNowUser, and a password passwordServiceNowPassword with the required permissions to read from ServiceNow. The instance name, user, and password are used in the AWS Glue connection to authenticate within ServiceNow using the BASIC authentication type. Optionally, you can choose OAUTH2 if your ServiceNow supports it.
  6. Create the secret zero_etl_demo_secret in AWS Secrets Manager to store ServiceNow credentials.

Build and verify the zero-ETL integration

Complete the following steps to create and validate zero-ETL integration:

Step 1: Set up a connector

Zero-ETL integration, when used with AWS Glue natively supported applications connectors, provides a straightforward way to bring third-party data into an Amazon S3 transactional data lake or Amazon Redshift. Use the following steps to create a ServiceNow data connection:

  1. Open the AWS Glue console.
  2. In the navigation pane, under Data catalog, choose Connections.
  3. Choose Create Connection.
  4. In the Create Connection pane, enter ServiceNow in Data Sources.
  5. Choose ServiceNow.
  6. Choose Next.
  7. For Instance Name, enter ServiceNowInstance (created as part of the prerequisites).
  8. For IAM service role, choose the zero_etl_demo_role (created as part of the prerequisites).
  9. For Authentication Type, choose the authentication type that you’re using for ServiceNow. In this example. we have chosen OAUTH2, which requires the set up of Application Registries in ServiceNow.
  10. For AWS Secret, choose the secret zero_etl_demo_secret (created as part of the prerequisites).
  11. Choose Next.
  12. In the Connection Properties section, for Name, enter zero_etl_demo_conn.
  13. Choose Next.
  14. Choose Create connection.

  15. There will be a popup from ServiceNow after you choose Create connection. Choose Allow.

Step 2: Set up Zero-ETL integration

After creating the data connection to ServiceNow, use the following steps to create the zero-ETL integration:

  1. Open the AWS Glue console.
  2. In the navigation pane, under Data catalog, choose Zero-ETL integrations.
  3. Choose Create zero-ETL integration.
  4. In the Create integration pane, enter ServiceNow in Data Sources.
  5. Choose ServiceNow.
  6. Choose Next.
  7. For ServiceNow connection, choose the data connection created on Step 1—zero_etl_demo_conn.
  8. For Source IAM role, choose the zero_etl_demo_role (from the prerequisites).
  9. For ServiceNow objects, choose the objects you want to perform the ingestion managed by zero-ETL integration. For this post, choose problem and incident objects.
  10. For Namespace or Database, choose <your database name>. In this example, we use the zero_etl_demo_db (from the prerequisites).
  11. For Target IAM role, choose the zero_etl_demo_role (from the prerequisites).
  12. Choose Next.
  13. For Security and data encryption, you can choose either AWS Managed KMS Key or choose a customer KMS key managed by AWS Key Management Service. For this post, choose Use AWS managed KMS key.
  14. In the Integration details section, for Name, enter zero-etl-demo-integration.
  15. Choose Next.
  16. Review the details and choose Create and launch integration.
  17. The newly created integration will show as Active in about a minute.

Step 3: Verify the initial SEED load

The SEED load refers to the initial loading of the tables that you want to ingest into an Amazon SageMaker Lakehouse using zero-ETL integration. The status and statistics of the SEED load are published into CloudWatch and the data ingested by zero-ETL integration can be accessed in AWS using a set of services such Amazon Sagemaker Unified StudioAmazon QuickSight, and others. Use the following steps to access zero-ETL integration logs and query the data:

  1. Open the AWS Glue console.
  2. In the navigation pane, choose Zero-ETL integrations.
  3. In the Zero-ETL integrations section, choose zero-etl-demo-integration.
  4. In the Activity summary (all time) section, choose CloudWatch logs.
  5. Check CloudWatch log events for the SEED Load. For each table ingested by the zero-ETL integration, two groups of logs are created: status and statistics. Highlighted in the following screenshot in IngestionTableStatistics are the statistics. The insertCount represents how many rows were extracted and loaded by zero-ETL integration. For the SEED load, you will always see only insertCount because it’s the initial load. In addition, in IngestionCompleted you will find information about the Zero-ETL integration such as status, load type, and message.

To validate the SEED load, query the data using Amazon Sagemaker Unified Studio.

  1. Access Amazon Sagemaker Unified Studio for your specific domain through your AWS Console.
  2. Open the Amazon SageMaker Unified Studio URL.
  3. Sign in with SSO or AWS IAM user.
  4. Select your project.
  5. Go to Data from the left menu, expand the Lakehouse AWSDataCatalog, expand your database, and select the incident table. Click the icon and select Query with Athena.
  6. For Query, enter the following statement:
    SELECT count(*) AS incidents_count
    FROM "zero_etl_demo_db"."incident"

  7. Choose Run.
  8. Let’s check an existing incident in ServiceNow. This is the incident that you will update the description of in ServiceNow to validate change data capture (CDC). In the query editor, pane, for Query, enter the following statement:
    SELECT number
    , short_description
    , description
    FROM "zero_etl_demo_db"."incident"
    WHERE number = 'INC0000003' -- update to your Incident number

  9. Choose Run.

Step 4: Validate CDC

The CDC load is a technique used to identify and process only the data that has changed in a source system since the last extraction. Instead of reloading an entire dataset, CDC captures and transfers only the new, updated, or deleted records into the target system, making data processing more efficient and reducing load times. The status and statistics of the CDC load are published into CloudWatch. For this post, you will use Amazon SageMaker unified studio to query the data ingested. Use the following steps to access zero-ETL integration logs and query the data ingested. For the next step in this example, you will select an incident and perform an update in ServiceNow, changing the short_description and description of the incident.

  1. To demonstrate CDC event, in this blog we are going to edit 1 incident and delete 1 incident in ServiceNow.
  2. Open the AWS Glue console.
  3. In the navigation pane, under Data catalog, choose Zero-ETL integrations.
  4. In the Zero-ETL integrations section, choose zero-etl-demo-integration.
  5. In the Activity summary (all time) section, choose CloudWatch logs.
  6. Zero-ETL integration replicates the changes to the Amazon S3 transactional data lake every 60 minutes by default. Check CloudWatch log events for the CDC load. Shown in the following figure in IngestionTableStatistics, review updateCount and deleteCount for each specific object managed by zero-ETL integration. It’s applying the updates and deletes that happened in ServiceNow to the transactional data lake.

To validate the CDC load, query the data using Amazon SageMaker Unified Studio.

  1. You can go back to Amazon SageMaker Unified Studio.
  2. For Query, enter the following statement:
    SELECT count(*) AS incidents_count
    FROM "zero_etl_demo_db"."incident"

  3. For Query, enter the following statement to record initial snapshot results before CDC:
    SELECT number
        , short_description
        , description
    FROM "zero_etl_demo_db"."incident"
    WHERE number = 'INC0000003' -- update to your Incident number

  4. Choose Run and confirm that one record was updated in short_description and description attributes.

By following these steps, you can effectively set up, build, and verify a zero-ETL job using the new AWS Glue application connector for ServiceNow. This process demonstrates the simplicity and efficiency of the zero-ETL approach in integrating applications data into your AWS environment.

Apache Iceberg Time Travel: Enhancing data versioning in zero-ETL

One of the benefits of using Apache Iceberg in zero-ETL integration is the ability to perform Time Travel. This feature allows you to access and query historical versions of your data effortlessly. With Iceberg Time Travel, you can easily roll back to previous data states, compare data across different points in time, or recover from accidental data changes. In the context of zero-ETL integrations, this capability becomes particularly valuable when dealing with rapidly changing applications data.

To demonstrate this feature, let’s consider a scenario where you’re analyzing ServiceNow incident data ingested through zero-ETL integration using Amazon SageMaker Unified Studio. Here’s an example query that showcases Iceberg time travel:

-- Query incident data as of particular timestamp before CDC
SELECT number,
    short_description,
    description
FROM "zero_etl_demo_db"."incident" 
FOR TIMESTAMP AS OF TIMESTAMP '2024-11-06 05:10:00 UTC' 
-- update this timestamp value to before your CDC update
WHERE number = 'INC0000003' -- update to your Incident number
-- Compare with current data
SELECT number,
    short_description,
    description
FROM "zero_etl_demo_db"."incident"
WHERE number = 'INC0000003' -- update to your Incident number

In this example:

  1. The first query uses the FOR TIMESTAMP AS OF clause for time travel queries on Iceberg tables. It retrieves incident data as it existed before CDC update for the specific incident number INC0000003.
  2. The second query fetches the current state of the data for the same incident number.

This capability allows you to track the evolution of incidents, identify trends in resolution times, or recover information that may have been inadvertently altered.

Clean up

To avoid incurring future charges, remove up the resources used in this post from your AWS account by completing the following steps:

  1. Delete zero-ETL integration zero-etl-demo-integration.
  2. Delete content from the S3 bucket zeroetl-etl-demo-<your AWS Account Number>-<AWS Region>.
  3. Delete the Data Catalog database zero_etl_demo_db.
  4. Delete the Data Catalog connection zero_etl_demo_conn.
  5. Delete the AWS Secrets manager Secret.

Conclusion

As the pace of business continues to accelerate, the ability to quickly and efficiently integrate data from various applications and enterprise platforms has become a critical competitive advantage. By adopting a zero-ETL integration powered by AWS Glue and its new set of managed connectors, you organization can unlock the full potential of its data across multiple platforms faster and stay ahead of the curve.

To learn more about how AWS Amazon SageMaker Lakehouse can help your organization streamline its data integration efforts, visit Amazon SageMaker Lakehouse.

Get started with zero-ETL on AWS by creating a free account today!


About the authors

Shovan Kanjilal is a Senior Analytics and Machine Learning Architect with Amazon Web Services. He is passionate about helping customers build scalable, secure and high-performance data solutions in the cloud.

Vivek Pinyani is a Data Architect at AWS Professional Services with expertise in Big Data technologies. He focuses on helping customers build robust and performant Data Analytics solutions and Data Lake migrations. In his free time, he loves to spend time with his family and enjoys playing cricket and running.

Kartikay KhatorKartikay Khator is a Solutions Architect within Global Life Sciences at AWS, where he dedicates his efforts to developing innovative and scalable solutions that cater to the evolving needs of customers. His expertise lies in harnessing the capabilities of AWS analytics services. Extending beyond his professional pursuits, he finds joy and fulfillment in the world of running and hiking. Having already completed multiple marathons, he is currently preparing for his next marathon challenge.

Caio Sgaraboto Montovani is a Sr. Specialist Solutions Architect, Data Lake and AI/ML within AWS Professional Services, developing scalable solutions according customer needs. His vast experience has helped customers in different industries such as life sciences and healthcare, retail, banking, and aviation build solutions in data analytics, machine learning, and generative AI. He is passionate about rock and roll and cooking and loves to spend time with his family.

Kamen SharlandjievKamen Sharlandjiev is a Sr. Big Data and ETL Solutions Architect, Amazon MWAA and AWS Glue ETL expert. He’s on a mission to make life easier for customers who are facing complex data integration and orchestration challenges. His secret weapon? Fully managed AWS services that can get the job done with minimal effort. Follow Kamen on LinkedIn to keep up to date with the latest Amazon MWAA and AWS Glue features and news!

Control your AWS Glue Studio development interface with AWS Glue job mode API property

Post Syndicated from Shovan Kanjilal original https://aws.amazon.com/blogs/big-data/control-your-aws-glue-studio-development-interface-with-aws-glue-job-mode-api-property/

In recent years, as the importance of big data has grown, efficient data processing and analysis have become crucial factors in determining a company’s competitiveness. AWS Glue, a serverless data integration service for integrating data across multiple data sources at scale, addresses these data processing needs. Among its features, the AWS Glue Jobs API stands out as a particularly noteworthy tool.

The AWS Glue Jobs API is a robust interface that allows data engineers and developers to programmatically manage and run ETL jobs. By using this API, it becomes possible to automate, schedule, and monitor data pipelines, enabling efficient operation of large-scale data processing tasks.

To improve customer experience with the AWS Glue Jobs API, we added a new property describing the job mode corresponding to script, visual, or notebook. In this post, we explore how the updated AWS Glue Jobs API works in depth and demonstrate the new experience with the updated API.

JobMode property

A new property JobMode describes the mode of AWS Glue jobs (script, visual, or notebook) to improve your UI experience. AWS Glue users can use the mode that best fits your preference. Some extract, transform, and load (ETL) developers prefer to use visual mode and create visual jobs using AWS Glue Studio visual editor. Some data scientists prefer to use notebooks jobs and use AWS Glue Studio notebooks. Some data engineers and developers prefer to implement script through the AWS Glue Studio script editor or preferred integrated development environment (IDE). After the job is created with the preferred mode, you can search for it by filtering on the job mode within your saved AWS Glue jobs page and find it easily. Additionally, if you are migrating existing iPython notebook files to AWS Glue Studio notebook jobs, you can now choose and set the job mode and do so for multiple jobs using this new API property, as demonstrated in this post.

How CreateJob API works with the new JobMode property

You can use CreateJob API to create AWS Glue script or a visual or notebook job. The following is an example of how it works for a visual job using AWS SDK for Python (Boto3): (replace <your-bucket-name> with your S3 bucket)

CODE_GEN_JSON_STR = '''
{
  "node-1": {
    "S3ParquetSource": {
      "Name": "Amazon S3",
      "Paths": [
        "s3://aws-bigdata-blog/generated_synthetic_reviews/data/product_category=Books/"
      ],
      "Exclusions": [],
      "Recurse": true,
      "AdditionalOptions": {
        "EnableSamplePath": false,
        "SamplePath": "s3://aws-bigdata-blog/generated_synthetic_reviews/data/product_category=Books/73612da260b94159b705cf4df12364cb_0.snappy.parquet"
      },
      "OutputSchemas": [
        {
          "Columns": [
            {
              "Name": "marketplace",
              "Type": "string"
            },
            {
              "Name": "customer_id",
              "Type": "string"
            },
            {
              "Name": "review_id",
              "Type": "string"
            },
            {
              "Name": "product_id",
              "Type": "string"
            },
            {
              "Name": "product_title",
              "Type": "string"
            },
            {
              "Name": "star_rating",
              "Type": "bigint"
            },
            {
              "Name": "helpful_votes",
              "Type": "bigint"
            },
            {
              "Name": "total_votes",
              "Type": "bigint"
            },
            {
              "Name": "insight",
              "Type": "string"
            },
            {
              "Name": "review_headline",
              "Type": "string"
            },
            {
              "Name": "review_body",
              "Type": "string"
            },
            {
              "Name": "review_date",
              "Type": "timestamp"
            },
            {
              "Name": "review_year",
              "Type": "bigint"
            }
          ]
        }
      ]
    }
  },
  "node-2": {
    "DropFields": {
      "Name": "Drop Fields",
      "Inputs": [
        "node-1"
      ],
      "Paths": [
        [
          "review_headline"
        ],
        [
          "review_body"
        ],
        [
          "review_date"
        ]
      ]
    }
  },
  "node-3": {
    "S3DirectTarget": {
      "Name": "Amazon S3",
      "Inputs": [
        "node-2"
      ],
      "PartitionKeys": [],
      "Path": "s3://<your-bucket-name>/data/jobmode-blog/output/parquet/",
      "Compression": "snappy",
      "Format": "parquet",
      "SchemaChangePolicy": {
        "EnableUpdateCatalog": false
      }
    }
  }
}
'''

glue_client = boto3.client('glue')
codeGenJson = json.loads(constants.CODE_GEN_JSON_STR, strict=False)

# Call the create_job method
try:
    glue_client.create_job(
        Name="glue-visual-job",
        Description="Glue Visual ETL job",
        Command={'Name': 'glueetl', 'ScriptLocation': "s3://aws-glue-assets-<account-id>-<region>/scripts/glue-visual-job", 'PythonVersion': "3"},
        WorkerType=constants.WORKERTYPE,
        NumberOfWorkers="G.1X",
        Role=<role-arn>,  
        GlueVersion="4.0",        
        CodeGenConfigurationNodes=codeGenJson,
        JobMode="VISUAL"
    )
    print("Successfully created Glue job")
except Exception as e:
    print(f"Error creating Glue job: {str(e)}")

CODE_GEN_JSON_STR represents the visual nodes for the AWS Glue Job. There are three nodes: node-1 uses S3 source, node-2 does transformation, and node-3 uses S3 target. The script instantiates the AWS Glue Boto3 client, loads the JSON, and calls the create_job. JobMode is set to VISUAL.

After you run the Python script, a new job is created. The following screenshot shows how the created job looks in AWS Glue visual editor.

There are three nodes in the visual directed acyclic graph (DAG): node 1 sources product review data for the product_category book from the public S3 bucket, node-2 drops some of the fields that aren’t needed for downstream systems, and node-3 persists the transformed data in a local S3 bucket.

How CloudFormation works with the new JobMode property

You can use AWS CloudFormation to create different types of AWS Glue jobs by specifying the JobMode parameter with the AWS::Glue::Job resource. The supported job modes include:

  • SCRIPT
  • VISUAL
  • NOTEBOOK

In this example, you create a AWS Glue notebook job using AWS CloudFormation, which requires setting the JobMode parameter to NOTEBOOK.

  1. Create a Jupyter Notebook file containing your logic and code, and save the notebook file with a descriptive name, such as my-glue-notebook.ipynb. Alternatively you can download the notebook file, and rename it to my-glue-notebook.ipynb.
  2. Upload the Notebook file to the notebooks/ folder within the aws-glue-assets-<account-id>-<region> S3 bucket.
  3. Create a new CloudFormation template to create a new AWS Glue job, specifying the NotebookJobName parameter as the same name as the Notebook file. Here’s the sample snippet of CloudFormation template:
    AWSTemplateFormatVersion: '2010-09-09'
    Description: CloudFormation template for creating an AWS Glue ETL job using a Jupyter Notebook
    
    Parameters:
      NotebookJobName:
        Type: String
        Description: Name of the AWS Glue ETL Notebook job
    
    Resources:
      GlueJobRole:
        Type: AWS::IAM::Role
        Properties:
          RoleName: !Sub ${AWS::StackName}-GlueJobRole
          AssumeRolePolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Principal:
                  Service:
                    - glue.amazonaws.com
                Action:
                  - sts:AssumeRole
          ManagedPolicyArns:
            - arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
          Policies:
            - PolicyName: GlueJobS3Access
              PolicyDocument:
                Version: '2012-10-17'
                Statement:
                  - Effect: Allow
                    Action:
                      - iam:PassRole
                    Resource:
                      - !Sub arn:aws:iam::${AWS::AccountId}:role/${AWS::StackName}-GlueJobRole
    
      ETLNotebookJob:
        Type: AWS::Glue::Job
        Properties:
          Name: !Ref NotebookJobName
          Description: ETL job using a Jupyter Notebook
          Role: !GetAtt GlueJobRole.Arn
          Command:
            Name: glueetl
            PythonVersion: '3'
            ScriptLocation: !Sub s3://aws-glue-assets-${AWS::AccountId}-${AWS::Region}/scripts/${NotebookJobName}.py
          DefaultArguments:
            '--job-bookmark-option': job-bookmark-enable
          JobMode: NOTEBOOK
    
    Outputs:
      ETLNotebookJobName:
        Value: !Ref ETLNotebookJob
        Description: Name of the ETL Notebook job

  4. Deploy the CloudFormation template. For NotebookJobName, enter same name as the notebook file.
  5. Verify that the AWS Glue job you created is listed and that it has the name you specified in the CloudFormation template.

AWS Glue notebook shows the Notebook job that contains the existing cells that you had in the ipynb file. You can review the job details to confirm it’s configured correctly.

Console experience

On the AWS Glue console, in the navigation pane, choose ETL Jobs to observe all your ETL jobs listed. Here you have different columns Job name, Type, Created by, Last modified, and AWS Glue version. You can sort and filter by these columns. The following screenshot shows how it looks.

We also enhanced the console experience with the JobMode introduction. The Created by column on the console gives you information about JobMode of the job. You can filter access jobs created by VISUAL, NOTEBOOK, or SCRIPT, as shown in the following screenshot.

This new console experience helps you search and discover your jobs based on JobMode.

Conclusion

This post demonstrated how AWS Glue Job API works with the newly introduced job mode property. With the new property, you can explicitly choose the mode of each job. The steps instructed detailed usage in API, AWS SDK, and CloudFormation. Additionally, the property makes it straightforward to search and discover your jobs quickly on the AWS Glue console.


About the Authors

Shovan Kanjilal is a Senior Analytics and Machine Learning Architect with Amazon Web Services. He is passionate about helping customers build scalable, secure, and high-performance data solutions in the cloud.

Manoj Shunmugam is a DevOps Consultant in Professional Services at Amazon Web Services. He works with customers to establish infrastructures using cloud-centered and/or container-based platforms in the AWS Cloud.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling on his road bike.

Gal HeyneGal Heyne is a Product Manager for AWS Glue with a strong focus on AI/ML, data engineering, and BI. She is passionate about developing a deep understanding of customers’ business needs and collaborating with engineers to design easy-to-use data products.