All posts by David Zhang

Introducing AWS Glue Data Catalog usage metrics for API usage

2025-06-26 David Zhang

Post Syndicated from David Zhang original https://aws.amazon.com/blogs/big-data/introducing-aws-glue-data-catalog-usage-metrics-for-api-usage/

We’re excited to announce AWS Glue Data Catalog usage metrics. The usage metrics is a new feature that provides native integration with Amazon CloudWatch. This feature provides you with immediate visibility into your AWS Glue Data Catalog API usage patterns and trends.

AWS Glue Data Catalog is a centralized repository that stores metadata about your organization’s datasets. With its unified interface that acts as an index, you can store and query information about your data sources, including their location, formats, schemas, and runtime metrics.

As you scale your lakehouse architecture on Amazon Web Services (AWS) and maintain reliable data operations, observability and monitoring becomes critical to understanding and optimizing Data Catalog API usages.

With Data Catalog usage metrics in CloudWatch, you can achieve the following:

Monitor API call patterns at 1-minute intervals
Proactively request service quota increase for API rate limits
Enable the CloudWatch pre-built anomaly detection feature to identify abnormalities in your API usage
Understand lakehouse usage across more than 50 APIs

In this post, we demonstrate how to access these metrics, provide a step-by-step walkthrough, and set up meaningful alarms.

Access Data Catalog usage metrics in Amazon CloudWatch console

To access Data Catalog usage metrics, complete the following steps:

Open Amazon CloudWatch console
Under Metrics, choose All metrics
In the search bar, enter Glue and choose Enter
Choose Usage > By AWS Resource, as shown in the following screenshot

The Metrics section opens and displays different catalog usage metrics that you can select from to create dashboards and alarms, as shown in the following screenshot

Monitor CallCount metrics

Each Amazon CloudWatch metric for Data Catalog is of a type API and set as CallCount. This means that for each API call on that specific resource (for example, GetConnection API) will be logged as one count. These metrics can seamlessly integrate into your existing CloudWatch dashboards, or you can use them to create new ones. For proactive monitoring, you can configure custom alarms that trigger automatically when this API usage exceeds your defined thresholds, helping you comply with service limits.

Under the Graphed metrics tab, you can provide additional customizations to match your monitoring needs. In the Details column, you can create alarms and enable anomaly detection to identify unusual patterns.

To help with effective API monitoring, CallCount metrics specifically focus on successful API calls. This way, you have more precise monitoring and can troubleshoot different types of API behaviors. The following screenshot shows the AWS Glue usage metrics view for GetTables API.

In the Statistics column, you can view your API usage beyond the default Sum, Min, and Max metrics. You can now select a wide variety of statistical methods to analyze your usage patterns, as shown in the following screenshot.

Metrics and dimensions for Data Catalog usage metrics

Data Catalog usage metrics use the AWS/Usage namespace and provide CallCount metrics. These metrics are published with the dimensions Service, Resource, Type and Class.

The CallCount metric doesn’t have a specified unit. The most useful statistic for the metric is SUM, which represents the total operation count for the 1-minute period. An important note is that the metric value is emitted at 1-minute intervals. Reducing the period further (for example, to 1 second) won’t change the emittance interval.

Metrics

Metric	Description
`CallCount`	The number of specified operations performed in your account.

Dimensions

Dimension key	Dimension value	Description
Service	AWS Glue	The name of the AWS service containing the resource. For Data Catalog usage metrics, the value for this dimension is AWS Glue.
Type	API	The type of resource being tracked. Currently, when the Service dimension is AWS Glue, the only valid value for Type is API.
Resource	<API name>	The name of the API operation. Valid values include the following: GetCatalogs, GetCatalog, GetDatabases, GetDatabase, GetTables, GetTable, GetTableVersion, GetTableVersions, SearchTables, GetPartitionIndexes, GetColumnStatisticsForTable, GetPartition, GetPartitions, BatchGetPartition, GetColumnStatisticsForPartition, GetConnection, GetConnections, GetUserDefinedFunction, GetUserDefinedFunctions, GetCatalogImportStatus, GetTableOptimizer, BatchGetTableOptimizer, ListTableOptimizerRuns, CreateCatalog, CreateDatabase, CreateTable, CreatePartitionIndex, CreatePartition, BatchCreatePartition, CreateConnection, CreateUserDefinedFunction, CreateTableOptimizer, UpdateCatalog, UpdateDatabase, UpdateTable, UpdateColumnStatisticsForTable, UpdatePartition, BatchUpdatePartition, UpdateColumnStatisticsForPartition, UpdateConnection, UpdateUserDefinedFunction, UpdateTableOptimizer, DeleteCatalog, DeleteDatabase, DeleteTable, BatchDeleteTable, DeleteTableVersion, DeletePartitionIndex, DeleteColumnStatisticsForTable, DeletePartition, BatchDeletePartition, DeleteColumnStatisticsForPartition, DeleteConnection, BatchDeleteConnection, DeleteUserDefinedFunction, DeleteTableOptimizer, TestConnection, ImportCatalogToGlue
Class	None	The class of resource being tracked. Data Catalog usage metrics use this dimension with a value of `None`.

Set up CloudWatch alarms for Data Catalog usage metrics

Data Catalog has defined rules to manage atypical usage patterns that limit the customer call rate at the granularity of requests per second. You can generate CloudWatch alarms using the CallCount metric so that limit increases can be done proactively. To configure a CloudWatch alarm with this threshold, complete the following steps:

On the CloudWatch metrics console, select one of the available metrics, as shown in the following screenshot. In this example, we select the resource GetTables. You can select multiple metrics to fit your use case.

Choose Graphed metrics.
Choose Sum as the primary statistic.
Set period to 1 minute.

Choose Details and Create Alarm.

For Threshold type, choose Anomaly Detection. You can also select Static based on your requirements and after you’ve determined a specific threshold value.
Set the Anomaly detection threshold to 2 (default). The threshold value is used to determine the normal range of values for the metric. A higher value produces a thicker band of normal values. For more information on how CloudWatch anomaly detection works, refer to How CloudWatch anomaly detection works.
Choose Next.
For Send a notification to the following SNS topic, choose Create new topic.
For Create a new topic, enter your Amazon Simple Notification Service (Amazon SNS) topic name.
For Email endpoints that will receive the notification, enter your email address. In this example, we’re going to create a new SNS topic. However, you can use your existing SNS topics or use other options such as AWS Lambda or auto scaling action.
Choose Create topic.

Scroll down and choose Next.
Enter an alarm name and a description and choose Next.
Review all the details you’ve entered and choose Create alarm, as shown in the following screenshot.

By following these steps, you’ve successfully configured a CloudWatch alarm using anomaly detection that monitors your Data Catalog usage with the threshold that you set. The alarm will trigger when the CallCount metric exceeds the calculated threshold, sending notifications to your specified SNS topic and email endpoints.

This proactive monitoring approach prevents API rate limit issues and provides a smooth operation of your Data Catalog usage. For more information on using CloudWatch alarms, refer to Using Amazon CloudWatch alarms.

Conclusion

AWS Glue Data Catalog usage metrics is an effective enhancement to your data infrastructure monitoring capabilities. It addresses the growing need for detailed observability through Amazon CloudWatch in modern data architectures built on top of Data Catalog. You now have access to more granular statistics, moving beyond simple maximum and average request metrics to comprehensive performance indicators including p99 percentiles. These metrics are emitted in 1-minute intervals, providing visibility into your data catalog operations. Organizations can now proactively identify bottlenecks before they affect operations and efficiently conduct capacity planning through detailed usage patterns.

From building monitoring dashboards to setting up alerts, the native support with CloudWatch anomaly detection and flexible alarm configurations makes it straightforward to proactively monitor your lakehouse deployment and prevent abnormalities in your lakehouse usage. For more information, refer to Monitoring Data Catalog usage metrics in Amazon CloudWatch in the AWS Glue documentation. We recommend testing and using these metrics as part of your modern monitoring and observability strategy. We encourage you to share your feedback with us.

About the authors

David Zhang is an Analytics Solutions Architect specializing in designing and implementing large-scale data infrastructure, ETL processes, and extensive data management systems. He helps customers modernize data platforms on Amazon Web Services (AWS). David is also an active speaker at AWS events and contributor to technical content and open source initiatives. He enjoys playing volleyball, tennis, and basketball during his free time.

Noritaka Sekiyama is a Principal Big Data Architect with Amazon Web Services (AWS) Analytics services. He’s responsible for building software artifacts to help customers. In his spare time, he enjoys cycling on his road bike.

Sandeep Adwankar is a Senior Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

Abhay Joshi is a Software Development Engineer at AWS Glue and AWS Lake Formation. He is passionate about building fault tolerant and reliable distributed systems at scale.

Build a serverless analytics application with Amazon Redshift and Amazon API Gateway

2023-01-24 David Zhang

Post Syndicated from David Zhang original https://aws.amazon.com/blogs/big-data/build-a-serverless-analytics-application-with-amazon-redshift-and-amazon-api-gateway/

Serverless applications are a modernized way to perform analytics among business departments and engineering teams. Business teams can gain meaningful insights by simplifying their reporting through web applications and distributing it to a broader audience.

Use cases can include the following:

Dashboarding – A webpage consisting of tables and charts where each component can offer insights to a specific business department.
Reporting and analysis – An application where you can trigger large analytical queries with dynamic inputs and then view or download the results.
Management systems – An application that provides a holistic view of the internal company resources and systems.
ETL workflows – A webpage where internal company individuals can trigger specific extract, transform, and load (ETL) workloads in a user-friendly environment with dynamic inputs.
Data abstraction – Decouple and refactor underlying data structure and infrastructure.
Ease of use – An application where you want to give a large set of user-controlled access to analytics without having to onboard each user to a technical platform. Query updates can be completed in an organized manner and maintenance has minimal overhead.

In this post, you will learn how to build a serverless analytics application using Amazon Redshift Data API and Amazon API Gateway WebSocket and REST APIs.

Amazon Redshift is fully managed by AWS, so you no longer need to worry about data warehouse management tasks such as hardware provisioning, software patching, setup, configuration, monitoring nodes and drives to recover from failures, or backups. The Data API simplifies access to Amazon Redshift because you don’t need to configure drivers and manage database connections. Instead, you can run SQL commands to an Amazon Redshift cluster by simply calling a secured API endpoint provided by the Data API. The Data API takes care of managing database connections and buffering data. The Data API is asynchronous, so you can retrieve your results later.

API Gateway is a fully managed service that makes it easy for developers to publish, maintain, monitor, and secure APIs at any scale. With API Gateway, you can create RESTful APIs and WebSocket APIs that enable real-time two-way communication applications. API Gateway supports containerized and serverless workloads, as well as web applications. API Gateway acts as a reverse proxy to many of the compute resources that AWS offers.

Event-driven model

Event-driven applications are increasingly popular among customers. Analytical reporting web applications can be implemented through an event-driven model. The applications run in response to events such as user actions and unpredictable query events. Decoupling the producer and consumer processes allows greater flexibility in application design and building decoupled processes. This design can be achieved with the Data API and API Gateway WebSocket and REST APIs.

Both REST API calls and WebSocket establish communication between the client and the backend. Due to the popularity of REST, you may wonder why WebSockets are present and how they contribute to an event-driven design.

What are WebSockets and why do we need them?

Unidirectional communication is customary when building analytical web solutions. In traditional environments, the client initiates a REST API call to run a query on the backend and either synchronously or asynchronously waits for the query to complete. The “wait” aspect is engineered to apply the concept of polling. Polling in this context is when the client doesn’t know when a backend process will complete. Therefore, the client will consistently make a request to the backend and check.

What is the problem with polling? Main challenges include the following:

Increased traffic in your network bandwidth – A large number of users performing empty checks will impact your backend resources and doesn’t scale well.
Cost usage – Empty requests don’t deliver any value to the business. You pay for the unnecessary cost of resources.
Delayed response – Polling is scheduled in time intervals. If the query is complete in-between these intervals, the user can only see the results after the next check. This delay impacts the user experience and, in some cases, may result in UI deadlocks.

For more information on polling, check out From Poll to Push: Transform APIs using Amazon API Gateway REST APIs and WebSockets.

WebSockets is another approach compared to REST when establishing communication between the front end and backend. WebSockets enable you to create a full duplex communication channel between the client and the server. In this bidirectional scenario, the client can make a request to the server and is notified when the process is complete. The connection remains open, with minimal network overhead, until the response is received.

You may wonder why REST is present, since you can transfer response data with WebSockets. A WebSocket is a light weight protocol designed for real-time messaging between systems. The protocol is not designed for handling large analytical query data and in API Gateway, each frame’s payload can only hold up to 32 KB. Therefore, the REST API performs large data retrieval.

By using the Data API and API Gateway, you can build decoupled event-driven web applications for your data analytical needs. You can create WebSocket APIs with API Gateway and establish a connection between the client and your backend services. You can then initiate requests to perform analytical queries with the Data API. Due to the Data API’s asynchronous nature, the query completion generates an event to notify the client through the WebSocket channel. The client can decide to either retrieve the query results through a REST API call or perform other follow-up actions. The event-driven architecture enables bidirectional interoperable messages and data while keeping your system components agnostic.

Solution overview

In this post, we show how to create a serverless event-driven web application by querying with the Data API in the backend, establishing a bidirectional communication channel between the user and the backend with the WebSocket feature in API Gateway, and retrieving the results using its REST API feature. Instead of designing an application with long-running API calls, you can use the Data API. The Data API allows you to run SQL queries asynchronously, removing the need to hold long, persistent database connections.

The web application is protected using Amazon Cognito, which is used to authenticate the users before they can utilize the web app and also authorize the REST API calls when made from the application.

Other relevant AWS services in this solution include AWS Lambda and Amazon EventBridge. Lambda is a serverless, event-driven compute resource that enables you to run code without provisioning or managing servers. EventBridge is a serverless event bus allowing you to build event-driven applications.

The solution creates a lightweight WebSocket connection between the browser and the backend. When a user submits a request using WebSockets to the backend, a query is submitted to the Data API. When the query is complete, the Data API sends an event notification to EventBridge. EventBridge signals the system that the data is available and notifies the client. Afterwards, a REST API call is performed to retrieve the query results for the client to view.

We have published this solution on the AWS Samples GitHub repository and will be referencing it during the rest of this post.

The following architecture diagram highlights the end-to-end solution, which you can provision automatically with AWS CloudFormation templates run as part of the shell script with some parameter variables.

The application performs the following steps (note the corresponding numbered steps in the process flow):

A web application is provisioned on AWS Amplify; the user needs to sign up first by providing their email and a password to access the site.
The user verifies their credentials using a pin sent to their email. This step is mandatory for the user to then log in to the application and continue access to the other features of the application.
After the user is signed up and verified, they can sign in to the application and requests data through their web or mobile clients with input parameters. This initiates a WebSocket connection in API Gateway. (Flow 1, 2)
The connection request is handled by a Lambda function, OnConnect, which initiates an asynchronous database query in Amazon Redshift using the Data API. The SQL query is taken from a SQL script in Amazon Simple Storage Service (Amazon S3) with dynamic input from the client. (Flow 3, 4, 6, 7)
In addition, the OnConnect Lambda function stores the connection, statement identifier, and topic name in an Amazon DynamoDB database. The topic name is an extra parameter that can be used if users want to implement multiple reports on the same webpage. This allows the front end to map responses to the correct report. (Flow 3, 4, 5)
The Data API runs the query, mentioned in step 2. When the operation is complete, an event notification is sent to EventBridge. (Flow 8)
EventBridge activates an event rule to redirect that event to another Lambda function, SendMessage. (Flow 9)
The SendMessage function notifies the client that the SQL query is complete via API Gateway. (Flow 10, 11, 12)
After the notification is received, the client performs a REST API call (GET) to fetch the results. (Flow 13, 14, 15, 16)
The GetResult function is triggered, which retrieves the SQL query result and returns it to the client.
The user is now able to view the results on the webpage.
When clients disconnect from their browser, API Gateway automatically deletes the connection information from the DynamoDB table using the onDisconnect function. (Flow 17, 18,19)

Prerequisites

Prior to deploying your event-driven web application, ensure you have the following:

An Amazon Redshift cluster in your AWS environment – This is your backend data warehousing solution to run your analytical queries. For instructions to create your Amazon Redshift cluster, refer to Getting started with Amazon Redshift.
- After you create your Amazon Redshift cluster, add the AmazonS3ReadOnlyAccess permission to the associated cluster AWS Identity and Access Management (IAM) role.
An S3 bucket that you have access to – The S3 bucket will be your object storage solution where you can store your SQL scripts. To create your S3 bucket, refer to Create your first S3 bucket.

Deploy CloudFormation templates

The code associated to the design is available in the following GitHub repository. You can clone the repository inside an AWS Cloud9 environment in our AWS account. The AWS Cloud9 environment comes with AWS Command Line Interface (AWS CLI) installed, which is used to run the CloudFormation templates to set up the AWS infrastructure. Make sure that the jQuery library is installed; we use it to parse the JSON output during the run of the script.

The complete architecture is set up using three CloudFormation templates:

cognito-setup.yaml – Creates the Amazon Cognito user pool to web app client, which is used for authentication and protecting the REST API
backend-setup.yaml – Creates all the required Lambda functions and the WebSocket and Rest APIs, and configures them on API Gateway
webapp-setup.yaml – Creates the web application hosting using Amplify to connect and communicate with the WebSocket and Rest APIs.

These CloudFormation templates are run using the script.sh shell script, which takes care of all the dependencies as required.

A generic template is provided for you to customize your own DDL SQL scripts as well as your own query SQL scripts. We have created sample scripts for you to follow along.

Download the sample DDL script and upload it to an existing S3 bucket.
Change the IAM role value to your Amazon Redshift cluster’s IAM role with permissions to AmazonS3ReadOnlyAccess.

For this post, we copy the New York Taxi Data 2015 dataset from a public S3 bucket.

Download the sample query script and upload it to an existing S3 bucket.
Upload the modified sample DDL script and the sample query script into a preexisting S3 bucket that you own, and note down the S3 URI path.

If you want to run your own customized version, modify the DDL and query script to fit your scenario.

Edit the script.sh file before you run it and set the values for the following parameters:

- RedshiftClusterEndpoint (aws_redshift_cluster_ep) – Your Amazon Redshift cluster endpoint available on the AWS Management Console
- DBUsername (aws_dbuser_name) – Your Amazon Redshift database user name
- DDBTableName (aws_ddbtable_name) – The name of your DynamoDB table name that will be created
- WebsocketEndpointSSMParameterName (aws_wsep_param_name) – The parameter name that stores the WebSocket endpoint in AWS Systems Manager Parameter Store.
- RestApiEndpointSSMParameterName (aws_rapiep_param_name) – The parameter name that stores the REST API endpoint in Parameter Store.
- DDLScriptS3Path (aws_ddl_script_path) – The S3 URI to the DDL script that you uploaded.
- QueryScriptS3Path (aws_query_script_path) – The S3 URI to the query script that you uploaded.
- AWSRegion (aws_region) – The Region where the AWS infrastructure is being set up.
- CognitoPoolName (aws_user_pool_name) – The name you want to give to your Amazon Cognito user pool
- ClientAppName (aws_client_app_name) – The name of the client app to be configured for the web app to handle the user authentication for the users

The default acceptable values are already provided as part of the downloaded code.

Run the script using the following command:

./script.sh

During deployment, AWS CloudFormation creates and triggers the Lambda function SetupRedshiftLambdaFunction, which sets up an Amazon Redshift database table and populates data into the table. The following diagram illustrates this process.

Use the demo app

When the shell script is complete, you can start interacting with the demo web app:

On the Amplify console, under All apps in the navigation pane, choose DemoApp.
Choose Run build.

The DemoApp web application goes through a phase of Provision, Build, Deploy.

When it’s complete, use the URL provided to access the web application.

The following screenshot shows the web application page. It has minimal functionality: you can sign in, sign up, or verify a user.

Choose Sign Up.

For Email ID, enter an email.
For Password, enter a password that is at least eight characters long, has at least one uppercase and lowercase letter, at least one number, and at least one special character.
Choose Let’s Enroll.

The Verify your Login to Demo App page opens.

Enter your email and the verification code sent to the email you specified.
Choose Verify.

You’re redirected to a login page.

You’re redirected to the demoPage.html website.

Choose Open Connection.

You now have an active WebSocket connection between your browser and your backend AWS environment.

For Trip Month, specify a month (for this example, December) and choose Submit.

You have now defined the month and year you want to query your data upon. After a few seconds, you can to see the output delivered from the WebSocket.

You may continue using the active WebSocket connection for additional queries—just choose a different month and choose Submit again.

When you’re done, choose Close Connection to close the WebSocket connection.

For exploratory purposes, while your WebSocket connection is active, you can navigate to your DynamoDB table on the DynamoDB console to view the items that are currently stored. After the WebSocket connection is closed, the items stored in DynamoDB are deleted.

Clean up

To clean up your resources, complete the following steps:

On the Amazon S3 console, navigate to the S3 bucket containing the sample DDL script and query script and delete them from the bucket.
On the Amazon Redshift console, navigate to your Amazon Redshift cluster and delete the data you copied over from the sample DDL script.
1. Run truncate nyc_yellow_taxi;
2. Run drop table nyc_yellow_taxi;
On the AWS CloudFormation console, navigate to the CloudFormation stacks and choose Delete. Delete the stacks in the following order:
1. WebappSetup
2. BackendSetup
3. CognitoSetup

All resources created in this solution will be deleted.

Monitoring

You can monitor your event-driven web application events, user activity, and API usage with Amazon CloudWatch and AWS CloudTrail. Most areas of this solution already have logging enabled. To view your API Gateway logs, you can turn on CloudWatch Logs. Lambda comes with default logging and monitoring and can be accessed with CloudWatch.

Security

You can secure access to the application using Amazon Cognito, which is a developer-centric and cost-effective customer authentication, authorization, and user management solution. It provides both identity store and federation options that can scale easily. Amazon Cognito supports logins with social identity providers and SAML or OIDC-based identity providers, and supports various compliance standards. It operates on open identity standards (OAuth2.0, SAML 2.0, and OpenID Connect). You can also integrate it with API Gateway to authenticate and authorize the REST API calls either using the Amazon Cognito client app or a Lambda function.

Considerations

The nature of this application includes a front-end client initializing SQL queries to Amazon Redshift. An important component to consider are potential malicious activities that the client can perform, such as SQL injections. With the current implementation, that is not possible. In this solution, the SQL queries preexist in your AWS environment and are DQL statements (they don’t alter the data or structure). However, as you develop this application to fit your business, you should evaluate these areas of risk.

AWS offers a variety of security services to help you secure your workloads and applications in the cloud, including AWS Shield, AWS Network Firewall, AWS Web Application Firewall, and more. For more information and a full list, refer to Security, Identity, and Compliance on AWS.

Cost optimization

The AWS services that the CloudFormation templates provision in this solution are all serverless. In terms of cost optimization, you only pay for what you use. This model also allows you to scale without manual intervention. Review the following pages to determine the associated pricing for each service:

Conclusion

In this post, we showed you how to create an event-driven application using the Amazon Redshift Data API and API Gateway WebSocket and REST APIs. The solution helps you build data analytical web applications in an event-driven architecture, decouple your application, optimize long-running database queries processes, and avoid unnecessary polling requests between the client and the backend.

You also used severless technologies, API Gateway, Lambda, DynamoDB, and EventBridge. You didn’t have to manage or provision any servers throughout this process.

This event-driven, serverless architecture offers greater extensibility and simplicity, making it easier to maintain and release new features. Adding new components or third-party products is also simplified.

With the instructions in this post and the generic CloudFormation templates we provided, you can customize your own event-driven application tailored to your business. For feedback or contributions, we welcome you to contact us through the AWS Samples GitHub Repository by creating an issue.

About the Authors

David Zhang is an AWS Data Architect in Global Financial Services. He specializes in designing and implementing serverless analytics infrastructure, data management, ETL, and big data systems. He helps customers modernize their data platforms on AWS. David is also an active speaker and contributor to AWS conferences, technical content, and open-source initiatives. During his free time, he enjoys playing volleyball, tennis, and weightlifting. Feel free to connect with him on LinkedIn.

Manash Deb is a Software Development Manager in the AWS Directory Service team. With over 18 years of software dev experience, his passion is designing and delivering highly scalable, secure, zero-maintenance applications in the AWS identity and data analytics space. He loves mentoring and coaching others and to act as a catalyst and force multiplier, leading highly motivated engineering teams, and building large-scale distributed systems.

Pavan Kumar Vadupu Lakshman Manikya is an AWS Solutions Architect who helps customers design robust, scalable solutions across multiple industries. With a background in enterprise architecture and software development, Pavan has contributed in creating solutions to handle API security, API management, microservices, and geospatial information system use cases for his customers. He is passionate about learning new technologies and solving, automating, and simplifying customer problems using these solutions.

Get started with the Amazon Redshift Data API

2021-08-31 David Zhang

Post Syndicated from David Zhang original https://aws.amazon.com/blogs/big-data/get-started-with-the-amazon-redshift-data-api/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that enables you to analyze your data at scale. Tens of thousands of customers use Amazon Redshift to process exabytes of data to power their analytical workloads.

The Amazon Redshift Data API is an Amazon Redshift feature that simplifies access to your Amazon Redshift data warehouse by removing the need to manage database drivers, connections, network configurations, data buffering, credentials, and more. You can run SQL statements using the AWS Software Development Kit (AWS SDK), which supports different languages such as C++, Go, Java, JavaScript, .Net, Node.js, PHP, Python, and Ruby.

Since you’re reading this post, you may also be interested in the following AWS Online Tech Talks video for more info:

Integrating Web Services and Serverless Applications using Amazon Redshift Data API

With the Data API, you can programmatically access data in your Amazon Redshift cluster from different AWS services such as AWS Lambda, Amazon SageMaker notebooks, AWS Cloud9, and also your on-premises applications using the AWS SDK. This allows you to build cloud-native, containerized, serverless, web-based, and event-driven applications on the AWS Cloud.

The Data API also enables you to run analytical queries on Amazon Redshift’s native tables, external tables in your data lake via Amazon Redshift Spectrum, and also across Amazon Redshift clusters, which is known as data sharing. You can also perform federated queries with external data sources such as Amazon Aurora.

In an earlier, post, we shared in great detail on how you can use the Data API to interact with your Amazon Redshift data warehouse. In this post, we learn how to get started with the Data API in different languages and also discuss various use cases in which customers are using this to build modern applications combining modular, serverless, and event-driven architectures. The Data API was launched in September 2020, and thousands of our customers are already using it for a variety of use cases:

Extract, transform, and load (ETL) orchestration with AWS Step Functions
Access Amazon Redshift from applications
Access Amazon Redshift from SageMaker Jupyter notebooks
Access Amazon Redshift with REST endpoints
Event-driven extract, load, transformation
Event-driven web application design
Serverless data processing workflows

Key features of the Data API

In this section, we discuss the key features of the Data API.

Use different programming language of your choice

The Data API integrates with the AWS SDK to run queries. Therefore, you can use any language supported by the AWS SDK to build your application with it, such as C++, Go, Java, JavaScript, .NET, Node.js, PHP, Python, and Ruby.

Run individual or batch SQL statements

With the Data API, you can run individual queries from your application or submit a batch of SQL statements within a transaction, which is useful to simplify your workload.

Run SQL statements with parameters

With the Data API, you can run parameterized SQL queries, which brings the ability to write reusable code when developing ETL code by passing parameters into a SQL template instead of concatenating parameters into each query on their own. This also makes it easier to migrate code from existing applications that needs parameterization. In addition, parameterization also makes code secure by eliminating malicious SQL injection.

No drivers needed

With the Data API, you can interact with Amazon Redshift without having to configure JDBC or ODBC drivers. The Data API eliminates the need for configuring drivers and managing database connections. You can run SQL commands to your Amazon Redshift cluster by calling a Data API secured API endpoint.

No network or security group setup

The Data API doesn’t need a persistent connection with Amazon Redshift. Instead, it provides a secure HTTP endpoint, which you can use to run SQL statements. Therefore, you don’t need to set up and manage a VPC, security groups, and related infrastructure to access Amazon Redshift with the Data API.

No password management

The Data API provides two options to provide credentials:

IAM temporary credentials – With this approach, you only need to provide the username and AWS Identity and Access Management (IAM) GetClusterCredentials permission to access Amazon Redshift with no password. The Data API automatically handles retrieving your credentials using IAM temporary credentials.
AWS Secrets Manager secret – With this approach, you store your username and password credentials in an AWS Secrets Manager secret and allow the Data API to access those credentials from that secret.

You can also use the Data API when working with federated logins through IAM credentials. You don’t have to pass database credentials via API calls when using identity providers such as Okta, Azure Active Directory, or database credentials stored in Secrets Manager. If you’re using Lambda, the Data API provides a secure way to access your database without the additional overhead of launching Lambda functions in Amazon Virtual Private Cloud (Amazon VPC).

Asynchronous

The Data API is asynchronous. You can run long-running queries without having to wait for it to complete, which is key in developing a serverless, microservices-based architecture. Each query results in a query ID, and you can use this ID to check the status and response of the query. In addition, query results are stored for 24 hours.

Event notifications

You can monitor Data API events in Amazon EventBridge, which delivers a stream of real-time data from your source application to targets such as Lambda. This option is available when you’re running your SQL statements in the Data API using the WithEvent parameter set to true. When a query is complete, the Data API can automatically send event notifications to EventBridge, which you may use to take further actions. This helps you design event-driven applications with Amazon Redshift. For more information, see Monitoring events for the Amazon Redshift Data API in Amazon EventBridge.

Get started with the Data API

It’s easy to get started with the Data API using the AWS SDK. You can use the Data API to run your queries on Amazon Redshift using different languages such as C++, Go, Java, JavaScript, .Net, Node.js, PHP, Python and Ruby.

All API calls from different programming languages follow similar parameter signatures. For instance, you can run the ExecuteStatement API to run individual SQL statements in the AWS Command Line Interface (AWS CLI) or different languages such as Python and JavaScript (NodeJS).

The following code is an example using the AWS CLI:

aws redshift-data execute-statement
--cluster-identifier my-redshift-cluster
--database dev
--db-user awsuser
--sql 'select data from redshift_table’;

The following example code uses Python:

boto3.client("redshift-data").execute_statement(
	ClusterIdentifier = ‘my-redshift-cluster’,
	Database = ‘dev’,
	DbUser = ‘awsuser’,
	Sql = 'select data from redshift_table’)

The following code uses JavaScript (NodeJS):

new AWS.RedshiftData().executeStatement({
	ClusterIdentifier: 'my-redshift-cluster',
	Database: 'dev',
	DbUser: 'awsuser',
	Sql: 'select data from redshift_table'
}).promise();

We have also published a GitHub repository showcasing how to get started with the Data API in different languages such as Go, Java, JavaScript, Python, and TypeScript. You may go through the step-by-step process explained in the repository to build your custom application in all these languages using the Data API.

Use cases with the Data API

You can use the Data API to modernize and simplify your application architectures by creating modular, serverless, event-driven applications with Amazon Redshift. In this section, we discuss some common use cases.

ETL orchestration with Step Functions

When performing ETL workflows, you have to complete a number of steps. As your business scales, the steps and dependencies often become complex and difficult to manage. With the Data API and Step Functions, you can easily orchestrate complex ETL workflows. You can explore the following example use case and AWS CloudFormation template demonstrating ETL orchestration using the Data API and Step Functions.

Access Amazon Redshift from custom applications

If you’re designing your custom application in any programming language that is supported by the AWS SDK, the Data API simplifies data access from your applications, which may be an application hosted on Amazon Elastic Compute Cloud (Amazon EC2) or Amazon Elastic Container Service (Amazon ECS) and other compute services or a serverless application built with Lambda. You can explore an example use case and CloudFormation template showcasing how to easily work with the Data API from Amazon EC2 based applications.

Access Amazon Redshift from SageMaker Jupyter notebooks

SageMaker notebooks are very popular among the data science community to analyze and solve machine learning problems. The Data API makes it easy to access and visualize data from your Amazon Redshift data warehouse without troubleshooting issues on password management or VPC or network issues. You can learn more about this use case along with a CloudFormation template showcasing how to use the Data API to interact from a SageMaker Jupyter notebook.

Access Amazon Redshift with REST endpoints

With the AWS SDK, you can use the Data APIs to directly invoke them as REST API calls such as GET or POST methods. For more information, see REST for Redshift Data API.

Event-driven ELT

Event–driven applications are popular with many customers, where applications run in response to events. A primary benefit of this architecture is the decoupling of producer and consumer processes, which allows greater flexibility in application design and building decoupled processes. For more information, see Building an event-driven application with AWS Lambda and the Amazon Redshift Data API. The following CloudFormation template demonstrates the same.

Event-driven web application design

Similar to event-driven ELT applications, event-driven web applications are also becoming popular, especially if you want to avoid long-running database queries, which create bottlenecks for the application servers. For example, you may be running a web application that has a long-running database query taking a minute to complete. Instead of designing that web application with long-running API calls, you can use the Data API and Amazon API Gateway WebSockets, which creates a lightweight websocket connection with the browser and submits the query to Amazon Redshift using the Data API. When the query is finished, the Data API sends a notification to EventBridge about its completion. When the data is available in the Data API, it’s pushed back to this browser session and the end-user can view the dataset. You can explore an example use case along with a CloudFormation template showcasing how to build an event-driven web application using the Data API and API Gateway WebSockets.

Serverless data processing workflows

With the Data API, you can design a serverless data processing workflow, where you can design an end-to-end data processing pipeline orchestrated using serverless AWS components such as Lambda, EventBridge, and the Data API client. Typically, a data pipeline involves multiple steps, for example:

Load raw sales and customer data to a data warehouse.
Integrate and transform the raw data.
Build summary tables or unload this data to a data lake so subsequent steps can consume this data.

The example use case Serverless Data Processing Workflow using Amazon Redshift Data Api demonstrates how to chain multiple Lambda functions in a decoupled fashion and build an end-to-end data pipeline. In that code sample, a Lambda function is run through a scheduled event that loads raw data from Amazon Simple Storage Service (Amazon S3) to Amazon Redshift. On its completion, the Data API generates an event that triggers an event rule in EventBridge to invoke another Lambda function that prepares and transforms raw data. When that process is complete, it generates another event triggering a third EventBridge rule to invoke another Lambda function and unloads the data to Amazon S3. The Data API enables you to chain this multi-step data pipeline in a decoupled fashion.

Conclusion

The Data API offers many additional benefits when integrating Amazon Redshift into your analytical workload. The Data API simplifies and modernizes current analytical workflows and custom applications. You can perform long-running queries without having to pause your application for the queries to complete. This enables you to build event-driven applications as well as fully serverless ETL pipelines.

The Data API functionalities are available in many different programming languages to suit your environment. As mentioned earlier, there are a wide variety of use cases and possibilities where you can use the Data API to improve your analytical workflow. To learn more, see Using the Amazon Redshift Data API.

About the Authors

David Zhang is an AWS Solutions Architect who helps customers design robust, scalable, and data-driven solutions across multiple industries. With a background in software engineering, David is an active leader and contributor to AWS open-source initiatives. He is passionate about solving real-world business problems and continuously strives to work from the customer’s perspective.

Bipin Pandey is a Data Architect at AWS. He loves to build data lake and analytics platform for his customers. He is passionate about automating and simplifying customer problems with the use of cloud solutions.

Manash Deb is a Senior Analytics Specialist Solutions Architect at AWS. He has worked on building end-to-end data-driven solutions in different database and data warehousing technologies for over 15 years. He loves to learn new technologies and solving, automating, and simplifying customer problems with easy-to-use cloud data solutions on AWS.

Bhanu Pittampally is Analytics Specialist Solutions Architect based out of Dallas. He specializes in building analytical solutions. His background is in data warehouse – architecture, development and administration. He is in data and analytical field for over 13 years. His Linkedin profile is here.

Chao Duan is a software development manager at Amazon Redshift, where he leads the development team focusing on enabling self-maintenance and self-tuning with comprehensive monitoring for Redshift. Chao is passionate about building high-availability, high-performance, and cost-effective database to empower customers with data-driven decision making.

Debu Panda, a Principal Product Manager at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world.

Access Data Catalog usage metrics in Amazon CloudWatch console

Monitor CallCount metrics

Metrics and dimensions for Data Catalog usage metrics

Set up CloudWatch alarms for Data Catalog usage metrics

Conclusion

About the authors

Event-driven model

What are WebSockets and why do we need them?

Solution overview

Prerequisites

Deploy CloudFormation templates

Use the demo app

Clean up

Monitoring

Security

Considerations

Cost optimization

Conclusion

About the Authors

Key features of the Data API

Use different programming language of your choice

Run individual or batch SQL statements

Run SQL statements with parameters

No drivers needed

No network or security group setup

No password management

Asynchronous

Event notifications

Get started with the Data API

Use cases with the Data API

ETL orchestration with Step Functions

Access Amazon Redshift from custom applications

Access Amazon Redshift from SageMaker Jupyter notebooks

Access Amazon Redshift with REST endpoints

Event-driven ELT

Event-driven web application design

Serverless data processing workflows

Conclusion

About the Authors

The collective thoughts of the interwebz