Use the Amazon Redshift SQLAlchemy dialect to interact with Amazon Redshift

Post Syndicated from Sumeet Joshi original https://aws.amazon.com/blogs/big-data/use-the-amazon-redshift-sqlalchemy-dialect-to-interact-with-amazon-redshift/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that enables you to analyze your data at scale. You can interact with an Amazon Redshift database in several different ways. One method is using an object-relational mapping (ORM) framework. ORM is widely used by developers as an abstraction layer upon the database, which allows you to write code in your preferred programming language instead of writing SQL. SQLAlchemy is a popular Python ORM framework that enables the interaction between Python code and databases.

A SQLAlchemy dialect is the system used to communicate with various types of DBAPI implementations and databases. Previously, the SQLAlchemy dialect for Amazon Redshift used psycopg2 for communication with the database. Because psycopg2 is a Postgres connector, it doesn’t support Amazon Redshift specific functionality such as AWS Identity and Access Management (IAM) authentication for secure connections and Amazon Redshift specific data types such as SUPER and GEOMETRY. The new Amazon Redshift SQLAlchemy dialect uses the Amazon Redshift Python driver (redshift_connector) and lets you securely connect to your Amazon Redshift database. It natively supports IAM authentication and single sign-on (SSO). It also supports Amazon Redshift specific data types such as SUPER, GEOMETRY, TIMESTAMPTZ, and TIMETZ.

In this post, we discuss how you can interact with your Amazon Redshift database using the new Amazon Redshift SQLAlchemy dialect. We demonstrate how you can securely connect using Okta and perform various DDL and DML operations. Because the new Amazon Redshift SQLAlchemy dialect uses redshift_connector, users of this package can take full advantage of the connection options provided by redshift_connector, such as authenticating via IAM and identity provider (IdP) plugins. Additionally, we also demonstrate the support for IPython SqlMagic, which simplifies running interactive SQL queries directly from a Jupyter notebook.

Prerequisites

The following are the prerequisites for this post:

Get started with the Amazon Redshift SQLAlchemy dialect

It’s easy to get started with the Amazon Redshift SQLAlchemy dialect for Python. You can install the sqlalchemy-redshift library using pip. To demonstrate this, we start with a Jupyter notebook. Complete the following steps:

  1. Create a notebook instance (for this post, we call it redshift-sqlalchemy).
  2. On the Amazon SageMaker console, under Notebook in the navigation pane, choose Notebook instances.
  3. Find the instance you created and choose Open Jupyter.
  4. Open your notebook instance and create a new conda_python3 Jupyter notebook.
  5. Run the following commands to install sqlalchemy-redshift and redshift_connector:
pip install sqlalchemy-redshift
pip install redshift_connector


redshift_connector provides many different connection options that help customize how you access your Amazon Redshift cluster. For more information, see Connection Parameters.

Connect to your Amazon Redshift cluster

In this step, we show you how to connect to your Amazon Redshift cluster using two different methods: Okta SSO federation, and direct connection using your database user and password.

Connect with Okta SSO federation

As a prerequisite, set up your Amazon Redshift application in your Okta configuration. For more information, see Federate Amazon Redshift access with Okta as an identity provider.

To establish a connection to the Amazon Redshift cluster, we utilize the create_engine function. The SQLAlchemy create_engine() function produces an engine object based on a URL. The sqlalchemy-redshift package provides a custom interface for creating an RFC-1738 compliant URL that you can use to establish a connection to an Amazon Redshift cluster.

We build the SQLAlchemy URL as shown in the following code. URL.create() is available for SQLAlchemy version 1.4 and above. When authenticating using IAM, the host and port don’t need to be specified by the user. To connect with Amazon Redshift securely using SSO federation, we use the Okta user name and password in the URL.

import sqlalchemy as sa
from sqlalchemy.engine.url import URL
from sqlalchemy import orm as sa_orm

from sqlalchemy_redshift.dialect import TIMESTAMPTZ, TIMETZ


# build the sqlalchemy URL. When authenticating using IAM, the host
# and port do not need to be specified by the user.
url = URL.create(
drivername='redshift+redshift_connector', # indicate redshift_connector driver and dialect will be used
database='dev', # Amazon Redshift database
username='[email protected]', # Okta username
password='<PWD>' # Okta password
)

# a dictionary is used to store additional connection parameters
# that are specific to redshift_connector or cannot be URL encoded.
conn_params = {
"iam": True, # must be enabled when authenticating via IAM
"credentials_provider": "OktaCredentialsProvider",
"idp_host": "<prefix>.okta.com",
"app_id": "<appid>",
"app_name": "amazon_aws_redshift",
"region": "<region>",
"cluster_identifier": "<clusterid>",
"ssl_insecure": False, # ensures certificate verification occurs for idp_host
}

engine = sa.create_engine(url, connect_args=conn_params)

Connect with an Amazon Redshift database user and password

You can connect to your Amazon Redshift cluster using your database user and password. We construct a URL and use the URL.create() constructor, as shown in the following code:

import sqlalchemy as sa
from sqlalchemy.engine.url import URL

# build the sqlalchemy URL
url = URL.create(
drivername='redshift+redshift_connector', # indicate redshift_connector driver and dialect will be used
host='<clusterid>.xxxxxx.<aws-region>.redshift.amazonaws.com', # Amazon Redshift host
port=5439, # Amazon Redshift port
database='dev', # Amazon Redshift database
username='awsuser', # Amazon Redshift username
password='<pwd>' # Amazon Redshift password
)

engine = sa.create_engine(url)

Next, we will create a session using the already established engine above. 

Session = sa_orm.sessionmaker()
Session.configure(bind=engine)
session = Session()

# Define Session-based Metadata
metadata = sa.MetaData(bind=session.bind)

Create a database table using Amazon Redshift data types and insert data

With new Amazon Redshift SQLAlchemy dialect, you can create tables with Amazon Redshift specific data types such as SUPER, GEOMETRY, TIMESTAMPTZ, and TIMETZ.

In this step, you create a table with TIMESTAMPTZ, TIMETZ, and SUPER data types.

Optionally, you can define your table’s distribution style, sort key, and compression encoding. See the following code:

import datetime
import uuid
import random

table_name = 'product_clickstream_tz'

RedshiftDBTable = sa.Table(
table_name,
metadata,
sa.Column('session_id', sa.VARCHAR(80)),
sa.Column('click_region', sa.VARCHAR(100), redshift_encode='lzo'),
sa.Column('product_id', sa.BIGINT),
sa.Column('click_datetime', TIMESTAMPTZ),
sa.Column('stream_time', TIMETZ),
sa.Column ('order_detail', SUPER),
redshift_diststyle='KEY',
redshift_distkey='session_id',
redshift_sortkey='click_datetime'
)

# Drop the table if it already exists
if sa.inspect(engine).has_table(table_name):
RedshiftDBTable.drop(bind=engine)

# Create the table (execute the "CREATE TABLE" SQL statement for "product_clickstream_tz")
RedshiftDBTable.create(bind=engine)

In this step, you will populate the table by preparing the insert command. 

# create sample data set
# generate a UUID for this row
session_id = str(uuid.uuid1())

# create Region information
click_region = "US / New York"

# create Product information
product_id = random.randint(1,100000)

# create a datetime object with timezone
click_datetime = datetime.datetime(year=2021, month=10, day=20, hour=10, minute=12, second=40, tzinfo=datetime.timezone.utc)

# create a time object with timezone
stream_time = datetime.time(hour=10, minute=14, second=56, tzinfo=datetime.timezone.utc)

# create SUPER information
order_detail = '[{"o_orderstatus":"F","o_clerk":"Clerk#0000001991","o_lineitems":[{"l_returnflag":"R","l_tax":0.03,"l_quantity":4,"l_linestatus":"F"}]}]'

# create the insert SQL statement
insert_data_row = RedshiftDBTable.insert().values(
session_id=session_id,
click_region=click_region,
product_id=product_id,
click_datetime=click_datetime,
stream_time=stream_time,
order_detail=order_detail
)

# execute the insert SQL statement
session.execute(insert_data_row)
session.commit()

Query and fetch results from the table

The SELECT statements generated by SQLAlchemy ORM are constructed by a query object. You can use several different methods, such as all(), first(), count(), order_by(), and join(). The following screenshot shows how you can retrieve all rows from the queried table.

Use IPython SqlMagic with the Amazon Redshift SQLAlchemy dialect

The Amazon Redshift SQLAlchemy dialect now supports SqlMagic. To establish a connection, you can build the SQLAlchemy URL with the redshift_connector driver. More information about SqlMagic is available on GitHub.

In the next section, we demonstrate how you can use SqlMagic. Make sure that you have the ipython-sql package installed; if not, install it by running the following command:

pip install ipython-sql

Connect to Amazon Redshift and query the data

In this step, you build the SQLAlchemy URL to connect to Amazon Redshift and run a sample SQL query. For this demo, we have prepopulated TPCH data in the cluster from GitHub. See the following code:

import sqlalchemy as sa
from sqlalchemy.engine.url import URL
from sqlalchemy.orm import Session
%reload_ext sql
%config SqlMagic.displaylimit = 25

connect_to_db = URL.create(
drivername='redshift+redshift_connector',     host='cluster.xxxxxxxx.region.redshift.amazonaws.com',     
port=5439,  
database='dev',  
username='awsuser',  
password='xxxxxx'  
)
%sql $connect_to_db
%sql select current_user, version();

You can view the data in tabular format by using the pandas.DataFrame() method.

If you installed matplotlib, you can use the result set’s .plot(), .pie(), and .bar() methods for quick plotting.

Clean up

Make sure that SQLAlchemy resources are closed and cleaned up when you’re done with them. SQLAlchemy uses a connection pool to provide access to an Amazon Redshift cluster. Once opened, the default behavior leaves these connections open. If not properly cleaned up, this can lead to connectivity issues with your cluster. Use the following code to clean up your resources:

session.close()

# If the connection was accessed directly, ensure it is invalidated
conn = engine.connect()
conn.invalidate()

# Clean up the engine
engine.dispose()

Summary

In this post, we discussed the new Amazon Redshift SQLAlchemy dialect. We demonstrated how it lets you securely connect to your Amazon Redshift database using SSO as well as direct connection using the SQLAlchemy URL. We also demonstrated how SQLAlchemy supports TIMESTAMPTZ, TIMETZ, and SUPER data types without explicitly casting it. We also showcased how redshift_connector and the dialect support SqlMagic with Jupyter notebooks, which enables you to run interactive queries against Amazon Redshift.


About the Authors

Sumeet Joshi is an Analytics Specialist Solutions Architect based out of New York. He specializes in building large-scale data warehousing solutions. He has over 16 years of experience in data warehousing and analytical space.

Brooke White is a Software Development Engineer at AWS. She enables customers to get the most out of their data through her work on Amazon Redshift drivers. Prior to AWS, she built ETL pipelines and analytics APIs at a San Francisco Bay Area startup.