Improve reusability and security using Amazon Athena parameterized queries

Post Syndicated from Blayze Stefaniak original https://aws.amazon.com/blogs/big-data/improve-reusability-and-security-using-amazon-athena-parameterized-queries/

Amazon Athena is a serverless interactive query service that makes it easy to analyze data in Amazon Simple Storage Service (Amazon S3) using standard SQL, and you only pay for the amount of data scanned by your queries. If you use SQL to analyze your business on a daily basis, you may find yourself repeatedly running the same queries, or similar queries with minor adjustments. Athena parameterized queries enable you to prepare statements you can reuse with different argument values you provide at run time. Athena parameterized queries also provide a layer of security against SQL injection attacks, and mask the query string in AWS CloudTrail for workloads with sensitive data.

This post shows you how to create and run parameterized queries in Athena. This post provides an example of how Athena parameterized queries protect against SQL injection, and shows the CloudTrail events with the masked query string. Lastly, the post reviews functions related to managing Athena prepared statements. If you want to follow along, this post provides steps to set up the components with a sample dataset; alternatively, you can use your own dataset.

Reusability

Athena prepared statements allow you to run and reuse queries within your Athena workgroup. By decoupling the queries from the code, you can update your prepared statements and your applications independent from one another. If a data lake has schema updates, it could require query updates. If multiple applications share the same Athena workgroup and are using similar queries, you can create a new query or update the existing query to serve multiple use cases, without each application being required to adjust similar queries in their own source code. Parameterized queries are currently supported for SELECT, INSERT INTO, CTAS, and UNLOAD statements. For the most current list, refer to Considerations and Limitations in Querying with Prepared Statements.

Security

Athena prepared statements provide a layer of protection against SQL injection. If you are using Athena behind an application interface, free text inputs inherently present a SQL injection threat vector which, if left unmitigated, could result in data exfiltration. When the parameterized query is run, Athena interprets the arguments as literal values, not as executable commands nor SQL fragments like SQL operators.

When using Athena, CloudTrail captures all Athena API calls as audit events to provide a record of actions taken by an AWS user, role, or AWS service. Customers with sensitive data in their data lakes, such as personally identifiable information (PII), have told us they don’t want query strings in their CloudTrail event history for compliance reasons. When running parameterized queries, the query string is masked with HIDDEN_DUE_TO_SECURITY_REASONS in the CloudTrail event, so you don’t show protected data within your log streams.

Solution overview

This post documents the steps using the public Amazon.com customer reviews dataset; however, you can follow similar steps to use your own dataset.

The example query is to find a product’s 4-star (out of 5 stars) reviews voted as the most helpful by other customers. The intent behind the query is to find query results that indicate constructive product feedback. The intent is to validate the feedback and get helpful feedback incorporated into the product roadmap. The product used in this use case is the Amazon Smile eGift Card.

Prerequisites

As a prerequisite, you need a foundational understanding of SQL syntax, as well as a foundational understanding the following AWS services:

Athena
AWS CloudFormation
AWS Glue and the AWS Glue Data Catalog
AWS Identity and Access Management (IAM)
Amazon S3

This post assumes you have:

A CloudTrail trail created – for instructions, refer to Creating a trail.
An S3 bucket created – for instructions, refer to Creating a bucket.

Deploy resources for the example dataset

If you’re using the example dataset, follow the steps in this section. The data is in an S3 bucket in an AWS-managed AWS account. You need to create Athena and AWS Glue resources to get started.

This post provides a CloudFormation template that deploys the following resources in your AWS account:

AthenaWorkGroup – An Athena workgroup for your dataset and prepared statements. On the console, this workgroup is named PreparedStatementsWG.
GlueDatabase – A database in the AWS Glue Data Catalog for table metadata. The database is named athena_prepared_statements.
GlueTableAmazonReviews – An external table with Amazon.com customer reviews in the Data Catalog.

The following diagram shows how the resources interact when the query runs.

To deploy the CloudFormation template, follow these steps:

Navigate to this post’s GitHub repository.
Clone the repository or copy the CloudFormation template athena-prepared-statements.yaml.
On the AWS CloudFormation console, choose Create stack.
Select Upload a template file and choose Choose file.
Upload athena-prepared-statements.yaml, then choose Next.
On the Specify stack details page, enter the stack name athena-prepared-statements-blog.
For S3QueryResultsBucketName, enter your S3 bucket name.
If you leave AthenaWorkGroupName as default, the Athena workgroup is named PreparedStatementsWG. If you change the value, the Athena workgroup name must be unique in your AWS Region.
Choose Next.
On the Configure stack options page, choose Next.
On the Review page, choose Create stack.

The script takes less than a minute to run and change to a CREATE_COMPLETE state. If you deploy the stack twice in the same AWS account and Region, the AWS Glue database, table, or Athena workgroup may already exist, and the process fails with a message indicating that the resource already exists in another template.

For least-privilege authorization for deployment of the CloudFormation template, you can create an AWS CloudFormation service role with the following IAM policy actions. To do this, you must create an IAM policy and IAM role, and choose this role when configuring stack options.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:CreateDatabase"
      ],
      "Resource": [
        "arn:${Partition}:glue:${Region}:${Account}:catalog",
        "arn:${Partition}:glue:${Region}:${Account}:database/athena_prepared_statements"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "glue:DeleteDatabase"
      ],
      "Resource": [
        "arn:${Partition}:glue:${Region}:${Account}:catalog",
        "arn:${Partition}:glue:${Region}:${Account}:database/athena_prepared_statements",
        "arn:${Partition}:glue:${Region}:${Account}:table/athena_prepared_statements/*",
        "arn:${Partition}:glue:${Region}:${Account}:userDefinedFunction/athena_prepared_statements/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "glue:CreateTable"
      ],
      "Resource": [
        "arn:${Partition}:glue:${Region}:${Account}:catalog",
        "arn:${Partition}:glue:${Region}:${Account}:database/athena_prepared_statements",
        "arn:${Partition}:glue:${Region}:${Account}:table/athena_prepared_statements/amazon_reviews_parquet"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "glue:DeleteTable"
      ],
      "Resource": [
        "arn:${Partition}:glue:${Region}:${Account}:catalog",
        "arn:${Partition}:glue:${Region}:${Account}:database/athena_prepared_statements",
        "arn:${Partition}:glue:${Region}:${Account}:table/athena_prepared_statements/amazon_reviews_parquet"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "athena:CreateWorkGroup",
        "athena:DeleteWorkGroup",
        "athena:GetWorkGroup"
      ],
      "Resource": "arn:${Partition}:athena:${Region}:${Account}:workgroup/PreparedStatementsWG"
    }
  ]
}

For authorization for the IAM principal running the CloudFormation template and following along, this post was tested with the following AWS managed policies and the customer managed policy below.

AWS managed policies:

AmazonAthenaFullAccess
AWSCloudTrailReadOnlyAccess
AWSCloudFormationFullAccess

Customer managed policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ViewS3BucketsWithoutErrors",
            "Effect": "Allow",
            "Action": [
                "s3:ListAllMyBuckets"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "InteractWithMyBucketAndDataSetBucket",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:GetBucketLocation",
                "s3:ListBucket",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::${my-bucket-name}*",
                "arn:aws:s3:::amazon-reviews-pds*"
            ]
        },
        {
            "Sid": "UploadCloudFormationTemplate",
            "Effect": "Allow",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::cf-template*"
        },
        {
            "Sid": "CleanUpResults",
            "Effect": "Allow",
            "Action": [
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::${my-bucket-name}/results*"
            ]
        },
        {
            "Sid": "ListRolesForCloudFormationDeployment",
            "Effect": "Allow",
            "Action": [
                "iam:ListRoles"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "IAMRoleForCloudFormationDeployment",
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ],
            "Resource": [
                "arn:${Partition}:iam::${Account}:role/${role-name}"
            ],
            "Condition": {
                "StringEquals": {
                    "iam:PassedToService": "cloudformation.amazonaws.com"
                }
            }
        }
    ]
}

Partition the example dataset

The CloudFormation template created an external table pointing at a dataset of over 130 million customer reviews from Amazon.com. Partitioning data improves query performance and reduces cost by restricting the amount of data scanned by each query. The external table for this dataset has Hive-compatible partitions. The MSCK REPAIR TABLE SQL statement scans the prefix paths in the S3 bucket and updates the metadata in the Data Catalog with the partition metadata. To access the dataset, the external table’s partitions must be updated.

After you deploy the CloudFormation template, complete the following steps:

On the Athena console, choose Query editor in the navigation pane.
For Data Source, enter AwsDataCatalog.
For Database, enter athena_prepared_statements.
On the Workgroup drop-down menu, choose PreparedStatementsWG.
Choose Acknowledge to confirm.
In the query editor pane, run the following SQL statement for your external table:

MSCK REPAIR TABLE athena_prepared_statements.amazon_reviews_parquet;

This query takes approximately 15 seconds to run when tested in us-east-1.

Run the following query to list the available partitions. The example dataset has partitions based on product_category.

SHOW PARTITIONS athena_prepared_statements.amazon_reviews_parquet;

Run a SELECT statement to output a sample of data available in the table:

SELECT * FROM athena_prepared_statements.amazon_reviews_parquet limit 10;

Create prepared statements

To use Athena parameterized queries, first you run the PREPARE SQL statement and specify your positional parameters, denoted by question marks. The Athena prepared statement is stored with a name you specify.

Run the following PREPARE statement in the Athena query editor. This example query, named product_helpful_reviews, provides customer reviews with three parameters for a specified product ID, star rating provided by the reviewer, and minimum number of helpful votes provided to the review by other Amazon.com customers.

PREPARE product_helpful_reviews FROM
SELECT product_id, product_title, star_rating, helpful_votes, review_headline, review_body
FROM amazon_reviews_parquet WHERE product_id = ? AND star_rating = ? AND helpful_votes > ?
ORDER BY helpful_votes DESC
LIMIT 10;

You could also use the CreatePreparedStatement API or SDK. For example, to create your prepared statement from AWS CLI, run the following command:

aws athena create-prepared-statement \
--statement-name "product_helpful_reviews" \
--query-statement "SELECT product_id, product_title, star_rating, helpful_votes, review_headline, review_body FROM amazon_reviews_parquet WHERE product_id = ? AND star_rating = ? AND helpful_votes > ? ORDER BY helpful_votes DESC LIMIT 10;" \
--work-group PreparedStatementsWG \
--region region

For more information on creating prepared statements, refer to SQL Statements in Querying with Prepared Statements.

Run parameterized queries

You can run a parameterized query against the prepared statement with the EXECUTE SQL statement and a USING clause. The USING clause specifies the argument values for the prepared statement’s parameters.

Run the following EXECUTE statement in the Athena query editor. The prepared statement created in the previous section is run with parameters to output 4-star reviews for the Amazon Smile eGift Card product ID with at least 10 helpful votes.

EXECUTE product_helpful_reviews USING 'BT00DDVMVQ', 4, 10;

If you receive the message PreparedStatement product_helpful_reviews was not found in workGroup primary, make sure you selected the PreparedStatementsWG workgroup.

For more information on running parameterized queries, refer to SQL Statements in Querying with Prepared Statements.

Mask query string data in CloudTrail events using parameterized queries

You may want to use parameterized queries to redact sensitive data from the query string visible in CloudTrail events. For example, you may have columns containing PII as parameters, which you don’t want visible in logs. Athena automatically masks query strings from CloudTrail events for EXECUTE statements, replacing the query string with the value HIDDEN_DUE_TO_SECURITY_REASONS. This helps you avoid displaying protected data in your log streams.

To access the CloudTrail event for the query, complete the following steps:

Navigate to the Event history page on the CloudTrail console.
On the drop-down menu, choose Event name.
Search for StartQueryExecution events.

CloudTrail event records for parameterized queries include a queryString value redacted with HIDDEN_DUE_TO_SECURITY_REASONS. The query string is visible in the Athena workgroup’s query history. You can control access by using least-privilege IAM policies to Athena, the AWS Glue Data Catalog, and the Amazon S3 query output location in your workgroup settings. For more information on viewing recent queries, refer to Viewing Recent Queries. For more information on IAM policies, refer to Actions, resources, and condition keys for AWS services.

Layer of protection for SQL injection

In this section, you’re shown an example of a SQL injection attack, and how prepared statements can protect against the same attack. We use the Athena console to invoke the StartQueryExecution API against a table named users with three rows.

SQL injection is an attempt to insert malicious SQL code into requests to change the statement and extract data from your dataset’s tables. Without Athena parameterized queries, if you’re querying a dataset directly or appending user input to a SQL query, and users can append SQL fragments, the dataset may be vulnerable to SQL injection attacks which return unauthorized data in the result set.

This post shows an example of inserting a SQL fragment in a malicious way. In the example, an OR condition which will always return true (such as OR 1=1) is appended to the WHERE clause. The same example query is shown with Athena parameterized queries, and the query fails because it contains an invalid parameter value, since the parameter value is expected to be an integer but contains the characters “OR”. If the parameter was based on a String column, then the same SQL injection attempt would result in the query returning no results because the positional argument is interpreted as a literal parameter value.

Athena provides an additional layer of defense against multi-statement SQL injection attacks. Attempting to perform SQL injection with an executable command (such as DROP) results in a failed query with Athena providing an error Only one sql statement is allowed, because Athena only accepts one executable command per SQL statement submission.

Although Athena prepared statements provide a layer of protection against SQL injection attacks, other precautions provide additional layers of defense. Athena prepared statements can be a part of your defense-in-depth strategy. For more information on layers of security, refer to Amazon Athena Security.

SQL injection example

The intended use of the SELECT query in the example is to receive a small set of values. However, an attacker can manipulate the input to append malicious SQL code. For example, an attacker can input a value of 1 OR 1=1, which appends a true condition to the WHERE clause and returns all records in the table:

SELECT * FROM users WHERE id = 1 OR 1=1;

By appending malicious SQL code, the attacker can retrieve all rows of the users table, as shown in the following screenshot.
An image of the Athena graphical user interface. A query SELECT * FROM users WHERE id = 1 OR 1=1; has been run. All 3 users in the table, with ids 1, 2, and 3, returned with all columns of the table.

SQL injection attempt with a prepared statement

If we create prepared statements with the same query from the previous example, the executable command is passed as a literal argument for the parameter’s value. If a user tries to pass additional SQL, they receive a syntax error because the WHERE clause is based on ID, which expects an integer value.

Create a prepared statement using the same query against the users table:

PREPARE get_user FROM SELECT * FROM users WHERE id = ?

Set the parameter to a legitimate value:

EXECUTE get_user USING 1

The expected result returns, as shown in the following screenshot.

Graphical user interface of Athena running query EXECUTE get_user USING 1. Only the user with id 1 returned.

Now, attempt to pass a malicious value:

EXECUTE get_user USING 1 OR 1=1

Running this prepared statement produces a syntax error, because an integer value is expected, but it receives an invalid integer value of 1 OR 1=1. The query and syntax error are shown in the following screenshot.

Graphical user interface of Athena querying EXECUTE get_user USING 1 OR 1=1. There is an error. The error says "SYNTAX_ERROR: line 1:24: Left side of logical expression must evaluate to a boolean (actual: integer). This query ran against the "default" database, unless qualified by the query. Please post the error message in our forum."

Working with prepared statements

This section describes administrative functions to make it easier to work with prepared statements.

List all prepared statements in my AWS account

To list all prepared statements in an Athena workgroup from the AWS Command Line Interface (AWS CLI), you can run the following command:

aws athena list-prepared-statements --work-group workgroup_name --region region_name

If following the example above, the command will return the following response.

{
  "PreparedStatements": [
    {
      "StatementName": "product_helpful_reviews",
      "LastModifiedTime": "2022-01-14T15:33:07.935000+00:00"
    }
  ]
}

To list all available prepared statements in your AWS account, you can use the AWS APIs. This post provides a sample script using the AWS SDK for Python (Boto3) to loop through all Regions in your account, and provide the prepared statements per Athena workgroup.

Make sure you have AWS credentials where you plan to run the Python script. For more information, refer to Credentials.

Clone the GitHub repo or copy the Python script list-prepared-statements.py from the repo and run the script:

python3 list-prepared-statements.py

Replace <my-profile-name> with your AWS profile name when it prompts you, or leave empty to use default local credentials.

Enter the AWS CLI profile name or leave blank if using instance profile: <my-profile-name>

The following text is the output of the script. If following along, the response returns only the product_helpful_reviews prepared statement.

eu-north-1:
ap-south-1:
eu-west-3:
eu-west-2:
eu-west-1:
ap-northeast-3:
ap-northeast-2:
ap-northeast-1:
sa-east-1:
ca-central-1:
ap-southeast-1:
ap-southeast-2:
eu-central-1:
us-east-1:
        athena-v2-wg: my_select
        PreparedStatementsWG: get_user
        PreparedStatementsWG: get_contacts_by_company
        PreparedStatementsWG: product_helpful_reviews
        PreparedStatementsWG: count_stars
        PreparedStatementsWG: helpful_reviews
        PreparedStatementsWG: get_product_info
        PreparedStatementsWG: check_avg_stars_of_category
        PreparedStatementsWG: my_select_v1
        PreparedStatementsWG: my_select_v2
us-east-2:
us-west-1:
us-west-2:

Update prepared statements

You have a few options for updating prepared statements. You may want to do this to optimize your query performance, change the values you select, or for several other reasons.

Rerun the PREPARE statement with the changes in the Athena query editor or against the StartQueryExecution API.
Use the UpdatePreparedStatement API via the AWS CLI or SDK.

You can use this API to add a description to your prepared statements or update your queries. To update your query statement via this method, you must provide the statement name, workgroup name, updated query statement, and optionally a new description. For more information about the UpdatePreparedStatement API, refer to update-prepared-statement.

You may want to roll out versions of your query. To maintain backward-compatibility for users, you could create a new prepared statement with a different name. For example, the prepared statement could have a version number in its name (such as my_select_v1 and my_select_v2). When necessary, you could communicate changes to teams who rely on the prepared statement, and later deallocate the old prepared statement versions.

Delete prepared statements

To delete a prepared statement, you can use the following query syntax when against the StartQueryExecution API, or from within the Athena query editor:

DEALLOCATE PREPARE product_helpful_reviews

You could also use the DeletePreparedStatement API or SDK. For example, to delete your prepared statement from AWS CLI, run the following command:

aws athena delete-prepared-statement --statement-name product_helpful_reviews --work-group PreparedStatementsWG --region region

Clean up

If you followed along with this post, you created several components that may incur costs. To avoid future charges, remove the resources with the following steps:

Delete the S3 bucket’s results prefix created after you run a query on your workgroup.

With the default template, it’s named <S3QueryResultsBucketName>/athena-results. Use caution in this step. Unless you are using versioning on your S3 bucket, deleting S3 objects cannot be undone.

Delete the Athena prepared statements in the PreparedStatementsWG

You can follow the steps in the Delete prepared statements section of this post using either the DEALLOCATE PREPARE statement or delete-prepared-statement API for each prepared statement you created.

To remove the CloudFormation stack, select the stack on the AWS CloudFormation console, choose Delete, and confirm.

Conclusion

Athena parameterized queries make it easy to decouple your code base from your queries by providing a way to store common queries within your Athena workgroup. This post provided information about how Athena parameterized queries can improve your code reusability and data lake security. We showed how you can set up a sample data lake and start using parameterized queries today. We also provided an example of the protections parameterized queries offers, and detailed additional administrative functions.

You can get started with Athena prepared statements via the Athena console, the AWS CLI, or the AWS SDK. To learn more about Athena, refer to the Amazon Athena User Guide.

Thanks for reading this post! If you have questions about Athena parameterized queries, don’t hesitate to leave a comment in the comments section.

About the Authors

Blayze Stefaniak is a Senior Solutions Architect at AWS who works with public sector, federal financial, and healthcare organizations. Blayze is based out of Pittsburgh. He is passionate about breaking down complex situations into something practical and actionable. His interests include artificial intelligence, distributed systems, and Excel formula gymnastics. Blayze holds a B.S.B.A. in Accounting and B.S. in Information Systems from Clarion University of Pennsylvania. In his spare time, you can find Blayze listening to Star Wars audiobooks, trying to make his dogs laugh, and probably talking on mute.

Daniel Tatarkin is a Solutions Architect at Amazon Web Services (AWS) supporting Federal Financial organizations. He is passionate about big data analytics and serverless technologies. Outside of work, he enjoys learning about personal finance, coffee, and trying out new programming languages for fun.

Noise