All posts by Blayze Stefaniak

Upgrade to Athena engine version 3 to increase query performance and access more analytics features

Post Syndicated from Blayze Stefaniak original https://aws.amazon.com/blogs/big-data/upgrade-to-athena-engine-version-3-to-increase-query-performance-and-access-more-analytics-features/

Customers tell us they want to have stronger performance and lower costs for their data analytics applications and workloads. Customers also want to use AWS as a platform that hosts managed versions of their favorite open-source projects, which will frequently adopt the latest features from the open-source communities. With Amazon Athena engine version 3, we continue to increase performance, provide new features and now deliver better currency with the Trino and Presto projects.

Athena is an interactive query service that makes it easy to analyze data in Amazon Simple Storage Service (Amazon S3) using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. Customers such as Orca Security, the Agentless Cloud Security Platform, are already realizing the benefits of using Athena engine version 3 with the Apache Iceberg.

“At Orca Security, we are excited about the launch of Athena engine version 3,” says Arie Teter, VP R&D at Orca Security. “With Athena engine version 3, we will be able to query our massive petabyte-scale data lake more efficiently and at a lower cost. We are especially excited about being able to leverage all the latest Trino features with Athena’s new engine in order to deliver our customers the best-of-breed, ML-driven anomaly detection solution.”

In this post, we discuss benefits of Athena engine version 3, performance benchmark results for different table formats and information about upgrading to engine version 3.

New features, more often

One of the most exciting aspects of engine version 3 is its new continuous integration approach to open source software management that will improve currency with the Trino and PrestoDB projects. This approach enables Athena to deliver increased performance and new features at an even faster pace.

At AWS, we are committed to bringing the value of open source to our customers and providing contributions to open source communities. The Athena development team is actively contributing bug fixes and security, scalability, performance, and feature enhancements back to these open-source code bases, so anyone using Trino, PrestoDB and Apache Iceberg can benefit from the team’s contributions. For more information on AWS’s commitment to the open-source community, refer to Open source at AWS.

Athena engine version 3 incorporates over 50 new SQL functions, and 30 new features from the open-source Trino project. For example, Athena engine version 3 supports T-Digest functions that can be used to approximate rank-based statistics with high accuracy, new Geospatial functions to run optimized Geospatial queries, and new query syntaxes such as MATCH_RECOGNIZE for identifying data patterns in applications such as fraud detection and sensor data analysis.

Athena engine version 3 also gives you more AWS-specific features. For example, we have worked closely with the AWS Glue data catalog team to improve Athena’s metadata retrieval time, which we explain in the section “Faster query planning with AWS Glue Data Catalog” below.

For more information about what’s new in Athena engine version 3, refer to the Athena engine version 3 Improvements and new features.

Faster runtime, lower cost

Last year, we shared benchmark testing on Athena engine version 2 using TPC-DS benchmark queries at 3 TB scale and observed that query performance improved by three times and cost decreased by 70% as a result of reduced scanned data. These improvements have been a combination of enhancements developed by Athena and AWS engineering teams as well as contributions from the PrestoDB and Trino open-source communities.

The new engine version 3 will allow Athena to continue delivering performance improvements at a rapid pace. We performed benchmark testing on engine version 3 using TPC-DS benchmark queries at 3 TB scale, and observed 20% query performance improvement when compared to the latest release of engine version 2. Athena engine version 3 includes performance improvement across operators, clauses, and decoders: such as performance improvement of joins involving comparisons with the <,<=, >,>= operators, queries that contains JOIN, UNION, UNNEST, GROUP BY clauses, queries using IN predicate with a short list of constant.  Athena engine version 3 also provides query execution improvements that reduce the amount of data scanned which gives you additional performance gains. With Athena, you are charged based on the amount of data scanned by each query, so this also translates to lower costs. For more information, refer to Amazon Athena pricing.

Faster query planning with AWS Glue Data Catalog

Athena engine version 3 provides better integration with AWS Glue Data Catalog to improve query planning performance by up to ten times. Query planning is the process of listing instructions the query engine will follow in order to run a query. During query planning, Athena uses AWS Glue API to retrieve various information such as table and partition metadata, and column statistics. As the number of tables increases, the number of calls to the Glue API for metadata also increase which results in additional query latency. In engine version 3, we reduced this Glue API overhead thus brought down the overall query planning time. For smaller datasets and datasets with large number of tables, you can see the total runtime has been reduced significantly because the query planning time is a higher percentage of the total run time.

Figure 1 below charts the top 10 queries from the TPC-DS benchmark with the most performance improvement from engine version 2 to engine version 3 based on the Amazon CloudWatch metric for total runtime. Each query involves joining multiple tables with complex predicates.

Faster query runtime with Apache Iceberg integration

Athena engine version 3 provides better integration with the Apache Iceberg table format. Features such as Iceberg’s hidden partitioning now augment Athena optimizations such as partition pruning and dynamic filtering to reduce data scanned and improve query performance in Athena engine v3. You do not need to maintain partition columns or even understand the physical table layout to load data to table and achieve good query performance.

We performed TPC-DS benchmark testing by loading data into the Apache Iceberg table format, with hidden partitions configured, and compared the performance between Athena engine version 2 and 3. Figure 2 below is a chart of the top 10 query improvements, which all include complex predicates. The top query, query 52, has five WHERE predicates and two GROUP BY operations. Compared to engine version 2, the query runs thirteen times faster with sixteen times less data scanned on engine version 3.

Upgrading to Athena engine version 3

To use Athena engine version 3, you can create a new workgroup, or configure an existing workgroup, and select the recommended Athena engine version 3. Any Athena workgroup can upgrade from engine version 2 to engine version 3 without interruption in your ability to submit queries. For more information and instructions for changing your Athena engine version, refer to Changing Athena engine versions.

Athena engine version 3 has feature parity with all major features from Athena engine version 2. There are no changes required by you to use features like dynamic partition pruningApache Iceberg and Apache Hudi table formats, AWS Lake Formation governed tables integration, and Athena Federated Query in engine version 3.For more information on Athena features, refer to Amazon Athena features, and the Amazon Athena User Guide.

Athena engine version 3 includes additional improvements to support ANSI SQL compliance. This results in some changes to syntax, data processing, and timestamps that may cause errors when running the same queries in the new engine version. For information about error messages, causes, and suggested solutions, refer to Athena engine version 3 LimitationsBreaking changesData processing changes, and Timestamp changes.

To make sure that your Athena engine version upgrade goes smoothly, we recommend the following practices to facilitate your upgrade process. After you have confirmed your query behavior works as you expect, you can safely upgrade your existing Athena workgroups.

  • Review the Athena engine version 3 Limitations and Breaking changes and update any affected queries.
  • Test in pre-production to validate and qualify your queries against Athena engine version 3 by creating a test workgroup or upgrading an existing pre-production environment. For example, you can create a new test workgroup running engine version 3 to run integration tests from your pre-production or staging environment, and monitor for failures or performance regressions. For information about CloudWatch metrics and dimensions published by Athena, refer to Monitoring Athena queries with CloudWatch metrics.
  • Upgrade each query based on metrics to test your queries against an Athena engine version 3 workgroup. For example, you can create a new workgroup with engine version 3 alongside your existing engine version 2 workgroup. You can send a small percentage of queries to the engine version 3 workgroup, monitor for failures or performance regressions, then increase the number of queries if they’re successful and performant. Repeat until all your queries have been migrated to Athena engine version 3.

With our simplified automatic engine upgrade process, you can configure existing workgroups to be automatically upgraded to engine version 3 without requiring manual review or intervention. The upgrade behavior is as follows:

  • If Query engine version is set to Automatic, your workgroup will remain on engine version 2 pending the automatic upgrade, and Athena will choose when to upgrade the workgroup to engine version 3. Before upgrading a workgroup, we perform a set of validation tests to confirm that its queries perform correctly and efficiently on engine version 3. Because our validation is performed on a best effort basis, we recommend you perform your own validation testing to ensure all queries run as expected.
  • If Query engine version is set to Manual, you will have the ability to select your version. The default choice is set to engine version 3, with the ability to toggle to engine version 2.

Conclusion

This post discussed Athena engine version 3 benefits, performance benchmark results, and how you can start using engine version 3 today with minimal work required. You can get started with Athena engine version 3 by using the Athena Console, the AWS CLI, or the AWS SDK. To learn more about Athena, refer to the Amazon Athena User Guide.

Thanks for reading this post! If you have questions on Athena engine version 3, don’t hesitate to leave a comment in the comments section.


About the authors

Blayze Stefaniak is a Senior Solutions Architect for the Technical Strategist Program supporting Executive Customer Programs in AWS Marketing. He has experience working across industries including healthcare, automotive, and public sector. He is passionate about breaking down complex situations into something practical and actionable. In his spare time, you can find Blayze listening to Star Wars audiobooks, trying to make his dogs laugh, and probably talking on mute.

Daniel Chen is a Senior Product Manager at Amazon Web Services (AWS) Athena. He has experience in Banking and Capital Market of financial service industry and works closely with enterprise customers building data lakes and analytical applications on the AWS platform. In his spare time, he loves playing tennis and ping pong.

Theo Tolv is a Senior Big Data Architect in the Athena team. He’s worked with small and big data for most of his career and often hangs out on Stack Overflow answering questions about Athena.

Jack Ye is a software engineer of the Athena Data Lake and Storage team. He is an Apache Iceberg Committer and PMC member.

Use Amazon Athena parameterized queries to provide data as a service

Post Syndicated from Blayze Stefaniak original https://aws.amazon.com/blogs/big-data/use-amazon-athena-parameterized-queries-to-provide-data-as-a-service/

Amazon Athena now provides you more flexibility to use parameterized queries for any query you send to Athena, and we recommend you use them as the best practice for your Athena queries moving forward so you benefit from the security, reusability, and simplicity they offer. In a previous post, Improve reusability and security using Amazon Athena parameterized queries, we explained how parameterized queries with prepared statements provide reusability of queries, protection against SQL injection, and masking of query strings from AWS CloudTrail events. In this post, we explain how you can run Athena parameterized queries using the ExecutionParameters property in your StartQueryExecution requests. We provide a sample application you can reference for using parameterized queries, with and without prepared statements. Athena parameterized queries can be integrated into many data driven applications, and we walk you through a sample data as a service application to see how parameterized queries can plug in.

Customers tell us they are finding new ways to make effective use of their data assets by providing data as a service (DaaS). In this post, we share a sample architecture using parameterized queries applied in the form of a DaaS application. This is helpful for many types of organizations, whether you’re working with an enterprise making data available to other lines of business, a regulator making reports available to your industry, a company monetizing your data assets, an independent software vendor (ISV) enabling your applications’ tenants to query their data when they need it, or trying to share data at scale in other ways. In DaaS applications, you can provide predefined queries to run against your governed datasets with values your users input. You can expand your DaaS application to break away from monolithic data infrastructure by treating data as a product (DaaP) and providing a distribution of datasets, which have distinct domain-specific data pipelines. You can authorize these datasets to consumers in your DaaS application permissions. You can use Athena parameterized queries as a way to predefine your queries, which you can use to run queries across your datasets, and serve as a layer of protection for your DaaS applications. This post first describes how parameterized queries work, then applies parameterized queries in the form of a DaaS application.

Feature overview

In any query you send to Athena, you can use positional parameters declared by a question mark (?) in your query string, then declare values as execution parameters sequentially in your StartQueryExecution request. You can use execution parameters with your existing prepared statements and also with any SQL queries in Athena. You can still take advantage of the reusability and security benefits of parameterized queries, and using execution parameters also masks your query’s parameters when viewing recent queries in Athena. You can also change from building SQL query strings manually to using execution parameters; this allows you to run parameterized queries without needing to first create prepared statements. For more information on using execution parameters, refer to StartQueryExecution.

Previously, you could only run parameterized queries by first creating prepared statements in your Athena workgroup, then running parameterized queries while passing variables into an EXECUTE SQL statement with the USING clause. You are no longer required to create and maintain prepared statements across all of your Athena workgroups to take advantage of parameterization. This is helpful if you run the same queries across multiple workgroups or otherwise do not need the prepared statements feature.

You can continue to use Athena workgroups to isolate, implement individual cost constraints, and track query-related metrics for tenants within your multi-tenant application. For example, your DaaS application’s customers can run the same queries against your dataset with separate workgroups. For more information on Athena workgroups, refer to Using workgroups for running queries.

Changing your code to use parameterized queries

Changing your existing code to use parameterized queries is a small change which will have an immediate positive impact. Previously, you were required to build your query string value manually using environment variables as parameter placeholders. Manipulating the query string can be burdensome and has an inherent risk for injecting undesired values or SQL fragments (such as SQL operators), regardless of intent. You can now replace variables in your query string with a question mark (?), and declare your variable values sequentially with the ExecutionParameters option. By doing so, you take advantage of the security benefits of parameterized queries, and your queries are less complicated to author and maintain. The syntax change is shown in the following code, using the AWS Command Line Interface (AWS CLI) as an example.

Previously, running queries against Athena without execution parameters:

aws athena start-query-execution \
--query-string "SELECT * FROM table WHERE x = $ARG1 AND y = $ARG2 AND z = $ARG3" \
--query-execution-context "Database"="default" \
--work-group myWorkGroup

Now, running parameterized queries against Athena with execution parameters:

aws athena start-query-execution \
--query-string "SELECT * FROM table WHERE x = ? AND y = ? AND z = ?" \
--query-execution-context "Database"="default" \
--work-group myWorkGroup \
--execution-parameters $ARG1 $ARG2 $ARG3

The following is an example of a command that creates a prepared statement in your Athena workgroup. To learn more about creating prepared statements, refer to Querying with prepared statements.

aws athena start-query-execution \
--query-string "PREPARE my-prepared-statement FROM SELECT * FROM table WHERE x = ? AND y = ? AND z = ?" \
--query-execution-context "Database"="default" \
--work-group myWorkGroup

Previously, running parameterized queries against prepared statements without execution parameters:

aws athena start-query-execution \
--query-string "EXECUTE my-prepared-statement USING $ARG1, $ARG2, $ARG3“ \
--query-execution-context "Database"="default" \
--work-group myWorkGroup

Now, running parameterized queries against prepared statements with execution parameters:

aws athena start-query-execution \
--query-string "EXECUTE my-prepared-statement" \
--query-execution-context "Database"="default" \
--work-group myWorkGroup \
--execution-parameters $ARG1 $ARG2 $ARG3

Sample architecture

The purpose of this sample architecture is to apply the ExecutionParameters feature when running Athena queries, with and without prepared statements. This is not intended to be a DaaS solution for use with your production data.

This sample architecture exhibits a DaaS application with a user interface (UI) that presents three Athena parameterized queries written against the public Amazon.com customer reviews dataset. The following figure depicts this workflow when a user submits a query to Athena. This example uses AWS Amplify to host a front-end application. The application calls an Amazon API Gateway HTTP API, which invokes AWS Lambda functions to authenticate requests, fetch the Athena prepared statements and named queries, and run the parameterized queries against Athena. The Lambda function uses the name of the Athena workgroup, statement name, statement type (prepared statement or not), and a list of query parameters input by the user. Athena queries data in an Amazon Simple Storage Service (Amazon S3), bucket which is cataloged in AWS Glue, and presents results to the user on the DaaS application UI.

Diagram showing the process of using a sample DaaS web application. Web Application Users use an Amplify application to run Athena parameterized queries. The application sends HTTP requests to API Gateway. API Gateway authenticates incoming requests with a Lambda function. API Gateway processes the request to start the query against Athena. Athena uses Glue Data Catalog and queries data from an S3 bucket. The query results are stored in an S3 bucket, and presented to the Web Application Users.

End-users of the DaaS application UI can run only parameterized queries against Athena. The DaaS application UI demonstrates two ways to run parameterized queries with execution parameters: with and without prepared statements. In both cases, the Lambda function submits the query, waits for the query to complete, and provides the results that match the query parameters. The following figure depicts the DaaS application UI.

Screenshot of the application divided into two sections, one for querying prepared statements another without prepared statements. Both sections include a Workgroup name selector, statement selector, statement description, statement SQL query string, input fields to enter parameter arguments, and a button to launch the query. Selected on the screenshot is the sample workgroup created by the CloudFormation template, and a count of reviews in a given product category sample query statement. User entered ‘Video_Games’ as the product category.

You may want your users to have the ability to list all Athena prepared statements within your Athena workgroup, select a statement, input arguments, and run the query; on the left side of the DaaS application UI, you use an EXECUTE statement to query the data lake with an Athena prepared statement. You may have several reporting queries maintained in your code base. In this case, your users select a statement, input arguments, and run the query. On the right side of the DaaS application UI, you use a SELECT statement to use Athena parameterized queries without prepared statements.

Prerequisites

This post uses the following AWS services to demonstrate a DaaS architecture pattern that uses Athena to query the Amazon.com customer reviews dataset:

This post assumes you have the following:

Deploy the CloudFormation stack

In this section, you deploy a CloudFormation template that creates the following resources:

  • AWS Glue Data Catalog database
  • AWS Glue Data Catalog table
  • An Athena workgroup
  • Three Athena prepared statements
  • Three Athena named queries
  • The API Gateway HTTP API
  • The Lambda execution role for Athena queries
  • The Lambda execution role for API Gateway HTTP API authorization
  • Five Lambda functions:
    • Update the AWS Glue Data Catalog
    • Authorize API Gateway requests
    • Submit Athena queries
    • List Athena prepared statements
    • List Athena named queries

Note that this CloudFormation template was tested in AWS Regions ap-southeast-2, ca-central-1, eu-west-2, us-east-1, us-east-2, and us-west-2. Note that deploying this into your AWS account will incur cost. Steps for cleaning up the resources are included later in this post.

To deploy the CloudFormation stack, follow these steps:

  1. Navigate to this post’s GitHub repository.
  2. Clone the repository or copy the CloudFormation template athena-parameterized-queries.yaml.
  3. On the AWS CloudFormation console, choose Create stack.
  4. Select Upload a template file and choose Choose file.
  5. Upload athena-parameterized-queries.yaml, then choose Next.
  6. On the Specify stack details page, enter the stack name athena-parameterized-queries.
  7. On the same page, there are two parameters:
    1. For S3QueryResultsBucketName, enter the S3 bucket name in your AWS account and in the same AWS Region as where you’re running your CloudFormation stack. (For this post, we use the bucket name value, like my-bucket).
    2. For APIPassphrase, enter a passphrase to authenticate API requests. You use this later.
  8. Choose Next.
  9. On the Configure stack options page, choose Next.
  10. On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources with custom names, and choose Create stack.

The script takes less than two minutes to run and change to a CREATE_COMPLETE state. If you deploy the stack twice in the same AWS account and Region, some resources may already exist, and the process fails with a message indicating the resource already exists in another template.

  1. On the Outputs tab, copy the APIEndpoint value to use later.

For least-privilege authorization for deployment of the CloudFormation template, you can create an AWS CloudFormation service role with the following IAM policy actions. To do this, you must create an IAM policy and IAM role, and choose this role when configuring stack options. You need to replace the values for ${Partition}, ${AccountId}, and ${Region} with your own values; for more information on these values, refer to Pseudo parameters reference.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "IAM",
            "Effect": "Allow",
            "Action": [
                "iam:GetRole",
                "iam:UntagRole",
                "iam:TagRole",
                "iam:CreateRole",
                "iam:DeleteRole",
                "iam:PassRole",
                "iam:GetRolePolicy",
                "iam:PutRolePolicy",
                "iam:AttachRolePolicy",
                "iam:TagPolicy",
                "iam:DeleteRolePolicy",
                "iam:DetachRolePolicy",
                "iam:UntagPolicy"
            ],
            "Resource": [
                "arn:${Partition}:iam::${AccountId}:role/LambdaAthenaExecutionRole-athena-parameterized-queries",
                "arn:${Partition}:iam::${AccountId}:role/service-role/LambdaAthenaExecutionRole-athena-parameterized-queries",
                "arn:${Partition}:iam::${AccountId}:role/service-role/LambdaAuthorizerExecutionRole-athena-parameterized-queries",
                "arn:${Partition}:iam::${AccountId}:role/LambdaAuthorizerExecutionRole-athena-parameterized-queries"
            ]
        },
        {
            "Sid": "LAMBDA",
            "Effect": "Allow",
            "Action": [
                "lambda:CreateFunction",
                "lambda:GetFunction",
                "lambda:InvokeFunction",
                "lambda:AddPermission",
                "lambda:DeleteFunction",
                "lambda:RemovePermission",
                "lambda:UpdateFunctionConfiguration"
            ],
            "Resource": [
                "arn:${Partition}:lambda:${Region}:${AccountId}:function:LambdaRepairFunction-athena-parameterized-queries",
                "arn:${Partition}:lambda:${Region}:${AccountId}:function:LambdaAthenaFunction-athena-parameterized-queries",
                "arn:${Partition}:lambda:${Region}:${AccountId}:function:LambdaAuthorizerFunction-athena-parameterized-queries",
                "arn:${Partition}:lambda:${Region}:${AccountId}:function:GetPrepStatements-athena-parameterized-queries",
                "arn:${Partition}:lambda:${Region}:${AccountId}:function:GetNamedQueries-athena-parameterized-queries"
            ]
        },
        {
            "Sid": "ATHENA",
            "Effect": "Allow",
            "Action": [
                "athena:GetWorkGroup",
                "athena:CreateWorkGroup",
                "athena:DeleteWorkGroup",
                "athena:DeleteNamedQuery",
                "athena:CreateNamedQuery",
                "athena:CreatePreparedStatement",
                "athena:DeletePreparedStatement",
                "athena:GetPreparedStatement"
            ],
            "Resource": [
                "arn:${Partition}:athena:${Region}:${AccountId}:workgroup/ParameterizedStatementsWG"
            ]
        },
        {
            "Sid": "GLUE",
            "Effect": "Allow",
            "Action": [
                "glue:CreateDatabase",
                "glue:DeleteDatabase",
                "glue:CreateTable",
                "glue:DeleteTable"
            ],
            "Resource": [
                "arn:${Partition}:glue:${Region}:${AccountId}:catalog",
                "arn:${Partition}:glue:${Region}:${AccountId}:database/athena_prepared_statements",
                "arn:${Partition}:glue:${Region}:${AccountId}:table/athena_prepared_statements/*",
                "arn:${Partition}:glue:${Region}:${AccountId}:userDefinedFunction/athena_prepared_statements/*"
            ]
        },
        {
            "Sid": "APIGATEWAY",
            "Effect": "Allow",
            "Action": [
                "apigateway:DELETE",
                "apigateway:PUT",
                "apigateway:PATCH",
                "apigateway:POST",
                "apigateway:TagResource",
                "apigateway:UntagResource"
            ],
            "Resource": [
                "arn:${Partition}:apigateway:${Region}::/apis/*/integrations*",
                "arn:${Partition}:apigateway:${Region}::/apis/*/stages*",
                "arn:${Partition}:apigateway:${Region}::/apis/*/authorizers*",
                "arn:${Partition}:apigateway:${Region}::/apis/*/routes*",
                "arn:${Partition}:apigateway:${Region}::/tags/arn%3Aaws%3Aapigateway%3A${Region}%3A%3A%2Fv2%2Fapis%2F*"
            ]
        },
        {
            "Sid": "APIGATEWAYMANAGEAPI",
            "Effect": "Allow",
            "Action": [
                "apigateway:DELETE",
                "apigateway:PUT",
                "apigateway:PATCH",
                "apigateway:POST",
                "apigateway:GET"
            ],
            "Resource": [
                "arn:${Partition}:apigateway:${Region}::/apis"
            ],
            "Condition": {
                "StringEquals": {
                    "apigateway:Request/ApiName": "AthenaAPI-athena-parameterized-queries"
                }
            }
        },
        {
            "Sid": "APIGATEWAYMANAGEAPI2",
            "Effect": "Allow",
            "Action": [
                "apigateway:DELETE",
                "apigateway:PUT",
                "apigateway:PATCH",
                "apigateway:POST",
                "apigateway:GET"
            ],
            "Resource": [
                "arn:${Partition}:apigateway:${Region}::/apis/*"
            ],
            "Condition": {
                "StringEquals": {
                    "apigateway:Resource/ApiName": "AthenaAPI-athena-parameterized-queries"
                }
            }
        },
        {
            "Sid": "APIGATEWAYGET",
            "Effect": "Allow",
            "Action": [
                "apigateway:GET"
            ],
            "Resource": [
                "arn:${Partition}:apigateway:${Region}::/apis/*"
            ]
        },
        {
            "Sid": "LAMBDALAYER",
            "Effect": "Allow",
            "Action": [
                "lambda:GetLayerVersion"
            ],
            "Resource": [
                "arn:${Partition}:lambda:*:280475519630:layer:boto3-1_24*"
            ]
        }
    ]
}

After you create the CloudFormation stack, you use the AWS management console to deploy an Amplify application and view the Lambda functions. The following is the scoped-down IAM policy that you can attach to an IAM user or role to perform these operations:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AmplifyCreateApp",
            "Effect": "Allow",
            "Action": [
                "amplify:CreateBranch",
                "amplify:StartDeployment",
                "amplify:CreateDeployment",
                "amplify:CreateApp",
                "amplify:StartJob"
            ],
            "Resource": "arn:${Partition}:amplify:${Region}:${AccountId}:apps/*"
        },
        {
            "Sid": "AmplifyList",
            "Effect": "Allow",
            "Action": "amplify:List*",
            "Resource": "arn:${Partition}:amplify:${Region}:${AccountId}:apps/*"
        },
        {
            "Sid": "AmplifyGet",
            "Effect": "Allow",
            "Action": "amplify:GetJob",
            "Resource": "arn:${Partition}:amplify:${Region}:${AccountId}:apps/*"
        },
        {
            "Sid": "LambdaList",
            "Effect": "Allow",
            "Action": [
                "lambda:GetAccountSettings",
                "lambda:ListFunctions"
            ],
            "Resource": "*"
        },
        {
            "Sid": "LambdaFunction",
            "Effect": "Allow",
            "Action": [
                "lambda:GetFunction"
            ],
            "Resource": "arn:${Partition}:lambda:${Region}:${AccountId}:function:LambdaAthenaFunction-athena-parameterized-queries"
        }
    ]
}

Note that you need the following IAM policy when deploying your Amplify application to set a global password, and when cleaning up your resources to delete the Amplify application. Remember to replace ${AppARN} with the ARN of the Amplify application. You can find the ARN after creating the Amplify app on the General tab in the App Settings section of the Amplify console.

{
   "Version": "2012-10-17",
   "Statement": [
       {
           "Sid": "UpdateAndDeleteAmplifyApp",
           "Effect": "Allow",
            "Action": [
                "amplify:DeleteApp",
                "amplify:UpdateApp"
            ],
           "Resource": "${AppARN}"
       }
   ]
}

Deploy the Amplify application

In this section, you deploy your Amplify application.

  1. In the cloned repository, open web-application/.env in a text editor.
  2. Set AWS_API_ENDPOINT as the APIEndpoint value from the CloudFormation stack Outputs For example: AWS_API_ENDPOINT="https://123456abcd.execute-api.your-region.amazonaws.com".
  3. Set API_AUTH_CODE as the value you input as the CloudFormation stack’s APIPassphrase parameter argument. For example: API_AUTH_CODE="YOUR_PASSPHRASE".
  4. Navigate to the web-application/ directory and run npm install.
  5. Run npm run build to compile distribution assets.
  6. On the Amplify console, choose All apps.
  7. Choose New app.
  8. Select Host web app, select Deploy without Git provider, then choose Continue.
  9. For App name, enter Athena Parameterized Queries App.
  10. For Environment name¸ you don’t need to enter a value.
  11. Select Drag and Drop.
  12. Locate the dist/ directory inside web-application/, drag it into the window and drop it. Ensure you drag the entire directory, not the files within it.Screen shot depicting dragging and dropping the "dist/" directory into Amplify's drag and drop area for the Athena Parameterized Queries App.
  13. Choose Save and deploy to deploy the web application on Amplify.

This step takes less than a minute to complete.

  1. Under App settings, choose Access control, then choose Manage access.
  2. Select Apply a global password, then enter values for Username and Password.

You use these credentials to access your Amplify application.

Access your Amplify application and run queries

In this section, you use the Amplify application to run Athena parameterized queries against the Amazon.com customer reviews dataset. The left side of the application shows how you can run parameterized queries using Athena prepared statements. The right side of the application shows how you can run parameterized queries without prepared statements, such as if the queries are written in your code. The sample in this post uses named queries within the Athena workgroup. For more information about named queries, refer to NamedQuery.

  1. Open the Amplify web application link located under Domain. For example: https://dev123.abcd12345xyz.amplifyapp.com/.
  2. In the Sign in prompt, enter the user name and password you provided as the Amplify application global password.
  3. For Workgroup Name, choose the ParameterizedStatementsWG workgroup.
  4. Choose a statement example on the Prepared Statement or SQL Statement drop-down menu.

Selecting a statement displays a description about the query, including examples of parameters you can try with this statement, and the original SQL query string. SQL parameters of type string must be surrounded by single quotes, for example: 'your_string_value'.

  1. Enter your query parameters.

The following figure shows an example of the parameters to input for the product_helpful_reviews prepared statement.

Screenshot of the Athena prepared statements window in the DaaS application. The sample workgroup created by the CloudFormation template is selected. A sample query is selected, which retrieves customer reviews for a given product id based on the review's star rating and count of helpful votes. The user entered ‘BT00DDVMVQ’ as the product id value, 4 as the star rating value, and 10 as the value for minimum count of helpful votes.

  1. Choose Run Query to send the query request to the API endpoint.

After the query runs, the sample application presents the results in a table format, as depicted in the following screenshot. This is one of many ways to present results, and your application can display results in the format which makes the most sense for your users. The complete query workflow is depicted in the previous architecture diagram.

Screenshot of the sample application's query results rendered in a table format. The table has columns for product_id, product_title, star_rating, helpful_votes, review_headline, and review_body. The query returned two results, which are 4 star reviews for the Amazon Smile eGift Card.

Using execution parameters with the AWS SDK for Python (Boto3)

In this section, you inspect the Lambda function code for using the StartQueryExecution API with and without prepared statements.

  1. On the Lambda console, choose Functions.
  2. Navigate to the LambdaAthenaFunction-athena-parameterized-queries function.
  3. Choose the Code Source window.

Examples of passing parameters to the Athena StartQueryExecution API using the AWS SDK for Python (Boto3) begin on lines 39 and 49. Note the ExecutionParameters option on lines 45 and 55.

The following code uses execution parameters with Athena prepared statements:

response = athena.start_query_execution(
    QueryString=f'EXECUTE {statement}', # Example: "EXECUTE prepared_statement_name"
    WorkGroup=workgroup,
    QueryExecutionContext={
        'Database': 'athena_prepared_statements'
    },
    ExecutionParameters=input_parameters
)

The following code uses execution parameters without Athena prepared statements:

response = athena.start_query_execution(
    QueryString=statement, # Example: "SELECT * FROM TABLE WHERE parameter_name = ?"
    WorkGroup=workgroup,
    QueryExecutionContext={
        'Database': 'athena_prepared_statements'
    },
    ExecutionParameters=input_parameters
)

Clean up

In this post, you created several components, which generate cost. To avoid incurring future charges, remove the resources with the following steps:

  1. Delete the S3 bucket’s results prefix created after you ran a query on your workgroup.

With the default template, the prefix is named <S3QueryResultsBucketName>/athena-results. Use caution in this step. Unless you are using versioning on your S3 bucket, deleting S3 objects can’t be undone.

  1. On the Amplify console, select the app to delete and on the Actions menu, choose Delete app, then confirm.
  2. On the AWS CloudFormation console, select the stack to delete, choose Delete, and confirm.

Conclusion

In this post, we showed how you can build a DaaS application using Athena parameterized queries. The StartQueryExecution API in Athena now supports execution parameters, which allows you to run any Athena query as a parameterized query. You can decouple your execution parameters from your query strings, and use parameterized queries without being limited to the Athena workgroups where you have created prepared statements. You can take advantage of the security benefits Athena offers with parameterized queries, and developers no longer need to build query strings manually. In this post, you learned how to use execution parameters, and you deployed a DaaS reference architecture to see how parameterized queries can be applied.

You can get started with Athena parameterized queries by using the Athena console, the AWS CLI, or the AWS SDK. To learn more about Athena, refer to the Amazon Athena User Guide.

Thanks for reading this post! If you have questions about Athena prepared statements and parameterized queries, don’t hesitate to leave a comment.


About the Authors

Blayze Stefaniak is a Senior Solutions Architect for the Technical Strategist Program supporting Executive Customer Programs in AWS Marketing. He has experience working across industries including healthcare, automotive, and public sector. He is passionate about breaking down complex situations into something practical and actionable. In his spare time, you can find Blayze listening to Star Wars audiobooks, trying to make his dogs laugh, and probably talking on mute.

Daniel Tatarkin is a Solutions Architect at Amazon Web Services (AWS) supporting Federal Financial organizations. He is passionate about big data analytics and serverless technologies. Outside of work, he enjoys learning about personal finance, coffee, and trying out new programming languages for fun.

Matt Boyd is a Senior Solutions Architect at AWS working with federal financial organizations. He is passionate about effective cloud management and governance, as well as data governance strategies. When he’s not working, he enjoys running, weight lifting, and teaching his elementary-age son ethical hacking skills.

Improve reusability and security using Amazon Athena parameterized queries

Post Syndicated from Blayze Stefaniak original https://aws.amazon.com/blogs/big-data/improve-reusability-and-security-using-amazon-athena-parameterized-queries/

Amazon Athena is a serverless interactive query service that makes it easy to analyze data in Amazon Simple Storage Service (Amazon S3) using standard SQL, and you only pay for the amount of data scanned by your queries. If you use SQL to analyze your business on a daily basis, you may find yourself repeatedly running the same queries, or similar queries with minor adjustments. Athena parameterized queries enable you to prepare statements you can reuse with different argument values you provide at run time. Athena parameterized queries also provide a layer of security against SQL injection attacks, and mask the query string in AWS CloudTrail for workloads with sensitive data.

This post shows you how to create and run parameterized queries in Athena. This post provides an example of how Athena parameterized queries protect against SQL injection, and shows the CloudTrail events with the masked query string. Lastly, the post reviews functions related to managing Athena prepared statements. If you want to follow along, this post provides steps to set up the components with a sample dataset; alternatively, you can use your own dataset.

Reusability

Athena prepared statements allow you to run and reuse queries within your Athena workgroup. By decoupling the queries from the code, you can update your prepared statements and your applications independent from one another. If a data lake has schema updates, it could require query updates. If multiple applications share the same Athena workgroup and are using similar queries, you can create a new query or update the existing query to serve multiple use cases, without each application being required to adjust similar queries in their own source code. Parameterized queries are currently supported for SELECT, INSERT INTO, CTAS, and UNLOAD statements. For the most current list, refer to Considerations and Limitations in Querying with Prepared Statements.

Security

Athena prepared statements provide a layer of protection against SQL injection. If you are using Athena behind an application interface, free text inputs inherently present a SQL injection threat vector which, if left unmitigated, could result in data exfiltration. When the parameterized query is run, Athena interprets the arguments as literal values, not as executable commands nor SQL fragments like SQL operators.

When using Athena, CloudTrail captures all Athena API calls as audit events to provide a record of actions taken by an AWS user, role, or AWS service. Customers with sensitive data in their data lakes, such as personally identifiable information (PII), have told us they don’t want query strings in their CloudTrail event history for compliance reasons. When running parameterized queries, the query string is masked with HIDDEN_DUE_TO_SECURITY_REASONS in the CloudTrail event, so you don’t show protected data within your log streams.

Solution overview

This post documents the steps using the public Amazon.com customer reviews dataset; however, you can follow similar steps to use your own dataset.

The example query is to find a product’s 4-star (out of 5 stars) reviews voted as the most helpful by other customers. The intent behind the query is to find query results that indicate constructive product feedback. The intent is to validate the feedback and get helpful feedback incorporated into the product roadmap. The product used in this use case is the Amazon Smile eGift Card.

Prerequisites

As a prerequisite, you need a foundational understanding of SQL syntax, as well as a foundational understanding the following AWS services:

This post assumes you have:

Deploy resources for the example dataset

If you’re using the example dataset, follow the steps in this section. The data is in an S3 bucket in an AWS-managed AWS account. You need to create Athena and AWS Glue resources to get started.

This post provides a CloudFormation template that deploys the following resources in your AWS account:

  • AthenaWorkGroup – An Athena workgroup for your dataset and prepared statements. On the console, this workgroup is named PreparedStatementsWG.
  • GlueDatabase – A database in the AWS Glue Data Catalog for table metadata. The database is named athena_prepared_statements.
  • GlueTableAmazonReviews – An external table with Amazon.com customer reviews in the Data Catalog.

The following diagram shows how the resources interact when the query runs.
Diagram depicting a customer's AWS account and an AWS managed AWS account. In the customer account, there is a region box. In the region, there is an Amazon Athena workgroup taking 3 steps. In the first step, the workgroup accesses metadata from the AWS Glue Data Catalog named default. The catalog has a dotted line to an AWS Glue table called amazon_reviews_parquet, which has the attributes and S3 bucket location. The second step from the workgroup queries data from the S3 bucket. The S3 bucket is in the AWS managed AWS account. The bucket is for the Amazon Customer Reviews dataset. In the third step, the workgroup stores the query results in the Amazon S3 bucket in the customer AWS account. The query results can then be read by users with read access to the Athena workgroup.

To deploy the CloudFormation template, follow these steps:

  1. Navigate to this post’s GitHub repository.
  2. Clone the repository or copy the CloudFormation template athena-prepared-statements.yaml.
  3. On the AWS CloudFormation console, choose Create stack.
  4. Select Upload a template file and choose Choose file.
  5. Upload athena-prepared-statements.yaml, then choose Next.
  6. On the Specify stack details page, enter the stack name athena-prepared-statements-blog.
  7. For S3QueryResultsBucketName, enter your S3 bucket name.
  8. If you leave AthenaWorkGroupName as default, the Athena workgroup is named PreparedStatementsWG. If you change the value, the Athena workgroup name must be unique in your AWS Region.
  9. Choose Next.
  10. On the Configure stack options page, choose Next.
  11. On the Review page, choose Create stack.

The script takes less than a minute to run and change to a CREATE_COMPLETE state. If you deploy the stack twice in the same AWS account and Region, the AWS Glue database, table, or Athena workgroup may already exist, and the process fails with a message indicating that the resource already exists in another template.

For least-privilege authorization for deployment of the CloudFormation template, you can create an AWS CloudFormation service role with the following IAM policy actions. To do this, you must create an IAM policy and IAM role, and choose this role when configuring stack options.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:CreateDatabase"
      ],
      "Resource": [
        "arn:${Partition}:glue:${Region}:${Account}:catalog",
        "arn:${Partition}:glue:${Region}:${Account}:database/athena_prepared_statements"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "glue:DeleteDatabase"
      ],
      "Resource": [
        "arn:${Partition}:glue:${Region}:${Account}:catalog",
        "arn:${Partition}:glue:${Region}:${Account}:database/athena_prepared_statements",
        "arn:${Partition}:glue:${Region}:${Account}:table/athena_prepared_statements/*",
        "arn:${Partition}:glue:${Region}:${Account}:userDefinedFunction/athena_prepared_statements/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "glue:CreateTable"
      ],
      "Resource": [
        "arn:${Partition}:glue:${Region}:${Account}:catalog",
        "arn:${Partition}:glue:${Region}:${Account}:database/athena_prepared_statements",
        "arn:${Partition}:glue:${Region}:${Account}:table/athena_prepared_statements/amazon_reviews_parquet"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "glue:DeleteTable"
      ],
      "Resource": [
        "arn:${Partition}:glue:${Region}:${Account}:catalog",
        "arn:${Partition}:glue:${Region}:${Account}:database/athena_prepared_statements",
        "arn:${Partition}:glue:${Region}:${Account}:table/athena_prepared_statements/amazon_reviews_parquet"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "athena:CreateWorkGroup",
        "athena:DeleteWorkGroup",
        "athena:GetWorkGroup"
      ],
      "Resource": "arn:${Partition}:athena:${Region}:${Account}:workgroup/PreparedStatementsWG"
    }
  ]
}

For authorization for the IAM principal running the CloudFormation template and following along, this post was tested with the following AWS managed policies and the customer managed policy below.

AWS managed policies:

  • AmazonAthenaFullAccess
  • AWSCloudTrailReadOnlyAccess
  • AWSCloudFormationFullAccess

Customer managed policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ViewS3BucketsWithoutErrors",
            "Effect": "Allow",
            "Action": [
                "s3:ListAllMyBuckets"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "InteractWithMyBucketAndDataSetBucket",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:GetBucketLocation",
                "s3:ListBucket",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::${my-bucket-name}*",
                "arn:aws:s3:::amazon-reviews-pds*"
            ]
        },
        {
            "Sid": "UploadCloudFormationTemplate",
            "Effect": "Allow",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::cf-template*"
        },
        {
            "Sid": "CleanUpResults",
            "Effect": "Allow",
            "Action": [
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::${my-bucket-name}/results*"
            ]
        },
        {
            "Sid": "ListRolesForCloudFormationDeployment",
            "Effect": "Allow",
            "Action": [
                "iam:ListRoles"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "IAMRoleForCloudFormationDeployment",
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ],
            "Resource": [
                "arn:${Partition}:iam::${Account}:role/${role-name}"
            ],
            "Condition": {
                "StringEquals": {
                    "iam:PassedToService": "cloudformation.amazonaws.com"
                }
            }
        }
    ]
}

Partition the example dataset

The CloudFormation template created an external table pointing at a dataset of over 130 million customer reviews from Amazon.com. Partitioning data improves query performance and reduces cost by restricting the amount of data scanned by each query. The external table for this dataset has Hive-compatible partitions. The MSCK REPAIR TABLE SQL statement scans the prefix paths in the S3 bucket and updates the metadata in the Data Catalog with the partition metadata. To access the dataset, the external table’s partitions must be updated.

After you deploy the CloudFormation template, complete the following steps:

  1. On the Athena console, choose Query editor in the navigation pane.
  2. For Data Source, enter AwsDataCatalog.
  3. For Database, enter athena_prepared_statements.
  4. On the Workgroup drop-down menu, choose PreparedStatementsWG.
  5. Choose Acknowledge to confirm.
  6. In the query editor pane, run the following SQL statement for your external table:
MSCK REPAIR TABLE athena_prepared_statements.amazon_reviews_parquet;

This query takes approximately 15 seconds to run when tested in us-east-1.

  1. Run the following query to list the available partitions. The example dataset has partitions based on product_category.
SHOW PARTITIONS athena_prepared_statements.amazon_reviews_parquet;
  1. Run a SELECT statement to output a sample of data available in the table:
SELECT * FROM athena_prepared_statements.amazon_reviews_parquet limit 10;

Create prepared statements

To use Athena parameterized queries, first you run the PREPARE SQL statement and specify your positional parameters, denoted by question marks. The Athena prepared statement is stored with a name you specify.

Run the following PREPARE statement in the Athena query editor. This example query, named product_helpful_reviews, provides customer reviews with three parameters for a specified product ID, star rating provided by the reviewer, and minimum number of helpful votes provided to the review by other Amazon.com customers.

PREPARE product_helpful_reviews FROM
SELECT product_id, product_title, star_rating, helpful_votes, review_headline, review_body
FROM amazon_reviews_parquet WHERE product_id = ? AND star_rating = ? AND helpful_votes > ?
ORDER BY helpful_votes DESC
LIMIT 10;

You could also use the CreatePreparedStatement API or SDK. For example, to create your prepared statement from AWS CLI, run the following command:

aws athena create-prepared-statement \
--statement-name "product_helpful_reviews" \
--query-statement "SELECT product_id, product_title, star_rating, helpful_votes, review_headline, review_body FROM amazon_reviews_parquet WHERE product_id = ? AND star_rating = ? AND helpful_votes > ? ORDER BY helpful_votes DESC LIMIT 10;" \
--work-group PreparedStatementsWG \
--region region

For more information on creating prepared statements, refer to SQL Statements in Querying with Prepared Statements.

Run parameterized queries

You can run a parameterized query against the prepared statement with the EXECUTE SQL statement and a USING clause. The USING clause specifies the argument values for the prepared statement’s parameters.

Run the following EXECUTE statement in the Athena query editor. The prepared statement created in the previous section is run with parameters to output 4-star reviews for the Amazon Smile eGift Card product ID with at least 10 helpful votes.

EXECUTE product_helpful_reviews USING 'BT00DDVMVQ', 4, 10;

If you receive the message PreparedStatement product_helpful_reviews was not found in workGroup primary, make sure you selected the PreparedStatementsWG workgroup.

For more information on running parameterized queries, refer to SQL Statements in Querying with Prepared Statements.

Mask query string data in CloudTrail events using parameterized queries

You may want to use parameterized queries to redact sensitive data from the query string visible in CloudTrail events. For example, you may have columns containing PII as parameters, which you don’t want visible in logs. Athena automatically masks query strings from CloudTrail events for EXECUTE statements, replacing the query string with the value HIDDEN_DUE_TO_SECURITY_REASONS. This helps you avoid displaying protected data in your log streams.

To access the CloudTrail event for the query, complete the following steps:

  1. Navigate to the Event history page on the CloudTrail console.
  2. On the drop-down menu, choose Event name.
  3. Search for StartQueryExecution events.

CloudTrail event records for parameterized queries include a queryString value redacted with HIDDEN_DUE_TO_SECURITY_REASONS. The query string is visible in the Athena workgroup’s query history. You can control access by using least-privilege IAM policies to Athena, the AWS Glue Data Catalog, and the Amazon S3 query output location in your workgroup settings. For more information on viewing recent queries, refer to Viewing Recent Queries. For more information on IAM policies, refer to Actions, resources, and condition keys for AWS services.

Layer of protection for SQL injection

In this section, you’re shown an example of a SQL injection attack, and how prepared statements can protect against the same attack. We use the Athena console to invoke the StartQueryExecution API against a table named users with three rows.

SQL injection is an attempt to insert malicious SQL code into requests to change the statement and extract data from your dataset’s tables. Without Athena parameterized queries, if you’re querying a dataset directly or appending user input to a SQL query, and users can append SQL fragments, the dataset may be vulnerable to SQL injection attacks which return unauthorized data in the result set.

This post shows an example of inserting a SQL fragment in a malicious way. In the example, an OR condition which will always return true (such as OR 1=1) is appended to the WHERE clause. The same example query is shown with Athena parameterized queries, and the query fails because it contains an invalid parameter value, since the parameter value is expected to be an integer but contains the characters “OR”. If the parameter was based on a String column, then the same SQL injection attempt would result in the query returning no results because the positional argument is interpreted as a literal parameter value.

Athena provides an additional layer of defense against multi-statement SQL injection attacks. Attempting to perform SQL injection with an executable command (such as DROP) results in a failed query with Athena providing an error Only one sql statement is allowed, because Athena only accepts one executable command per SQL statement submission.

Although Athena prepared statements provide a layer of protection against SQL injection attacks, other precautions provide additional layers of defense. Athena prepared statements can be a part of your defense-in-depth strategy. For more information on layers of security, refer to Amazon Athena Security.

SQL injection example

The intended use of the SELECT query in the example is to receive a small set of values. However, an attacker can manipulate the input to append malicious SQL code. For example, an attacker can input a value of 1 OR 1=1, which appends a true condition to the WHERE clause and returns all records in the table:

SELECT * FROM users WHERE id = 1 OR 1=1;

By appending malicious SQL code, the attacker can retrieve all rows of the users table, as shown in the following screenshot.
An image of the Athena graphical user interface. A query SELECT * FROM users WHERE id = 1 OR 1=1; has been run. All 3 users in the table, with ids 1, 2, and 3, returned with all columns of the table.

SQL injection attempt with a prepared statement

If we create prepared statements with the same query from the previous example, the executable command is passed as a literal argument for the parameter’s value. If a user tries to pass additional SQL, they receive a syntax error because the WHERE clause is based on ID, which expects an integer value.

  1. Create a prepared statement using the same query against the users table:
PREPARE get_user FROM SELECT * FROM users WHERE id = ?
  1. Set the parameter to a legitimate value:
EXECUTE get_user USING 1

The expected result returns, as shown in the following screenshot.

Graphical user interface of Athena running query EXECUTE get_user USING 1. Only the user with id 1 returned.

  1. Now, attempt to pass a malicious value:
EXECUTE get_user USING 1 OR 1=1

Running this prepared statement produces a syntax error, because an integer value is expected, but it receives an invalid integer value of 1 OR 1=1. The query and syntax error are shown in the following screenshot.

Graphical user interface of Athena querying EXECUTE get_user USING 1 OR 1=1. There is an error. The error says "SYNTAX_ERROR: line 1:24: Left side of logical expression must evaluate to a boolean (actual: integer). This query ran against the "default" database, unless qualified by the query. Please post the error message in our forum."

Working with prepared statements

This section describes administrative functions to make it easier to work with prepared statements.

List all prepared statements in my AWS account

To list all prepared statements in an Athena workgroup from the AWS Command Line Interface (AWS CLI), you can run the following command:

aws athena list-prepared-statements --work-group workgroup_name --region region_name

If following the example above, the command will return the following response.

{
  "PreparedStatements": [
    {
      "StatementName": "product_helpful_reviews",
      "LastModifiedTime": "2022-01-14T15:33:07.935000+00:00"
    }
  ]
}

To list all available prepared statements in your AWS account, you can use the AWS APIs. This post provides a sample script using the AWS SDK for Python (Boto3) to loop through all Regions in your account, and provide the prepared statements per Athena workgroup.

Make sure you have AWS credentials where you plan to run the Python script. For more information, refer to Credentials.

Clone the GitHub repo or copy the Python script list-prepared-statements.py from the repo and run the script:

python3 list-prepared-statements.py

Replace <my-profile-name> with your AWS profile name when it prompts you, or leave empty to use default local credentials.

Enter the AWS CLI profile name or leave blank if using instance profile: <my-profile-name>

The following text is the output of the script. If following along, the response returns only the product_helpful_reviews prepared statement.

eu-north-1:
ap-south-1:
eu-west-3:
eu-west-2:
eu-west-1:
ap-northeast-3:
ap-northeast-2:
ap-northeast-1:
sa-east-1:
ca-central-1:
ap-southeast-1:
ap-southeast-2:
eu-central-1:
us-east-1:
        athena-v2-wg: my_select
        PreparedStatementsWG: get_user
        PreparedStatementsWG: get_contacts_by_company
        PreparedStatementsWG: product_helpful_reviews
        PreparedStatementsWG: count_stars
        PreparedStatementsWG: helpful_reviews
        PreparedStatementsWG: get_product_info
        PreparedStatementsWG: check_avg_stars_of_category
        PreparedStatementsWG: my_select_v1
        PreparedStatementsWG: my_select_v2
us-east-2:
us-west-1:
us-west-2:

Update prepared statements

You have a few options for updating prepared statements. You may want to do this to optimize your query performance, change the values you select, or for several other reasons.

  1. Rerun the PREPARE statement with the changes in the Athena query editor or against the StartQueryExecution API.
  2. Use the UpdatePreparedStatement API via the AWS CLI or SDK.

You can use this API to add a description to your prepared statements or update your queries. To update your query statement via this method, you must provide the statement name, workgroup name, updated query statement, and optionally a new description. For more information about the UpdatePreparedStatement API, refer to update-prepared-statement.

You may want to roll out versions of your query. To maintain backward-compatibility for users, you could create a new prepared statement with a different name. For example, the prepared statement could have a version number in its name (such as my_select_v1 and my_select_v2). When necessary, you could communicate changes to teams who rely on the prepared statement, and later deallocate the old prepared statement versions.

Delete prepared statements

To delete a prepared statement, you can use the following query syntax when against the StartQueryExecution API, or from within the Athena query editor:

DEALLOCATE PREPARE product_helpful_reviews

You could also use the DeletePreparedStatement API or SDK. For example, to delete your prepared statement from AWS CLI, run the following command:

aws athena delete-prepared-statement --statement-name product_helpful_reviews --work-group PreparedStatementsWG --region region

Clean up

If you followed along with this post, you created several components that may incur costs. To avoid future charges, remove the resources with the following steps:

  1. Delete the S3 bucket’s results prefix created after you run a query on your workgroup.

With the default template, it’s named <S3QueryResultsBucketName>/athena-results. Use caution in this step. Unless you are using versioning on your S3 bucket, deleting S3 objects cannot be undone.

  1. Delete the Athena prepared statements in the PreparedStatementsWG

You can follow the steps in the Delete prepared statements section of this post using either the DEALLOCATE PREPARE statement or delete-prepared-statement API for each prepared statement you created.

  1. To remove the CloudFormation stack, select the stack on the AWS CloudFormation console, choose Delete, and confirm.

Conclusion

Athena parameterized queries make it easy to decouple your code base from your queries by providing a way to store common queries within your Athena workgroup. This post provided information about how Athena parameterized queries can improve your code reusability and data lake security. We showed how you can set up a sample data lake and start using parameterized queries today. We also provided an example of the protections parameterized queries offers, and detailed additional administrative functions.

You can get started with Athena prepared statements via the Athena console, the AWS CLI, or the AWS SDK. To learn more about Athena, refer to the Amazon Athena User Guide.

Thanks for reading this post! If you have questions about Athena parameterized queries, don’t hesitate to leave a comment in the comments section.


About the Authors

Blayze Stefaniak is a Senior Solutions Architect at AWS who works with public sector, federal financial, and healthcare organizations. Blayze is based out of Pittsburgh. He is passionate about breaking down complex situations into something practical and actionable. His interests include artificial intelligence, distributed systems, and Excel formula gymnastics. Blayze holds a B.S.B.A. in Accounting and B.S. in Information Systems from Clarion University of Pennsylvania. In his spare time, you can find Blayze listening to Star Wars audiobooks, trying to make his dogs laugh, and probably talking on mute.

Daniel Tatarkin is a Solutions Architect at Amazon Web Services (AWS) supporting Federal Financial organizations. He is passionate about big data analytics and serverless technologies. Outside of work, he enjoys learning about personal finance, coffee, and trying out new programming languages for fun.