Auditing, inspecting, and visualizing Amazon Athena usage and cost

Post Syndicated from Kalyan Janaki original https://aws.amazon.com/blogs/big-data/auditing-inspecting-and-visualizing-amazon-athena-usage-and-cost/

Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL. It’s a serverless platform with no need to set up or manage infrastructure. Athena scales automatically—running queries in parallel—so results are fast, even with large datasets and complex queries. You pay only for the queries that you run and the charges are based on the amount of data scanned by each query. You’re not charged for data definition language (DDL) statements like CREATE, ALTER, or DROP TABLE, statements for managing partitions, or failed queries. Cancelled queries are charged based on the amount of data scanned.

Typically, multiple users within an organization operate under different Athena workgroups and query various datasets in Amazon S3. In such cases, it’s beneficial to monitor Athena usage to ensure cost-effectiveness, avoid any performance bottlenecks, and adhere to security best practices. As a result, it’s desirable to have metrics that provide the following details:

Amount of data scanned by individual users
Amount of data scanned by different workgroups
Repetitive queries run by individual users
Slow-running queries
Most expensive queries

In this post, we provide a solution that collects detailed statistics of every query run in Athena and stores it in your data lake for auditing. We also demonstrate how to visualize audit data collected for a few key metrics (data scanned per workgroup and data scanned per user) using Amazon QuickSight.

Solution overview

The following diagram illustrates our solution architecture.

The solution consists of the following high-level steps:

We use an Amazon CloudWatch Events rule with the following event pattern to trigger an AWS Lambda function for every Athena query run:

{
"detail-type": [
"Athena Query State Change"
],
"source": [
"aws.athena"
],
"detail": {
"currentState": [
"SUCCEEDED",
"FAILED",
"CANCELED"
]
}
}

The Lambda function queries the Athena API to get details about the query and publishes them to Amazon Kinesis Data Firehose.
Kinesis Data Firehose batches the data and writes it to Amazon S3.
We use a second CloudWatch event rule with the following event pattern to trigger another Lambda function for every Athena query being run:
```
{
"detail-type": [
"AWS API Call via CloudTrail"
],
"source": [
"aws.athena"
],
"detail": {
"eventName": [
"StartQueryExecution"
]
}
}
```

The Lambda function extracts AWS Identity and Access Management (IAM) user details from the query and publishes them to Kinesis Data Firehose.
Kinesis Data Firehose batches the data and writes it to Amazon S3.

We create and schedule an AWS Glue crawler to crawl the data written by Kinesis Data Firehose in the previous steps to create and update the Data Catalog tables. The following code is the table schemas the crawler creates:

CREATE EXTERNAL TABLE `athena_query_details`(
  `query_execution_id` string, 
  `query` string, 
  `workgroup` string, 
  `state` string, 
  `data_scanned_bytes` bigint, 
  `execution_time_millis` int, 
  `planning_time_millis` int, 
  `queryqueue_time_millis` int, 
  `service_processing_time_millis` int)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://<S3location>/athena-query-details/'
TBLPROPERTIES (
  'parquet.compression'='SNAPPY')

CREATE EXTERNAL TABLE `athena_user_access_details`(
  `query_execution_id` string, 
  `account` string, 
  `region` string, 
  `user_detail` string, 
  `source_ip` string, 
  `event_time` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://<S3location>/athena-user-access-details/'
TBLPROPERTIES (
  'parquet.compression'='SNAPPY')

Athena stores a detailed history of the queries run in the last 30 days.
The solution deploys a Lambda function that queries AWS CloudTrail to get list of Athena queries run in the last 30 days and publishes the details to the Kinesis Data Firehose streams created in the previous steps. The historical event processor function only needs to run one time after deploying the solution using the provided AWS CloudFormation
We use the data available in the tables athena_query_details and athena_user_access_details in QuickSight to build insights and create visual dashboards.

Setting up your resources

Click on the Launch Stack button to deploy the solution in your account.

After you deploy the CloudFormation template, manually invoke the athena_historicalquery_events_processor Lambda function. This step processes the Athena queries that ran before deploying this solution; you only need to perform this step one time.

Reporting using QuickSight dashboards

In this section, we cover the steps to create QuickSight dashboards based on the athena_query_details and athena_user_access_details Athena tables. The first step is to create a dataset with the athena_audit_db database, which the CloudFormation template created.

On the QuickSight console, choose Datasets.
Choose New dataset.

For Create a Data Set, choose Athena.

For Data source name, enter a name (for example, Athena_Audit).
For Athena workgroup, keep at its default [primary].
Choose Create data source.

In the Choose your table section, choose athena_audit_db.
Choose Use custom SQL.

Enter a name for your custom SQL (for example, Custom-SQL-Athena-Audit).

Enter the following SQL query:

SELECT a.query_execution_id, a.query, a.state, a.data_scanned_bytes, a.execution_time_millis, b.user_detail
FROM "athena_audit_db"."athena_query_details" AS a FULL JOIN  "athena_audit_db"."athena_user_access_details" AS b
ON a.query_execution_id=b.query_execution_id
WHERE CAST(a.data_scanned_bytes AS DECIMAL) > 0

Choose Confirm query.

This query does a full join between the athena_query_details and athena_user_access_details tables.

Select Import to SPICE for quicker analytics.
Choose Edit/Preview data.

Confirm that the data_scanned_bytes and execution_time_millis column data type is set to Decimal.
Choose Save & visualize.

We’re now ready to create visuals in QuickSight. For this post, we create the following visuals:

Data scanned per query
Data scanned per user

Data scanned per query

To configure the chart for our first visual, complete the following steps:

For Visual types, choose Horizontal bar chart.
For Y axis¸ choose query_execution_id.
For Value, chose data_scanned_bytes.
For Group/Color, choose query.

If you’re interested in determining the total runtime of the queries, you can use execution_time_milis instead of data_scanned_bytes.

The following screenshot shows our visualization.

If you hover over one of the rows in the chart, the pop-out detail shows the Athena query execution ID, Athena query that ran, and the data scanned by the query in bytes.

Data scanned per user

To configure the chart for our second visual, complete the following steps:

For Visual types, choose Horizontal bar chart.
For Y axis, choose query_execution_id
For Value, choose data_scanned_bytes.
For Group/Color, choose user_details.

You can use execution_time_millis instead of data_scanned_bytes if you’re interested in determining the total runtime of the queries.

The following screenshot shows our visualization.

If you hover over one of the rows in the chart, the pop-out detail shows the Athena query execution ID, the IAM principal that ran the query, and the data scanned by the query in bytes.

Conclusion

This solution queries CloudTrail for Athena query activity in the last 30 days and uses CloudWatch rules to capture statistics for future Athena queries. Furthermore, the solution uses the GetQueryExecution API to provide details about the amount of data scanned, which provides information about per-query costs and how many queries are run per user. This enables you to further understand how your organization is using Athena.

You can further improve upon this architecture by using a partitioning strategy. Instead of writing all query metrics as .csv files to one S3 bucket folder, you can partition the data using year, month, and day partition keys. This allows you to query data by year, month, or day. For more information about partitioning strategies, see Partitioning Data.

For more information about improving Athena performance, see Top 10 Performance Tuning Tips for Amazon Athena.

About the Authors

Kalyan Janaki is Senior Technical Account Manager with AWS. Kalyan enjoys working with customers and helping them migrate their workloads to the cloud. In his spare time, he tries to keep up with his 2-year-old.

Kapil Shardha is an AWS Solutions Architect and supports enterprise customers with their AWS adoption. He has background in infrastructure automation and DevOps.

Aravind Singirikonda is a Solutions Architect at Amazon Web Services. He works with AWS Enterprise customers to provide guidance and technical assistance, helping them improve the value of their solutions when using AWS.

Noise