All posts by Steffen Grunwald

Analyze your Amazon CloudFront access logs at scale

Post Syndicated from Steffen Grunwald original https://aws.amazon.com/blogs/big-data/analyze-your-amazon-cloudfront-access-logs-at-scale/

Many AWS customers are using Amazon CloudFront, a global content delivery network (CDN) service. It delivers websites, videos, and API operations to browsers and clients with low latency and high transfer speeds. Amazon CloudFront protects your backends from massive load or malicious requests by caching or a web application firewall. As a result, sometimes only a small fraction of all requests gets to your backends. You can configure Amazon CloudFront to store access logs with detailed information of every request to Amazon Simple Storage Service (S3). This lets you gain insight into your cache efficiency and learn how your customers are using your products.

A common choice to run standard SQL queries on your data in S3 is Amazon Athena. Queries analyze your data immediately without the prior setup of infrastructure or loading your data. You pay only for the queries that you run. Amazon Athena is ideal for quick, interactive querying. It supports complex analysis of your data, including large joins, unions, nested queries, and window functions.

This blog post shows you how you can restructure your Amazon CloudFront access logs storage to optimize the cost and performance for queries. It demonstrates common patterns that are also applicable to other sources of time series data.

Optimizing Amazon CloudFront access logs for Amazon Athena queries

There are two main aspects to optimize: cost and performance.

Cost should be low for both storage of your data and the queries. Access logs are stored in S3, which is billed by GB/ month. Thus, it makes sense to compress your data – especially when you want to keep your logs for a long time. Also cost incurs on queries. When you optimize the storage cost, usually the query cost follows. Access logs are delivered compressed by gzip and Amazon Athena can deal with compression. Amazon Athena is billed by the amount of compressed data scanned, so the benefits of compression are passed on to you as cost savings.

Queries further benefit from partitioning. Partitioning divides your table into parts and keeps the related data together based on column values. For time-based queries, you benefit from partitioning by year, month, day, and hour. In Amazon CloudFront access logs, this indicates the request time. Depending on your data and queries, you add further dimensions to partitions. For example, for access logs it could be the domain name that was requested. When querying your data, you specify filters based on the partition to make Amazon Athena scan less data.

Generally, performance improves by scanning less data. Conversion of your access logs to columnar formats reduces the data to scan significantly. Columnar formats retain all information but store values by column. This allows creation of dictionaries, and effective use of Run Length Encoding and other compression techniques. Amazon Athena can further optimize the amount of data to read, because it does not scan columns at all if a column is not used in a filter or the result of a query. Columnar formats also split a file into chunks and calculate metadata on file- and chunk level like the range (min/ max), count, or sum of values. If the metadata indicates that the file or chunk is not relevant for the query Amazon Athena skips it. In addition, if you know your queries and the information you are looking for, you can further aggregate your data (for example, by day) for improved performance of frequent queries.

This blog post focuses on two measures to restructure Amazon CloudFront access logs for optimization: partitioning and conversion to columnar formats. For more details on performance tuning read the blog post about the top 10 performance tuning tips for Amazon Athena.

This blog post describes the concepts of a solution and includes code excerpts for better illustration of the implementation. Visit the AWS Samples repository for a fully working implementation of the concepts. Launching the packaged sample application from the AWS Serverless Application Repository, you deploy it within minutes in one step:

Partitioning CloudFront Access Logs in S3

Amazon CloudFront delivers each access log file in CSV format to an S3 bucket of your choice. Its name adheres to the following format (for more information, see Configuring and Using Access Logs):

/optional-prefix/distribution-ID.YYYY-MM-DD-HH.unique-ID.gz

The file name includes the date and time of the period in which the requests occurred in Coordinated Universal time (UTC). Although you can specify an optional prefix for an Amazon CloudFront distribution, all access log files for a distribution are stored with the same prefix.

When you have a large amount of access log data, this makes it hard to only scan and process parts of it efficiently. Thus, you must partition your data. Most tools in the big data space (for example, the Apache Hadoop ecosystem, Amazon Athena, AWS Glue) can deal with partitioning using the Apache Hive style. A partition is a directory that is self-descriptive. The directory name not only reflects the value of a column but also the column name. For access logs this is a desirable structure:

/optional-prefix/year=YYYY/month=MM/day=DD/hour=HH/distribution-ID.YYYY-MM-DD-HH.unique-ID.gz

To generate this structure, the sample application initiates the processing of each file by an S3 event notification. As soon as Amazon CloudFront puts a new access log file to an S3 bucket, an event triggers the AWS Lambda function moveAccessLogs. This moves the file to a prefix corresponding to the filename. Technically, the move is a copy followed by deletion of the original file.

 

 

Migration of your Amazon CloudFront Access Logs

The deployment of the sample application contains a single S3 bucket called <StackName>-cf-access-logs. You can modify your existing Amazon CloudFront distribution configuration to deliver access logs to this bucket with the new/ log prefix. Files are moved to the canonical file structure for Amazon Athena partitioning as soon as they are put into the bucket.

To migrate all previous access log files, copy them manually to the new/ folder in the bucket. For example, you could copy the files by using the AWS Command Line Interface (AWS CLI). These files are treated the same way as the incoming files by Amazon CloudFront.

Load the Partitions and query your Access Logs

Before you can query the access logs in your bucket with Amazon Athena the AWS Glue Data Catalog needs metadata. On deployment, the sample application creates a table with the definition of the schema and the location. The new table is created by adding the partitioning information to the CREATE TABLE statement from the Amazon CloudFront documentation (mind the PARTITIONED BY clause):

CREATE EXTERNAL TABLE IF NOT EXISTS
    cf_access_logs.partitioned_gz (
         date DATE,
         time STRING,
         location STRING,
         bytes BIGINT,
         requestip STRING,
         method STRING,
         host STRING,
         uri STRING,
         status INT,
         referrer STRING,
         useragent STRING,
         querystring STRING,
         cookie STRING,
         resulttype STRING,
         requestid STRING,
         hostheader STRING,
         requestprotocol STRING,
         requestbytes BIGINT,
         timetaken FLOAT,
         xforwardedfor STRING,
         sslprotocol STRING,
         sslcipher STRING,
         responseresulttype STRING,
         httpversion STRING,
         filestatus STRING,
         encryptedfields INT 
)
PARTITIONED BY(
         year string,
         month string,
         day string,
         hour string )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3://<StackName>-cf-access-logs/partitioned-gz/'
TBLPROPERTIES ( 'skip.header.line.count'='2');

You can load the partitions added so far by running the metastore check (msck) statement via the Amazon Athena query editor. It discovers the partition structure in S3 and adds partitions to the metastore.

msck repair table cf_access_logs.partitioned_gz

You are now ready for your first query on your data in the Amazon Athena query editor:

SELECT SUM(bytes) AS total_bytes
FROM cf_access_logs.partitioned_gz
WHERE year = '2017'
AND month = '10'
AND day = '01'
AND hour BETWEEN '00' AND '11';

This query does not specify the request date (called date in a previous example) column of the table but the columns used for partitioning. These columns are dependent on date but the table definition does not specify this relationship. When you specify only the request date column, Amazon Athena scans every file as there is no hint which files contain the relevant rows and which files do not. By specifying the partition columns, Amazon Athena scans only a small subset of the total amount of Amazon CloudFront access log files. This optimizes both the performance and the cost of your queries. You can add further columns to the WHERE clause, such as the time to further narrow down the results.

To save cost, consider narrowing the scope of partitions down to a minimum by also putting the partitioning columns into the WHERE clause. You validate the approach by observing the amount of data that was scanned in the query execution statistics for your queries. These statistics are also displayed in the Amazon Athena query editor after your statement has been run:

Adding Partitions continuously

As Amazon CloudFront continuously delivers new access log data for requests, new prefixes for partitions are created in S3. However, Amazon Athena only queries the files contained in the known partitions, i.e. partitions that have been added before to the metastore. That’s why periodically triggering the msck command would not be the best solution. First, it is a time-consuming operation since Amazon Athena scans all S3 paths to validate and load your partitions. More importantly, this way you only add partitions that already have data delivered. Thus, there is some time period when the data exists in S3 but is not visible to Amazon Athena queries yet.

The sample application solves this by adding the partition for each hour in advance because partitions are just dependent on the request time. This way Amazon Athena scans files as soon as they exist in S3. A scheduled AWS Lambda function runs a statement like this:

ALTER TABLE cf_access_logs.partitioned_gz
ADD IF NOT EXISTS 
PARTITION (
    year = '2017',
    month = '10',
    day = '01',
    hour = '02' );

It can omit the specification of the canonical location attribute in this statement as it is automatically derived from the column values.

Conversion of the Access Logs to a Columnar Format

As mentioned previously, with columnar formats Amazon Athena skips scanning of data not relevant for a query resulting in less cost. Amazon Athena currently supports the columnar formats Apache ORC and Apache Parquet.

Key to the conversion is the Amazon Athena CREATE TABLE AS SELECT (CTAS) feature. A CTAS query creates a new table from the results of another SELECT query. Amazon Athena stores data files created by the CTAS statement in a specified location in Amazon S3. You can use CTAS to aggregate or transform the data, and to convert it into columnar formats. The sample application uses CTAS to hourly rewrite all logs from the CSV format to the Apache Parquet format. After this the resulting data will be added to a single partitioned table (the target table).

Creating the Target Table in Apache Parquet Format

The target table is a slightly modified version of the partitioned_gz table. Besides a different location the following table shows the different Serializer/Deserializer (SerDe) configuration for Apache Parquet:

CREATE EXTERNAL TABLE `cf_access_logs.partitioned_parquet`(
  `date` date, 
  `time` string, 
  `location` string, 
  `bytes` bigint, 
  `requestip` string, 
  `method` string, 
  `host` string, 
  `uri` string, 
  `status` int, 
  `referrer` string, 
  `useragent` string, 
  `querystring` string, 
  `cookie` string, 
  `resulttype` string, 
  `requestid` string, 
  `hostheader` string, 
  `requestprotocol` string, 
  `requestbytes` bigint, 
  `timetaken` float, 
  `xforwardedfor` string, 
  `sslprotocol` string, 
  `sslcipher` string, 
  `responseresulttype` string, 
  `httpversion` string, 
  `filestatus` string, 
  `encryptedfields` int)
PARTITIONED BY ( 
  `year` string, 
  `month` string, 
  `day` string, 
  `hour` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://<StackName>-cf-access-logs/partitioned-parquet'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'parquet.compression'='SNAPPY')

Transformation to Apache Parquet by the CTAS Query

The sample application provides a scheduled AWS Lambda function transformPartition that runs a CTAS query on a single partition per run, taking one hour of data into account. The target location for the Apache Parquet files is the Apache Hive style path in the location of the partitioned_parquet table.

 

 

The files written to S3 are important but the table in the AWS Glue Data Catalog for this data is just a by-product. Hence the function drops the CTAS table immediately and create the corresponding partition in the partitioned_parquet table instead.

CREATE TABLE cf_access_logs.ctas_2017_10_01_02
WITH ( format='PARQUET',
    external_location='s3://<StackName>-cf-access-logs/partitioned_parquet/year=2017/month=10/day=01/hour=02',
    parquet_compression = 'SNAPPY')
AS SELECT *
FROM cf_access_logs.partitioned_gz
WHERE year = '2017'
    AND month = '10'
    AND day = '01'
    AND hour = '02';

DROP TABLE cf_access_logs.ctas_2017_10_01_02;

ALTER TABLE cf_access_logs.partitioned_parquet
ADD IF NOT EXISTS 
PARTITION (
    year = '2017',
    month = '10',
    day = '01',
    hour = '02' );

The statement should be run as soon as new data is written. Amazon CloudFront usually delivers the log file for a time period to your Amazon S3 bucket within an hour of the events that appear in the log. The sample application schedules the transformPartition function hourly to transform the data for the hour before the previous hour.

Some or all log file entries for a time period can sometimes be delayed by up to 24 hours. If you must mitigate this case, you delete and recreate a partition after that period. Also if you migrated partitions from previous Amazon CloudFront access logs, run the transformPartition function for each partition. The sample applications only transforms continuously added files.

When all files of a gzip partition are converted to Apache Parquet, you can save cost by getting rid of data that you do not need. Use the Lifecycle Policies in S3 to archive the gzip files in a cheaper storage class or delete them after a specific amount of days.

Query data over Multiple Tables

You now have two derived tables from the original Amazon CloudFront access log data:

  • partitioned_gz contains gzip compressed CSV files that are added as soon as new files are delivered.
  • Access logs in partitioned_parquet are written after one hour latest. A rough assumption is that the CTAS query takes a maximum of 15 minutes to transform a gzip partition. You must measure and confirm this assumption. Depending on the data size, this can be much faster.

The following diagram shows how the complete view on all data is composed of the two tables. The last complete partition of Apache Parquet files ends before the current time minus the transformation duration and the duration until Amazon CloudFront delivers the access log files.

For convenience the sample application creates the Amazon Athena view combined as a union of both tables. It includes an additional column called file. This is the file that stores the row.

CREATE OR REPLACE VIEW cf_access_logs.combined AS
SELECT *, "$path" AS file
FROM cf_access_logs.partitioned_gz
WHERE concat(year, month, day, hour) >=
       date_format(date_trunc('hour', (current_timestamp -
       INTERVAL '15' MINUTE - INTERVAL '1' HOUR)), '%Y%m%d%H')
UNION ALL SELECT *, "$path" AS file
FROM cf_access_logs.partitioned_parquet
WHERE concat(year, month, day, hour) <
       date_format(date_trunc('hour', (current_timestamp -
       INTERVAL '15' MINUTE - INTERVAL '1' HOUR)), '%Y%m%d%H')

Now you can query the data from the view to take advantage of the columnar based file partitions automatically. As mentioned before, you should add the partition columns (year, month, day, hour) to your statement to limit the files Amazon Athena scans.

SELECT SUM(bytes) AS total_bytes
FROM cf_access_logs.combined
WHERE year = '2017'
   AND month = '10'
   AND day = '01'

Summary

In this blog post, you learned how to optimize the cost and performance of your Amazon Athena queries with two steps. First, you divide the overall data into small partitions. This allows queries to run much faster by reducing the number of files to scan. The second step converts each partition into a columnar format to reduce storage cost and increase the efficiency of scans by Amazon Athena.

The results of both steps are combined in a single view for convenient interactive queries by you or your application. All data is partitioned by the time of the request. Thus, this format is best suited for interactive drill-downs into your logs for which the columns are limited and the time range is known. This way, it complements the Amazon CloudFront reports, for example, by providing easy access to:

  • Data from more than 60 days in the past
  • The distribution of detailed HTTP status codes (for example, 200, 403, 404) on a certain day or hour
  • Statistics based on the URI paths
  • Statistics of objects that are not listed in Amazon CloudFront’s 50 most popular objects report
  • A drill down into the attributes of each request

We hope you find this blog post and the sample application useful also for other types of time series data beside Amazon CloudFront access logs. Feel free to submit enhancements to the example application in the source repository or provide feedback in the comments.

 


About the Author

Steffen Grunwald is a senior solutions architect with Amazon Web Services. Supporting German enterprise customers on their journey to the cloud, he loves to dive deep into application architectures and development processes to drive performance, operational efficiency, and increase the speed of innovation.