Tag Archives: Amazon Athena

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

2024-04-25 Takeshi Nakatani

Post Syndicated from Takeshi Nakatani original https://aws.amazon.com/blogs/big-data/optimize-data-layout-by-bucketing-with-amazon-athena-and-aws-glue-to-accelerate-downstream-queries/

In the era of data, organizations are increasingly using data lakes to store and analyze vast amounts of structured and unstructured data. Data lakes provide a centralized repository for data from various sources, enabling organizations to unlock valuable insights and drive data-driven decision-making. However, as data volumes continue to grow, optimizing data layout and organization becomes crucial for efficient querying and analysis.

One of the key challenges in data lakes is the potential for slow query performance, especially when dealing with large datasets. This can be attributed to factors such as inefficient data layout, resulting in excessive data scanning and inefficient use of compute resources. To address this challenge, common practices like partitioning and bucketing can significantly improve query performance and reduce computation costs.

Partitioning is a technique that divides a large dataset into smaller, more manageable parts based on specific criteria, such as date, region, or product category. By partitioning data, downstream analytical queries can skip irrelevant partitions, reducing the amount of data that needs to be scanned and processed. You can use partition columns in the WHERE clause in queries to scan only the specific partitions that your query needs. This can lead to faster query runtimes and more efficient resource utilization. It especially works well when columns with low cardinality are chosen as the key.

What if you have a high cardinality column that you sometimes need to filter by VIP customers? Each customer is usually identified with an ID, which can be millions. Partitioning isn’t suitable for such high cardinality columns because you end up with small files, slow partition filtering, and high Amazon Simple Storage Service (Amazon S3) API cost (one S3 prefix is created per value of partition column). Although you can use partitioning with a natural key such as city or state to narrow down your dataset to some degree, it is still necessary to query across date-based partitions if your data is time series.

This is where bucketing comes into play. Bucketing makes sure that all rows with the same values of one or more columns end up in the same file. Instead of one file per value, like partitioning, a hash function is used to distribute values evenly across a fixed number of files. By organizing data this way, you can perform efficient filtering, because only the relevant buckets need to be processed, further reducing computational overhead.

There are multiple options for implementing bucketing on AWS. One approach is to use the Amazon Athena CREATE TABLE AS SELECT (CTAS) statement, which allows you to create a bucketed table directly from a query. Alternatively, you can use AWS Glue for Apache Spark, which provides built-in support for bucketing configurations during the data transformation process. AWS Glue allows you to define bucketing parameters, such as the number of buckets and the columns to bucket on, providing an optimized data layout for efficient querying with Athena.

In this post, we discuss how to implement bucketing on AWS data lakes, including using Athena CTAS statement and AWS Glue for Apache Spark. We also cover bucketing for Apache Iceberg tables.

Example use case

In this post, you use a public dataset, the NOAA Integrated Surface Database. Data analysts run one-time queries for data during the past 5 years through Athena. Most of the queries are for specific stations with specific report types. The queries need to complete in 10 seconds, and the cost needs to be optimized carefully. In this scenario, you’re a data engineer responsible for optimizing query performance and cost.

For example, if an analyst wants to retrieve data for a specific station (for example, station ID 123456) with a particular report type (for example, CRN01), the query might look like the following query:

SELECT station, report_type, columnA, columnB, ...
FROM table_name
WHERE
report_type = 'CRN01'
AND station = '123456'

In the case of the NOAA Integrated Surface Database, the station_id column is likely to have a high cardinality, with numerous unique station identifiers. On the other hand, the report_type column may have a relatively low cardinality, with a limited set of report types. Given this scenario, it would be a good idea to partition the data by report_type and bucket it by station_id.

With this partitioning and bucketing strategy, Athena can first eliminate partitions for irrelevant report types, and then scan only the buckets within the relevant partition that match the specified station ID, significantly reducing the amount of data processed and accelerating query runtimes. This approach not only meets the query performance requirement, but also helps optimize costs by minimizing the amount of data scanned and billed for each query.

In this post, we examine how query performance is affected by data layout, in particular, bucketing. We also compare three different ways to achieve bucketing. The following table represents conditions for the tables to be created.

.	noaa_remote_original	athena_non_bucketed	athena_bucketed	glue_bucketed	athena_bucketed_iceberg
Format	CSV	Parquet	Parquet	Parquet	Parquet
Compression	n/a	Snappy	Snappy	Snappy	Snappy
Created via	n/a	Athena CTAS	Athena CTAS	Glue ETL	Athena CTAS with Iceberg
Engine	n/a	Trino	Trino	Apache Spark	Apache Iceberg
Is partitioned?	Yes but with different way	Yes	Yes	Yes	Yes
Is bucketed?	No	No	Yes	Yes	Yes

noaa_remote_original is partitioned by the year column, but not by the report_type column. This row represents if the table is partitioned by the actual columns that are used in the queries.

Baseline table

For this post, you create several tables with different conditions: some without bucketing and some with bucketing, to showcase the performance characteristics of bucketing. First, let’s create an original table using the NOAA data. In subsequent steps, you ingest data from this table to create test tables.

There are multiple ways to define a table definition: running DDL, an AWS Glue crawler, the AWS Glue Data Catalog API, and so on. In this step, you run DDL via the Athena console.

Complete the following steps to create the "bucketing_blog"."noaa_remote_original" table in the Data Catalog:

Open the Athena console.
In the query editor, run the following DDL to create a new AWS Glue database:
```
-- Create Glue database
CREATE DATABASE bucketing_blog;
```
For Database under Data, choose bucketing_blog to set the current database.

Run the following DDL to create the original table:

-- Create original table
CREATE EXTERNAL TABLE `bucketing_blog`.`noaa_remote_original`(
  `station` STRING, 
  `date` STRING, 
  `source` STRING, 
  `latitude` STRING, 
  `longitude` STRING, 
  `elevation` STRING, 
  `name` STRING, 
  `report_type` STRING, 
  `call_sign` STRING, 
  `quality_control` STRING, 
  `wnd` STRING, 
  `cig` STRING, 
  `vis` STRING, 
  `tmp` STRING, 
  `dew` STRING, 
  `slp` STRING, 
  `aj1` STRING, 
  `gf1` STRING, 
  `mw1` STRING)
PARTITIONED BY (
    year STRING)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
WITH SERDEPROPERTIES ( 
  'escapeChar'='\\',
  'quoteChar'='\"',
  'separatorChar'=',') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://noaa-global-hourly-pds/'
TBLPROPERTIES (
  'skip.header.line.count'='1'
)

Because the source data has quoted fields, we use OpenCSVSerde instead of the default LazySimpleSerde.

These CSV files have a header row, which we tell Athena to skip by adding skip.header.line.count and setting the value to 1.

For more details, refer to OpenCSVSerDe for processing CSV.

Run the following DDL to add partitions. We add partitions only for 5 years out of 124 years based on the use case requirement:

-- Load partitions
ALTER TABLE `bucketing_blog`.`noaa_remote_original` ADD
  PARTITION (year = '2024') LOCATION 's3://noaa-global-hourly-pds/2024/'
  PARTITION (year = '2023') LOCATION 's3://noaa-global-hourly-pds/2023/'
  PARTITION (year = '2022') LOCATION 's3://noaa-global-hourly-pds/2022/'
  PARTITION (year = '2021') LOCATION 's3://noaa-global-hourly-pds/2021/'
  PARTITION (year = '2020') LOCATION 's3://noaa-global-hourly-pds/2020/';

Run the following DML to verify if you can successfully query the data:

-- Check data 
SELECT * FROM "bucketing_blog"."noaa_remote_original" LIMIT 10;

Now you’re ready to start querying the original table to examine the baseline performance.

Run a query against the original table to evaluate the query performance as a baseline. The following query selects records for five specific stations with report type CRN05:

-- Baseline
SELECT station, report_type, date, source, latitude, longitude, elevation, name, call_sign, quality_control, wnd, cig, tmp
FROM "bucketing_blog"."noaa_remote_original"
WHERE
    report_type = 'CRN05'
    AND ( station = '99999904237'
        OR station = '99999953132'
        OR station = '99999903061'
        OR station = '99999963856'
        OR station = '99999994644'
    );

We ran this query 10 times. The average query runtime for 10 queries is 27.6 seconds, which is far longer than our target of 10 seconds, and 155.75 GB data is scanned to return 1.65 million records. This is the baseline performance of the original raw table. It’s time to start optimizing data layout from this baseline.

Next, you create tables with different conditions from the original: one without bucketing and one with bucketing, and compare them.

Optimize data layout using Athena CTAS

In this section, we use an Athena CTAS query to optimize data layout and its format.

First, let’s create a table with partitioning but without bucketing. The new table is partitioned by the column report_type because most of expected queries use this column in the WHERE clause, and objects are stored as Parquet with Snappy compression.

Open the Athena query editor.

Run the following query, providing your own S3 bucket and prefix:

--CTAS, non-bucketed
CREATE TABLE "bucketing_blog"."athena_non_bucketed"
WITH (
    external_location = 's3://<your-s3-location>/athena-non-bucketed/',
    partitioned_by = ARRAY['report_type'],
    format = 'PARQUET',
    write_compression = 'SNAPPY'
)
AS
SELECT
    station, date, source, latitude, longitude, elevation, name, call_sign, quality_control, wnd, cig, vis, tmp, dew, slp, aj1, gf1, mw1, report_type
FROM "bucketing_blog"."noaa_remote_original"
;

Your data should look like the following screenshots.

There are 30 files under the partition.

Next, you create a table with Hive style bucketing. The number of buckets needs to be carefully tuned through experiments for your own use case. Generally speaking, the more buckets you have, the smaller the granularity, which might result in better performance. On the other hand, too many small files may introduce inefficiency in query planning and processing. Also, bucketing only works if you are querying a few values of the bucketing key. The more values you add to your query, the more likely that you will end up reading all buckets.

The following is the baseline query to optimize:

-- Baseline
SELECT station, report_type, date, source, latitude, longitude, elevation, name, call_sign, quality_control, wnd, cig, tmp
FROM "bucketing_blog"."noaa_remote_original"
WHERE
    report_type = 'CRN05'
    AND ( station = '99999904237'
        OR station = '99999953132'
        OR station = '99999903061'
        OR station = '99999963856'
        OR station = '99999994644'
    );

In this example, the table is going to be bucketed into 16 buckets by a high-cardinality column (station), which is supposed to be used for the WHERE clause in the query. All other conditions remain the same. The baseline query has five values in the station ID, and you expect queries to have around that number at most, which is less enough than the number of buckets, so 16 should work well. It is possible to specify a larger number of buckets, but CTAS can’t be used if the total number of partitions exceeds 100.

Run the following query:

-- CTAS, Hive-bucketed
CREATE TABLE "bucketing_blog"."athena_bucketed"
WITH (
    external_location = 's3://<your-s3-location>/athena-bucketed/',
    partitioned_by = ARRAY['report_type'],
    bucketed_by = ARRAY['station'],
    bucket_count = 16,
    format = 'PARQUET',
    write_compression = 'SNAPPY'
)
AS
SELECT
    station, date, source, latitude, longitude, elevation, name, call_sign, quality_control, wnd, cig, vis, tmp, dew, slp, aj1, gf1, mw1, report_type
FROM "bucketing_blog"."noaa_remote_original"
;

The query creates S3 objects organized as shown in the following screenshots.

The table-level layout looks exactly the same between athena_non_bucketed and athena_bucketed: there are 13 partitions in each table. The difference is the number of objects under the partitions. There are 16 objects (buckets) per partition, of roughly 10–25 MB each in this case. The number of buckets is constant at the specified value regardless of the amount of data, but the bucket size depends on the amount of data.

Now you’re ready to query against each table to evaluate query performance. The query will select records with five specific stations and report type CRN05 for the past 5 years. Although you can’t see which data of a specific station is located in which bucket, it has been calculated and located correctly by Athena.

Query the non-bucketed table with the following statement:

-- No bucketing 
SELECT station, report_type, date, source, latitude, longitude, elevation, name, call_sign, quality_control, wnd, cig, tmp
FROM "bucketing_blog"."athena_non_bucketed"
WHERE
    report_type = 'CRN05'
    AND ( station = '99999904237'
        OR station = '99999953132'
        OR station = '99999903061'
        OR station = '99999963856'
        OR station = '99999994644'
    );

We ran this query 10 times. The average runtime of the 10 queries is 10.95 seconds, and 358 MB of data is scanned to return 2.21 million records. Both the runtime and scan size have been significantly decreased because you’ve partitioned the data, and can now read only one partition where 12 partitions of 13 are skipped. In addition, the amount of data scanned has gone down from 206 GB to 360 MB, which is a reduction of 99.8%. This is not just due to the partitioning, but also due to the change of its format to Parquet and compression with Snappy.

Query the bucketed table with the following statement:

-- Hive bucketing
SELECT station, report_type, date, source, latitude, longitude, elevation, name, call_sign, quality_control, wnd, cig, tmp
FROM "bucketing_blog"."athena_bucketed"
WHERE
    report_type = 'CRN05'
    AND ( station = '99999904237'
        OR station = '99999953132'
        OR station = '99999903061'
        OR station = '99999963856'
        OR station = '99999994644'
    );

We ran this query 10 times. The average runtime of the 10 queries is 7.82 seconds, and 69 MB of data is scanned to return 2.21 million records. This means a reduction of average runtime from 10.95 to 7.82 seconds (-29%), and a dramatic reduction of data scanned from 358 MB to 69 MB (-81%) to return the same number of records compared with the non-bucketed table. In this case, both runtime and data scanned were improved by bucketing. This means bucketing contributed not only to performance but also to cost reduction.

Considerations

As stated earlier, size your bucket carefully to maximize performance of your query. Bucketing only works if you are querying a few values of the bucketing key. Consider creating more buckets than the number of values expected in the actual query.

Additionally, an Athena CTAS query is limited to create up to 100 partitions at one time. If you need a large number of partitions, you may want to use AWS Glue extract, transform, and load (ETL), although there is a workaround to split into multiple SQL statements.

Optimize data layout using AWS Glue ETL

Apache Spark is an open source distributed processing framework that enables flexible ETL with PySpark, Scala, and Spark SQL. It allows you to partition and bucket your data based on your requirements. Spark has several tuning options to accelerate jobs. You can effortlessly automate and monitor Spark jobs. In this section, we use AWS Glue ETL jobs to run Spark code to optimize data layout.

Unlike Athena bucketing, AWS Glue ETL uses Spark-based bucketing as a bucketing algorithm. All you need to do is add the following table property onto the table: bucketing_format = 'spark'. For details about this table property, see Partitioning and bucketing in Athena.

Complete the following steps to create a table with bucketing through AWS Glue ETL:

On the AWS Glue console, choose ETL jobs in the navigation pane.
Choose Create job and choose Visual ETL.
Under Add nodes, choose AWS Glue Data Catalog for Sources.
For Database, choose bucketing_blog.
For Table, choose noaa_remote_original.
Under Add nodes, choose Change Schema for Transforms.
Under Add nodes, choose Custom Transform for Transforms.
For Name, enter ToS3WithBucketing.
For Node parents, choose Change Schema.

For Code block, enter the following code snippet:

def ToS3WithBucketing (glueContext, dfc) -> DynamicFrameCollection:
    # Convert DynamicFrame to DataFrame
    df = dfc.select(list(dfc.keys())[0]).toDF()
    
    # Write to S3 with bucketing and partitioning
    df.repartition(1, "report_type") \
        .write.option("path", "s3://<your-s3-location>/glue-bucketed/") \
        .mode("overwrite") \
        .partitionBy("report_type") \
        .bucketBy(16, "station") \
        .format("parquet") \
        .option("compression", "snappy") \
        .saveAsTable("bucketing_blog.glue_bucketed")

The following screenshot shows the job created using AWS Glue Studio to generate a table and data.

Each node represents the following:

The AWS Glue Data Catalog node loads the noaa_remote_original table from the Data Catalog
The Change Schema node makes sure that it loads columns registered in the Data Catalog
The ToS3WithBucketing node writes data to Amazon S3 with both partitioning and Spark-based bucketing

The job has been successfully authored in the visual editor.

Under Job details, for IAM Role, choose your AWS Identity and Access Management (IAM) role for this job.
For Worker type, choose G.8X.
For Requested number of workers, enter 5.
Choose Save, then choose Run.

After these steps, the table glue_bucketed. has been created.

Choose Tables in the navigation pane, and choose the table glue_bucketed.
On the Actions menu, choose Edit table under Manage.
In the Table properties section, choose Add.
Add a key pair with key bucketing_format and value spark.
Choose Save.

Now it’s time to query the tables.

Query the bucketed table with the following statement:

-- Spark bucketing
SELECT station, report_type, date, source, latitude, longitude, elevation, name, call_sign, quality_control, wnd, cig, tmp
FROM "bucketing_blog"."glue_bucketed"
WHERE
    report_type = 'CRN05'
    AND ( station = '99999904237'
        OR station = '99999953132'
        OR station = '99999903061'
        OR station = '99999963856'
        OR station = '99999994644'
    );

We ran the query 10 times. The average runtime of the 10 queries is 7.09 seconds, and 88 MB of data is scanned to return 2.21 million records. In this case, both the runtime and data scanned were improved by bucketing. This means bucketing contributed not only to performance but also to cost reduction.

The reason for the larger bytes scanned compared to the Athena CTAS example is that the values were distributed differently in this table. In the AWS Glue bucketed table, the values were distributed over five files. In the Athena CTAS bucketed table, the values were distributed over four files. Remember that rows are distributed into buckets using a hash function. The Spark bucketing algorithm uses a different hash function than Hive, and in this case, it resulted in a different distribution across the files.

Considerations

Glue DynamicFrame does not support bucketing natively. You need to use Spark DataFrame instead of DynamicFrame to bucket tables.

For information about fine-tuning AWS Glue ETL performance, refer to Best practices for performance tuning AWS Glue for Apache Spark jobs.

Optimize Iceberg data layout with hidden partitioning

Apache Iceberg is a high-performance open table format for huge analytic tables, bringing the reliability and simplicity of SQL tables to big data. Recently, there has been a huge demand to use Apache Iceberg tables to achieve advanced capabilities like ACID transaction, time travel query, and more.

In Iceberg, bucketing works differently than the Hive table method we’ve seen so far. In Iceberg, bucketing is a subset of partitioning, and can be applied using the bucket partition transform. The way you use it and the end result is similar to bucketing in Hive tables. For more details about Iceberg bucket transforms, refer to Bucket Transform Details.

Complete the following steps:

Open the Athena query editor.

Run the following query to create an Iceberg table with hidden partitioning along with bucketing:

-- CTAS, Iceberg-bucketed
CREATE TABLE "bucketing_blog"."athena_bucketed_iceberg"
WITH (table_type = 'ICEBERG',
      location = 's3://<your-s3-location>/athena-bucketed-iceberg/', 
      is_external = false,
      partitioning = ARRAY['report_type', 'bucket(station, 16)'],
      format = 'PARQUET',
      write_compression = 'SNAPPY'
) 
AS
SELECT
    station, date, source, latitude, longitude, elevation, name, call_sign, quality_control, wnd, cig, vis, tmp, dew, slp, aj1, gf1, mw1, report_type
FROM "bucketing_blog"."noaa_remote_original"
;

Your data should look like the following screenshot.

There are two folders: data and metadata. Drill down to data.

You see random prefixes under the data folder. Choose the first one to view its details.

You see the top-level partition based on the report_type column. Drill down to the next level.

You see the second-level partition, bucketed with the station column.

The Parquet data files exist under these folders.

Query the bucketed table with the following statement:

-- Iceberg bucketing
SELECT station, report_type, date, source, latitude, longitude, elevation, name, call_sign, quality_control, wnd, cig, tmp
FROM "bucketing_blog"."athena_bucketed_iceberg"
WHERE
    report_type = 'CRN05'
    AND
    ( station = '99999904237'
        OR station = '99999953132'
        OR station = '99999903061'
        OR station = '99999963856'
        OR station = '99999994644'
    );

With the Iceberg-bucketed table, the average runtime of the 10 queries is 8.03 seconds, and 148 MB of data is scanned to return 2.21 million records. This is less efficient than bucketing with AWS Glue or Athena, but considering the benefits of Iceberg’s various features, it is within an acceptable range.

Results

The following table summarizes all the results.

.	noaa_remote_original	athena_non_bucketed	athena_bucketed	glue_bucketed	athena_bucketed_iceberg
Format	CSV	Parquet	Parquet	Parquet	Iceberg (Parquet)
Compression	n/a	Snappy	Snappy	Snappy	Snappy
Created via	n/a	Athena CTAS	Athena CTAS	Glue ETL	Athena CTAS with Iceberg
Engine	n/a	Trino	Trino	Apache Spark	Apache Iceberg
Table size (GB)	155.8	5.0	5.0	5.8	5.0
The number of S3 Objects	53360	376	192	192	195
Is partitioned?	Yes but with different way	Yes	Yes	Yes	Yes
Is bucketed?	No	No	Yes	Yes	Yes
Bucketing format	n/a	n/a	Hive	Spark	Iceberg
Number of buckets	n/a	n/a	16	16	16
Average runtime (sec)	29.178	10.950	7.815	7.089	8.030
Scanned size (MB)	206640.0	358.6	69.1	87.8	147.7

With athena_bucketed, glue_bucketed, and athena_bucketed_iceberg, you were able to meet the latency goal of 10 seconds. With bucketing, you saw a 25–40% reduction in runtime and a 60–85% reduction in scan size, which can contribute to both latency and cost optimization.

As you can see from the result, although partitioning contributes significantly to reduce both runtime and scan size, bucketing can also contribute to reduce them further.

Athena CTAS is straightforward and fast enough to complete the bucketing process. AWS Glue ETL is more flexible and scalable to achieve advanced use cases. You can choose either method based on your requirement and use case, because you can take advantage of bucketing through either option.

Conclusion

In this post, we demonstrated how to optimize your table data layout with partitioning and bucketing through Athena CTAS and AWS Glue ETL. We showed that bucketing contributes to accelerating query latency and reducing scan size to further optimize costs. We also discussed bucketing for Iceberg tables through hidden partitioning.

Bucketing just one technique to optimize data layout by reducing data scan. For optimizing your entire data layout, we recommend considering other options like partitioning, using columnar file format, and compression in conjunction with bucketing. This can enable your data to further enhance query performance.

Happy bucketing!

About the Authors

Takeshi Nakatani is a Principal Big Data Consultant on the Professional Services team in Tokyo. He has 26 years of experience in the IT industry, with expertise in architecting data infrastructure. On his days off, he can be a rock drummer or a motorcyclist.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Multicloud data lake analytics with Amazon Athena

2024-03-18 Shoukat Ghouse

Post Syndicated from Shoukat Ghouse original https://aws.amazon.com/blogs/big-data/multicloud-data-lake-analytics-with-amazon-athena/

Many organizations operate data lakes spanning multiple cloud data stores. This could be for various reasons, such as business expansions, mergers, or specific cloud provider preferences for different business units. In these cases, you may want an integrated query layer to seamlessly run analytical queries across these diverse cloud stores and streamline your data analytics processes. With a unified query interface, you can avoid the complexity of managing multiple query tools and gain a holistic view of your data assets regardless of where the data assets reside. You can consolidate your analytics workflows, reducing the need for extensive tooling and infrastructure management. This consolidation not only saves time and resources but also enables teams to focus more on deriving insights from data rather than navigating through various query tools and interfaces. A unified query interface promotes a holistic view of data assets by breaking down silos and facilitating seamless access to data stored across different cloud data stores. This comprehensive view enhances decision-making capabilities by empowering stakeholders to analyze data from multiple sources in a unified manner, leading to more informed strategic decisions.

In this post, we delve into the ways in which you can use Amazon Athena connectors to efficiently query data files residing across Azure Data Lake Storage (ADLS) Gen2, Google Cloud Storage (GCS), and Amazon Simple Storage Service (Amazon S3). Additionally, we explore the use of Athena workgroups and cost allocation tags to effectively categorize and analyze the costs associated with running analytical queries.

Solution overview

Imagine a fictional company named Oktank, which manages its data across data lakes on Amazon S3, ADLS, and GCS. Oktank wants to be able to query any of their cloud data stores and run analytical queries like joins and aggregations across the data stores without needing to transfer data to an S3 data lake. Oktank also wants to identify and analyze the costs associated with running analytics queries. To achieve this, Oktank envisions a unified data query layer using Athena.

The following diagram illustrates the high-level solution architecture.

Users run their queries from Athena connecting to specific Athena workgroups. Athena uses connectors to federate the queries across multiple data sources. In this case, we use the Amazon Athena Azure Synapse connector to query data from ADLS Gen2 via Synapse and the Amazon Athena GCS connector for GCS. An Athena connector is an extension of the Athena query engine. When a query runs on a federated data source using a connector, Athena invokes multiple AWS Lambda functions to read from the data sources in parallel to optimize performance. Refer to Using Amazon Athena Federated Query for further details. The AWS Glue Data Catalog holds the metadata for Amazon S3 and GCS data.

In the following sections, we demonstrate how to build this architecture.

Prerequisites

Before you configure your resources on AWS, you need to set up the necessary infrastructure required for this post in both Azure and GCP. The detailed steps and guidelines for creating the resources in Azure and GCP are beyond the scope of this post. Refer to the respective documentation for details. In this section, we provide some basic steps needed to create the resources required for the post.

You can download the sample data file cust_feedback_v0.csv.

Configure the dataset for Azure

To set up the sample dataset for Azure, log in to the Azure portal and upload the file to ADLS Gen2. The following screenshot shows the file under the container blog-container under a specific storage account on ADLS Gen2.

Set up a Synapse workspace in Azure and create an external table in Synapse that points to the relevant location. The following commands offer a foundational guide for running the necessary actions within the Synapse workspace to create the essential resources for this post. Refer to the corresponding Synapse documentation for additional details as required.

# Create Database
CREATE DATABASE azure_athena_blog_db
# Create file format
CREATE EXTERNAL FILE FORMAT [SynapseDelimitedTextFormat]
WITH ( FORMAT_TYPE = DELIMITEDTEXT ,
FORMAT_OPTIONS (
FIELD_TERMINATOR = ',',
USE_TYPE_DEFAULT = FALSE,
FIRST_ROW = 2
))

# Create key
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '*******;

# Create Database credential
CREATE DATABASE SCOPED CREDENTIAL dbscopedCreds
WITH IDENTITY = 'Managed Identity';

# Create Data Source
CREATE EXTERNAL DATA SOURCE athena_blog_datasource
WITH ( LOCATION = 'abfss://[email protected]/',
CREDENTIAL = dbscopedCreds
)

# Create External Table
CREATE EXTERNAL TABLE dbo.customer_feedbacks_azure (
[data_key] nvarchar(4000),
[data_load_date] nvarchar(4000),
[data_location] nvarchar(4000),
[product_id] nvarchar(4000),
[customer_email] nvarchar(4000),
[customer_name] nvarchar(4000),
[comment1] nvarchar(4000),
[comment2] nvarchar(4000)
)
WITH (
LOCATION = 'cust_feedback_v0.csv',
DATA_SOURCE = athena_blog_datasource,
FILE_FORMAT = [SynapseDelimitedTextFormat]
);

# Create User
CREATE LOGIN bloguser1 WITH PASSWORD = '****';
CREATE USER bloguser1 FROM LOGIN bloguser1;

# Grant select on the Schema
GRANT SELECT ON SCHEMA::dbo TO [bloguser1];

Note down the user name, password, database name, and the serverless or dedicated SQL endpoint you use—you need these in the subsequent steps.

This completes the setup on Azure for the sample dataset.

Configure the dataset for GCS

To set up the sample dataset for GCS, upload the file to the GCS bucket.

Create a GCP service account and grant access to the bucket.

In addition, create a JSON key for the service account. The content of the key is needed in subsequent steps.

This completes the setup on GCP for our sample dataset.

Deploy the AWS infrastructure

You can now run the provided AWS CloudFormation stack to create the solution resources. Identify an AWS Region in which you want to create the resources and ensure you use the same Region throughout the setup and verifications.

Refer to the following table for the necessary parameters that you must provide. You can leave other parameters at their default values or modify them according to your requirement.

Parameter Name	Expected Value
`AzureSynapseUserName`	User name for the Synapse database you created.
`AzureSynapsePwd`	Password for the Synapse database user.
`AzureSynapseURL`	Synapse JDBC URL, in the following format: `jdbc:sqlserver://<sqlendpoint>;databaseName=<databasename>` For example: `jdbc:sqlserver://xxxxg-ondemand.sql.azuresynapse.net;databaseName=azure_athena_blog_db`
`GCSSecretKey`	Content from the secret key file from GCP.
`UserAzureADLSOnlyUserPassword`	AWS Management Console password for the Azure-only user. This user can only query data from ADLS.
`UserGCSOnlyUserPassword`	AWS Management Console password for the GCS-only user. This user can only query data from GCP GCS.
`UserMultiCloudUserPassword`	AWS Management Console password for the multi-cloud user. This user can query data from any of the cloud stores.

The stack provisions the VPC, subnets, S3 buckets, Athena workgroups, and AWS Glue database and tables. It creates two secrets in AWS Secrets Manager to store the GCS secret key and the Synapse user name and password. You use these secrets when creating the Athena connectors.

The stack also creates three AWS Identity and Access Management (IAM) users and grants permissions on corresponding Athena workgroups, Athena data sources, and Lambda functions: AzureADLSUser, which can run queries on ADLS and Amazon S3, GCPGCSUser, which can query GCS and Amazon S3, and MultiCloudUser, which can query Amazon S3, Azure ADLS Gen2 and GCS data sources. The stack does not create the Athena data source and Lambda functions. You create these in subsequent steps when you create the Athena connectors.

The stack also attaches cost allocation tags to the Athena workgroups, the secrets in Secrets Manager, and the S3 buckets. You use these tags for cost analysis in subsequent steps.

When the stack deployment is complete, note the values of the CloudFormation stack outputs, which you use in subsequent steps.

Upload the data file to the S3 bucket created by the CloudFormation stack. You can retrieve the bucket name from the value of the key named S3SourceBucket from the stack output. This serves as the S3 data lake data for this post.

You can now create the connectors.

Create the Athena Synapse connector

To set up the Azure Synapse connector, complete the following steps:

On the Lambda console, create a new application.
In the Application settings section, enter the values for the corresponding key from the output of the CloudFormation stack, as listed in the following table.

Property Name	CloudFormation Output Key
`SecretNamePrefix`	`AzureSecretName`
`DefaultConnectionString`	`AzureSynapseConnectorJDBCURL`
`LambdaFunctionName`	`AzureADLSLambdaFunctionName`
`SecurityGroupIds`	`SecurityGroupId`
`SpillBucket`	`AthenaLocationAzure`
`SubnetIds`	`PrivateSubnetId`

Select the Acknowledgement check box and choose Deploy.

Wait for the application to be deployed before proceeding to the next step.

Create the Athena GCS connector

To create the Athena GCS connector, complete the following steps:

On the Lambda console, create a new application.
In the Application settings section, enter the values for the corresponding key from the output of the CloudFormation stack, as listed in the following table.

Property Name	CloudFormation Output Key
`SpillBucket`	`AthenaLocationGCP`
`GCSSecretName`	`GCSSecretName`
`LambdaFunctionName`	`GCSLambdaFunctionName`

Select the Acknowledgement check box and choose Deploy.

For the GCS connector, there are some post-deployment steps to create the AWS Glue database and table for the GCS data file. In this post, the CloudFormation stack you deployed earlier already created these resources, so you don’t have to create it. The stack created an AWS Glue database called oktank_multicloudanalytics_gcp and a table called customer_feedbacks under the database with the required configurations.

Next, you create the Athena data sources corresponding to these connectors.

Create the Azure data source

Complete the following steps to create your Azure data source:

On the Athena console, create a new data source.
For Data sources, select Microsoft Azure Synapse.
Choose Next.

For Data source name, enter the value for the AthenaFederatedDataSourceNameForAzure key from the CloudFormation stack output.
In the Connection details section, choose Lambda function you created earlier for Azure.

Choose Next, then choose Create data source.

You should be able to see the associated schemas for the Azure external database.

Create the GCS data source

Complete the following steps to create your Azure data source:

On the Athena console, create a new data source.
For Data sources, select Google Cloud Storage.
Choose Next.

For Data source name, enter the value for the AthenaFederatedDataSourceNameForGCS key from the CloudFormation stack output.
In the Connection details section, choose Lambda function you created earlier for GCS.

Choose Next, then choose Create data source.

This completes the deployment. You can now run the multi-cloud queries from Athena.

Query the federated data sources

In this section, we demonstrate how to query the data sources using the ADLS user, GCS user, and multi-cloud user.

Run queries as the ADLS user

The ADLS user can run multi-cloud queries on ADLS Gen2 and Amazon S3 data. Complete the following steps:

Get the value for UserAzureADLSUser from the CloudFormation stack output.

Sign in to the Athena query editor with this user.
Switch the workgroup to athena-mc-analytics-azure-wg in the Athena query editor.

Choose Acknowledge to accept the workgroup settings.

Run the following query to join the S3 data lake table to the ADLS data lake table:

SELECT a.data_load_date as azure_load_date, b.data_key as s3_data_key, a.data_location as azure_data_location FROM "azure_adls_ds"."dbo"."customer_feedbacks_azure" a join "AwsDataCatalog"."oktank_multicloudanalytics_aws"."customer_feedbacks" b ON cast(a.data_key as integer) = b.data_key

Run queries as the GCS user

The GCS user can run multi-cloud queries on GCS and Amazon S3 data. Complete the following steps:

Get the value for UserGCPGCSUser from the CloudFormation stack output.
Sign in to the Athena query editor with this user.
Switch the workgroup to athena-mc-analytics-gcp-wg in the Athena query editor.
Choose Acknowledge to accept the workgroup settings.
Run the following query to join the S3 data lake table to the GCS data lake table:

SELECT a.data_load_date as gcs_load_date, b.data_key as s3_data_key, a.data_location as gcs_data_location FROM "gcp_gcs_ds"."oktank_multicloudanalytics_gcp"."customer_feedbacks" a
join "AwsDataCatalog"."oktank_multicloudanalytics_aws"."customer_feedbacks" b 
ON a.data_key = b.data_key

Run queries as the multi-cloud user

The multi-cloud user can run queries that can access data from any cloud store. Complete the following steps:

Get the value for UserMultiCloudUser from the CloudFormation stack output.
Sign in to the Athena query editor with this user.
Switch the workgroup to athena-mc-analytics-multi-wg in the Athena query editor.
Choose Acknowledge to accept the workgroup settings.
Run the following query to join data across the multiple cloud stores:

SELECT a.data_load_date as adls_load_date, b.data_key as s3_data_key, c.data_location as gcs_data_location 
FROM "azure_adls_ds"."dbo"."CUSTOMER_FEEDBACKS_AZURE" a 
join "AwsDataCatalog"."oktank_multicloudanalytics_aws"."customer_feedbacks" b 
on cast(a.data_key as integer) = b.data_key join "gcp_gcs_ds"."oktank_multicloudanalytics_gcp"."customer_feedbacks" c 
on b.data_key = c.data_key

Cost analysis with cost allocation tags

When you run multi-cloud queries, you need to carefully consider the data transfer costs associated with each cloud provider. Refer to the corresponding cloud documentation for details. The cost reports highlighted in this section refer to the AWS infrastructure and service usage costs. The storage and other associated costs with ADLS, Synapse, and GCS are not included.

Let’s see how to handle cost analysis for the multiple scenarios we have discussed.

The CloudFormation stack you deployed earlier added user-defined cost allocation tags, as shown in the following screenshot.

Sign in to AWS Billing and Cost Management console and enable these cost allocation tags. It may take up to 24 hours for the cost allocation tags to be available and reflected in AWS Cost Explorer.

To track the cost of the Lambda functions deployed as part of the GCS and Synapse connectors, you can use the AWS generated cost allocation tags, as shown in the following screenshot.

You can use these tags on the Billing and Cost Management console to determine the cost per tag. We provide some sample screenshots for reference. These reports only show the cost of AWS resources used to access ADLS Gen2 or GCP GCS. The reports do not show the cost of GCP or Azure resources.

Athena costs

To view Athena costs, choose the tag athena-mc-analytics:athena:workgroup and filter the tags values azure, gcp, and multi.

You can also use workgroups to set limits on the amount of data each workgroup can process to track and control cost. For more information, refer to Using workgroups to control query access and costs and Separate queries and managing costs using Amazon Athena workgroups.

Amazon S3 costs

To view the costs for Amazon S3 storage (Athena query results and spill storage), choose the tag athena-mc-analytics:s3:result-spill and filter the tag values azure, gcp, and multi.

Lambda costs

To view the costs for the Lambda functions, choose the tag aws:cloudformation:stack-name and filter the tag values serverlessepo-AthenaSynapseConnector and serverlessepo-AthenaGCSConnector.

Cost allocation tags help manage and track costs effectively when you’re running multi-cloud queries. This can help you track, control, and optimize your spending while taking advantage of the benefits of multi-cloud data analytics.

Clean up

To avoid incurring further charges, delete the CloudFormation stacks to delete the resources you provisioned as part of this post. There are two additional stacks deployed for each connector: serverlessrepo-AthenaGCSConnector and serverlessrepo-AthenaSynapseConnector. Delete all three stacks.

Conclusion

In this post, we discussed a comprehensive solution for organizations looking to implement multi-cloud data lake analytics using Athena, enabling a consolidated view of data across diverse cloud data stores and enhancing decision-making capabilities. We focused on querying data lakes across Amazon S3, Azure Data Lake Storage Gen2, and Google Cloud Storage using Athena. We demonstrated how to set up resources on Azure, GCP, and AWS, including creating databases, tables, Lambda functions, and Athena data sources. We also provided instructions for querying federated data sources from Athena, demonstrating how you can run multi-cloud queries tailored to your specific needs. Lastly, we discussed cost analysis using AWS cost allocation tags.

For further reading, refer to the following resources:

About the Author

Shoukat Ghouse is a Senior Big Data Specialist Solutions Architect at AWS. He helps customers around the world build robust, efficient and scalable data platforms on AWS leveraging AWS analytics services like AWS Glue, AWS Lake Formation, Amazon Athena and Amazon EMR.

Gain insights from historical location data using Amazon Location Service and AWS analytics services

2024-03-13 Alan Peaty

Post Syndicated from Alan Peaty original https://aws.amazon.com/blogs/big-data/gain-insights-from-historical-location-data-using-amazon-location-service-and-aws-analytics-services/

Many organizations around the world rely on the use of physical assets, such as vehicles, to deliver a service to their end-customers. By tracking these assets in real time and storing the results, asset owners can derive valuable insights on how their assets are being used to continuously deliver business improvements and plan for future changes. For example, a delivery company operating a fleet of vehicles may need to ascertain the impact from local policy changes outside of their control, such as the announced expansion of an Ultra-Low Emission Zone (ULEZ). By combining historical vehicle location data with information from other sources, the company can devise empirical approaches for better decision-making. For example, the company’s procurement team can use this information to make decisions about which vehicles to prioritize for replacement before policy changes go into effect.

Developers can use the support in Amazon Location Service for publishing device position updates to Amazon EventBridge to build a near-real-time data pipeline that stores locations of tracked assets in Amazon Simple Storage Service (Amazon S3). Additionally, you can use AWS Lambda to enrich incoming location data with data from other sources, such as an Amazon DynamoDB table containing vehicle maintenance details. Then a data analyst can use the geospatial querying capabilities of Amazon Athena to gain insights, such as the number of days their vehicles have operated in the proposed boundaries of an expanded ULEZ. Because vehicles that do not meet ULEZ emissions standards are subjected to a daily charge to operate within the zone, you can use the location data, along with maintenance data such as age of the vehicle, current mileage, and current emissions standards to estimate the amount the company would have to spend on daily fees.

This post shows how you can use Amazon Location, EventBridge, Lambda, Amazon Data Firehose, and Amazon S3 to build a location-aware data pipeline, and use this data to drive meaningful insights using AWS Glue and Athena.

Overview of solution

This is a fully serverless solution for location-based asset management. The solution consists of the following interfaces:

IoT or mobile application – A mobile application or an Internet of Things (IoT) device allows the tracking of a company vehicle while it is in use and transmits its current location securely to the data ingestion layer in AWS. The ingestion approach is not in scope of this post. Instead, a Lambda function in our solution simulates sample vehicle journeys and directly updates Amazon Location tracker objects with randomized locations.
Data analytics – Business analysts gather operational insights from multiple data sources, including the location data collected from the vehicles. Data analysts are looking for answers to questions such as, “How long did a given vehicle historically spend inside a proposed zone, and how much would the fees have cost had the policy been in place over the past 12 months?”

The following diagram illustrates the solution architecture.
Architecture diagram

The workflow consists of the following key steps:

The tracking functionality of Amazon Location is used to track the vehicle. Using EventBridge integration, filtered positional updates are published to an EventBridge event bus. This solution uses distance-based filtering to reduce costs and jitter. Distanced-based filtering ignores location updates in which devices have moved less than 30 meters (98.4 feet).
Amazon Location device position events arrive on the EventBridge default bus with source: ["aws.geo"] and detail-type: ["Location Device Position Event"]. One rule is created to forward these events to two downstream targets: a Lambda function, and a Firehose delivery stream.
Two different patterns, based on each target, are described in this post to demonstrate different approaches to committing the data to a S3 bucket:
1. Lambda function – The first approach uses a Lambda function to demonstrate how you can use code in the data pipeline to directly transform the incoming location data. You can modify the Lambda function to fetch additional vehicle information from a separate data store (for example, a DynamoDB table or a Customer Relationship Management system) to enrich the data, before storing the results in an S3 bucket. In this model, the Lambda function is invoked for each incoming event.
2. Firehose delivery stream – The second approach uses a Firehose delivery stream to buffer and batch the incoming positional updates, before storing them in an S3 bucket without modification. This method uses GZIP compression to optimize storage consumption and query performance. You can also use the data transformation feature of Data Firehose to invoke a Lambda function to perform data transformation in batches.
AWS Glue crawls both S3 bucket paths, populates the AWS Glue database tables based on the inferred schemas, and makes the data available to other analytics applications through the AWS Glue Data Catalog.
Athena is used to run geospatial queries on the location data stored in the S3 buckets. The Data Catalog provides metadata that allows analytics applications using Athena to find, read, and process the location data stored in Amazon S3.
This solution includes a Lambda function that continuously updates the Amazon Location tracker with simulated location data from fictitious journeys. The Lambda function is triggered at regular intervals using a scheduled EventBridge rule.

You can test this solution yourself using the AWS Samples GitHub repository. The repository contains the AWS Serverless Application Model (AWS SAM) template and Lambda code required to try out this solution. Refer to the instructions in the README file for steps on how to provision and decommission this solution.

Visual layouts in some screenshots in this post may look different than those on your AWS Management Console.

Data generation

In this section, we discuss the steps to manually or automatically generate journey data.

Manually generate journey data

You can manually update device positions using the AWS Command Line Interface (AWS CLI) command aws location batch-update-device-position. Replace the tracker-name, device-id, Position, and SampleTime values with your own, and make sure that successive updates are more than 30 meters in distance apart to place an event on the default EventBridge event bus:

aws location batch-update-device-position --tracker-name <tracker-name> --updates "[{\"DeviceId\": \"<device-id>\", \"Position\": [<longitude>, <latitude>], \"SampleTime\": \"<YYYY-MM-DDThh:mm:ssZ>\"}]"

Automatically generate journey data using the simulator

The provided AWS CloudFormation template deploys an EventBridge scheduled rule and an accompanying Lambda function that simulates tracker updates from vehicles. This rule is enabled by default, and runs at a frequency specified by the SimulationIntervalMinutes CloudFormation parameter. The data generation Lambda function updates the Amazon Location tracker with a randomized position offset from the vehicles’ base locations.

Vehicle names and base locations are stored in the vehicles.json file. A vehicle’s starting position is reset each day, and base locations have been chosen to give them the ability to drift in and out of the ULEZ on a given day to provide a realistic journey simulation.

You can disable the rule temporarily by navigating to the scheduled rule details on the EventBridge console. Alternatively, change the parameter State: ENABLED to State: DISABLED for the scheduled rule resource GenerateDevicePositionsScheduleRule in the template.yml file. Rebuild and re-deploy the AWS SAM template for this change to take effect.

Location data pipeline approaches

The configurations outlined in this section are deployed automatically by the provided AWS SAM template. The information in this section is provided to describe the pertinent parts of the solution.

Amazon Location device position events

Amazon Location sends device position update events to EventBridge in the following format:

{
    "version":"0",
    "id":"<event-id>",
    "detail-type":"Location Device Position Event",
    "source":"aws.geo",
    "account":"<account-number>",
    "time":"<YYYY-MM-DDThh:mm:ssZ>",
    "region":"<region>",
    "resources":[
        "arn:aws:geo:<region>:<account-number>:tracker/<tracker-name>"
    ],
    "detail":{
        "EventType":"UPDATE",
        "TrackerName":"<tracker-name>",
        "DeviceId":"<device-id>",
        "SampleTime":"<YYYY-MM-DDThh:mm:ssZ>",
        "ReceivedTime":"<YYYY-MM-DDThh:mm:ss.sssZ>",
        "Position":[
            <longitude>, 
            <latitude>
	]
    }
}

You can optionally specify an input transformation to modify the format and contents of the device position event data before it reaches the target.

Data enrichment using Lambda

Data enrichment in this pattern is facilitated through the invocation of a Lambda function. In this example, we call this function ProcessDevicePosition, and use a Python runtime. A custom transformation is applied in the EventBridge target definition to receive the event data in the following format:

{
    "EventType":<EventType>,
    "TrackerName":<TrackerName>,
    "DeviceId":<DeviceId>,
    "SampleTime":<SampleTime>,
    "ReceivedTime":<ReceivedTime>,
    "Position":[<Longitude>,<Latitude>]
}

You could apply additional transformations, such as the refactoring of Latitude and Longitude data into separate key-value pairs if this is required by the downstream business logic processing the events.

The following code demonstrates the Python application logic that is run by the ProcessDevicePosition Lambda function. Error handling has been skipped in this code snippet for brevity. The full code is available in the GitHub repo.

import json
import os
import uuid
import boto3

# Import environment variables from Lambda function.
bucket_name = os.environ["S3_BUCKET_NAME"]
bucket_prefix = os.environ["S3_BUCKET_LAMBDA_PREFIX"]

s3 = boto3.client("s3")

def lambda_handler(event, context):
    key = "%s/%s/%s-%s.json" % (bucket_prefix,
                                event["DeviceId"],
                                event["SampleTime"],
                                str(uuid.uuid4())
    body = json.dumps(event, separators=(",", ":"))
    body_encoded = body.encode("utf-8")
    s3.put_object(Bucket=bucket_name, Key=key, Body=body_encoded)
    return {
        "statusCode": 200,
        "body": "success"
    }

The preceding code creates an S3 object for each device position event received by EventBridge. The code uses the DeviceId as a prefix to write the objects to the bucket.

You can add additional logic to the preceding Lambda function code to enrich the event data using other sources. The example in the GitHub repo demonstrates enriching the event with data from a DynamoDB vehicle maintenance table.

In addition to the prerequisite AWS Identity and Access Management (IAM) permissions provided by the role AWSBasicLambdaExecutionRole, the ProcessDevicePosition function requires permissions to perform the S3 put_object action and any other actions required by the data enrichment logic. IAM permissions required by the solution are documented in the template.yml file.

{
    "Version":"2012-10-17",
    "Statement":[
        {
            "Action":[
                "s3:ListBucket"
            ],
            "Resource":[
                "arn:aws:s3:::<S3_BUCKET_NAME>"
            ],
            "Effect":"Allow"
        },
        {
            "Action":[
                "s3:PutObject"
            ],
            "Resource":[
                "arn:aws:s3:::<S3_BUCKET_NAME>/<S3_BUCKET_LAMBDA_PREFIX>/*"
            ],
            "Effect":"Allow"
        }
    ]
}

Data pipeline using Amazon Data Firehose

Complete the following steps to create your Firehose delivery stream:

On the Amazon Data Firehose console, choose Firehose streams in the navigation pane.
Choose Create Firehose stream.
For Source, choose as Direct PUT.
For Destination, choose Amazon S3.
For Firehose stream name, enter a name (for this post, ProcessDevicePositionFirehose).
Configure the destination settings with details about the S3 bucket in which the location data is stored, along with the partitioning strategy:
1. Use <S3_BUCKET_NAME> and <S3_BUCKET_FIREHOSE_PREFIX> to determine the bucket and object prefixes.
2. Use DeviceId as an additional prefix to write the objects to the bucket.
Enable Dynamic partitioning and New line delimiter to make sure partitioning is automatic based on DeviceId, and that new line delimiters are added between records in objects that are delivered to Amazon S3.

These are required by AWS Glue to later crawl the data, and for Athena to recognize individual records.
Destination settings for Firehose stream

Create an EventBridge rule and attach targets

The EventBridge rule ProcessDevicePosition defines two targets: the ProcessDevicePosition Lambda function, and the ProcessDevicePositionFirehose delivery stream. Complete the following steps to create the rule and attach targets:

On the EventBridge console, create a new rule.
For Name, enter a name (for this post, ProcessDevicePosition).
For Event bus¸ choose default.
For Rule type¸ select Rule with an event pattern.
For Event source, select AWS events or EventBridge partner events.
For Method, select Use pattern form.
In the Event pattern section, specify AWS services as the source, Amazon Location Service as the specific service, and Location Device Position Event as the event type.
For Target 1, attach the ProcessDevicePosition Lambda function as a target.
We use Input transformer to customize the event that is committed to the S3 bucket.

Configure Input paths map and Input template to organize the payload into the desired format.

The following code is the input paths map:

{
    EventType: $.detail.EventType
    TrackerName: $.detail.TrackerName
    DeviceId: $.detail.DeviceId
    SampleTime: $.detail.SampleTime
    ReceivedTime: $.detail.ReceivedTime
    Longitude: $.detail.Position[0]
    Latitude: $.detail.Position[1]
}

The following code is the input template:

{
    "EventType":<EventType>,
    "TrackerName":<TrackerName>,
    "DeviceId":<DeviceId>,
    "SampleTime":<SampleTime>,
    "ReceivedTime":<ReceivedTime>,
    "Position":[<Longitude>, <Latitude>]
}

For Target 2, choose the ProcessDevicePositionFirehose delivery stream as a target.

This target requires an IAM role that allows one or multiple records to be written to the Firehose delivery stream:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "firehose:PutRecord",
                "firehose:PutRecords"
            ],
            "Resource": [
                "arn:aws:firehose:<region>:<account-id>:deliverystream/<delivery-stream-name>"
            ],
            "Effect": "Allow"
        }
    ]
}

Crawl and catalog the data using AWS Glue

After sufficient data has been generated, complete the following steps:

On the AWS Glue console, choose Crawlers in the navigation pane.
Select the crawlers that have been created, location-analytics-glue-crawler-lambda and location-analytics-glue-crawler-firehose.
Choose Run.

The crawlers will automatically classify the data into JSON format, group the records into tables and partitions, and commit associated metadata to the AWS Glue Data Catalog.

When the Last run statuses of both crawlers show as Succeeded, confirm that two tables (lambda and firehose) have been created on the Tables page.

The solution partitions the incoming location data based on the deviceid field. Therefore, as long as there are no new devices or schema changes, the crawlers don’t need to run again. However, if new devices are added, or a different field is used for partitioning, the crawlers need to run again.
Tables

You’re now ready to query the tables using Athena.

Query the data using Athena

Athena is a serverless, interactive analytics service built to analyze unstructured, semi-structured, and structured data where it is hosted. If this is your first time using the Athena console, follow the instructions to set up a query result location in Amazon S3. To query the data with Athena, complete the following steps:

On the Athena console, open the query editor.
For Data source, choose AwsDataCatalog.
For Database, choose location-analytics-glue-database.
On the options menu (three vertical dots), choose Preview Table to query the content of both tables.

The query displays 10 sample positional records currently stored in the table. The following screenshot is an example from previewing the firehose table. The firehose table stores raw, unmodified data from the Amazon Location tracker.
Query results
You can now experiment with geospatial queries.The GeoJSON file for the 2021 London ULEZ expansion is part of the repository, and has already been converted into a query compatible with both Athena tables.

Copy and paste the content from the 1-firehose-athena-ulez-2021-create-view.sql file found in the examples/firehose folder into the query editor.

This query uses the ST_Within geospatial function to determine if a recorded position is inside or outside the ULEZ zone defined by the polygon. A new view called ulezvehicleanalysis_firehose is created with a new column, insidezone, which captures whether the recorded position exists within the zone.

A simple Python utility is provided, which converts the polygon features found in the downloaded GeoJSON file into ST_Polygon strings based on the well-known text format that can be used directly in an Athena query.

Choose Preview View on the ulezvehicleanalysis_firehose view to explore its content.

You can now run queries against this view to gain overarching insights.

Copy and paste the content from the 2-firehose-athena-ulez-2021-query-days-in-zone.sql file found in the examples/firehose folder into the query editor.

This query establishes the total number of days each vehicle has entered ULEZ, and what the expected total charges would be. The query has been parameterized using the ? placeholder character. Parameterized queries allow you to rerun the same query with different parameter values.

Enter the daily fee amount for Parameter 1, then run the query.

The results display each vehicle, the total number of days spent in the proposed ULEZ, and the total charges based on the daily fee you entered.
Query results
You can repeat this exercise using the lambda table. Data in the lambda table is augmented with additional vehicle details present in the vehicle maintenance DynamoDB table at the time it is processed by the Lambda function. The solution supports the following fields:

MeetsEmissionStandards (Boolean)
Mileage (Number)
PurchaseDate (String, in YYYY-MM-DD format)

You can also enrich the new data as it arrives.

On the DynamoDB console, find the vehicle maintenance table under Tables. The table name is provided as output VehicleMaintenanceDynamoTable in the deployed CloudFormation stack.
Choose Explore table items to view the content of the table.
Choose Create item to create a new record for a vehicle.
Enter DeviceId (such as vehicle1 as a String), PurchaseDate (such as 2005-10-01 as a String), Mileage (such as 10000 as a Number), and MeetsEmissionStandards (with a value such as False as Boolean).
Choose Create item to create the record.
Duplicate the newly created record with additional entries for other vehicles (such as for vehicle2 or vehicle3), modifying the values of the attributes slightly each time.
Rerun the location-analytics-glue-crawler-lambda AWS Glue crawler after new data has been generated to confirm that the update to the schema with new fields is registered.
Copy and paste the content from the 1-lambda-athena-ulez-2021-create-view.sql file found in the examples/lambda folder into the query editor.
Preview the ulezvehicleanalysis_lambda view to confirm that the new columns have been created.

If errors such as Column 'mileage' cannot be resolved are displayed, the data enrichment is not taking place, or the AWS Glue crawler has not yet detected updates to the schema.

If the Preview table option is only returning results from before you created records in the DynamoDB table, return the query results in descending order using sampletime (for example, order by sampletime desc limit 100;).

Now we focus on the vehicles that don’t currently meet emissions standards, and order the vehicles in descending order based on the mileage per year (calculated using the latest mileage / age of vehicle in years).

Copy and paste the content from the 2-lambda-athena-ulez-2021-query-days-in-zone.sql file found in the examples/lambda folder into the query editor.

In this example, we can see that out of our fleet of vehicles, five have been reported as not meeting emission standards. We can also see the vehicles that have accumulated high mileage per year, and the number of days spent in the proposed ULEZ. The fleet operator may now decide to prioritize these vehicles for replacement. Because location data is enriched with the most up-to-date vehicle maintenance data at the time it is ingested, you can further evolve these queries to run over a defined time window. For example, you could factor in mileage changes within the past year.

Due to the dynamic nature of the data enrichment, any new data being committed to Amazon S3, along with the query results, will be altered as and when records are updated in the DynamoDB vehicle maintenance table.

Clean up

Refer to the instructions in the README file to clean up the resources provisioned for this solution.

Conclusion

This post demonstrated how you can use Amazon Location, EventBridge, Lambda, Amazon Data Firehose, and Amazon S3 to build a location-aware data pipeline, and use the collected device position data to drive analytical insights using AWS Glue and Athena. By tracking these assets in real time and storing the results, companies can derive valuable insights on how effectively their fleets are being utilized and better react to changes in the future. You can now explore extending this sample code with your own device tracking data and analytics requirements.

About the Authors

Alan Peaty is a Senior Partner Solutions Architect at AWS. Alan helps Global Systems Integrators (GSIs) and Global Independent Software Vendors (GISVs) solve complex customer challenges using AWS services. Prior to joining AWS, Alan worked as an architect at systems integrators to translate business requirements into technical solutions. Outside of work, Alan is an IoT enthusiast and a keen runner who loves to hit the muddy trails of the English countryside.

Parag Srivastava is a Solutions Architect at AWS, helping enterprise customers with successful cloud adoption and migration. During his professional career, he has been extensively involved in complex digital transformation projects. He is also passionate about building innovative solutions around geospatial aspects of addresses.

Bring your workforce identity to Amazon EMR Studio and Athena

2024-03-05 Manjit Chakraborty

Post Syndicated from Manjit Chakraborty original https://aws.amazon.com/blogs/big-data/bring-your-workforce-identity-to-amazon-emr-studio-and-athena/

Customers today may struggle to implement proper access controls and auditing at the user level when multiple applications are involved in data access workflows. The key challenge is to implement proper least-privilege access controls based on user identity when one application accesses data on behalf of the user in another application. It forces you to either give all users broad access through the application with no auditing, or try to implement complex bespoke solutions to map roles to users.

Using AWS IAM Identity Center, you can now propagate user identity to a set of AWS services and minimize the need to build and maintain complex custom systems to vend roles between applications. IAM Identity Center also provides a consolidated view of users and groups in one place that the interconnected applications can use for authorization and auditing.

IAM Identity Center enables centralized management of user access to AWS accounts and applications using identity providers (IDPs) like Okta. This allows users to log in one time with their existing corporate credentials and seamlessly access downstream AWS services supporting identity propagation. With IAM Identity Center, Okta user identities and groups can be automatically synced using SCIM 2.0 for accurate user information in AWS.

Amazon EMR Studio is a unified data analysis environment where you can develop data engineering and data science applications. You can now develop and run interactive queries on Amazon Athena from EMR Studio (for more details, refer to Amazon EMR Studio adds interactive query editor powered by Amazon Athena ). Athena users can access EMR Studio without logging in to the AWS Management Console by enabling federated access from your IdP via IAM Identity Center. This removes the complexity of maintaining different identities and mapping user roles across your IdP, EMR Studio, and Athena.

You can govern Athena workgroups based on user attributes from Okta to control query access and costs. AWS Lake Formation can also use Okta identities to enforce fine-grained access controls through granting and revoking permissions.

IAM Identity Center and Okta single sign-on (SSO) integration streamlines access to EMR Studio and Athena with centralized authentication. Users can have a familiar sign-in experience with their workforce credentials to securely run queries in Athena. Access policies on Athena workgroups and Lake Formation permissions provide governance based on Okta user profiles.

This blog post explains how to enable single sign-on to EMR Studio using IAM Identity Center integration with Okta. It shows how to propagate Okta identities to Athena and Lake Formation to provide granular access controls on queries and data. The solution streamlines access to analytics tools with centralized authentication using workforce credentials. It leverages AWS IAM Identity Center, Amazon EMR Studio, Amazon Athena, and AWS Lake Formation.

Solution overview

IAM Identity Center allows users to connect to EMR Studio without needing administrators to manually configure AWS Identity and Access Management (IAM) roles and permissions. It enables mapping of IAM Identity Center groups to existing corporate identity roles and groups. Admins can then assign privileges to roles and groups and assign users to them, enabling granular control over user access. IAM Identity Center provides a central repository of all users in AWS. You can create users and groups directly in IAM Identity Center or connect existing users and groups from providers like Okta, Ping Identity, or Azure AD. It handles authentication through your chosen identity source and maintains a user and group directory for EMR Studio access. Known user identities and logged data access facilitates compliance through auditing user access in AWS CloudTrail.

The following diagram illustrates the solution architecture.

Solution Overview

The EMR Studio workflow consists of the following high-level steps:

The end-user launches EMR Studio using the AWS access portal URL. This URL is provided by an IAM Identity Center administrator via the IAM Identity Center dashboard.
The URL redirects the end-user to the workforce IdP Okta, where the user enters workforce identity credentials.
After successful authentication, the user will be logged in to the AWS console as a federated user.
The user opens EMR Studio and navigates to the Athena query editor using the link available on EMR Studio.
The user selects the correct workgroup as per the user role to run Athena queries.
The query results are stored in separate Amazon Simple Storage Service (Amazon S3) locations with a prefix that is based on user identity.

To implement the solution, we complete the following steps:

Integrate Okta with IAM Identity Center to sync users and groups.
Integrate IAM Identity Center with EMR Studio.
Assign users or groups from IAM Identity Center to EMR Studio.
Set up Lake Formation with IAM Identity Center.
Configure granular role-based entitlements using Lake Formation on propagated corporate identities.
Set up workgroups in Athena for governing access.
Set up Amazon S3 access grants for fine-grained access to Amazon S3 resources like buckets, prefixes, or objects.
Access EMR Studio through the AWS access portal using IAM Identity Center.
Run queries on the Athena SQL editor in EMR Studio.
Review the end-to-end audit trail of workforce identity.

Prerequisites

To follow along this post, you should have the following:

An AWS account – If you don’t have one, you can sign up here.
An Okta account that has an active subscription – You need an administrator role to set up the application on Okta. If you’re new to Okta, you can sign up for a free trial or a developer account.

For instructions to configure Okta with IAM Identity Center, refer to Configure SAML and SCIM with Okta and IAM Identity Center.

Integrate Okta with IAM Identity Center to sync users and groups

After you have successfully synced users or groups from Okta to IAM Identity Center, you can see them on the IAM Identity Center console, as shown in the following screenshot. For this post, we created and synced two user groups:

Data Engineer
Data Scientists

Workforce Identity groups in IAM Identity Center

Next, create a trusted token issuer in IAM Identity Center:

On the IAM Identity Center console, choose Settings in the navigation pane.
Choose Create trusted token issuer.
For Issuer URL, enter the URL of the trusted token issuer.
For Trusted token issuer name, enter Okta.
For Map attributes¸ map the IdP attribute Email to the IAM Identity Center attribute Email.
Choose Create trusted token issuer.

The following screenshot shows your new trusted token issuer on the IAM Identity Center console.

Okta Trusted Token Issuer in Identity Center

Integrate IAM Identity Center with EMR Studio

We start with creating a trusted identity propagation enabled in EMR Studio.

An EMR Studio administrator must perform the steps to configure EMR Studio as an IAM Identity Center-enabled application. This enables EMR Studio to discover and connect to IAM Identity Center automatically to receive sign-in and user directory services.

The point of enabling EMR Studio as an IAM Identity Center-managed application is so you can control user and group permissions from within IAM Identity Center or from a source third-party IdP that’s integrated with it (Okta in this case). When your users sign in to EMR Studio, for example data-engineer or data-scientist, it checks their groups in IAM Identity Center, and these are mapped to roles and entitlements in Lake Formation. In this manner, a group can map to a Lake Formation database role that allows read access to a set of tables or columns.

The following steps show how to create EMR Studio as an AWS-managed application with IAM Identity Center, then we see how the downstream applications like Lake Formation and Athena propagate these roles and entitlements using existing corporate credentials.

On the Amazon EMR console, navigate to EMR Studio.
Choose Create a Studio.
For Setup options, select Custom.
For Studio name, enter a name.
For S3 location for Workspace storage, select Select existing location and enter the Amazon S3 location.

Create EMR Studio with Custom Set up option

6. Configure permission details for the EMR Studio.

Note that when you choose View permission details under Service role, a new pop-up window will open. You need to create an IAM role with the same policies as shown in the pop-up window. You can use the same for your service role and IAM role.

Permission details for EMR studio

On the Create a Studio page, for Authentication, select AWS IAM Identity Center.
For User role, choose your user role.
Under Trusted identity propagation, select Enable trusted identity propagation.
Under Application access, select Only assigned users and groups.
For VPC, enter your VPC.
For Subnets, enter your subnet.
For Security and access, select Default security group.
Choose Create Studio.

Enable Identity Center and Trusted Identity Propagation

You should now see an IAM Identity Center-enabled EMR Studio on the Amazon EMR console.

IAM Identity Center enabled EMR Studio

After the EMR Studio administrator finishes creating the trusted identity propagation-enabled EMR Studio and saves the configuration, the instance of the EMR Studio appears as an IAM Identity Center-enabled application on the IAM Identity Center console.

EMR Studio appears under AWS Managed app in IAM Identity Centre

Assign users or groups from IAM Identity Center to EMR Studio

You can assign users and groups from your IAM Identity Center directory to the EMR Studio application after syncing with IAM. The EMR Studio administrator decides which IAM Identity Center users or groups to include in the app. For example, if you have 10 total groups in IAM Identity Center but don’t want all of them accessing this instance of EMR Studio, you can select which groups to include in the EMR Studio-enabled IAM app.

The following steps assign groups to EMR Studio-enabled IAM Identity Center application:

On the EMR Studio console, navigate to the new EMR Studio instance.
On the Assigned groups tab, choose Assign groups.
Choose which IAM Identity Center groups you want to include in the application. For example, you may choose the Data-Scientist and Data-Engineer groups.
Choose Done.

This allows the EMR Studio administrator to choose specific IAM Identity Center groups to be assigned access to this specific instance integrated with IAM Identity Center. Only the selected groups will be synced and given access, not all groups from the IAM Identity Center directory.

Assign Trusted Identity Propagation enabled EMR studio to your user groups by selecting groups from Studio settings

Set up Lake Formation with IAM Identity Center

To set up Lake Formation with IAM Identity Center, make sure that you have configured Okta as the IdP for IAM Identity Center, and confirm that the users and groups form Okta are now available in IAM Identity Center. Then complete the following steps:

On the Lake Formation console, choose IAM Identity Center Integration under Administration in the navigation pane.

You will see the message “IAM Identity Center enabled” along with the ARN for the IAM Identity Center application.

Choose Create.

In a few minutes, you will see a message indicating that Lake Formation has been successfully integrated with your centralized IAM identities from Okta Identity Center. Specifically, the message will state “Successfully created identity center integration with application ARN,” signifying the integration is now in place between Lake Formation and the identities managed in Okta.

IAM Identity Center enabled AWS Lake Formation

Configure granular role-based entitlements using Lake Formation on propagated corporate identities

We will now set up granular entitlements for our data access in Lake Formation. For this post, we summarize the steps needed to use the existing corporate identities on the Lake Formation console to provide relevant controls and governance on the data, which we will later query through the Athena query editor. To learn about setting up databases and tables in Lake Formation, refer to Getting started with AWS Lake Formation

This post will not go into the full details about Lake Formation. Instead, we will focus on a new capability that has been introduced in Lake Formation—the ability to set up permissions based on your existing corporate identities that are synchronized with IAM Identity Center.

This integration allows Lake Formation to use your organization’s IdP and access management policies to control permissions to data lakes. Rather than defining permissions from scratch specifically for Lake Formation, you can now rely on your existing users, groups, and access controls to determine who can access data catalogs and underlying data sources. Overall, this new integration with IAM Identity Center makes it straightforward to manage permissions for your data lake workloads using your corporate identities. It reduces the administrative overhead of keeping permissions aligned across separate systems. As AWS continues enhancing Lake Formation, features like this will further improve its viability as a full-featured data lake management environment.

In this post, we created a database called zipcode-db-tip and granted full access to the user group Data-Engineer to query on the underlying table in the database. Complete the following steps:

On the Lake Formation console, choose Grant data lake permissions.
For Principals, select IAM Identity Center.
For Users and groups, select Data-Engineer.
For LF-Tags or catalog resources, select Named Data Catalog resources.
For Databases, choose zipcode-db-tip.
For Tables, choose tip-zipcode.

Similarly, we need to provide the relevant access on the underlying tables to the users and groups for them to be able to query on the data.

Repeat the preceding steps to provide access to the Data-Engineer group to be able to query on the data.
For Table permissions, select Select, Describe, and Super.
For Data permissions, select All data access.

You can grant selective access on rows and comments as per your specific requirements.

Grant Table permissions in AWS Data Lake

Set up workgroups in Athena

Athena workgroups are an AWS feature that allows you to isolate data and queries within an AWS account. It provides a way to segregate data and control access so that each group can only access the data that is relevant to them. Athena workgroups are useful for organizations that want to restrict access to sensitive datasets or help prevent queries from impacting each other. When you create a workgroup, you can assign users and roles to it. Queries launched within a workgroup will run with the access controls and settings configured for that workgroup. They enable governance, security, and resource controls at a granular level. Athena workgroups are an important feature for managing and optimizing Athena usage across large organizations.

In this post, we create a workgroup specifically for members of our Data Engineering team. Later, when logged in under Data Engineer user profiles, we run queries from within this workgroup to demonstrate how access to Athena workgroups can be restricted based on the user profile. This allows governance policies to be enforced, making sure users can only access permitted datasets and queries based on their role.

On the Athena console, choose Workgroups under Administration in the navigation pane.
Choose Create workgroup.
For Authentication, select AWS Identity Center.
For Service role to authorize Athena, select Create and use a new service role.
For Service role name, enter a name for your role.
For Location of query result, enter an Amazon S3 location for saving your Athena query results.

This is a mandatory field when you specify IAM Identity Center for authentication.

Configure location for query result and enable user identity based S3 prefix

After you create the workgroup, you need to assign users and groups to it. For this post, we create a workgroup named data-engineer and assign the group Data-Engineer (propagated through the trusted identity propagation from IAM Identity Center).

On the Groups tab on the data-engineer details page, select the user group to assign and choose Assign groups.

Set up Amazon S3 access grants to separate the query results for each workforce identity

Next, we set up Amazon S3 grants.

You can watch the following video to set up the grants or refer to Use Amazon EMR with S3 Access Grants to scale Spark access Amazon S3 for instructions.

Initiate login through AWS federated access using the IAM Identity Center access portal

Now we’re ready to connect to EMR Studio and federated login using IAM Identity Center authentication:

On the IAM Identity Center console, navigate to the dashboard and choose the AWS access portal URL.
A browser pop-up directs you to the Okta login page, where you enter your Okta credentials.
After successful authentication, you’ll be logged in to the AWS console as a federated user.
Choose the EMR Studio application.
After you federate to EMR Studio, choose Query Editor in the navigation pane to open a new tab with the Athena query editor.

The following video shows a federated user using the AWS access portal URL to access EMR Studio using IAM Identity Center authentication.

Run queries with granular access on the editor

On EMR Studio, the user can open the Athena query editor and then specify the correct workgroup in the query editor to run the queries.

Athena Query result in data-engineer workgroup

The data engineer can query only the tables on which the user has access. The query results will appear under the S3 prefix, which is separate for each workforce identity.

Review the end-to-end audit trail of workforce identity

The IAM Identity Center administrator can look into the downstream apps that are trusted for identity propagation, as shown in the following screenshot of the IAM Identity Center console.

AWS IAM Identity Center view of the trusted applications

On the CloudTrail console, the event history displays the event name and resource accessed by the specific workforce identity.

Auditors can see the workforce identity who executed the query on AWS Data Lake

When you choose an event in CloudTrail, the auditors can see the unique user ID that accessed the underlying AWS Analytics services.

Clean up

Complete the following steps to clean up your resources:

Delete the Okta applications that you created to integrate with IAM Identity Center.
Delete IAM Identity Center configuration.
Delete the EMR Studio that you created for testing.
Delete the IAM role that you created for IAM Identity Center and EMR Studio integration.

Conclusion

In this post, we showed you a detailed walkthrough to bring your workforce identity to EMR Studio and propagate the identity to connected AWS applications like Athena and Lake Formation. This solution provides your workforce with a familiar sign-in experience, without the need to remember additional credentials or maintain complex role mapping across different analytics systems. In addition, it provides auditors with end-to-end visibility into workforce identities and their access to analytics services.

To learn more about trusted identity propagation and EMR Studio, refer to Integrate Amazon EMR with AWS IAM Identity Center.

About the authors

Manjit Chakraborty is a Senior Solutions Architect at AWS. He is a Seasoned & Result driven professional with extensive experience in Financial domain having worked with customers on advising, designing, leading, and implementing core-business enterprise solutions across the globe. In his spare time, Manjit enjoys fishing, practicing martial arts and playing with his daughter.

Neeraj Roy is a Principal Solutions Architect at AWS based out of London. He works with Global Financial Services customers to accelerate their AWS journey. In his spare time, he enjoys reading and spending time with his family.

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

2024-03-04 Satyanarayana Adimula

Post Syndicated from Satyanarayana Adimula original https://aws.amazon.com/blogs/big-data/use-aws-glue-etl-to-perform-merge-partition-evolution-and-schema-evolution-on-apache-iceberg/

As enterprises collect increasing amounts of data from various sources, the structure and organization of that data often need to change over time to meet evolving analytical needs. However, altering schema and table partitions in traditional data lakes can be a disruptive and time-consuming task, requiring renaming or recreating entire tables and reprocessing large datasets. This hampers agility and time to insight.

Schema evolution enables adding, deleting, renaming, or modifying columns without needing to rewrite existing data. This is critical for fast-moving enterprises to augment data structures to support new use cases. For example, an ecommerce company may add new customer demographic attributes or order status flags to enrich analytics. Apache Iceberg manages these schema changes in a backward-compatible way through its innovative metadata table evolution architecture.

Similarly, partition evolution allows seamless adding, dropping, or splitting partitions. For instance, an ecommerce marketplace may initially partition order data by day. As orders accumulate, and querying by day becomes inefficient, they may split to day and customer ID partitions. Table partitioning organizes big datasets most efficiently for query performance. Iceberg gives enterprises the flexibility to incrementally adjust partitions rather than requiring tedious rebuild procedures. New partitions can be added in a fully compatible way without downtime or having to rewrite existing data files.

This post demonstrates how you can harness Iceberg, Amazon Simple Storage Service (Amazon S3), AWS Glue, AWS Lake Formation, and AWS Identity and Access Management (IAM) to implement a transactional data lake supporting seamless evolution. By allowing for painless schema and partition adjustments as data insights evolve, you can benefit from the future-proof flexibility needed for business success.

Overview of solution

For our example use case, a fictional large ecommerce company processes thousands of orders each day. When orders are received, updated, cancelled, shipped, delivered, or returned, the changes are made in their on-premises system, and those changes need to be replicated to an S3 data lake so that data analysts can run queries through Amazon Athena. The changes can contain schema updates as well. Due to the security requirements of different organizations, they need to manage fine-grained access control for the analysts through Lake Formation.

The following diagram illustrates the solution architecture.

The solution workflow includes the following key steps:

Ingest data from on premises into a Dropzone location using a data ingestion pipeline.
Merge the data from the Dropzone location into Iceberg using AWS Glue.
Query the data using Athena.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Set up the infrastructure with AWS CloudFormation

To create your infrastructure with an AWS CloudFormation template, complete the following steps:

Log in as an administrator to your AWS account.
Open the AWS CloudFormation console.
Choose Launch Stack:
For Stack name, enter a name (for this post, icebergdemo1).
Choose Next.
Provide information for the following parameters:
1. DatalakeUserName
2. DatalakeUserPassword
3. DatabaseName
4. TableName
5. DatabaseLFTagKey
6. DatabaseLFTagValue
7. TableLFTagKey
8. TableLFTagValue
Choose Next.
Choose Next again.
In the Review section, review the values you entered.
Select I acknowledge that AWS CloudFormation might create IAM resources with custom names and choose Submit.

In a few minutes, the stack status will change to CREATE_COMPLETE.

You can go to the Outputs tab of the stack to see all the resources it has provisioned. The resources are prefixed with the stack name you provided (for this post, icebergdemo1).

Create an Iceberg table using Lambda and grant access using Lake Formation

To create an Iceberg table and grant access on it, complete the following steps:

Navigate to the Resources tab of the CloudFormation stack icebergdemo1 and search for logical ID named LambdaFunctionIceberg.
Choose the hyperlink of the associated physical ID.

You’re redirected to the Lambda function icebergdemo1-Lambda-Create-Iceberg-and-Grant-access.

On the Configuration tab, choose Environment variables in the left pane.

On the Code tab, you can inspect the function code.

The function uses the AWS SDK for Python (Boto3) APIs to provision the resources. It assumes the provisioned data lake admin role to perform the following tasks:

Grant DATA_LOCATION_ACCESS access to the data lake admin role on the registered data lake location
Create Lake Formation Tags (LF-Tags)
Create a database in the AWS Glue Data Catalog using the AWS Glue create_database API
Assign LF-Tags to the database
Grant DESCRIBE access on the database using LF-Tags to the data lake IAM user and AWS Glue ETL IAM role
Create an Iceberg table using the AWS Glue create_table API:

response_create_table = glue_client.create_table(
DatabaseName= 'icebergdb1',
OpenTableFormatInput= { 
 'IcebergInput': { 
 'MetadataOperation': 'CREATE',
 'Version': '2'
 }
},
TableInput={
    'Name': ‘ecomorders’,
    'StorageDescriptor': {
        'Columns': [
            {'Name': 'ordernum', 'Type': 'int'},
            {'Name': 'sku', 'Type': 'string'},
            {'Name': 'quantity','Type': 'int'},
            {'Name': 'category','Type': 'string'},
            {'Name': 'status','Type': 'string'},
            {'Name': 'shipping_id','Type': 'string'}
        ],  
        'Location': 's3://icebergdemo1-s3bucketiceberg-vthvwwblrwe8/iceberg/'
    },
    'TableType': 'EXTERNAL_TABLE'
    }
)

Assign LF-Tags to the table
Grant DESCRIBE and SELECT on the Iceberg table LF-Tags for the data lake IAM user
Grant ALL, DESCRIBE, SELECT, INSERT, DELETE, and ALTER access on the Iceberg table LF-Tags to the AWS Glue ETL IAM role

On the Test tab, choose Test to run the function.

When the function is complete, you will see the message “Executing function: succeeded.”

Lake Formation helps you centrally manage, secure, and globally share data for analytics and machine learning. With Lake Formation, you can manage fine-grained access control for your data lake data on Amazon S3 and its metadata in the Data Catalog.

To add an Amazon S3 location as Iceberg storage in your data lake, register the location with Lake Formation. You can then use Lake Formation permissions for fine-grained access control to the Data Catalog objects that point to this location, and to the underlying data in the location.

The CloudFormation stack registered the data lake location.

Data location permissions in Lake Formation enable principals to create and alter Data Catalog resources that point to the designated registered Amazon S3 locations. Data location permissions work in addition to Lake Formation data permissions to secure information in your data lake.

Lake Formation tag-based access control (LF-TBAC) is an authorization strategy that defines permissions based on attributes. In Lake Formation, these attributes are called LF-Tags. You can attach LF-Tags to Data Catalog resources, Lake Formation principals, and table columns. You can assign and revoke permissions on Lake Formation resources using these LF-Tags. Lake Formation allows operations on those resources when the principal’s tag matches the resource tag.

Verify the Iceberg table from the Lake Formation console

To verify the Iceberg table, complete the following steps:

On the Lake Formation console, choose Databases in the navigation pane.
Open the details page for icebergdb1.

You can see the associated database LF-Tags.

Choose Tables in the navigation pane.
Open the details page for ecomorders.

In the Table details section, you can observe the following:

Table format shows as Apache Iceberg
Table management shows as Managed by Data Catalog
Location lists the data lake location of the Iceberg table

In the LF-Tags section, you can see the associated table LF-Tags.

In the Table details section, expand Advanced table properties to view the following:

metadata_location points to the location of the Iceberg table’s metadata file
table_type shows as ICEBERG

On the Schema tab, you can view the columns defined on the Iceberg table.

Integrate Iceberg with the AWS Glue Data Catalog and Amazon S3

Iceberg tracks individual data files in a table instead of directories. When there is an explicit commit on the table, Iceberg creates data files and adds them to the table. Iceberg maintains the table state in metadata files. Any change in table state creates a new metadata file that atomically replaces the older metadata. Metadata files track the table schema, partitioning configuration, and other properties.

Iceberg requires file systems that support the operations to be compatible with object stores like Amazon S3.

Iceberg creates snapshots for the table contents. Each snapshot is a complete set of data files in the table at a point in time. Data files in snapshots are stored in one or more manifest files that contain a row for each data file in the table, its partition data, and its metrics.

The following diagram illustrates this hierarchy.

When you create an Iceberg table, it creates the metadata folder first and a metadata file in the metadata folder. The data folder is created when you load data into the Iceberg table.

Contents of the Iceberg metadata file

The Iceberg metadata file contains a lot of information, including the following:

format-version –Version of the Iceberg table
Location – Amazon S3 location of the table
Schemas – Name and data type of all columns on the table
partition-specs – Partitioned columns
sort-orders – Sort order of columns
properties – Table properties
current-snapshot-id – Current snapshot
refs – Table references
snapshots – List of snapshots, each containing the following information:
- sequence-number – Sequence number of snapshots in chronological order (the highest number represents the current snapshot, 1 for the first snapshot)
- snapshot-id – Snapshot ID
- timestamp-ms – Timestamp when the snapshot was committed
- summary – Summary of changes committed
- manifest-list – List of manifests; this file name starts with snap-< snapshot-id >
schema-id – Sequence number of the schema in chronological order (the highest number represents the current schema)
snapshot-log – List of snapshots in chronological order
metadata-log – List of metadata files in chronological order

The metadata file has all the historical changes to the table’s data and schema. Reviewing the contents on the metafile file directly can be a time-consuming task. Fortunately, you can query the Iceberg metadata using Athena.

Iceberg framework in AWS Glue

AWS Glue 4.0 supports Iceberg tables registered with Lake Formation. In the AWS Glue ETL jobs, you need the following code to enable the Iceberg framework:

from awsglue.context import GlueContext
from pyspark.context import SparkContext
from pyspark.conf import SparkConf
aws_account_id = boto3.client('sts').get_caller_identity().get('Account')

args = getResolvedOptions(sys.argv, ['JOB_NAME','warehouse_path']
    
# Set up configuration for AWS Glue to work with Apache Iceberg
conf = SparkConf()
conf.set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
conf.set("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
conf.set("spark.sql.catalog.glue_catalog.warehouse", args['warehouse_path'])
conf.set("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
conf.set("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
conf.set("spark.sql.catalog.glue_catalog.glue.lakeformation-enabled", "true")
conf.set("spark.sql.catalog.glue_catalog.glue.id", aws_account_id)

sc = SparkContext(conf=conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session

For read/write access to underlying data, in addition to Lake Formation permissions, the AWS Glue IAM role to run the AWS Glue ETL jobs was granted lakeformation: GetDataAccess IAM permission. With this permission, Lake Formation grants the request for temporary credentials to access the data.

The CloudFormation stack provisioned the four AWS Glue ETL jobs for you. The name of each job starts with your stack name (icebergdemo1). Complete the following steps to view the jobs:

Log in as an administrator to your AWS account.
On the AWS Glue console, choose ETL jobs in the navigation pane.
Search for jobs with icebergdemo1 in the name.

Merge data from Dropzone into the Iceberg table

For our use case, the company ingests their ecommerce orders data daily from their on-premises location into an Amazon S3 Dropzone location. The CloudFormation stack loaded three files with sample orders for 3 days, as shown in the following figures. You see the data in the Dropzone location s3://icebergdemo1-s3bucketdropzone-kunftrcblhsk/data.

The AWS Glue ETL job icebergdemo1-GlueETL1-merge will run daily to merge the data into the Iceberg table. It has the following logic to add or update the data on Iceberg:

Create a Spark DataFrame from input data:

df = spark.read.format(dropzone_dataformat).option("header", True).load(dropzone_path)
df = df.withColumn("ordernum", df["ordernum"].cast(IntegerType())) \
    .withColumn("quantity", df["quantity"].cast(IntegerType()))
df.createOrReplaceTempView("input_table")

For a new order, add it to the table
If the table has a matching order, update the status and shipping_id:

stmt_merge = f"""
    MERGE INTO glue_catalog.{database_name}.{table_name} AS t
    USING input_table AS s 
    ON t.ordernum= s.ordernum
    WHEN MATCHED 
            THEN UPDATE SET 
                t.status = s.status,
                t.shipping_id = s.shipping_id
    WHEN NOT MATCHED THEN INSERT *
    """
spark.sql(stmt_merge)

Complete the following steps to run the AWS Glue merge job:

On the AWS Glue console, choose ETL jobs in the navigation pane.
Select the ETL job icebergdemo1-GlueETL1-merge.
On the Actions dropdown menu, choose Run with parameters.
On the Run parameters page, go to Job parameters.
For the --dropzone_path parameter, provide the S3 location of the input data (icebergdemo1-s3bucketdropzone-kunftrcblhsk/data/merge1).
Run the job to add all the orders: 1001, 1002, 1003, and 1004.
For the --dropzone_path parameter, change the S3 location to icebergdemo1-s3bucketdropzone-kunftrcblhsk/data/merge2.
Run the job again to add orders 2001 and 2002, and update orders 1001, 1002, and 1003.
For the --dropzone_path parameter, change the S3 location to icebergdemo1-s3bucketdropzone-kunftrcblhsk/data/merge3.
Run the job again to add order 3001 and update orders 1001, 1003, 2001, and 2002.

Go to the data folder of table to see the data files written by Iceberg when you merged the data into the table using the Glue ETL job icebergdemo1-GlueETL1-merge.

Query Iceberg using Athena

The CloudFormation stack created the IAM user iceberguser1, which has read access on the Iceberg table using LF-Tags. To query Iceberg using Athena via this user, complete the following steps:

Log in as iceberguser1 to the AWS Management Console.
On the Athena console, choose Workgroups in the navigation pane.
Locate the workgroup that CloudFormation provisioned (icebergdemo1-workgroup)
Verify Athena engine version 3.

The Athena engine version 3 supports Iceberg file formats, including Parquet, ORC, and Avro.

Go to the Athena query editor.
Choose the workgroup icebergdemo1-workgroup on the dropdown menu.
For Database, choose icebergdb1. You will see the table ecomorders.
Run the following query to see the data in the Iceberg table:
```
SELECT * FROM "icebergdb1"."ecomorders" ORDER BY ordernum ;
```
Run the following query to see table’s current partitions:
```
DESCRIBE icebergdb1.ecomorders ;
```

Partition-spec describes how table is partitioned. In this example, there are no partitioned fields because you didn’t define any partitions on the table.

Iceberg partition evolution

You may need to change your partition structure; for example, due to trend changes of common query patterns in downstream analytics. A change of partition structure for traditional tables is a significant operation that requires an entire data copy.

Iceberg makes this straightforward. When you change the partition structure on Iceberg, it doesn’t require you to rewrite the data files. The old data written with earlier partitions remains unchanged. New data is written using the new specifications in a new layout. Metadata for each of the partition versions is kept separately.

Let’s add the partition field category to the Iceberg table using the AWS Glue ETL job icebergdemo1-GlueETL2-partition-evolution:

ALTER TABLE glue_catalog.icebergdb1.ecomorders
    ADD PARTITION FIELD category ;

On the AWS Glue console, run the ETL job icebergdemo1-GlueETL2-partition-evolution. When the job is complete, you can query partitions using Athena.

DESCRIBE icebergdb1.ecomorders ;

SELECT * FROM "icebergdb1"."ecomorders$partitions";

You can see the partition field category, but the partition values are null. There are no new data files in the data folder, because partition evolution is a metadata operation and doesn’t rewrite data files. When you add or update data, you will see the corresponding partition values populated.

Iceberg schema evolution

Iceberg supports in-place table evolution. You can evolve a table schema just like SQL. Iceberg schema updates are metadata changes, so no data files need to be rewritten to perform the schema evolution.

To explore the Iceberg schema evolution, run the ETL job icebergdemo1-GlueETL3-schema-evolution via the AWS Glue console. The job runs the following SparkSQL statements:

ALTER TABLE glue_catalog.icebergdb1.ecomorders
    ADD COLUMNS (shipping_carrier string) ;

ALTER TABLE glue_catalog.icebergdb1.ecomorders
    RENAME COLUMN shipping_id TO tracking_number ;

ALTER TABLE glue_catalog.icebergdb1.ecomorders
    ALTER COLUMN ordernum TYPE bigint ;

In the Athena query editor, run the following query:

SELECT * FROM "icebergdb1"."ecomorders" ORDER BY ordernum asc ;

You can verify the schema changes to the Iceberg table:

A new column has been added called shipping_carrier
The column shipping_id has been renamed to tracking_number
The data type of the column ordernum has changed from int to bigint
```
DESCRIBE icebergdb1.ecomorders;
```

Positional update

The data in tracking_number contains the shipping carrier concatenated with the tracking number. Let’s assume that we want to split this data in order to keep the shipping carrier in the shipping_carrier field and the tracking number in the tracking_number field.

On the AWS Glue console, run the ETL job icebergdemo1-GlueETL4-update-table. The job runs the following SparkSQL statement to update the table:

UPDATE glue_catalog.icebergdb1.ecomorders
SET shipping_carrier = substring(tracking_number,1,3),
    tracking_number = substring(tracking_number,4,50)
WHERE tracking_number != '' ;

Query the Iceberg table to verify the updated data on tracking_number and shipping_carrier.

SELECT * FROM "icebergdb1"."ecomorders" ORDER BY ordernum ;

Now that the data has been updated on the table, you should see the partition values populated for category:

SELECT * FROM "icebergdb1"."ecomorders$partitions"
ORDER BY partition;

Clean up

To avoid incurring future charges, clean up the resources you created:

On the Lambda console, open the details page for the function icebergdemo1-Lambda-Create-Iceberg-and-Grant-access.
In the Environment variables section, choose the key Task_To_Perform and update the value to CLEANUP.
Run the function, which drops the database, table, and their associated LF-Tags.
On the AWS CloudFormation console, delete the stack icebergdemo1.

Conclusion

In this post, you created an Iceberg table using the AWS Glue API and used Lake Formation to control access on the Iceberg table in a transactional data lake. With AWS Glue ETL jobs, you merged data into the Iceberg table, and performed schema evolution and partition evolution without rewriting or recreating the Iceberg table. With Athena, you queried the Iceberg data and metadata.

Based on the concepts and demonstrations from this post, you can now build a transactional data lake in an enterprise using Iceberg, AWS Glue, Lake Formation, and Amazon S3.

About the Author

Satya Adimula is a Senior Data Architect at AWS based in Boston. With over two decades of experience in data and analytics, Satya helps organizations derive business insights from their data at scale.

Enhance container software supply chain visibility through SBOM export with Amazon Inspector and QuickSight

2024-02-28 Jason Ng

Post Syndicated from Jason Ng original https://aws.amazon.com/blogs/security/enhance-container-software-supply-chain-visibility-through-sbom-export-with-amazon-inspector-and-quicksight/

In this post, I’ll show how you can export software bills of materials (SBOMs) for your containers by using an AWS native service, Amazon Inspector, and visualize the SBOMs through Amazon QuickSight, providing a single-pane-of-glass view of your organization’s software supply chain.

The concept of a bill of materials (BOM) originated in the manufacturing industry in the early 1960s. It was used to keep track of the quantities of each material used to manufacture a completed product. If parts were found to be defective, engineers could then use the BOM to identify products that contained those parts. An SBOM extends this concept to software development, allowing engineers to keep track of vulnerable software packages and quickly remediate the vulnerabilities.

Today, most software includes open source components. A Synopsys study, Walking the Line: GitOps and Shift Left Security, shows that 8 in 10 organizations reported using open source software in their applications. Consider a scenario in which you specify an open source base image in your Dockerfile but don’t know what packages it contains. Although this practice can significantly improve developer productivity and efficiency, the decreased visibility makes it more difficult for your organization to manage risk effectively.

It’s important to track the software components and their versions that you use in your applications, because a single affected component used across multiple organizations could result in a major security impact. According to a Gartner report titled Gartner Report for SBOMs: Key Takeaways You Should know, by 2025, 60 percent of organizations building or procuring critical infrastructure software will mandate and standardize SBOMs in their software engineering practice, up from less than 20 percent in 2022. This will help provide much-needed visibility into software supply chain security.

Integrating SBOM workflows into the software development life cycle is just the first step—visualizing SBOMs and being able to search through them quickly is the next step. This post describes how to process the generated SBOMs and visualize them with Amazon QuickSight. AWS also recently added SBOM export capability in Amazon Inspector, which offers the ability to export SBOMs for Amazon Inspector monitored resources, including container images.

Why is vulnerability scanning not enough?

Scanning and monitoring vulnerable components that pose cybersecurity risks is known as vulnerability scanning, and is fundamental to organizations for ensuring a strong and solid security posture. Scanners usually rely on a database of known vulnerabilities, the most common being the Common Vulnerabilities and Exposures (CVE) database.

Identifying vulnerable components with a scanner can prevent an engineer from deploying affected applications into production. You can embed scanning into your continuous integration and continuous delivery (CI/CD) pipelines so that images with known vulnerabilities don’t get pushed into your image repository. However, what if a new vulnerability is discovered but has not been added to the CVE records yet? A good example of this is the Apache Log4j vulnerability, which was first disclosed on Nov 24, 2021 and only added as a CVE on Dec 1, 2021. This means that for 7 days, scanners that relied on the CVE system weren’t able to identify affected components within their organizations. This issue is known as a zero-day vulnerability. Being able to quickly identify vulnerable software components in your applications in such situations would allow you to assess the risk and come up with a mitigation plan without waiting for a vendor or supplier to provide a patch.

In addition, it’s also good hygiene for your organization to track usage of software packages, which provides visibility into your software supply chain. This can improve collaboration between developers, operations, and security teams, because they’ll have a common view of every software component and can collaborate effectively to address security threats.

In this post, I present a solution that uses the new Amazon Inspector feature to export SBOMs from container images, process them, and visualize the data in QuickSight. This gives you the ability to search through your software inventory on a dashboard and to use natural language queries through QuickSight Q, in order to look for vulnerabilities.

Solution overview

Figure 1 shows the architecture of the solution. It is fully serverless, meaning there is no underlying infrastructure you need to manage. This post uses a newly released feature within Amazon Inspector that provides the ability to export a consolidated SBOM for Amazon Inspector monitored resources across your organization in commonly used formats, including CycloneDx and SPDX.

Figure 1: Solution architecture diagram

The workflow in Figure 1 is as follows:

The image is pushed into Amazon Elastic Container Registry (Amazon ECR), which sends an Amazon EventBridge event.
This invokes an AWS Lambda function, which starts the SBOM generation job for the specific image.
When the job completes, Amazon Inspector deposits the SBOM file in an Amazon Simple Storage Service (Amazon S3) bucket.
Another Lambda function is invoked whenever a new JSON file is deposited. The function performs the data transformation steps and uploads the new file into a new S3 bucket.
Amazon Athena is then used to perform preliminary data exploration.
A dashboard on Amazon QuickSight displays SBOM data.

Implement the solution

This section describes how to deploy the solution architecture.

In this post, you’ll perform the following tasks:

Create S3 buckets and AWS KMS keys to store the SBOMs
Create an Amazon Elastic Container Registry (Amazon ECR) repository
Deploy two AWS Lambda functions to initiate the SBOM generation and transformation
Set up Amazon EventBridge rules to invoke Lambda functions upon image push into Amazon ECR
Run AWS Glue crawlers to crawl the transformed SBOM S3 bucket
Run Amazon Athena queries to review SBOM data
Create QuickSight dashboards to identify libraries and packages
Use QuickSight Q to identify libraries and packages by using natural language queries

Deploy the CloudFormation stack

The AWS CloudFormation template we’ve provided provisions the S3 buckets that are required for the storage of raw SBOMs and transformed SBOMs, the Lambda functions necessary to initiate and process the SBOMs, and EventBridge rules to run the Lambda functions based on certain events. An empty repository is provisioned as part of the stack, but you can also use your own repository.

To deploy the CloudFormation stack

Download the CloudFormation template.
Browse to the CloudFormation service in your AWS account and choose Create Stack.
Upload the CloudFormation template you downloaded earlier.
For the next step, Specify stack details, enter a stack name.
You can keep the default value of sbom-inspector for EnvironmentName.
Specify the Amazon Resource Name (ARN) of the user or role to be the admin for the KMS key.
Deploy the stack.

Set up Amazon Inspector

If this is the first time you’re using Amazon Inspector, you need to activate the service. In the Getting started with Amazon Inspector topic in the Amazon Inspector User Guide, follow Step 1 to activate the service. This will take some time to complete.

Figure 2: Activate Amazon Inspector

SBOM invocation and processing Lambda functions

This solution uses two Lambda functions written in Python to perform the invocation task and the transformation task.

Invocation task — This function is run whenever a new image is pushed into Amazon ECR. It takes in the repository name and image tag variables and passes those into the create_sbom_export function in the SPDX format. This prevents duplicated SBOMs, which helps to keep the S3 data size small.
Transformation task — This function is run whenever a new file with the suffix .json is added to the raw S3 bucket. It creates two files, as follows:
1. It extracts information such as image ARN, account number, package, package version, operating system, and SHA from the SBOM and exports this data to the transformed S3 bucket under a folder named sbom/.
2. Because each package can have more than one CVE, this function also extracts the CVE from each package and stores it in the same bucket in a directory named cve/. Both files are exported in Apache Parquet so that the file is in a format that is optimized for queries by Amazon Athena.

Populate the AWS Glue Data Catalog

To populate the AWS Glue Data Catalog, you need to generate the SBOM files by using the Lambda functions that were created earlier.

To populate the AWS Glue Data Catalog

You can use an existing image, or you can continue on to create a sample image.
Open an AWS Cloudshell terminal.

Run the follow commands

# Pull the nginx image from a public repo
docker pull public.ecr.aws/nginx/nginx:1.19.10-alpine-perl

docker tag public.ecr.aws/nginx/nginx:1.19.10-alpine-perl <ACCOUNT-ID>.dkr.ecr.us-east-1.amazonaws.com/sbom-inspector:nginxperl

# Authenticate to ECR, fill in your account id
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <ACCOUNT-ID>.dkr.ecr.us-east-1.amazonaws.com

# Push the image into ECR
docker push <ACCOUNT-ID>.dkr.ecr.us-east-1.amazonaws.com/sbom-inspector:nginxperl

An image is pushed into the Amazon ECR repository in your account. This invokes the Lambda functions that perform the SBOM export by using Amazon Inspector and converts the SBOM file to Parquet.
Verify that the Parquet files are in the transformed S3 bucket:
1. Browse to the S3 console and choose the bucket named sbom-inspector-<ACCOUNT-ID>-transformed. You can also track the invocation of each Lambda function in the Amazon CloudWatch log console.
2. After the transformation step is complete, you will see two folders (cve/ and sbom/)in the transformed S3 bucket. Choose the sbom folder. You will see the transformed Parquet file in it. If there are CVEs present, a similar file will appear in the cve folder.
The next step is to run an AWS Glue crawler to determine the format, schema, and associated properties of the raw data. You will need to crawl both folders in the transformed S3 bucket and store the schema in separate tables in the AWS Glue Data Catalog.
On the AWS Glue Service console, on the left navigation menu, choose Crawlers.
On the Crawlers page, choose Create crawler. This starts a series of pages that prompt you for the crawler details.
In the Crawler name field, enter sbom-crawler, and then choose Next.
Under Data sources, select Add a data source.
Now you need to point the crawler to your data. On the Add data source page, choose the Amazon S3 data store. This solution in this post doesn’t use a connection, so leave the Connection field blank if it’s visible.
For the option Location of S3 data, choose In this account. Then, for S3 path, enter the path where the crawler can find the sbom and cve data, which is s3://sbom-inspector-<ACCOUNT-ID>-transformed/sbom/ and s3://sbom-inspector-<ACCOUNT-ID>-transformed/cve/. Leave the rest as default and select Add an S3 data source.

Figure 3: Data source for AWS Glue crawler
The crawler needs permissions to access the data store and create objects in the Data Catalog. To configure these permissions, choose Create an IAM role. The AWS Identity and Access Management (IAM) role name starts with AWSGlueServiceRole-, and in the field, you enter the last part of the role name. Enter sbomcrawler, and then choose Next.
Crawlers create tables in your Data Catalog. Tables are contained in a database in the Data Catalog. To create a database, choose Add database. In the pop-up window, enter sbom-db for the database name, and then choose Create.
Verify the choices you made in the Add crawler wizard. If you see any mistakes, you can choose Back to return to previous pages and make changes. After you’ve reviewed the information, choose Finish to create the crawler.

Figure 4: Creation of the AWS Glue crawler
Select the newly created crawler and choose Run.
After the crawler runs successfully, verify that the table is created and the data schema is populated.

Figure 5: Table populated from the AWS Glue crawler

Set up Amazon Athena

Amazon Athena performs the initial data exploration and validation. Athena is a serverless interactive analytics service built on open source frameworks that supports open-table and file formats. Athena provides a simplified, flexible way to analyze data in sources like Amazon S3 by using standard SQL queries. If you are SQL proficient, you can query the data source directly; however, not everyone is familiar with SQL. In this section, you run a sample query and initialize the service so that it can used in QuickSight later on.

To start using Amazon Athena

In the AWS Management Console, navigate to the Athena console.
For Database, select sbom-db (or select the database you created earlier in the crawler).
Navigate to the Settings tab located at the top right corner of the console. For Query result location, select the Athena S3 bucket created from the CloudFormation template, sbom-inspector-<ACCOUNT-ID>-athena.
Keep the defaults for the rest of the settings. You can now return to the Query Editor and start writing and running your queries on the sbom-db database.

You can use the following sample query.

select package, packageversion, cve, sha, imagearn from sbom
left join cve
using (sha, package, packageversion)
where cve is not null;

Your Athena console should look similar to the screenshot in Figure 6.

Figure 6: Sample query with Amazon Athena

This query joins the two tables and selects only the packages with CVEs identified. Alternatively, you can choose to query for specific packages or identify the most common package used in your organization.

Sample output:

# package packageversion cve sha imagearn
<PACKAGE_NAME> <PACKAGE_VERSION> <CVE> <IMAGE_SHA> <ECR_IMAGE_ARN>

Visualize data with Amazon QuickSight

Amazon QuickSight is a serverless business intelligence service that is designed for the cloud. In this post, it serves as a dashboard that allows business users who are unfamiliar with SQL to identify zero-day vulnerabilities. This can also reduce the operational effort and time of having to look through several JSON documents to identify a single package across your image repositories. You can then share the dashboard across teams without having to share the underlying data.

QuickSight SPICE (Super-fast, Parallel, In-memory Calculation Engine) is an in-memory engine that QuickSight uses to perform advanced calculations. In a large organization where you could have millions of SBOM records stored in S3, importing your data into SPICE helps to reduce the time to process and serve the data. You can also use the feature to perform a scheduled refresh to obtain the latest data from S3.

QuickSight also has a feature called QuickSight Q. With QuickSightQ, you can use natural language to interact with your data. If this is the first time you are initializing QuickSight, subscribe to QuickSight and select Enterprise + Q. It will take roughly 20–30 minutes to initialize for the first time. Otherwise, if you are already using QuickSight, you will need to enable QuickSight Q by subscribing to it in the QuickSight console.

Finally, in QuickSight you can select different data sources, such as Amazon S3 and Athena, to create custom visualizations. In this post, we will use the two Athena tables as the data source to create a dashboard to keep track of the packages used in your organization and the resulting CVEs that come with them.

Prerequisites for setting up the QuickSight dashboard

This process will be used to create the QuickSight dashboard from a template already pre-provisioned through the command line interface (CLI). It also grants the necessary permissions for QuickSight to access the data source. You will need the following:

AWS Command Line Interface (AWS CLI) programmatic access with read and write permissions to QuickSight.
A QuickSight + Q subscription (only if you want to use the Q feature).
QuickSight permissions to Amazon S3 and Athena (enable these through the QuickSight security and permissions interface).
Set the default AWS Region where you want to deploy the QuickSight dashboard. This post assumes that you’re using the us-east-1 Region.

Create datasets

In QuickSight, create two datasets, one for the sbom table and another for the cve table.

In the QuickSight console, select the Dataset tab.
Choose Create dataset, and then select the Athena data source.
Name the data source sbom and choose Create data source.
Select the sbom table.
Choose Visualize to complete the dataset creation. (Delete the analyses automatically created for you because you will create your own analyses afterwards.)
Navigate back to the main QuickSight page and repeat steps 1–4 for the cve dataset.

Merge datasets

Next, merge the two datasets to create the combined dataset that you will use for the dashboard.

On the Datasets tab, edit the sbom dataset and add the cve dataset.
Set three join clauses, as follows:
1. Sha : Sha
2. Package : Package
3. Packageversion : Packageversion
Perform a left merge, which will append the cve ID to the package and package version in the sbom dataset.

Figure 7: Combining the sbom and cve datasets

Next, you will create a dashboard based on the combined sbom dataset.

Prepare configuration files

In your terminal, export the following variables. Substitute <QuickSight username> in the QS_USER_ARN variable with your own username, which can be found in the Amazon QuickSight console.

export ACCOUNT_ID=$(aws sts get-caller-identity --output text --query Account)
export TEMPLATE_ID=”sbom_dashboard”
export QS_USER_ARN=$(aws quicksight describe-user --aws-account-id $ACCOUNT_ID --namespace default --user-name <QuickSight username> | jq .User.Arn)
export QS_DATA_ARN=$(aws quicksight search-data-sets --aws-account-id $ACCOUNT_ID --filters Name="DATASET_NAME",Operator="StringLike",Value="sbom" | jq .DataSetSummaries[0].Arn)

Validate that the variables are set properly. This is required for you to move on to the next step; otherwise you will run into errors.

echo ACCOUNT_ID is $ACCOUNT_ID || echo ACCOUNT_ID is not set
echo TEMPLATE_ID is $TEMPLATE_ID || echo TEMPLATE_ID is not set
echo QUICKSIGHT USER ARN is $QS_USER_ARN || echo QUICKSIGHT USER ARN is not set
echo QUICKSIGHT DATA ARN is $QS_DATA_ARN || echo QUICKSIGHT DATA ARN is not set

Next, use the following commands to create the dashboard from a predefined template and create the IAM permissions needed for the user to view the QuickSight dashboard.

cat < ./dashboard.json
{
    "SourceTemplate": {
      "DataSetReferences": [
        {
          "DataSetPlaceholder": "sbom",
          "DataSetArn": $QS_DATA_ARN
        }
      ],
      "Arn": "arn:aws:quicksight:us-east-1:293424211206:template/sbom_qs_template"
    }
}
EOF

cat < ./dashboardpermissions.json
[
    {
      "Principal": $QS_USER_ARN,
      "Actions": [
        "quicksight:DescribeDashboard",
        "quicksight:ListDashboardVersions",
        "quicksight:UpdateDashboardPermissions",
        "quicksight:QueryDashboard",
        "quicksight:UpdateDashboard",
        "quicksight:DeleteDashboard",
        "quicksight:DescribeDashboardPermissions",
        "quicksight:UpdateDashboardPublishedVersion"
      ]
    }
]
EOF

Run the following commands to create the dashboard in your QuickSight console.

aws quicksight create-dashboard --aws-account-id $ACCOUNT_ID --dashboard-id $ACCOUNT_ID --name sbom-dashboard --source-entity file://dashboard.json

Note: Run the following describe-dashboard command, and confirm that the response contains a status code of 200. The 200-status code means that the dashboard exists.

aws quicksight describe-dashboard --aws-account-id $ACCOUNT_ID --dashboard-id $ACCOUNT_ID

Use the following update-dashboard-permissions AWS CLI command to grant the appropriate permissions to QuickSight users.

aws quicksight update-dashboard-permissions --aws-account-id $ACCOUNT_ID --dashboard-id $ACCOUNT_ID --grant-permissions file://dashboardpermissions.json

You should now be able to see the dashboard in your QuickSight console, similar to the one in Figure 8. It’s an interactive dashboard that shows you the number of vulnerable packages you have in your repositories and the specific CVEs that come with them. You can navigate to the specific image by selecting the CVE (middle right bar chart) or list images with a specific vulnerable package (bottom right bar chart).

Note: You won’t see the exact same graph as in Figure 8. It will change according to the image you pushed in.

Figure 8: QuickSight dashboard containing SBOM information

Alternatively, you can use QuickSight Q to extract the same information from your dataset through natural language. You will need to create a topic and add the dataset you added earlier. For detailed information on how to create a topic, see the Amazon QuickSight User Guide. After QuickSight Q has completed indexing the dataset, you can start to ask questions about your data.

Figure 9: Natural language query with QuickSight Q

Conclusion

This post discussed how you can use Amazon Inspector to export SBOMs to improve software supply chain transparency. Container SBOM export should be part of your supply chain mitigation strategy and monitored in an automated manner at scale.

Although it is a good practice to generate SBOMs, it would provide little value if there was no further analysis being done on them. This solution enables you to visualize your SBOM data through a dashboard and natural language, providing better visibility into your security posture. Additionally, this solution is also entirely serverless, meaning there are no agents or sidecars to set up.

To learn more about exporting SBOMs with Amazon Inspector, see the Amazon Inspector User Guide.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Automate AWS Clean Rooms querying and dashboard publishing using AWS Step Functions and Amazon QuickSight – Part 2

2024-02-12 Venkata Kampana

Post Syndicated from Venkata Kampana original https://aws.amazon.com/blogs/big-data/automate-aws-clean-rooms-querying-and-dashboard-publishing-using-aws-step-functions-and-amazon-quicksight-part-2/

Public health organizations need access to data insights that they can quickly act upon, especially in times of health emergencies, when data needs to be updated multiple times daily. For example, during the COVID-19 pandemic, access to timely data insights was critically important for public health agencies worldwide as they coordinated emergency response efforts. Up-to-date information and analysis empowered organizations to monitor the rapidly changing situation and direct resources accordingly.

This is the second post in this series; we recommend that you read this first post before diving deep into this solution. In our first post, Enable data collaboration among public health agencies with AWS Clean Rooms – Part 1 , we showed how public health agencies can create AWS Clean Room collaborations, invite other stakeholders to join the collaboration, and run queries on their collective data without either party having to share or copy underlying data with each other. As mentioned in the previous blog, AWS Clean Rooms enables multiple organizations to analyze their data and unlock insights they can act upon, without having to share sensitive, restricted, or proprietary records.

However, public health organizations leaders and decision-making officials don’t directly access data collaboration outputs from their Amazon Simple Storage Service (Amazon S3) buckets. Instead, they rely on up-to-date dashboards that help them visualize data insights to make informed decisions quickly.

To ensure these dashboards showcase the most updated insights, the organization builders and data architects need to catalog and update AWS Clean Rooms collaboration outputs on an ongoing basis, which often involves repetitive and manual processes that, if not done well, could delay your organization’s access to the latest data insights.

Manually handling repetitive daily tasks at scale poses risks like delayed insights, miscataloged outputs, or broken dashboards. At a large volume, it would require around-the-clock staffing, straining budgets. This manual approach could expose decision-makers to inaccurate or outdated information.

Automating repetitive workflows, validation checks, and programmatic dashboard refreshes removes human bottlenecks and help decrease inaccuracies. Automation helps ensure continuous, reliable processes that deliver the most current data insights to leaders without delays, all while streamlining resources.

In this post, we explain an automated workflow using AWS Step Functions and Amazon QuickSight to help organizations access the most current results and analyses, without delays from manual data handling steps. This workflow implementation will empower decision-makers with real-time visibility into the evolving collaborative analysis outputs, ensuring they have up-to-date, relevant insights that they can act upon quickly

Solution overview

The following reference architecture illustrates some of the foundational components of clean rooms query automation and publishing dashboards using AWS services. We automate running queries using Step Functions with Amazon EventBridge schedules, build an AWS Glue Data Catalog on query outputs, and publish dashboards using QuickSight so they automatically refresh with new data. This allows public health teams to monitor the most recent insights without manual updates.

The architecture consists of the following components, as numbered in the preceding figure:

A scheduled event rule on EventBridge triggers a Step Functions workflow.
The Step Functions workflow initiates the run of a query using the StartProtectedQuery AWS Clean Rooms API. The submitted query runs securely within the AWS Clean Rooms environment, ensuring data privacy and compliance. The results of the query are then stored in a designated S3 bucket, with a unique protected query ID serving as the prefix for the stored data. This unique identifier is generated by AWS Clean Rooms for each query run, maintaining clear segregation of results.
When the AWS Clean Rooms query is successfully complete, the Step Functions workflow calls the AWS Glue API to update the location of the table in the AWS Glue Data Catalog with the Amazon S3 location where the query results were uploaded in Step 2.
Amazon Athena uses the catalog from the Data Catalog to query the information using standard SQL.
QuickSight is used to query, build visualizations, and publish dashboards using the data from the query results.

Prerequisites

For this walkthrough, you need the following:

An AWS account.
AWS Management Console access to launch AWS CloudFormation templates.
A QuickSight account.
An AWS Clean rooms collaboration. For this post, we use the membership ID for the collaboration created in Part 1 of this series. You can locate this on the AWS Clean Rooms console, on your collaboration Details tab.

Launch the CloudFormation stack

In this post, we provide a CloudFormation template to create the following resources:

An EventBridge rule that triggers the Step Functions state machine on a schedule
An AWS Glue database and a catalog table
An Athena workgroup
Three S3 buckets:
- For AWS Clean Rooms to upload the results of query runs
- For Athena to upload the results for the queries
- For storing access logs of other buckets
A Step Functions workflow designed to run the AWS Clean Rooms query, upload the results to an S3 bucket, and update the table location with the S3 path in the AWS Glue Data Catalog
An AWS Key Management Service (AWS KMS) customer-managed key to encrypt the data in S3 buckets
AWS Identity and Access Management (IAM) roles and policies with the necessary permissions

To create the necessary resources, complete the following steps:

Choose Launch Stack:

Enter cleanrooms-query-automation-blog for Stack name.
Enter the membership ID from the AWS Clean Rooms collaboration you created in Part 1 of this series.
Choose Next.

Choose Next again.
On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create stack.

After you run the CloudFormation template and create the resources, you can find the following information on the stack Outputs tab on the AWS CloudFormation console:

AthenaWorkGroup – The Athena workgroup
EventBridgeRule – The EventBridge rule triggering the Step Functions state machine
GlueDatabase – The AWS Glue database
GlueTable – The AWS Glue table storing metadata for AWS Clean Rooms query results
S3Bucket – The S3 bucket where AWS Clean Rooms uploads query results
StepFunctionsStateMachine – The Step Functions state machine

Test the solution

The EventBridge rule named cleanrooms_query_execution_Stepfunctions_trigger is scheduled to trigger every 1 hour. When this rule is triggered, it initiates the run of the CleanRoomsBlogStateMachine-XXXXXXX Step Functions state machine. Complete the following steps to test the end-to-end flow of this solution:

On the Step Functions console, navigate to the state machine you created.
On the state machine details page, locate the latest query run.

The details page lists the completed steps:

The state machine submits a query to AWS Clean Rooms using the startProtectedQuery API. The output of the API includes the query run ID and its status.
The state machine waits for 30 seconds before checking the status of the query run.
After 30 seconds, the state machine checks the query status using the getProtectedQuery API. When the status changes to SUCCESS, it proceeds to the next step to retrieve the AWS Glue table metadata information. The output of this step contains the S3 location to which the query run results are uploaded.
The state machine retrieves the metadata of the AWS Glue table named patientimmunization, which was created via the CloudFormation stack.
The state machine updates the S3 location (the location to which AWS Clean Rooms uploaded the results) in the metadata of the AWS Glue table.
After a successful update of the AWS Glue table metadata, the state machine is complete.

On the Athena console, switch the workgroup to CustomWorkgroup.
Run the following query:

“SELECT * FROM "cleanrooms_patientdb "."patientimmunization" limit 10;"

Visualize the data with QuickSight

Now that you can query your data in Athena, you can use QuickSight to visualize the results. Let’s start by granting QuickSight access to the S3 bucket where your AWS Clean Rooms query results are stored.

Grant QuickSight access to Athena and your S3 bucket

First, grant QuickSight access to the S3 bucket:

Sign in to the QuickSight console.
Choose your user name, then choose Manage QuickSight.
Choose Security and permissions.
For QuickSight access to AWS services, choose Manage.
For Amazon S3, choose Select S3 buckets, and choose the S3 bucket named cleanrooms-query-execution-results -XX-XXXX-XXXXXXXXXXXX (XXXXX represents the AWS Region and account number where the solution is deployed).
Choose Save.

Create your datasets and publish visuals

Before you can analyze and visualize the data in QuickSight, you must create datasets for your Athena tables.

On the QuickSight console, choose Datasets in the navigation pane.
Choose New dataset.
Select Athena.
Enter a name for your dataset.
Choose Create data source.
Choose the AWS Glue database cleanrooms_patientdb and select the table PatientImmunization.
Select Directly query your data.
Choose Visualize.

On the Analysis tab, choose the visual type of your choice and add visuals.

Clean up

Complete the following steps to clean up your resources when you no longer need this solution:

Manually delete the S3 buckets and the data stored in the bucket.
Delete the CloudFormation templates.
Delete the QuickSight analysis.
Delete the data source.

Conclusion

In this post, we demonstrated how to automate running AWS Clean Rooms queries using an API call from Step Functions. We also showed how to update the query results information on the existing AWS Glue table, query the information using Athena, and create visuals using QuickSight.

The automated workflow solution delivers real-time insights from AWS Clean Rooms collaborations to decision makers through automated checks for new outputs, processing, and Amazon QuickSight dashboard refreshes. This eliminates manual handling tasks, enabling faster data-driven decisions based on latest analyses. Additionally, automation frees up staff resources to focus on more strategic initiatives rather than repetitive updates.

Contact the public sector team directly to learn more about how to set up this solution, or reach out to your AWS account team to engage on a proof of concept of this solution for your organization.

About AWS Clean Rooms

AWS Clean Rooms helps companies and their partners more easily and securely analyze and collaborate on their collective datasets—without sharing or copying one another’s underlying data. With AWS Clean Rooms, you can create a secure data clean room in minutes, and collaborate with any other company on the AWS Cloud to generate unique insights about advertising campaigns, investment decisions, and research and development.

The AWS Clean Rooms team is continually building new features to help you collaborate. Watch this video to learn more about privacy-enhanced collaboration with AWS Clean Rooms.

Check out more AWS Partners or contact an AWS Representative to know how we can help accelerate your business.

Additional resources

About the Authors

Venkata Kampana is a Senior Solutions Architect in the AWS Health and Human Services team and is based in Sacramento, CA. In that role, he helps public sector customers achieve their mission objectives with well-architected solutions on AWS.

Jim Daniel is the Public Health lead at Amazon Web Services. Previously, he held positions with the United States Department of Health and Human Services for nearly a decade, including Director of Public Health Innovation and Public Health Coordinator. Before his government service, Jim served as the Chief Information Officer for the Massachusetts Department of Public Health.

Mastering market dynamics: Transforming transaction cost analytics with ultra-precise Tick History – PCAP and Amazon Athena for Apache Spark

2024-01-31 Pramod Nayak

Post Syndicated from Pramod Nayak original https://aws.amazon.com/blogs/big-data/mastering-market-dynamics-transforming-transaction-cost-analytics-with-ultra-precise-tick-history-pcap-and-amazon-athena-for-apache-spark/

This post is cowritten with Pramod Nayak, LakshmiKanth Mannem and Vivek Aggarwal from the Low Latency Group of LSEG.

Transaction cost analysis (TCA) is widely used by traders, portfolio managers, and brokers for pre-trade and post-trade analysis, and helps them measure and optimize transaction costs and the effectiveness of their trading strategies. In this post, we analyze options bid-ask spreads from the LSEG Tick History – PCAP dataset using Amazon Athena for Apache Spark. We show you how to access data, define custom functions to apply on data, query and filter the dataset, and visualize the results of the analysis, all without having to worry about setting up infrastructure or configuring Spark, even for large datasets.

Background

Options Price Reporting Authority (OPRA) serves as a crucial securities information processor, collecting, consolidating, and disseminating last sale reports, quotes, and pertinent information for US Options. With 18 active US Options exchanges and over 1.5 million eligible contracts, OPRA plays a pivotal role in providing comprehensive market data.

On February 5, 2024, the Securities Industry Automation Corporation (SIAC) is set to upgrade the OPRA feed from 48 to 96 multicast channels. This enhancement aims to optimize symbol distribution and line capacity utilization in response to escalating trading activity and volatility in the US options market. SIAC has recommended that firms prepare for peak data rates of up to 37.3 GBits per second.

Despite the upgrade not immediately altering the total volume of published data, it enables OPRA to disseminate data at a significantly faster rate. This transition is crucial for addressing the demands of the dynamic options market.

OPRA stands out as one the most voluminous feeds, with a peak of 150.4 billion messages in a single day in Q3 2023 and a capacity headroom requirement of 400 billion messages over a single day. Capturing every single message is critical for transaction cost analytics, market liquidity monitoring, trading strategy evaluation, and market research.

About the data

LSEG Tick History – PCAP is a cloud-based repository, exceeding 30 PB, housing ultra-high-quality global market data. This data is meticulously captured directly within the exchange data centers, employing redundant capture processes strategically positioned in major primary and backup exchange data centers worldwide. LSEG’s capture technology ensures lossless data capture and uses a GPS time-source for nanosecond timestamp precision. Additionally, sophisticated data arbitrage techniques are employed to seamlessly fill any data gaps. Subsequent to capture, the data undergoes meticulous processing and arbitration, and is then normalized into Parquet format using LSEG’s Real Time Ultra Direct (RTUD) feed handlers.

The normalization process, which is integral to preparing the data for analysis, generates up to 6 TB of compressed Parquet files per day. The massive volume of data is attributed to the encompassing nature of OPRA, spanning multiple exchanges, and featuring numerous options contracts characterized by diverse attributes. Increased market volatility and market making activity on the options exchanges further contribute to the volume of data published on OPRA.

The attributes of Tick History – PCAP enable firms to conduct various analyses, including the following:

Pre-trade analysis – Evaluate potential trade impact and explore different execution strategies based on historical data
Post-trade evaluation – Measure actual execution costs against benchmarks to assess the performance of execution strategies
Optimized execution – Fine-tune execution strategies based on historical market patterns to minimize market impact and reduce overall trading costs
Risk management – Identify slippage patterns, identify outliers, and proactively manage risks associated with trading activities
Performance attribution – Separate the impact of trading decisions from investment decisions when analyzing portfolio performance

The LSEG Tick History – PCAP dataset is available in AWS Data Exchange and can be accessed on AWS Marketplace. With AWS Data Exchange for Amazon S3, you can access PCAP data directly from LSEG’s Amazon Simple Storage Service (Amazon S3) buckets, eliminating the need for firms to store their own copy of the data. This approach streamlines data management and storage, providing clients immediate access to high-quality PCAP or normalized data with ease of use, integration, and substantial data storage savings.

Athena for Apache Spark

For analytical endeavors, Athena for Apache Spark offers a simplified notebook experience accessible through the Athena console or Athena APIs, allowing you to build interactive Apache Spark applications. With an optimized Spark runtime, Athena helps the analysis of petabytes of data by dynamically scaling the number of Spark engines is less than a second. Moreover, common Python libraries such as pandas and NumPy are seamlessly integrated, allowing for the creation of intricate application logic. The flexibility extends to the importation of custom libraries for use in notebooks. Athena for Spark accommodates most open-data formats and is seamlessly integrated with the AWS Glue Data Catalog.

Dataset

For this analysis, we used the LSEG Tick History – PCAP OPRA dataset from May 17, 2023. This dataset comprises the following components:

Best bid and offer (BBO) – Reports the highest bid and lowest ask for a security at a given exchange
National best bid and offer (NBBO) – Reports the highest bid and lowest ask for a security across all exchanges
Trades – Records completed trades across all exchanges

The dataset involves the following data volumes:

Trades – 160 MB distributed across approximately 60 compressed Parquet files
BBO – 2.4 TB distributed across approximately 300 compressed Parquet files
NBBO – 2.8 TB distributed across approximately 200 compressed Parquet files

Analysis overview

Analyzing OPRA Tick History data for Transaction Cost Analysis (TCA) involves scrutinizing market quotes and trades around a specific trade event. We use the following metrics as part of this study:

Quoted spread (QS) – Calculated as the difference between the BBO ask and the BBO bid
Effective spread (ES) – Calculated as the difference between the trade price and the midpoint of the BBO (BBO bid + (BBO ask – BBO bid)/2)
Effective/quoted spread (EQF) – Calculated as (ES / QS) * 100

We calculate these spreads before the trade and additionally at four intervals after the trade (just after, 1 second, 10 seconds, and 60 seconds after the trade).

Configure Athena for Apache Spark

To configure Athena for Apache Spark, complete the following steps:

On the Athena console, under Get started, select Analyze your data using PySpark and Spark SQL.
If this is your first time using Athena Spark, choose Create workgroup.
For Workgroup name¸ enter a name for the workgroup, such as tca-analysis.
In the Analytics engine section, select Apache Spark.
In the Additional configurations section, you can choose Use defaults or provide a custom AWS Identity and Access Management (IAM) role and Amazon S3 location for calculation results.
Choose Create workgroup.
After you create the workgroup, navigate to the Notebooks tab and choose Create notebook.
Enter a name for your notebook, such as tca-analysis-with-tick-history.
Choose Create to create your notebook.

Launch your notebook

If you have already created a Spark workgroup, select Launch notebook editor under Get started.

After your notebook is created, you will be redirected to the interactive notebook editor.

Now we can add and run the following code to our notebook.

Create an analysis

Complete the following steps to create an analysis:

Import common libraries:

import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

Create our data frames for BBO, NBBO, and trades:

bbo_quote = spark.read.parquet(f"s3://<bucket>/mt=bbo_quote/f=opra/dt=2023-05-17/*")
bbo_quote.createOrReplaceTempView("bbo_quote")
nbbo_quote = spark.read.parquet(f"s3://<bucket>/mt=nbbo_quote/f=opra/dt=2023-05-17/*")
nbbo_quote.createOrReplaceTempView("nbbo_quote")
trades = spark.read.parquet(f"s3://<bucket>/mt=trade/f=opra/dt=2023-05-17/29_1.parquet")
trades.createOrReplaceTempView("trades")

Now we can identify a trade to use for transaction cost analysis:

filtered_trades = spark.sql("select Product, Price,Quantity, ReceiptTimestamp, MarketParticipant from trades")

We get the following output:

+---------------------+---------------------+---------------------+-------------------+-----------------+ 
|Product |Price |Quantity |ReceiptTimestamp |MarketParticipant| 
+---------------------+---------------------+---------------------+-------------------+-----------------+ 
|QQQ 230518C00329000|1.1700000000000000000|10.0000000000000000000|1684338565538021907,NYSEArca|
|QQQ 230518C00329000|1.1700000000000000000|20.0000000000000000000|1684338576071397557,NASDAQOMXPHLX|
|QQQ 230518C00329000|1.1600000000000000000|1.0000000000000000000|1684338579104713924,ISE|
|QQQ 230518C00329000|1.1400000000000000000|1.0000000000000000000|1684338580263307057,NASDAQOMXBX_Options|
|QQQ 230518C00329000|1.1200000000000000000|1.0000000000000000000|1684338581025332599,ISE|
+---------------------+---------------------+---------------------+-------------------+-----------------+

We use the highlighted trade information going forward for the trade product (tp), trade price (tpr), and trade time (tt).

Here we create a number of helper functions for our analysis

def calculate_es_qs_eqf(df, trade_price):
    df['BidPrice'] = df['BidPrice'].astype('double')
    df['AskPrice'] = df['AskPrice'].astype('double')
    df["ES"] = ((df["AskPrice"]-df["BidPrice"])/2) - trade_price
    df["QS"] = df["AskPrice"]-df["BidPrice"]
    df["EQF"] = (df["ES"]/df["QS"])*100
    return df

def get_trade_before_n_seconds(trade_time, df, seconds=0, groupby_col = None):
    nseconds=seconds*1000000000
    nseconds += trade_time
    ret_df = df[df['ReceiptTimestamp'] < nseconds].groupby(groupby_col).last()
    ret_df['BidPrice'] = ret_df['BidPrice'].astype('double')
    ret_df['AskPrice'] = ret_df['AskPrice'].astype('double')
    ret_df = ret_df.reset_index()
    return ret_df

def get_trade_after_n_seconds(trade_time, df, seconds=0, groupby_col = None):
    nseconds=seconds*1000000000
    nseconds += trade_time
    ret_df = df[df['ReceiptTimestamp'] > nseconds].groupby(groupby_col).first()
    ret_df['BidPrice'] = ret_df['BidPrice'].astype('double')
    ret_df['AskPrice'] = ret_df['AskPrice'].astype('double')
    ret_df = ret_df.reset_index()
    return ret_df

def get_nbbo_trade_before_n_seconds(trade_time, df, seconds=0):
    nseconds=seconds*1000000000
    nseconds += trade_time
    ret_df = df[df['ReceiptTimestamp'] < nseconds].iloc[-1:]
    ret_df['BidPrice'] = ret_df['BidPrice'].astype('double')
    ret_df['AskPrice'] = ret_df['AskPrice'].astype('double')
    return ret_df

def get_nbbo_trade_after_n_seconds(trade_time, df, seconds=0):
    nseconds=seconds*1000000000
    nseconds += trade_time
    ret_df = df[df['ReceiptTimestamp'] > nseconds].iloc[:1]
    ret_df['BidPrice'] = ret_df['BidPrice'].astype('double')
    ret_df['AskPrice'] = ret_df['AskPrice'].astype('double')
    return ret_df

In the following function, we create the dataset that contains all the quotes before and after the trade. Athena Spark automatically determines how many DPUs to launch for processing our dataset.

def get_tca_analysis_via_df_single_query(trade_product, trade_price, trade_time):
    # BBO quotes
    bbos = spark.sql(f"SELECT Product, ReceiptTimestamp, AskPrice, BidPrice, MarketParticipant FROM bbo_quote where Product = '{trade_product}';")
    bbos = bbos.toPandas()

    bbo_just_before = get_trade_before_n_seconds(trade_time, bbos, seconds=0, groupby_col='MarketParticipant')
    bbo_just_after = get_trade_after_n_seconds(trade_time, bbos, seconds=0, groupby_col='MarketParticipant')
    bbo_1s_after = get_trade_after_n_seconds(trade_time, bbos, seconds=1, groupby_col='MarketParticipant')
    bbo_10s_after = get_trade_after_n_seconds(trade_time, bbos, seconds=10, groupby_col='MarketParticipant')
    bbo_60s_after = get_trade_after_n_seconds(trade_time, bbos, seconds=60, groupby_col='MarketParticipant')
    
    all_bbos = pd.concat([bbo_just_before, bbo_just_after, bbo_1s_after, bbo_10s_after, bbo_60s_after], ignore_index=True, sort=False)
    bbos_calculated = calculate_es_qs_eqf(all_bbos, trade_price)

    #NBBO quotes
    nbbos = spark.sql(f"SELECT Product, ReceiptTimestamp, AskPrice, BidPrice, BestBidParticipant, BestAskParticipant FROM nbbo_quote where Product = '{trade_product}';")
    nbbos = nbbos.toPandas()

    nbbo_just_before = get_nbbo_trade_before_n_seconds(trade_time,nbbos, seconds=0)
    nbbo_just_after = get_nbbo_trade_after_n_seconds(trade_time, nbbos, seconds=0)
    nbbo_1s_after = get_nbbo_trade_after_n_seconds(trade_time, nbbos, seconds=1)
    nbbo_10s_after = get_nbbo_trade_after_n_seconds(trade_time, nbbos, seconds=10)
    nbbo_60s_after = get_nbbo_trade_after_n_seconds(trade_time, nbbos, seconds=60)

    all_nbbos = pd.concat([nbbo_just_before, nbbo_just_after, nbbo_1s_after, nbbo_10s_after, nbbo_60s_after], ignore_index=True, sort=False)
    nbbos_calculated = calculate_es_qs_eqf(all_nbbos, trade_price)

    calc = pd.concat([bbos_calculated, nbbos_calculated], ignore_index=True, sort=False)
    
    return calc

Now let’s call the TCA analysis function with the information from our selected trade:

tp = "QQQ 230518C00329000"
tpr = 1.16
tt = 1684338579104713924
c = get_tca_analysis_via_df_single_query(tp, tpr, tt)

Visualize the analysis results

Now let’s create the data frames we use for our visualization. Each data frame contains quotes for one of the five time intervals for each data feed (BBO, NBBO):

bbo = c[c['MarketParticipant'].isin(['BBO'])]
bbo_bef = bbo[bbo['ReceiptTimestamp'] < tt]
bbo_aft_0 = bbo[bbo['ReceiptTimestamp'].between(tt,tt+1000000000)]
bbo_aft_1 = bbo[bbo['ReceiptTimestamp'].between(tt+1000000000,tt+10000000000)]
bbo_aft_10 = bbo[bbo['ReceiptTimestamp'].between(tt+10000000000,tt+60000000000)]
bbo_aft_60 = bbo[bbo['ReceiptTimestamp'] > (tt+60000000000)]

nbbo = c[~c['MarketParticipant'].isin(['BBO'])]
nbbo_bef = nbbo[nbbo['ReceiptTimestamp'] < tt]
nbbo_aft_0 = nbbo[nbbo['ReceiptTimestamp'].between(tt,tt+1000000000)]
nbbo_aft_1 = nbbo[nbbo['ReceiptTimestamp'].between(tt+1000000000,tt+10000000000)]
nbbo_aft_10 = nbbo[nbbo['ReceiptTimestamp'].between(tt+10000000000,tt+60000000000)]
nbbo_aft_60 = nbbo[nbbo['ReceiptTimestamp'] > (tt+60000000000)]

In the following sections, we provide example code to create different visualizations.

Plot QS and NBBO before the trade

Use the following code to plot the quoted spread and NBBO before the trade:

fig = px.bar(title="Quoted Spread Before The Trade",
    x=bbo_bef.MarketParticipant,
    y=bbo_bef['QS'],
    labels={'x': 'Market', 'y':'Quoted Spread'})
fig.add_hline(y=nbbo_bef.iloc[0]['QS'],
    line_width=1, line_dash="dash", line_color="red",
    annotation_text="NBBO", annotation_font_color="red")
%plotly fig

Plot QS for each market and NBBO after the trade

Use the following code to plot the quoted spread for each market and NBBO immediately after the trade:

fig = px.bar(title="Quoted Spread After The Trade",
    x=bbo_aft_0.MarketParticipant,
    y=bbo_aft_0['QS'],
    labels={'x': 'Market', 'y':'Quoted Spread'})
fig.add_hline(
    y=nbbo_aft_0.iloc[0]['QS'],
    line_width=1, line_dash="dash", line_color="red",
    annotation_text="NBBO", annotation_font_color="red")
%plotly fig

Plot QS for each time interval and each market for BBO

Use the following code to plot the quoted spread for each time interval and each market for BBO:

fig = go.Figure(data=[
    go.Bar(name="before trade", x=bbo_bef.MarketParticipant.unique(), y=bbo_bef['QS']),
    go.Bar(name="0s after trade", x=bbo_aft_0.MarketParticipant.unique(), y=bbo_aft_0['QS']),
    go.Bar(name="1s after trade", x=bbo_aft_1.MarketParticipant.unique(), y=bbo_aft_1['QS']),
    go.Bar(name="10s after trade", x=bbo_aft_10.MarketParticipant.unique(), y=bbo_aft_10['QS']),
    go.Bar(name="60s after trade", x=bbo_aft_60.MarketParticipant.unique(), y=bbo_aft_60['QS'])])
fig.update_layout(barmode='group',title="BBO Quoted Spread Per Market/TimeFrame",
    xaxis={'title':'Market'},
    yaxis={'title':'Quoted Spread'})
%plotly fig

Plot ES for each time interval and market for BBO

Use the following code to plot the effective spread for each time interval and market for BBO:

fig = go.Figure(data=[
    go.Bar(name="before trade", x=bbo_bef.MarketParticipant.unique(), y=bbo_bef['ES']),
    go.Bar(name="0s after trade", x=bbo_aft_0.MarketParticipant.unique(), y=bbo_aft_0['ES']),
    go.Bar(name="1s after trade", x=bbo_aft_1.MarketParticipant.unique(), y=bbo_aft_1['ES']),
    go.Bar(name="10s after trade", x=bbo_aft_10.MarketParticipant.unique(), y=bbo_aft_10['ES']),
    go.Bar(name="60s after trade", x=bbo_aft_60.MarketParticipant.unique(), y=bbo_aft_60['ES'])])
fig.update_layout(barmode='group',title="BBO Effective Spread Per Market/TimeFrame",
    xaxis={'title':'Market'}, 
    yaxis={'title':'Effective Spread'})
%plotly fig

Plot EQF for each time interval and market for BBO

Use the following code to plot the effective/quoted spread for each time interval and market for BBO:

fig = go.Figure(data=[
    go.Bar(name="before trade", x=bbo_bef.MarketParticipant.unique(), y=bbo_bef['EQF']),
    go.Bar(name="0s after trade", x=bbo_aft_0.MarketParticipant.unique(), y=bbo_aft_0['EQF']),
    go.Bar(name="1s after trade", x=bbo_aft_1.MarketParticipant.unique(), y=bbo_aft_1['EQF']),
    go.Bar(name="10s after trade", x=bbo_aft_10.MarketParticipant.unique(), y=bbo_aft_10['EQF']),
    go.Bar(name="60s after trade", x=bbo_aft_60.MarketParticipant.unique(), y=bbo_aft_60['EQF'])])
fig.update_layout(barmode='group',title="BBO Effective/Quoted Spread Per Market/TimeFrame",
    xaxis={'title':'Market'}, 
    yaxis={'title':'Effective/Quoted Spread'})
%plotly fig

Athena Spark calculation performance

When you run a code block, Athena Spark automatically determines how many DPUs it requires to complete the calculation. In the last code block, where we call the tca_analysis function, we are actually instructing Spark to process the data, and we then convert the resulting Spark dataframes into Pandas dataframes. This constitutes the most intensive processing part of the analysis, and when Athena Spark runs this block, it shows the progress bar, elapsed time, and how many DPUs are processing data currently. For example, in the following calculation, Athena Spark is utilizing 18 DPUs.

When you configure your Athena Spark notebook, you have the option of setting the maximum number of DPUs that it can use. The default is 20 DPUs, but we tested this calculation with 10, 20, and 40 DPUs to demonstrate how Athena Spark automatically scales to run our analysis. We observed that Athena Spark scales linearly, taking 15 minutes and 21 seconds when the notebook was configured with a maximum of 10 DPUs, 8 minutes and 23 seconds when the notebook was configured with 20 DPUs, and 4 minutes and 44 seconds when the notebook was configured with 40 DPUs. Because Athena Spark charges based on DPU usage, at a per-second granularity, the cost of these calculations is similar, but if you set a higher maximum DPU value, Athena Spark can return the result of the analysis much faster. For more details on Athena Spark pricing please click here.

Conclusion

In this post, we demonstrated how you can use high-fidelity OPRA data from LSEG’s Tick History-PCAP to perform transaction cost analytics using Athena Spark. The availability of OPRA data in a timely manner, complemented with accessibility innovations of AWS Data Exchange for Amazon S3, strategically reduces the time to analytics for firms looking to create actionable insights for critical trading decisions. OPRA generates about 7 TB of normalized Parquet data each day, and managing the infrastructure to provide analytics based on OPRA data is challenging.

Athena’s scalability in handling large-scale data processing for Tick History – PCAP for OPRA data makes it a compelling choice for organizations seeking swift and scalable analytics solutions in AWS. This post shows the seamless interaction between the AWS ecosystem and Tick History-PCAP data and how financial institutions can take advantage of this synergy to drive data-driven decision-making for critical trading and investment strategies.

About the Authors

Pramod Nayak is the Director of Product Management of the Low Latency Group at LSEG. Pramod has over 10 years of experience in the financial technology industry, focusing on software development, analytics, and data management. Pramod is a former software engineer and passionate about market data and quantitative trading.

LakshmiKanth Mannem is a Product Manager in the Low Latency Group of LSEG. He focuses on data and platform products for the low-latency market data industry. LakshmiKanth helps customers build the most optimal solutions for their market data needs.

Vivek Aggarwal is a Senior Data Engineer in the Low Latency Group of LSEG. Vivek works on developing and maintaining data pipelines for processing and delivery of captured market data feeds and reference data feeds.

Alket Memushaj is a Principal Architect in the Financial Services Market Development team at AWS. Alket is responsible for technical strategy, working with partners and customers to deploy even the most demanding capital markets workloads to the AWS Cloud.

Use Amazon Athena with Spark SQL for your open-source transactional table formats

2024-01-24 Pathik Shah

Post Syndicated from Pathik Shah original https://aws.amazon.com/blogs/big-data/use-amazon-athena-with-spark-sql-for-your-open-source-transactional-table-formats/

AWS-powered data lakes, supported by the unmatched availability of Amazon Simple Storage Service (Amazon S3), can handle the scale, agility, and flexibility required to combine different data and analytics approaches. As data lakes have grown in size and matured in usage, a significant amount of effort can be spent keeping the data consistent with business events. To ensure files are updated in a transactionally consistent manner, a growing number of customers are using open-source transactional table formats such as Apache Iceberg, Apache Hudi, and Linux Foundation Delta Lake that help you store data with high compression rates, natively interface with your applications and frameworks, and simplify incremental data processing in data lakes built on Amazon S3. These formats enable ACID (atomicity, consistency, isolation, durability) transactions, upserts, and deletes, and advanced features such as time travel and snapshots that were previously only available in data warehouses. Each storage format implements this functionality in slightly different ways; for a comparison, refer to Choosing an open table format for your transactional data lake on AWS.

In 2023, AWS announced general availability for Apache Iceberg, Apache Hudi, and Linux Foundation Delta Lake in Amazon Athena for Apache Spark, which removes the need to install a separate connector or associated dependencies and manage versions, and simplifies the configuration steps required to use these frameworks.

In this post, we show you how to use Spark SQL in Amazon Athena notebooks and work with Iceberg, Hudi, and Delta Lake table formats. We demonstrate common operations such as creating databases and tables, inserting data into the tables, querying data, and looking at snapshots of the tables in Amazon S3 using Spark SQL in Athena.

Prerequisites

Complete the following prerequisites:

Make sure you meet all the prerequisites mentioned in Run Spark SQL on Amazon Athena Spark.
Create a database called sparkblogdb and a table called noaa_pq created in the AWS Glue Data Catalog as detailed in the post Run Spark SQL on Amazon Athena Spark.
Grant the AWS Identity and Access Management (IAM) role used in the Athena workgroup read and write permissions to an S3 bucket and prefix. For more information, refer to Amazon S3: Allows read and write access to objects in an S3 Bucket.
Grant the IAM role used in the Athena workgroup s3:DeleteObject permission to an S3 bucket and prefix for cleanup. For more information, refer to the Delete Object permissions section in Amazon S3 actions.

Download and import example notebooks from Amazon S3

To follow along, download the notebooks discussed in this post from the following locations:

Iceberg tutorial notebook: s3://athena-examples-us-east-1/athenasparksqlblog/notebooks/SparkSQL_iceberg.ipynb
Hudi tutorial notebook: s3://athena-examples-us-east-1/athenasparksqlblog/notebooks/SparkSQL_hudi.ipynb
Delta tutorial notebook: s3://athena-examples-us-east-1/athenasparksqlblog/notebooks/SparkSQL_delta.ipynb

After you download the notebooks, import them into your Athena Spark environment by following the To import a notebook section in Managing notebook files.

Navigate to specific Open Table Format section

If you are interested in Iceberg table format, navigate to Working with Apache Iceberg tables section.

If you are interested in Hudi table format, navigate to Working with Apache Hudi tables section.

If you are interested in Delta Lake table format, navigate to Working with Linux foundation Delta Lake tables section.

Working with Apache Iceberg tables

When using Spark notebooks in Athena, you can run SQL queries directly without having to use PySpark. We do this by using cell magics, which are special headers in a notebook cell that change the cell’s behavior. For SQL, we can add the %%sql magic, which will interpret the entire cell contents as a SQL statement to be run on Athena.

In this section, we show how you can use SQL on Apache Spark for Athena to create, analyze, and manage Apache Iceberg tables.

Set up a notebook session

In order to use Apache Iceberg in Athena, while creating or editing a session, select the Apache Iceberg option by expanding the Apache Spark properties section. It will pre-populate the properties as shown in the following screenshot.

This image shows the Apache Iceberg properties set while creating Spak session in Athena.

For steps, see Editing session details or Creating your own notebook.

The code used in this section is available in the SparkSQL_iceberg.ipynb file to follow along.

Create a database and Iceberg table

First, we create a database in the AWS Glue Data Catalog. With the following SQL, we can create a database called icebergdb:

%%sql
CREATE DATABASE icebergdb

Next, in the database icebergdb, we create an Iceberg table called noaa_iceberg pointing to a location in Amazon S3 where we will load the data. Run the following statement and replace the location s3://<your-S3-bucket>/<prefix>/ with your S3 bucket and prefix:

%%sql
CREATE TABLE icebergdb.noaa_iceberg(
station string,
date string,
latitude string,
longitude string,
elevation string,
name string,
temp string,
temp_attributes string,
dewp string,
dewp_attributes string,
slp string,
slp_attributes string,
stp string,
stp_attributes string,
visib string,
visib_attributes string,
wdsp string,
wdsp_attributes string,
mxspd string,
gust string,
max string,
max_attributes string,
min string,
min_attributes string,
prcp string,
prcp_attributes string,
sndp string,
frshtt string)
USING iceberg
PARTITIONED BY (year string)
LOCATION 's3://<your-S3-bucket>/<prefix>/noaaiceberg/'

Insert data into the table

To populate the noaa_iceberg Iceberg table, we insert data from the Parquet table sparkblogdb.noaa_pq that was created as part of the prerequisites. You can do this using an INSERT INTO statement in Spark:

%%sql
INSERT INTO icebergdb.noaa_iceberg select * from sparkblogdb.noaa_pq

Alternatively, you can use CREATE TABLE AS SELECT with the USING iceberg clause to create an Iceberg table and insert data from a source table in one step:

%%sql
CREATE TABLE icebergdb.noaa_iceberg
USING iceberg
PARTITIONED BY (year)
AS SELECT * FROM sparkblogdb.noaa_pq

Query the Iceberg table

Now that the data is inserted in the Iceberg table, we can start analyzing it. Let’s run a Spark SQL to find the minimum recorded temperature by year for the 'SEATTLE TACOMA AIRPORT, WA US' location:

%%sql
select name, year, min(MIN) as minimum_temperature
from icebergdb.noaa_iceberg
where name = 'SEATTLE TACOMA AIRPORT, WA US'
group by 1,2

We get following output.

Image shows output of first select query

Update data in the Iceberg table

Let’s look at how to update data in our table. We want to update the station name 'SEATTLE TACOMA AIRPORT, WA US' to 'Sea-Tac'. Using Spark SQL, we can run an UPDATE statement against the Iceberg table:

%%sql
UPDATE icebergdb.noaa_iceberg
SET name = 'Sea-Tac'
WHERE name = 'SEATTLE TACOMA AIRPORT, WA US'

We can then run the previous SELECT query to find the minimum recorded temperature for the 'Sea-Tac' location:

%%sql
select name, year, min(MIN) as minimum_temperature
from icebergdb.noaa_iceberg
where name = 'Sea-Tac'
group by 1,2

We get the following output.

Image shows output of second select query

Compact data files

Open table formats like Iceberg work by creating delta changes in file storage, and tracking the versions of rows through manifest files. More data files leads to more metadata stored in manifest files, and small data files often cause an unnecessary amount of metadata, resulting in less efficient queries and higher Amazon S3 access costs. Running Iceberg’s rewrite_data_files procedure in Spark for Athena will compact data files, combining many small delta change files into a smaller set of read-optimized Parquet files. Compacting files speeds up the read operation when queried. To run compaction on our table, run the following Spark SQL:

%%sql
CALL spark_catalog.system.rewrite_data_files
(table => 'icebergdb.noaa_iceberg', strategy=>'sort', sort_order => 'zorder(name)')

rewrite_data_files offers options to specify your sort strategy, which can help reorganize and compact data.

List table snapshots

Each write, update, delete, upsert, and compaction operation on an Iceberg table creates a new snapshot of a table while keeping the old data and metadata around for snapshot isolation and time travel. To list the snapshots of an Iceberg table, run the following Spark SQL statement:

%%sql
SELECT *
FROM spark_catalog.icebergdb.noaa_iceberg.snapshots

Expire old snapshots

Regularly expiring snapshots is recommended to delete data files that are no longer needed, and to keep the size of table metadata small. It will never remove files that are still required by a non-expired snapshot. In Spark for Athena, run the following SQL to expire snapshots for the table icebergdb.noaa_iceberg that are older than a specific timestamp:

%%sql
CALL spark_catalog.system.expire_snapshots
('icebergdb.noaa_iceberg', TIMESTAMP '2023-11-30 00:00:00.000')

Note that the timestamp value is specified as a string in format yyyy-MM-dd HH:mm:ss.fff. The output will give a count of the number of data and metadata files deleted.

Drop the table and database

You can run the following Spark SQL to clean up the Iceberg tables and associated data in Amazon S3 from this exercise:

%%sql
DROP TABLE icebergdb.noaa_iceberg PURGE

Run the following Spark SQL to remove the database icebergdb:

%%sql
DROP DATABASE icebergdb

To learn more about all the operations you can perform on Iceberg tables using Spark for Athena, refer to Spark Queries and Spark Procedures in the Iceberg documentation.

Working with Apache Hudi tables

Next, we show how you can use SQL on Spark for Athena to create, analyze, and manage Apache Hudi tables.

Set up a notebook session

In order to use Apache Hudi in Athena, while creating or editing a session, select the Apache Hudi option by expanding the Apache Spark properties section.

This image shows the Apache Hudi properties set while creating Spak session in Athena.

For steps, see Editing session details or Creating your own notebook.

The code used in this section should be available in the SparkSQL_hudi.ipynb file to follow along.

Create a database and Hudi table

First, we create a database called hudidb that will be stored in the AWS Glue Data Catalog followed by Hudi table creation:

%%sql
CREATE DATABASE hudidb

We create a Hudi table pointing to a location in Amazon S3 where we will load the data. Note that the table is of copy-on-write type. It is defined by type= 'cow' in the table DDL. We have defined station and date as the multiple primary keys and preCombinedField as year. Also, the table is partitioned on year. Run the following statement and replace the location s3://<your-S3-bucket>/<prefix>/ with your S3 bucket and prefix:

%%sql
CREATE TABLE hudidb.noaa_hudi(
station string,
date string,
latitude string,
longitude string,
elevation string,
name string,
temp string,
temp_attributes string,
dewp string,
dewp_attributes string,
slp string,
slp_attributes string,
stp string,
stp_attributes string,
visib string,
visib_attributes string,
wdsp string,
wdsp_attributes string,
mxspd string,
gust string,
max string,
max_attributes string,
min string,
min_attributes string,
prcp string,
prcp_attributes string,
sndp string,
frshtt string,
year string)
USING HUDI
PARTITIONED BY (year)
TBLPROPERTIES(
primaryKey = 'station, date',
preCombineField = 'year',
type = 'cow'
)
LOCATION 's3://<your-S3-bucket>/<prefix>/noaahudi/'

Insert data into the table

Like with Iceberg, we use the INSERT INTO statement to populate the table by reading data from the sparkblogdb.noaa_pq table created in the previous post:

%%sql
INSERT INTO hudidb.noaa_hudi select * from sparkblogdb.noaa_pq

Query the Hudi table

Now that the table is created, let’s run a query to find the maximum recorded temperature for the 'SEATTLE TACOMA AIRPORT, WA US' location:

%%sql
select name, year, max(MAX) as maximum_temperature
from hudidb.noaa_hudi
where name = 'SEATTLE TACOMA AIRPORT, WA US'
group by 1,2

Update data in the Hudi table

Let’s change the station name 'SEATTLE TACOMA AIRPORT, WA US' to 'Sea–Tac'. We can run an UPDATE statement on Spark for Athena to update the records of the noaa_hudi table:

%%sql
UPDATE hudidb.noaa_hudi
SET name = 'Sea-Tac'
WHERE name = 'SEATTLE TACOMA AIRPORT, WA US'

We run the previous SELECT query to find the maximum recorded temperature for the 'Sea-Tac' location:

%%sql
select name, year, max(MAX) as maximum_temperature
from hudidb.noaa_hudi
where name = 'Sea-Tac'
group by 1,2

Run time travel queries

We can use time travel queries in SQL on Athena to analyze past data snapshots. For example:

%%sql
select name, year, max(MAX) as maximum_temperature
from hudidb.noaa_hudi timestamp as of '2023-12-01 23:53:43.100'
where name = 'SEATTLE TACOMA AIRPORT, WA US'
group by 1,2

This query checks the Seattle Airport temperature data as of a specific time in the past. The timestamp clause lets us travel back without altering current data. Note that the timestamp value is specified as a string in format yyyy-MM-dd HH:mm:ss.fff.

Optimize query speed with clustering

To improve query performance, you can perform clustering on Hudi tables using SQL in Spark for Athena:

%%sql
CALL run_clustering(table => 'hudidb.noaa_hudi', order => 'name')

Compact tables

Compaction is a table service employed by Hudi specifically in Merge On Read (MOR) tables to merge updates from row-based log files to the corresponding columnar-based base file periodically to produce a new version of the base file. Compaction is not applicable to Copy On Write (COW) tables and only applies to MOR tables. You can run the following query in Spark for Athena to perform compaction on MOR tables:

%%sql
CALL run_compaction(op => 'run', table => 'hudi_table_mor');

Drop the table and database

Run the following Spark SQL to remove the Hudi table you created and associated data from the Amazon S3 location:

%%sql
DROP TABLE hudidb.noaa_hudi PURGE

Run the following Spark SQL to remove the database hudidb:

%%sql
DROP DATABASE hudidb

To learn about all the operations you can perform on Hudi tables using Spark for Athena, refer to SQL DDL and Procedures in the Hudi documentation.

Working with Linux foundation Delta Lake tables

Next, we show how you can use SQL on Spark for Athena to create, analyze, and manage Delta Lake tables.

Set up a notebook session

In order to use Delta Lake in Spark for Athena, while creating or editing a session, select Linux Foundation Delta Lake by expanding the Apache Spark properties section.

This image shows the Delta Lake properties set while creating Spak session in Athena.

For steps, see Editing session details or Creating your own notebook.

The code used in this section should be available in the SparkSQL_delta.ipynb file to follow along.

Create a database and Delta Lake table

In this section, we create a database in the AWS Glue Data Catalog. Using following SQL, we can create a database called deltalakedb:

%%sql
CREATE DATABASE deltalakedb

Next, in the database deltalakedb, we create a Delta Lake table called noaa_delta pointing to a location in Amazon S3 where we will load the data. Run the following statement and replace the location s3://<your-S3-bucket>/<prefix>/ with your S3 bucket and prefix:

%%sql
CREATE TABLE deltalakedb.noaa_delta(
station string,
date string,
latitude string,
longitude string,
elevation string,
name string,
temp string,
temp_attributes string,
dewp string,
dewp_attributes string,
slp string,
slp_attributes string,
stp string,
stp_attributes string,
visib string,
visib_attributes string,
wdsp string,
wdsp_attributes string,
mxspd string,
gust string,
max string,
max_attributes string,
min string,
min_attributes string,
prcp string,
prcp_attributes string,
sndp string,
frshtt string)
USING delta
PARTITIONED BY (year string)
LOCATION 's3://<your-S3-bucket>/<prefix>/noaadelta/'

Insert data into the table

We use an INSERT INTO statement to populate the table by reading data from the sparkblogdb.noaa_pq table created in the previous post:

%%sql
INSERT INTO deltalakedb.noaa_delta select * from sparkblogdb.noaa_pq

You can also use CREATE TABLE AS SELECT to create a Delta Lake table and insert data from a source table in one query.

Query the Delta Lake table

Now that the data is inserted in the Delta Lake table, we can start analyzing it. Let’s run a Spark SQL to find the minimum recorded temperature for the 'SEATTLE TACOMA AIRPORT, WA US' location:

%%sql
select name, year, max(MAX) as minimum_temperature
from deltalakedb.noaa_delta
where name = 'SEATTLE TACOMA AIRPORT, WA US'
group by 1,2

Update data in the Delta lake table

Let’s change the station name 'SEATTLE TACOMA AIRPORT, WA US' to 'Sea–Tac'. We can run an UPDATE statement on Spark for Athena to update the records of the noaa_delta table:

%%sql
UPDATE deltalakedb.noaa_delta
SET name = 'Sea-Tac'
WHERE name = 'SEATTLE TACOMA AIRPORT, WA US'

We can run the previous SELECT query to find the minimum recorded temperature for the 'Sea-Tac' location, and the result should be the same as earlier:

%%sql
select name, year, max(MAX) as minimum_temperature
from deltalakedb.noaa_delta
where name = 'Sea-Tac'
group by 1,2

Compact data files

In Spark for Athena, you can run OPTIMIZE on the Delta Lake table, which will compact the small files into larger files, so the queries are not burdened by the small file overhead. To perform the compaction operation, run the following query:

%%sql
OPTIMIZE deltalakedb.noaa_delta

Refer to Optimizations in the Delta Lake documentation for different options available while running OPTIMIZE.

Remove files no longer referenced by a Delta Lake table

You can remove files stored in Amazon S3 that are no longer referenced by a Delta Lake table and are older than the retention threshold by running the VACCUM command on the table using Spark for Athena:

%%sql
VACUUM deltalakedb.noaa_delta

Refer to Remove files no longer referenced by a Delta table in the Delta Lake documentation for options available with VACUUM.

Drop the table and database

Run the following Spark SQL to remove the Delta Lake table you created:

%%sql
DROP TABLE deltalakedb.noaa_delta

Run the following Spark SQL to remove the database deltalakedb:

%%sql
DROP DATABASE deltalakedb

Running DROP TABLE DDL on the Delta Lake table and database deletes the metadata for these objects, but doesn’t automatically delete the data files in Amazon S3. You can run the following Python code in the notebook’s cell to delete the data from the S3 location:

import boto3

s3 = boto3.resource('s3')
bucket = s3.Bucket('<your-S3-bucket>')
bucket.objects.filter(Prefix="<prefix>/noaadelta/").delete()

To learn more about the SQL statements that you can run on a Delta Lake table using Spark for Athena, refer to the quickstart in the Delta Lake documentation.

Conclusion

This post demonstrated how to use Spark SQL in Athena notebooks to create databases and tables, insert and query data, and perform common operations like updates, compactions, and time travel on Hudi, Delta Lake, and Iceberg tables. Open table formats add ACID transactions, upserts, and deletes to data lakes, overcoming limitations of raw object storage. By removing the need to install separate connectors, Spark on Athena’s built-in integration reduces configuration steps and management overhead when using these popular frameworks for building reliable data lakes on Amazon S3. To learn more about selecting an open table format for your data lake workloads, refer to Choosing an open table format for your transactional data lake on AWS.

About the Authors

Pathik Shah is a Sr. Analytics Architect on Amazon Athena. He joined AWS in 2015 and has been focusing in the big data analytics space since then, helping customers build scalable and robust solutions using AWS analytics services.

Raj Devnath is a Product Manager at AWS on Amazon Athena. He is passionate about building products customers love and helping customers extract value from their data. His background is in delivering solutions for multiple end markets, such as finance, retail, smart buildings, home automation, and data communication systems.

Design a data mesh on AWS that reflects the envisioned organization

2024-01-22 Claudia Chitu

Post Syndicated from Claudia Chitu original https://aws.amazon.com/blogs/big-data/design-a-data-mesh-on-aws-that-reflects-the-envisioned-organization/

This post is written in collaboration with Claudia Chitu and Spyridon Dosis from ACAST.

Founded in 2014, Acast is the world’s leading independent podcast company, elevating podcast creators and podcast advertisers for the ultimate listening experience. By championing an independent and open ecosystem for podcasting, Acast aims to fuel podcasting with the tools and monetization needed to thrive.

The company uses AWS Cloud services to build data-driven products and scale engineering best practices. To ensure a sustainable data platform amid growth and profitability phases, their tech teams adopted a decentralized data mesh architecture.

In this post, we discuss how Acast overcame the challenge of coupled dependencies between teams working with data at scale by employing the concept of a data mesh.

The problem

With an accelerated growth and expansion, Acast encountered a challenge that resonates globally. Acast found itself with diverse business units and a vast amount of data generated across the organization. The existing monolith and centralized architecture was struggling to meet the growing demands of data consumers. Data engineers were finding it increasingly challenging to maintain and scale the data infrastructure, resulting in data access, data silos, and inefficiencies in data management. A key objective was to enhance the end-to-end user experience, starting from the business needs.

Acast needed to address these challenges in order to get to an operational scale, meaning a global maximum of the number of people that can independently operate and deliver value. In this case, Acast tried to tackle the challenge of this monolith structure and the high time to value for product teams, tech teams, end consumers. It’s worth mentioning that they also have other product and tech teams, including operational or business teams, without AWS accounts.

Acast has a variable number of product teams, continuously evolving by merging existing ones, splitting them, adding new people, or simply creating new teams. In the last 2 years, they have had between 10–20 teams, consisting of 4–10 people each. Each team owns at least two AWS accounts, up to 10 accounts, depending on the ownership. The majority of data produced by these accounts is used downstream for business intelligence (BI) purposes and in Amazon Athena, by hundreds of business users every day.

The solution Acast implemented is a data mesh, architected on AWS. The solution mirrors the organizational structure rather than an explicit architectural decision. As per the Inverse Conway Maneuver, Acast’s technology architecture displays isomorphism with the business architecture. In this case, the business users are enabled through the data mesh architecture to get faster time to insights and know directly who the domain specific owners are, speeding up collaboration. This will be further detailed when we discuss the AWS Identity and Access Management (IAM) roles used, because one of the roles is dedicated to the business group.

Parameters of success

Acast succeeded in bootstrapping and scaling a new team- and domain-oriented data product and its corresponding infrastructure and setup, resulting in less friction in gathering insights and happier users and consumers.

The success of the implementation meant assessing various aspects of the data infrastructure, data management, and business outcomes. They classified the metrics and indicators in the following categories:

Data usage – A clear understanding of who is consuming what data source, materialized with a mapping of consumers and producers. Discussions with users showed they were happier to have faster access to data in a simpler way, a more structured data organization, and a clear mapping of who the producer is. A lot of progress has been made to advance their data-driven culture (data literacy, data sharing, and collaboration across business units).
Data governance – With their service-level object stating when the data sources are available (among other details), teams know whom to notify and can do so in a shorter time when there is late data coming in or other issues with the data. With a data steward role in place, the ownership has been strengthened.
Data team productivity – Through engineering retrospectives, Acast found that their teams appreciate autonomy to make decisions regarding their data domains.
Cost and resource efficiency – This is an area where Acast observed a reduction in data duplication, and therefore cost reduction (in some accounts, removing the copy of data 100%), by reading data across accounts while enabling scaling.

Data mesh overview

A data mesh is a sociotechnical approach to build a decentralized data architecture by using a domain-oriented, self-serve design (in a software development perspective), and borrows Eric Evans’ theory of domain-driven design and Manuel Pais’ and Matthew Skelton’s theory of team topologies. It’s important to establish the context to understand what data mesh is because it sets the stage for the technical details that follow and can help you understand how the concepts discussed in this post fit into the broader framework of a data mesh.

To recap before diving deeper into Acast’s implementation, the data mesh concept is based on the following principles:

It’s domain driven, as opposed to pipelines as a first-class concern
It serves data as a product
It’s a good product that delights users (data is trustworthy, documentation is available, and it’s easily consumable)
It offers federated computational governance and decentralized ownership—a self-serve data platform

Domain-driven architecture

In Acast’s approach of owning the operational and analytical datasets, teams are structured with ownership based on domain, reading directly from the producer of the data, via an API or programmatically from Amazon S3 storage or using Athena as a SQL query engine. Some examples of Acast’s domains are presented in the following figure.

As illustrated in the preceding figure, some domains are loosely coupled to other domains’ operational or analytical endpoints, with a different ownership. Others might have stronger dependency, which is expected, for business (some podcasters can be also advertisers, creating sponsorship creatives and running campaigns for their own shows, or transacting ads using Acast’s software as a service).

Data as a product

Treating data as a product entails three key components: the data itself, the metadata, and the associated code and infrastructure. In this approach, teams responsible for generating data are referred to as producers. These producer teams possess in-depth knowledge about their consumers, understanding how their data product is utilized. Any changes planned by the data producers are communicated in advance to all consumers. This proactive notification ensures that downstream processes are not disrupted. By providing consumers with advance notice, they have sufficient time to prepare for and adapt to the upcoming changes, maintaining a smooth and uninterrupted workflow. The producers run a new version of the initial dataset in parallel, notify the consumers individually, and discuss with them their necessary timeframe to start consuming the new version. When all consumers are using the new version, the producers make the initial version unavailable.

Data schemas are inferred from the common agreed-upon format to share files between teams, which is Parquet in the case of Acast. Data can be shared in files, batched or stream events, and more. Each team has its own AWS account acting as an independent and autonomous entity with its own infrastructure. For orchestration, they use the AWS Cloud Development Kit (AWS CDK) for infrastructure as code (IaC) and AWS Glue Data Catalogs for metadata management. Users can also raise requests to producers to improve the way the data is presented or to enrich the data with new data points for generating a higher business value.

With each team owning an AWS account and a data catalog ID from Athena, it’s straightforward to see this through the lenses of a distributed data lake on top of Amazon S3, with a common catalog mapping all the catalogs from all the accounts.

At the same time, each team can also map other catalogs to their own account and use their own data, which they produce along with the data from other accounts. Unless it is sensitive data, the data can be accessed programmatically or from the AWS Management Console in a self-service manner without being dependent on the data infrastructure engineers. This is a domain-agnostic, shared way to self-serve data. The product discovery happens through the catalog registration. Using only a few standards commonly agreed upon and adopted across the company, for the purpose of interoperability, Acast addressed the fragmented silos and friction to exchange data or consume domain-agnostic data.

With this principle, teams get assurance that the data is secure, trustworthy, and accurate, and appropriate access controls are managed at each domain level. Moreover, on the central account, roles are defined for different types of permissions and access, using AWS IAM Identity Center permissions. All datasets are discoverable from a single central account. The following figure illustrates how it’s instrumented, where two IAM roles are assumed by two types of user (consumer) groups: one that has access to a limited dataset, which is restricted data, and one that has access to non-restricted data. There is also a way to assume any of these roles, for service accounts, such as those used by data processing jobs in Amazon Managed Workflows for Apache Airflow (Amazon MWAA), for example.

How Acast solved for high alignment and a loosely coupled architecture

The following diagram shows a conceptual architecture of how Acast’s teams are organizing data and collaborating with each other.

Acast used the Well-Architected Framework for the central account to improve its practice running analytical workloads in the cloud. Through the lenses of the tool, Acast was able to address better monitoring, cost optimization, performance, and security. It helped them understand the areas where they could improve their workloads and how to address common issues, with automated solutions, as well as how to measure the success, defining KPIs. It saved them time to get the learnings that otherwise would have been taking longer to find. Spyridon Dosis, Acast’s Information Security Officer, shares, “We are happy AWS is always ahead with releasing tools that enable the configuration, assessment, and review of multi-account setup. This is a big plus for us, working in a decentralized organization.” Spyridon also adds, “A very important concept we value is the AWS security defaults (e.g. default encryption for S3 buckets).”

In the architecture diagram, we can see that each team can be a data producer, except the team owning the central account, which serves as the central data platform, modeling the logic from multiple domains to paint the full business picture. All other teams can be data producers or data consumers. They can connect to the central account and discover datasets via the cross-account AWS Glue Data Catalog, analyze them in the Athena query editor or with Athena notebooks, or map the catalog to their own AWS account. Access to the central Athena catalog is implemented with IAM Identity Center, with roles for open data and restricted data access.

For non-sensitive data (open data), Acast uses a template where the datasets are by default open to the entire organization to read from, using a condition to provide the organization-assigned ID parameter, as shown in the following code snippet:

{
    "Version": "2012-10-17",
    "Statement": [
        
        {
            "Effect": "Allow",
            "Principal": "*",
            "Action": [
               "s3:GetObject*",
                "s3:GetBucket*",
                "s3:List*"  
            ],
            "Resource": [
                "arn:aws:s3:::DOC-EXAMPLE-BUCKET",
                "arn:aws:s3:::DOC-EXAMPLE-BUCKET/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:PrincipalOrgID": "ORG-ID-NUMBER"
                }
            }
        }
    ]
}

When handling sensitive data like financials, the teams use a collaborative data steward model. The data steward works with the requester to evaluate access justification for the intended use case. Together, they determine appropriate access methods to meet the need while maintaining security. This could include IAM roles, service accounts, or specific AWS services. This approach enables business users outside the tech organization (which means they don’t have an AWS account) to independently access and analyze the information they need. By granting access through IAM policies on AWS Glue resources and S3 buckets, Acast provides self-serve capabilities while still governing delicate data through human review. The data steward role has been valuable for understanding use cases, assessing security risks, and ultimately facilitating access that accelerates the business through analytical insights.

For Acast’s use case, granular row- or column-level access controls weren’t needed, so the approach sufficed. However, other organizations may require more fine-grained governance over sensitive data fields. In those cases, solutions like AWS Lake Formation could implement permissions needed, while still providing a self-serve data access model. For more information, refer to Design a data mesh architecture using AWS Lake Formation and AWS Glue.

At the same time, teams can read from other producers directly, from Amazon S3 or via an API, keeping the dependency at minimum, which enhances the velocity of development and delivery. Therefore, an account can be a producer and a consumer in parallel. Each team is autonomous, and is accountable for their own tech stack.

Additional learnings

What did Acast learn? So far, we’ve discussed that the architectural design is an effect of the organizational structure. Because the tech organization consists of multiple cross-functional teams, and it’s straightforward to bootstrap a new team, following the common principles of data mesh, Acast learned this doesn’t go seamlessly every time. To set up a fully new account in AWS, teams go through the same journey, but slightly different, considering their own set of particularities.

This can create certain frictions, and it’s difficult to get all data producing teams to reach a high maturity of being data producers. This can be explained by the different data competencies in those cross-functional teams and not being dedicated data teams.

By implementing the decentralized solution, Acast effectively tackled the scalability challenge by adapting their teams to align with evolving business needs. This approach ensures high decoupling and alignment. Furthermore, they strengthened ownership, significantly reducing the time needed to identify and resolve issues because the upstream source is readily known and easily accessible with specified SLAs. The volume of data support inquiries has seen a reduction of over 50%, because business users are empowered to gain faster insights. Notably, they successfully eliminated tens of terabytes of redundant storage that were previously copied solely to fulfill downstream requests. This achievement was made possible through the implementation of cross-account reading, leading to the removal of associated development and maintenance costs for these pipelines.

Conclusion

Acast used the Inverse Conway Maneuver law and employed AWS services where each cross-functional product team has its own AWS account to build a data mesh architecture that allows scalability, high ownership, and self-service data consumption. This has been working well for the company, regarding how data ownership and operations were approached, to meet their engineering principles, resulting in having the data mesh as an effect rather than a deliberate intent. For other organizations, the desired data mesh might look different and the approach might have other learnings.

To conclude, a modern data architecture on AWS allows you to efficiently construct data products and data mesh infrastructure at a low cost without compromising on performance.

The following are some examples of AWS services you can use to design your desired data mesh on AWS:

Use AWS Lake Formation to build a data mesh pattern at scale with tag-based access control
Use AWS Data Exchange to integrate third-party data into your data mesh
Use AWS Glue for sharing, hosting, and cataloging data products
Use AWS DataZone to govern data access across the organization and automate data discovery

About the Authors

Claudia Chitu is a Data strategist and an influential leader in the Analytics space. Focused on aligning data initiatives with the overall strategic goals of the organization, she employs data as a guiding force for long-term planning and sustainable growth.

Spyridon Dosis is an Information Security Professional in Acast. Spyridon supports the organization in designing, implementing and operating its services in a secure manner protecting the company and users’ data.

Srikant Das is an Acceleration Lab Solutions Architect at Amazon Web Services. He has over 13 years of experience in Big Data analytics and Data Engineering, where he enjoys building reliable, scalable, and efficient solutions. Outside of work, he enjoys traveling and blogging his experiences in social media.

Orchestrate Amazon EMR Serverless Spark jobs with Amazon MWAA, and data validation using Amazon Athena

2023-12-12 Gaurav Parekh

Post Syndicated from Gaurav Parekh original https://aws.amazon.com/blogs/big-data/orchestrate-amazon-emr-serverless-spark-jobs-with-amazon-mwaa-and-data-validation-using-amazon-athena/

As data engineering becomes increasingly complex, organizations are looking for new ways to streamline their data processing workflows. Many data engineers today use Apache Airflow to build, schedule, and monitor their data pipelines.

However, as the volume of data grows, managing and scaling these pipelines can become a daunting task. Amazon Managed Workflows for Apache Airflow (Amazon MWAA) can help simplify the process of building, running, and managing data pipelines. By providing Apache Airflow as a fully managed platform, Amazon MWAA allows data engineers to focus on building data workflows instead of worrying about infrastructure.

Today, businesses and organizations require cost-effective and efficient ways to process large amounts of data. Amazon EMR Serverless is a cost-effective and scalable solution for big data processing that can handle large volumes of data. The Amazon Provider in Apache Airflow comes with EMR Serverless operators and is already included in Amazon MWAA, making it easy for data engineers to build scalable and reliable data processing pipelines. You can use EMR Serverless to run Spark jobs on the data, and use Amazon MWAA to manage the workflows and dependencies between these jobs. This integration can also help reduce costs by automatically scaling the resources needed to process data.

Amazon Athena is a serverless, interactive analytics service built on open-source frameworks, supporting open-table and file formats. You can use standard SQL to interact with data. Athena, a serverless and interactive analytics service, makes this possible without the need to manage complex infrastructure.

In this post, we use Amazon MWAA, EMR Serverless, and Athena to build a complete end-to-end data processing pipeline.

Solution overview

The following diagram illustrates the solution architecture.

The workflow includes the following steps:

Create an Amazon MWAA workflow that retrieves data from your input Amazon Simple Storage Service (Amazon S3) bucket.
Use EMR Serverless to process the data stored in Amazon S3. EMR Serverless automatically scales up or down based on the workload, so you don’t need to worry about provisioning or managing any infrastructure.
Use EMR Serverless to transform the data using PySpark code and then store the transformed data back in your S3 bucket.
Use Athena to create an external table based on the S3 dataset and run queries to analyze the transformed data. Athena uses the AWS Glue Data Catalog to store the table metadata.

Prerequisites

You should have the following prerequisites:

An AWS account.
A basic understanding of Amazon S3, Athena for running SQL queries, Amazon MWAA to create an environment, and EMR Serverless.
A VPC with two private subnets.
An AWS Identity and Access Management (IAM) role with permissions to create an Amazon MWAA cluster, create the AWS Glue Data Catalog, and run SQL queries using Athena.
An IAM emr_serverless_execution_role job execution role with permissions to run an EMR Serverless application and read from S3 buckets, and iam:PassRole permission to the Amazon MWAA cluster execution role. For more information, refer to EMR Serverless Samples, Job runtime roles, and Adding and removing IAM identity permissions.

Data preparation

To illustrate using EMR Serverless jobs with Apache Spark via Amazon MWAA and data validation using Athena, we use the publicly available NYC taxi dataset. Download the following datasets to your local machine:

Green taxi and Yellow taxi trip records – Trip records for yellow and green taxis, which include information such as pick-up and drop-off dates and times, locations, trip distances, and payment types. In our example, we use the latest Parquet files for 2022.
Dataset for Taxi zone lookup – A dataset that provides location IDs and corresponding zone details for taxis.

In later steps, we upload these datasets to Amazon S3.

Create solution resources

This section outlines the steps for setting up data processing and transformation.

Create an EMR Serverless application

You can create one or more EMR Serverless applications that use open source analytics frameworks like Apache Spark or Apache Hive. Unlike EMR on EC2, you do not need to delete or terminate EMR Serverless applications. EMR Serverless application is only a definition and once created, can be re-used as long as needed. This makes the MWAA pipeline simpler as now you just have to submit jobs to a pre-created EMR Serverless application.

By default, EMR Serverless application will auto-start on job submission and auto-stop when idle for 15 minutes by default to ensure cost efficiency. You can modify the amount of idle time or choose to turn the feature off.

To create an application using EMR Serverless console, follow the instructions in “Create an EMR Serverless application”. Note down the application ID as we will use it in following steps.

Create an S3 bucket and folders

Complete the following steps to set up your S3 bucket and folders:

On the Amazon S3 console, create an S3 bucket to store the dataset.
Note the name of the S3 bucket to use in later steps.
Create an input_data folder for storing input data.
Within that folder, create three separate folders, one for each dataset: green, yellow, and zone_lookup.

You can download and work with the latest datasets available. For our testing, we use the following files:

The green/ folder has the file green_tripdata_2022-06.parquet
The yellow/ folder has the file yellow_tripdata_2022-06.parquet
The zone_lookup/ folder has the file taxi_zone_lookup.csv

Set up the Amazon MWAA DAG scripts

Complete the following steps to set up your DAG scripts:

Download the following scripts to your local machine:
1. requirements.txt – A Python dependency is any package or distribution that is not included in the Apache Airflow base install for your Apache Airflow version on your Amazon MWAA environment. For this post, we use Boto3 version >=1.23.9.
2. blog_dag_mwaa_emrs_ny_taxi.py – This script is a part of the Amazon MWAA DAG and consists of the following tasks: yellow_taxi_zone_lookup, green_taxi_zone_lookup, and ny_taxi_summary,. These tasks involve running Spark jobs to lookup taxi zones, and generating a data summary .
3. green_zone.py – This PySpark script reads data files for green taxi rides and zone lookup, performs a join operation to combine them, and generates an output file containing green taxi rides with zone information. It utilizes temporary views for the df_green and df_zone data frames, performs column-based joins, and aggregates data like passenger count, trip distance, and fare amount. Lastly, it creates the output_data folder in the specified S3 bucket to write the resulting data frame, df_green_zone, as Parquet files.
4. yellow_zone.py – This PySpark script processes yellow taxi ride and zone lookup data files by joining them to generate an output file containing yellow taxi rides with zone information. The script accepts a user-provided S3 bucket name and initiates a Spark session with the application name yellow_zone. It reads the yellow taxi files and zone lookup file from the specified S3 bucket, creates temporary views, performs a join based on location ID, and calculates statistics such as passenger count, trip distance, and fare amount. Lastly, it creates the output_data folder in the specified S3 bucket to write the resulting data frame, df_yellow_zone, as Parquet files.
5. ny_taxi_summary.py – This PySpark script processes the green_zone and yellow_zone files to aggregate statistics on taxi rides, grouping data by service zones and location IDs. It requires an S3 bucket name as a command line argument, creates a SparkSession named ny_taxi_summary, reads the files from S3, performs a join, and generates a new data frame named ny_taxi_summary. It creates an output_data folder in the specified S3 bucket to write the resulting data frame to new Parquet files.

On your local machine, update the blog_dag_mwaa_emrs_ny_taxi.py script with the following information:

Update your S3 bucket name in the following two lines:

S3_LOGS_BUCKET = "<<bucket_name_here>>"
S3_BASE_BUCKET = "<<bucket_name_here>>"

Update your role name ARN:

JOB_ROLE_ARN = “<<emr_serverless_execution_role ARN here>>”
e.g. arn:aws:iam::<<ACCOUNT_ID>>:role/<<ROLE_NAME>>

Update EMR Serverless Application ID. Use the Application ID created earlier.

EMR_SERVERLESS_APPLICATION_ID  = “<<emr serverless application ID here>>”

Upload the requirements.txt file to the S3 bucket created earlier
In the S3 bucket, create a folder named dags and upload the updated blog_dag_mwaa_emrs_ny_taxi.py file from your local machine.
On the Amazon S3 console, create a new folder named scripts inside the S3 bucket and upload the scripts to this folder from your local machine.

Create an Amazon MWAA environment

To create an Airflow environment, complete the following steps:

On the Amazon MWAA console, choose Create environment.
For Name, enter mwaa_emrs_athena_pipeline.
For Airflow version, choose the latest version (for this post, 2.5.1).
For S3 Bucket, enter the path to your S3 bucket.
For DAGs folder, enter the path to your dags folder.
For Requirements file, enter the path to the requirements.txt file.
Choose Next.
For Virtual private cloud (VPC), choose a VPC that has a minimum of two private subnets.

This will populate two of the private subnets in your VPC.

Under Web server access, select Public network.

This allows the Apache Airflow UI to be accessed over the internet by users granted access to the IAM policy for your environment.

For Security group(s), select Create new security group.
For Environment class, select mw1.small.
For Execution role, choose Create a new role.
For Role name, enter a name.
Leave the other configurations as default and choose Next.
On the next page, choose Create environment.

It may take about 20–30 minutes to create your Amazon MWAA environment.

When the Amazon MWAA environment status changes to Available, navigate to the IAM console and update cluster execution role to add pass role privileges to emr_serverless_execution_role.

Trigger the Amazon MWAA DAG

To trigger the DAG, complete the following steps:

On the Amazon MWAA console, choose Environments in the navigation pane.
Open your environment and choose Open Airflow UI.
Select blog_dag_mwaa_emr_ny_taxi, choose the play icon, and choose Trigger DAG.
When the DAG is running, choose the DAG blog_dag_mwaa_emrs_ny_taxi and choose Graph to locate your DAG run workflow.

The DAG will take approximately 4–6 minutes to run all the scripts. You will see all the complete tasks and the overall status of the DAG will show as success.

To rerun the DAG, remove s3://<<your_s3_bucket here >>/output_data/.

Optionally, to understand how Amazon MWAA runs these tasks, choose the task you want to inspect.

Choose Run to view the task run details.

The following screenshot shows an example of the task logs.

If you like to dive deep in the execution logs, then on the EMR Serverless console, navigate to “Applications”. The Apache Spark driver logs will indicate the initiation of your job along with the details for executors, stages and tasks that were created by EMR Serverless. These logs can be helpful to monitor your job progress and troubleshoot failures.

By default, EMR Serverless will store application logs securely in Amazon EMR managed storage for a period of 30 days. However, you can also specify Amazon S3 or Amazon CloudWatch as your log delivery options during job submission.

Validate the final result set with Athena

Let’s validate the data loaded by the process using Athena SQL queries.

On the Athena console, choose Query editor in the navigation pane.
If you’re using Athena for the first time, under Settings, choose Manage and enter the S3 bucket location that you created earlier (<S3_BUCKET_NAME>/athena), then choose Save.
In the query editor, enter the following query to create an external table:

CREATE EXTERNAL TABLE default.ny_taxi_summary(
  pu_service_zone string, 
  pulocationid bigint, 
  do_service_zone string, 
  dolocationid bigint, 
  passenger_count bigint, 
  trip_distance double, 
  fare_amount double, 
  extra double, 
  mta_tax double, 
  tip_amount double, 
  tolls_amount double, 
  improvement_surcharge double, 
  total_amount double, 
  congestion_surcharge double, 
  airport_fee double)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://<<YOUR-S3-BUCKET Here>>/output_data/ny_taxi_summary/' -- *** Change bucket name to your bucket***
TBLPROPERTIES (
  'classification'='parquet', 
  'compressionType'='none');

Run the following query on the recently created ny_taxi_summary table to retrieve the first 10 rows to validate the data:

select * from default.ny_taxi_summary limit 10;

Clean up

To prevent future charges, complete the following steps:

On the Amazon S3 console, delete the S3 bucket you created to store the Amazon MWAA DAG, scripts, and logs.
On the Athena console, drop the table you created:
```
drop table default.ny_taxi_summary;
```
On the Amazon MWAA console, navigate to the environment that you created and choose Delete.
On the EMR Studio console, delete the application.

To delete the application, navigate to the List applications page. Select the application that you created and choose Actions → Stop to stop the application. After the application is in the STOPPED state, select the same application and choose Actions → Delete.

Conclusion

Data engineering is a critical component of many organizations, and as data volumes continue to grow, it’s essential to find ways to streamline data processing workflows. The combination of Amazon MWAA, EMR Serverless, and Athena provides a powerful solution to build, run, and manage data pipelines efficiently. With this end-to-end data processing pipeline, data engineers can easily process and analyze large amounts of data quickly and cost-effectively without the need to manage complex infrastructure. The integration of these AWS services provides a robust and scalable solution for data processing, helping organizations make informed decisions based on their data insights.

Now that you’ve seen how to submit Spark jobs on EMR Serverless via Amazon MWAA, we encourage you to use Amazon MWAA to create a workflow that will run PySpark jobs via EMR Serverless.

We welcome your feedback and inquiries. Please feel free to reach out to us if you have any questions or comments.

About the authors

Rahul Sonawane is a Principal Analytics Solutions Architect at AWS with AI/ML and Analytics as his area of specialty.

Gaurav Parekh is a Solutions Architect helping AWS customers build large scale modern architecture. He specializes in data analytics and networking. Outside of work, Gaurav enjoys playing cricket, soccer and volleyball.

Audit History

December 2023: This post was reviewed for technical accuracy by Santosh Gantaram, Sr. Technical Account Manager.

Enhance query performance using AWS Glue Data Catalog column-level statistics

2023-11-22 Sandeep Adwankar

Post Syndicated from Sandeep Adwankar original https://aws.amazon.com/blogs/big-data/enhance-query-performance-using-aws-glue-data-catalog-column-level-statistics/

Today, we’re making available a new capability of AWS Glue Data Catalog that allows generating column-level statistics for AWS Glue tables. These statistics are now integrated with the cost-based optimizers (CBO) of Amazon Athena and Amazon Redshift Spectrum, resulting in improved query performance and potential cost savings.

Data lakes are designed for storing vast amounts of raw, unstructured, or semi-structured data at a low cost, and organizations share those datasets across multiple departments and teams. The queries on these large datasets read vast amounts of data and can perform complex join operations on multiple datasets. When talking with our customers, we learned that one the challenging aspect of data lake performance is how to optimize these analytics queries to execute faster.

The data lake performance optimization is especially important for queries with multiple joins and that is where cost-based optimizers helps the most. In order for CBO to work, column statistics need to be collected and updated based on changes in the data. We’re launching capability of generating column-level statistics such as number of distinct, number of nulls, max, and min on files such as Parquet, ORC, JSON, Amazon ION, CSV, XML on AWS Glue tables. With this launch, customers now have integrated end-to-end experience where statistics on Glue tables are collected and stored in the AWS Glue Catalog, and made available to analytics services for improved query planning and execution.

Using these statistics, cost-based optimizers improves query run plans and boosts the performance of queries run in Amazon Athena and Amazon Redshift Spectrum. For example, CBO can use column statistics such as number of distinct values and number of nulls to improve row prediction. Row prediction is the number of rows from a table that will be returned by a certain step during the query planning stage. The more accurate the row predictions are, the more efficient query execution steps are. This leads to faster query execution and potentially reduced cost. Some of the specific optimizations that CBO can employ include join reordering and push-down of aggregations based on the statistics available for each table and column.

For customers using data mesh with AWS Lake Formation permissions, tables from different data producers are cataloged in the centralized governance accounts. As they generate statistics on tables on centralized catalog and share those tables with consumers, queries on those tables in consumer accounts will see query performance improvements automatically. In this post, we’ll demonstrate the capability of AWS Glue Data Catalog to generate column statistics for our sample tables.

Solution overview

To demonstrate the effectiveness of this capability, we employ the industry-standard TPC-DS 3 TB dataset stored in an Amazon Simple Storage Service (Amazon S3) public bucket. We’ll compare the query performance before and after generating column statistics for the tables, by running queries in Amazon Athena and Amazon Redshift Spectrum. We are providing queries that we used in this post and we encourage to try out your own queries following workflow as illustrated in the following details.

The workflow consists of the following high level steps:

Cataloging the Amazon S3 Bucket: Utilize AWS Glue Crawler to crawl the designated Amazon S3 bucket, extracting metadata, and seamlessly storing it in the AWS Glue data catalog. We’ll query these tables using Amazon Athena and Amazon Redshift Spectrum.
Generating column statistics: Employ the enhanced capabilities of AWS Glue Data Catalog to generate comprehensive column statistics for the crawled data, thereby providing valuable insights into the dataset.
Querying with Amazon Athena and Amazon Redshift Spectrum: Evaluate the impact of column statistics on query performance by utilizing Amazon Athena and Amazon Redshift Spectrum to execute queries on the dataset.

The following diagram illustrates the solution architecture.

Walkthrough

To implement the solution, we complete the following steps:

Set up resources with AWS CloudFormation.
Run AWS Glue Crawler on Public Amazon S3 bucket to list the 3TB TPC-DS dataset.
Run queries on Amazon Athena and Amazon Redshift and note down query duration
Generate statistics for AWS Glue Data Catalog tables
Run queries on Amazon Athena and Amazon Redshift and compare query duration with previous run
Optional: Schedule AWS Glue column statistics jobs using AWS Lambda and the Amazon EventBridge Scheduler

Set up resources with AWS CloudFormation

This post includes an AWS CloudFormation template for a quick setup. You can review and customize it to suit your needs. The template generates the following resources:

An Amazon Virtual Private Cloud (Amazon VPC), public subnet, private subnets and route tables.
An Amazon Redshift Serverless workgroup and namespace.
An AWS Glue crawler to crawl the public Amazon S3 bucket and create a table for the Glue Data Catalog for TPC-DS dataset
AWS Glue catalog databases and tables
An Amazon S3 bucket to store athena result.
AWS Identity and Access Management (AWS IAM) users and policies.
AWS Lambda and Amazon Event Bridge scheduler to schedule the AWS Glue Column statistics

To launch the AWS CloudFormation stack, complete the following steps:

Note: The AWS Glue data catalog tables are generated using the public bucket s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned/, hosted in the us-east-1 region. If you intend to deploy this AWS CloudFormation template in a different region, it is necessary to either copy the data to the corresponding region or share the data within your deployed region for it to be accessible from Amazon Redshift.

Log in to the AWS Management Console as AWS Identity and Access Management (AWS IAM) administrator.
Choose Launch Stack to deploy a AWS CloudFormation template.
Choose Next.
On the next page, keep all the option as default or make appropriate changes based on your requirement choose Next.
Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create.

This stack can take around 10 minutes to complete, after which you can view the deployed stack on the AWS CloudFormation console.

Run the AWS Glue Crawlers created by the AWS CloudFormation stack

To run your crawlers, complete the following steps:

On the AWS Glue console to AWS Glue Console, choose Crawlers under Data Catalog in the navigation pane.
Locate and run two crawlers tpcdsdb-without-stats and tpcdsdb-with-stats. It may take few mins to complete.

Once the crawler completes successfully, it would create two identical databases tpcdsdbnostats and tpcdsdbwithstats. The tables in tpcdsdbnostats will have No Stats and we’ll use them as reference. We generate statistics on tables in tpcdsdbwithstats. Please verify that you have those two databases and underlying tables from the AWS Glue Console. The tpcdsdbnostats database will look like below. At this time there are no statistics generated on these tables.

Run provided query using Amazon Athena on no-stats tables

To run your query in Amazon Athena on tables without statistics, complete the following steps:

Download the athena queries from here.
On the Amazon Athena Console, choose the provided query one at a time for tables in database tpcdsdbnostats.
Run the query and note down the Run time for each query.

Run provided query using Amazon Redshift Spectrum on no-stats tables

To run your query in Amazon Redshift, complete the following steps:

Download the Amazon Redshift queries from here.
On the Redshift query editor v2, execute the Redshift Query for tables without stats section from downloaded query.
Run the query and note down the query execution of each query.

Generate statistics on AWS Glue Catalog tables

To generate statistics on AWS Glue Catalog tables, complete the following steps:

Navigate to the AWS Glue Console and choose the databases under Data Catalog.
Click on tpcdsdbwithstats database and it will list all the available tables.
Select any of these tables (e.g., call_center).
Go to Column statistics – new tab and choose Generate statistics.
Keep the default option. Under Choose columns keep Table (All columns) and Under Row sampling options Keep All rows, Under IAM role choose AWSGluestats-blog and select Generate statistics.

You’ll be able to see status of the statistics generation run as shown in the following illustration:

After generate statistics on AWS Glue Catalog tables, you should be able to see detailed column statistics for that table:

Reiterate steps 2–5 to generate statistics for all necessary tables, such as catalog_sales, catalog_returns, warehouse, item, date_dim, store_sales, customer, customer_address, web_sales, time_dim, ship_mode, web_site, web_returns. Alternatively, you can follow the “Schedule AWS Glue Statistics Runs” section near the end of this blog to generate statistics for all tables. Once done, assess query performance for each query.

Run provided query using Athena Console on stats tables

On the Amazon Athena console, execute the Athena Query for tables with stats section from downloaded query.
Run and note down the query execution of each query.

In our sample run of the queries on the tables, we observed the query execution time as per the below table. We saw clear improvement in the query performance, ranging from 13 to 55%.

Athena query time improvement

TPC-DS 3T Queries	without glue stats (sec)	with glue stats (sec)	performance improvement (%)
Query 2	33.62	15.17	55%
Query 4	132.11	72.94	45%
Query 14	134.77	91.48	32%
Query 28	55.99	39.36	30%
Query 38	29.32	25.58	13%

Run the provided query using Amazon Redshift Spectrum on statistics tables

On the Amazon Redshift query editor v2, execute the Redshift Query for tables with stats section from downloaded query.
Run the query and note down the query execution of each query.

In our sample run of the queries on the tables, we observed the query execution time as per the below table. We saw clear improvement in the query performance, ranging from 13 to 89%.

Amazon Redshift Spectrum query time improvement

TPC-DS 3T Queries	without glue stats (sec)	with glue stats (sec)	performance improvement (%)
Query 40	124.156	13.12	89%
Query 60	29.52	16.97	42%
Query 66	18.914	16.39	13%
Query 95	308.806	200	35%
Query 99	20.064	16	20%

Schedule AWS Glue statistics Runs

In this segment of the post, we’ll guide you through the steps of scheduling AWS Glue column statistics runs using AWS Lambda and the Amazon EventBridge Scheduler. To streamline this process, a AWS Lambda function and an Amazon EventBridge scheduler were created as part of the CloudFormation stack deployment.

AWS Lambda function setup:

To begin, we utilize an AWS Lambda function to trigger the execution of the AWS Glue column statistics job. The AWS Lambda function invokes the start_column_statistics_task_run API through the boto3 (AWS SDK for Python) library. This sets the groundwork for automating the column statistics update.

Let’s explore the AWS Lambda function:

- Go to the AWS Glue Lambda Console.
- Select Functions and locate the GlueTableStatisticsFunctionv1.
- For a clearer understanding of the AWS Lambda function, we recommend reviewing the code in the Code section and examining the environment variables under Configuration.

Amazon EventBridge scheduler configuration

The next step involves scheduling the AWS Lambda function invocation using the Amazon EventBridge Scheduler. The scheduler is configured to trigger the AWS Lambda function daily at a specific time – in this case, 08:00 PM. This ensures that the AWS Glue column statistics job runs on a regular and predictable basis.

Now, let’s explore how you can update the schedule:

- Open the Amazon EventBridge Console in your AWS Management Console.
- Once in the Amazon EventBridge Console, select Schedules under the Scheduler section. This is where you manage and configure the schedules for your events.

Cleaning up

To avoid unwanted charges to your AWS account, delete the AWS resources:

Sign into the AWS CloudFormation console as the AWS IAM administrator used for creating the AWS CloudFormation stack.
Delete the AWS CloudFormation stack you created.

Conclusion

In this post, we showed you how you can use AWS Glue Data Catalog to generate column-level statistics for AWS Glue tables. These statistics are now integrated with cost-based optimizer from Amazon Athena and Amazon Redshift Spectrum, resulting in improved query performance and potential costs savings. Refer to Docs for support for Glue Catalog Statistics across various AWS analytical services.

If you have questions or suggestions, submit them in the comments section.

About the Authors

Sandeep Adwankar is a Senior Technical Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

Navnit Shukla serves as an AWS Specialist Solution Architect with a focus on Analytics. He possesses a strong enthusiasm for assisting clients in discovering valuable insights from their data. Through his expertise, he constructs innovative solutions that empower businesses to arrive at informed, data-driven choices. Notably, Navnit Shukla is the accomplished author of the book titled Data Wrangling on AWS. He can be reached via LinkedIn.

AWS Weekly Roundup – EC2 DL2q instances, PartyRock, Amplify’s 6th birthday, and more – November 20, 2023

2023-11-20 Channy Yun

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-ec2-dl2q-instances-partyrock-amplifys-6th-birthday-and-more-november-20-2023/

Last week I saw an astonishing 160+ new service launches. There were so many updates that we decided to publish a weekly roundup again. This continues the same innovative pace of the previous week as we are getting closer to AWS re:Invent 2023.

Our News Blog team is also finalizing new blog posts for re:Invent to introduce awesome launches with service teams for your reading pleasure. Jeff Barr shared The Road to AWS re:Invent 2023 to explain our blogging journey and process. Please stay tuned in the next week!

Last week’s launches
Here are some of the launches that caught my attention last week:

Amazon EC2 DL2q instances – New DL2q instances are powered by Qualcomm AI 100 Standard accelerators and are the first to feature Qualcomm’s AI technology in the public cloud. With eight Qualcomm AI 100 Standard accelerators and 128 GiB of total accelerator memory, you can run popular generative artificial intelligence (AI) applications and extend to edge devices across smartphones, autonomous driving, personal compute, and extended reality headsets to develop and validate these AI workloads before deploying.

PartyRock for Amazon Bedrock – We introduced PartyRock, a fun and intuitive hands-on, generative AI app-building playground powered by Amazon Bedrock. You can experiment, learn all about prompt engineering, build mini-apps, and share them with your friends—all without writing any code or creating an AWS account.

You also can now access the Meta Llama 2 Chat 13B foundation model and Cohere Command Light, Embed English, and multilingual models for Amazon Bedrock.

AWS Amplify celebrates its sixth birthday – We announced six new launches; a new documentation site, support for Next.js 14 with our hosting and JavaScript library, added custom token providers and an automatic React Native social sign-in update to Amplify Auth, new ChangePassword and DeleteUser account settings components, and updated all Amplify UI packages to use new Amplify JavaScript v6. You can also use wildcard subdomains when using a custom domain with your Amplify application deployed to AWS Amplify Hosting.

Amplify docs site UI

Also check out other News Blog posts about major launches published in the past week:

Other AWS service launches
Here are some other bundled feature launches per AWS service:

Amazon Athena – You can use a new cost-based optimizer (CBO) to enhance query performance based on table and column statistics, collected by AWS Glue Data Catalog and Athena JDBC 3.x driver, a new alternative that supports almost all authentication plugins. You can also use Amazon EMR Studio to develop and run interactive queries on Amazon Athena.

Amazon CloudWatch – You can use a new CloudWatch metric called EBS Stalled I/O Check to monitor the health of your Amazon EBS volumes, the regular expression for Amazon CloudWatch Logs Live Tail filter pattern syntax to search and match relevant log events, observability of SAP Sybase ASE database in CloudWatch Application Insights, and up to two stats commands in a Log Insights query to perform aggregations on the results.

Amazon CodeCatalyst – You can connect to a Amazon Virtual Private Cloud (Amazon VPC) from CodeCatalyst Workflows, provision infrastructure using Terraform within CodeCatalyst Workflows, access CodeCatalyst with your workforce identities configured in IAM Identity Center, and create teams made up of members of the CodeCatalyst space.

Amazon Connect – You can use a pre-built queue performance dashboard and Contact Lens conversational analytics dashboard to view and compare real-time and historical aggregated queue performance. You can use quick responses for chats, previously written formats such as typing in ‘/#greet’ to insert a personalized response, and scanning attachments to detect malware or other unwanted content.

AWS Glue – AWS Glue for Apache Spark added new six database connectors: Teradata, SAP HANA, Azure SQL, Azure Cosmos DB, Vertica, and MongoDB, as well as the native connectivity to Amazon OpenSearch Service.

AWS Lambda – You can see single pane view of metrics, logs, and traces in the AWS Lambda console and advanced logging controls to natively capture logs in JSON structured format. You can view the SAM template on the Lambda console and export the function’s configuration to AWS Application Composer. AWS Lambda also supports Java 21 and NodeJS 20 versions built on the new Amazon Linux 2023 runtime.

AWS Local Zones in Dallas – You can enable the new Local Zone in Dallas, Texas, us-east-1-dfw-2a, with Amazon EC2 C6i, M6i, R6i, C6gn, and M6g instances and Amazon EBS volume types gp2, gp3, io1, sc1, and st1. You can also access Amazon ECS, Amazon EKS, Application Load Balancer, and AWS Direct Connect in this new Local Zone to support a broad set of workloads at the edge.

Amazon Managed Streaming for Apache Kafka (Amazon MSK) – You can standardize access control to Kafka resources using AWS Identity and Access Management (IAM) and build Kafka clients for Amazon MSK Serverless written in all programming languages. These are open source client helper libraries and code samples for popular languages, including Java, Python, Go, and JavaScript. Also, Amazon MSK now supports an enhanced version of Apache Kafka 3.6.0 that offers generally available Tiered Storage and automatically sends you storage capacity alerts when you are at risk of exhausting your storage.

Amazon OpenSearch Service Ingestion – You can migrate your data from Elasticsearch version 7.x clusters to the latest versions of Amazon OpenSearch Service and use persistent buffering to protect the durability of incoming data.

Amazon RDS –Amazon RDS for MySQL now supports creating active-active clusters using the Group Replication plugin, upgrading MySQL 5.7 snapshots to MySQL 8.0, and Innovation Release version of MySQL 8.1.

Amazon RDS Custom for SQL Server extends point-in-time recovery support for up to 1,000 databases, supports Service Master Key Retention to use transparent data encryption (TDE), table- and column-level encryption, DBMail and linked servers, and use SQL Server Developer edition with the bring your own media (BYOM).

Additionally, Amazon RDS Multi-AZ deployments with two readable standbys now supports minor version upgrades and system maintenance updates with typically less than one second of downtime when using Amazon RDS Proxy.

AWS Partner Central – You can use an improved user experience in AWS Partner Central to build and promote your offerings and the new Investments tab in the Partner Analytics Dashboard to gain actionable insights. You can now link accounts and associated users between Partner Central and AWS Marketplace and use an enhanced co-sell experience with APN Customer Engagements (ACE) manager.

Amazon QuickSight – You can programmatically manage user access and custom permissions support for roles to restrict QuickSight functionality to the QuickSight account for IAM Identity Center and Active Directory using APIs. You can also use shared restricted folders, a Contributor role and support for data source asset types in folders and the Custom Week Start feature, an addition designed to enhance the data analysis experience for customers across diverse industries and social contexts.

AWS Trusted Advisor – You can use new APIs to programmatically access Trusted Advisor best practices checks, recommendations, and prioritized recommendations and 37 new Amazon RDS checks that provide best practices guidance by analyzing DB instance configuration, usage, and performance data.

There’s a lot more launch news that I haven’t covered. See AWS What’s New for more details.

See you virtually in AWS re:Invent
Next week we’ll hear the latest from AWS, learn from experts, and connect with the global cloud community in Las Vegas. If you come, check out the agenda, session catalog, and attendee guides before your departure.

If you’re not able to attend re:Invent in person this year, we’re offering the option to livestream our Keynotes and Innovation Talks. With the registration for online pass, you will have access to on-demand keynote, Innovation Talks, and selected breakout sessions after the event.

– Channy

Speed up queries with the cost-based optimizer in Amazon Athena

2023-11-17 Darshit Thakkar

Post Syndicated from Darshit Thakkar original https://aws.amazon.com/blogs/big-data/speed-up-queries-with-cost-based-optimizer-in-amazon-athena/

Amazon Athena is a serverless, interactive analytics service built on open source frameworks, supporting open table file formats. Athena provides a simplified, flexible way to analyze petabytes of data where it lives. You can analyze data or build applications from an Amazon Simple Storage Service (Amazon S3) data lake and 30 data sources, including on-premises data sources or other cloud systems using SQL or Python. Athena is built on open source Trino and Presto engines and Apache Spark frameworks, with no provisioning or configuration effort required.

Starting today, the Athena SQL engine uses a cost-based optimizer (CBO), a new feature that uses table and column statistics stored in the AWS Glue Data Catalog as part of the table’s metadata. By using these statistics, CBO improves query run plans and boosts the performance of queries run in Athena. Some of the specific optimizations CBO can employ include join reordering and pushing aggregations down based on the statistics available for each table and column.

TPC-DS benchmarks These benchmarks demonstrate the power of the cost-based optimizer—queries run up to 2x times faster with CBO enabled compared to running the same TPC-DS queries without CBO.

Performance and cost comparison on TPC-DS benchmarks

We used the industry-standard TPC-DS 3 TB to represent different customer use cases. These are representative of workloads with 10 times the stated benchmark size. This means a 3 TB benchmark dataset accurately represents customer workloads on 30–50 TB datasets.

In our testing, the dataset was stored in Amazon S3 in non-compressed Parquet format and the AWS Glue Data Catalog was used to store metadata for databases and tables. Fact tables were partitioned on the date column used for join operations, and each fact table consisted of 2,000 partitions. To help illustrate the performance of CBO, we compare the behavior of various queries and highlight the performance differences between running with CBO enabled vs. disabled.

The following graph illustrates the runtime of queries on the engine with and without CBO.

The following graph presents the top 10 queries from the TPC-DS benchmark with the greatest performance improvement.

Let’s discuss some of the cost-based optimization techniques that contributed to improved query performance.

Cost-based join reordering

Join reordering, an optimization technique used by cost-based SQL optimizers, analyzes different join sequences to select the order that minimizes query runtime by reducing intermediate data processed at each step, lowering memory and CPU requirements.

Let’s talk about query 82 of the TPC-DS 3TB dataset. The query performs inner joins on four tables: item, inventory, date_dim, and store_sales. The store_sales table has 8.6 billion rows and is partitioned by date. The inventory table has 1 billion rows and is also partitioned by date. The item table contains 360,000 rows, and the date_dim table holds 73,000 rows.

Query 82

select  i_item_id ,i_item_desc ,i_current_price
from item, inventory, date_dim, store_sales
where i_current_price between 30 and 30+30
and inv_item_sk = i_item_sk
and d_date_sk=inv_date_sk
and cast(d_date as date) between cast('2002-05-30' as date) and (cast('2002-05-30' as date) +  interval '60' day)
and i_manufact_id in (437,129,727,663)
and inv_quantity_on_hand between 100 and 500
and ss_item_sk = i_item_sk
group by i_item_id,i_item_desc,i_current_price
order by i_item_id
limit 100

Without CBO

Without using CBO, the engine will determine the join order based on the sequence of tables defined in the input query with internal heuristics. The FROM clause of the input query is "from item, inventory, date_dim, store_sales" (all inner joins). After passing through internal heuristics, Athena chose the join order as ((item ⋈ (inventory ⋈ date_dim)) ⋈ store_sales). Despite store_sales being the largest fact table, it’s defined last in the FROM clause and therefore gets joined last. This plan fails to reduce the intermediate join sizes as early as possible, resulting in an increased query runtime. The following diagram shows the join order without CBO and the number of rows flowing through different stages.

With CBO

When using CBO, the optimizer determines the best join order using a variety of data, including statistics as well as join size estimation, join build side, and join type. In this instance, Athena’s selected join order is ((store_sales ⋈ item) ⋈ (inventory ⋈ date_dim)). The largest fact table, store_sales, without being shuffled, is first joined with the item dimension table. The other partitioned table, inventory, is also first joined in-place with the date_dim dimension table. The join with the dimension table acts as a filter on the fact table, which dramatically reduces the input data size of the join that follows. Note that which side a table resides for a join is significant in Athena, because it’s the table on the right that will be built into memory for the join operation. Therefore, we always want to keep the larger table on the left and the smaller table on the right. CBO chose a plan that the left side was 8.6 billion before, and now it’s 13.6 million.

With CBO, the query runtime improved by 25% (from 15 seconds down to 11 seconds) by choosing the optimal join order.

Next, let’s discuss another CBO technique.

Cost-based aggregation pushdown

Aggregation pushdown is an optimization technique used by query optimizers to improve performance. It involves pushing aggregation operations like SUM, COUNT, and AVG into an earlier stage in the query plan, while maintaining the same query semantics. This reduces the amount of data transferred between the stages. By minimizing data processing, aggregation pushdown decreases memory usage, I/O costs, and network traffic.

However, pushing down aggregation is not always beneficial. It depends on the data distribution. For example, grouping on a column with many rows but few distinct values (like gender) before joins works better. Grouping first means aggregating a large number of records into fewer records (just male, female, for example). Grouping after joining means a large number of records have to participate the join before being aggregated. On the other hand, grouping on a high cardinality column is better done after joins. Doing it before risks unnecessary aggregation overhead because each value is likely unique anyway and that step will not result in an earlier reduction in the amount of data transferred between intermediate stages.

Therefore, whether to push down aggregation should be a cost-based decision. Let’s take example of the query 2 run on a 3TB TPC-DS dataset, showing how the aggregation pushdown’s value depends on data distribution. The web_sales table has 2.1 billion rows and the catalog_sales table has 4.23 billion rows. Both tables are partitioned on the date column.

Query 2

with wscs as
 (select sold_date_sk
        ,sales_price
  from (select ws_sold_date_sk sold_date_sk
              ,ws_ext_sales_price sales_price
        from web_sales 
        union all
        select cs_sold_date_sk sold_date_sk
              ,cs_ext_sales_price sales_price
        from catalog_sales)),
 wswscs as 
 (select d_week_seq,
        sum(case when (d_day_name='Sunday') then sales_price else null end) sun_sales,
        sum(case when (d_day_name='Monday') then sales_price else null end) mon_sales,
        sum(case when (d_day_name='Tuesday') then sales_price else  null end) tue_sales,
        sum(case when (d_day_name='Wednesday') then sales_price else null end) wed_sales,
        sum(case when (d_day_name='Thursday') then sales_price else null end) thu_sales,
        sum(case when (d_day_name='Friday') then sales_price else null end) fri_sales,
        sum(case when (d_day_name='Saturday') then sales_price else null end) sat_sales
 from wscs
     ,date_dim
 where d_date_sk = sold_date_sk
 group by d_week_seq)
 select d_week_seq1
       ,round(sun_sales1/sun_sales2,2)
       ,round(mon_sales1/mon_sales2,2)
       ,round(tue_sales1/tue_sales2,2)
       ,round(wed_sales1/wed_sales2,2)
       ,round(thu_sales1/thu_sales2,2)
       ,round(fri_sales1/fri_sales2,2)
       ,round(sat_sales1/sat_sales2,2)
 from
 (select wswscs.d_week_seq d_week_seq1
        ,sun_sales sun_sales1
        ,mon_sales mon_sales1
        ,tue_sales tue_sales1
        ,wed_sales wed_sales1
        ,thu_sales thu_sales1
        ,fri_sales fri_sales1
        ,sat_sales sat_sales1
  from wswscs,date_dim 
  where date_dim.d_week_seq = wswscs.d_week_seq and
        d_year = 2001) y,
 (select wswscs.d_week_seq d_week_seq2
        ,sun_sales sun_sales2
        ,mon_sales mon_sales2
        ,tue_sales tue_sales2
        ,wed_sales wed_sales2
        ,thu_sales thu_sales2
        ,fri_sales fri_sales2
        ,sat_sales sat_sales2
  from wswscs
      ,date_dim 
  where date_dim.d_week_seq = wswscs.d_week_seq and
        d_year = 2001+1) z
 where d_week_seq1=d_week_seq2-53
 order by d_week_seq1

Without CBO

Athena first joins the result of the union all operation on the web_sales table and the catalog_sales table with another table. Only then does it perform aggregation on the joined results. In this example, the amount of data that needed to be joined was huge, resulting in a longer query runtime.

With CBO

Athena utilizes one of the statistics values, the distinct value count, to evaluate the cost implications of pushing down the aggregation vs. not doing so. When a column has many rows but few distinct values, CBO is more likely to push aggregation down. This shrank the qualified rows from web_sales and catalog_sales tables to 2,590 and 3,590 rows, respectively. These aggregated records were then unioned and used to join with the tables. Comparing to the plan without CBO, the records participating in the join from the two large tables dropped from 6.33 billion rows (2.1 billion + 4.23 billion) to just 6,180 rows (2,590 + 3,590). This significantly decreased query runtime.

With CBO, the query runtime improved by 50% (from 37 seconds down to 18 seconds). In summary, CBO helped Athena choose an optimal aggregation pushdown plan, cutting the query time in half compared to not using cost-based optimization.

Conclusion

In this post, we discussed how Athena uses a cost-based optimizer (CBO) in its engine v3 to use table statistics for generating more efficient query run plans. Testing on the TPC-DS benchmark showed an 11% improvement in overall query performance when using CBO compared to without it.

Two key optimization employed by CBO are join reordering and aggregate pushdown. Join reordering reduces intermediate data by intelligently picking the order to join tables based on statistics. Aggregate pushdown decreases intermediate data by pushing aggregations earlier in the plan when beneficial.

In summary, Athena’s new cost-based optimizer significantly speeds up queries by choosing superior run plans. CBO optimizes based on table statistics stored in the AWS Glue Data Catalog. This automatic optimization improves productivity for Athena users through more responsive query performance. To take advantage of optimization techniques of CBO, refer to working with column statistics to generate statistics on the tables and columns in the AWS Glue Data Catalog.

About the Authors

Darshit Thakkar is a Technical Product Manager with AWS and works with the Amazon Athena team based out of Boston, Massachusetts.

Wei Zheng is a Sr. Software Development Engineer with Amazon Athena. He joined AWS in 2021 and has been working on multiple performance improvements on Athena.

Chuho Chang is a Software Development Engineer with Amazon Athena. He has been working on query optimizers for over a decade.

Visualize Amazon DynamoDB insights in Amazon QuickSight using the Amazon Athena DynamoDB connector and AWS Glue

2023-11-17 Antonio Samaniego Jurado

Post Syndicated from Antonio Samaniego Jurado original https://aws.amazon.com/blogs/big-data/visualize-amazon-dynamodb-insights-in-amazon-quicksight-using-the-amazon-athena-dynamodb-connector-and-aws-glue/

Amazon DynamoDB is a fully managed, serverless, key-value NoSQL database designed to run high-performance applications at any scale. DynamoDB offers built-in security, continuous backups, automated multi-Region replication, in-memory caching, and data import and export tools. The scalability and flexible data schema of DynamoDB make it well-suited for a variety of use cases. These include internet-scale web and mobile applications, low-latency metadata stores, high-traffic retail websites, Internet of Things (IoT) and time series data, online gaming, and more.

Data stored in DynamoDB is the basis for valuable business intelligence (BI) insights. To make this data accessible to data analysts and other consumers, you can use Amazon Athena. Athena is a serverless, interactive service that allows you to query data from a variety of sources in heterogeneous formats, with no provisioning effort. Athena accesses data stored in DynamoDB via the open source Amazon Athena DynamoDB connector. Table metadata, such as column names and data types, is stored using the AWS Glue Data Catalog.

Finally, to visualize BI insights, you can use Amazon QuickSight, a cloud-powered business analytics service. QuickSight makes it straightforward for organizations to build visualizations, perform ad hoc analysis, and quickly get business insights from their data, anytime, on any device. Its generative BI capabilities enable you to ask questions about your data using natural language, without having to write SQL queries or learn a BI tool.

This post shows how you can use the Athena DynamoDB connector to easily query data in DynamoDB with SQL and visualize insights in QuickSight.

Solution overview

The following diagram illustrates the solution architecture.

The Athena DynamoDB connector runs in a pre-built, serverless AWS Lambda function. You don’t need to write any code.
AWS Glue provides supplemental metadata from the DynamoDB table. In particular, an AWS Glue crawler is run to infer and store the DynamoDB table format, schema, and associated properties in the Glue Data Catalog.
The Athena editor is used to test the connector and perform analysis via SQL queries.
QuickSight uses the Athena connector to visualize BI insights from DynamoDB.

This walkthrough uses data from the ProductCatalog table, part of the DynamoDB developer guide sample data files.

Prerequisites

Before you get started, you should meet the following prerequisites:

An AWS account (refer to How do I create and activate a new AWS account? to create a new account)
A QuickSight account (see Signing up for an Amazon QuickSight subscription for sign up instructions)
Familiarity with AWS Identity and Access Management (IAM)

Set up the Athena DynamoDB connector

The Athena DynamoDB connector comprises a pre-built, serverless Lambda function provided by AWS that communicates with DynamoDB so you can query your tables with SQL using Athena. The connector is available in the AWS Serverless Application Repository, and is used to create the Athena data source for later use in data analysis and visualization. To set up the connector, complete the following steps:

On the Athena console, choose Data sources in the navigation pane.
Choose Create data source.
In the search bar, search for and choose Amazon DynamoDB.
Choose Next.
Under Data source details, enter a name. Note that this name should be unique and will be referenced in your SQL statements when you query your Athena data source.
Under Connection details, choose Create Lambda function.

This will take you to the Lambda applications page on the Lambda console. Do not close the Athena data source creation tab; you will return to it in a later step.

Scroll down to Application settings and enter a value for the following parameters (leave the other parameters as default):
- SpillBucket – Specifies the Amazon Simple Storage Service (Amazon S3) bucket name for storing data that exceeds Lambda function response size limits. To create an S3 bucket, refer to Creating a bucket.
- AthenaCatalogName – A lowercase name for the Lambda function to be created.
Select the acknowledgement check box and choose Deploy.

Wait for deployment to complete before moving to the next step.

Return to the Athena data source creation tab.
Under Connection details, choose the refresh icon and choose the Lambda function you created.
Choose Next.
Review and choose Create data source.

Provide supplemental metadata via AWS Glue

The Athena connector already comes with a built-in inference capability to discover the schema and table properties of your data source. However, this capability is limited. To accurately discover the metadata of your DynamoDB table and centralize schema management as your data evolves over time, the connector integrates with AWS Glue.

To achieve this, an AWS Glue crawler is run to automatically determine the format, schema, and associated properties of the raw data stored in your DynamoDB table, writing the resulting metadata to a Glue database. Glue databases contain tables, which hold metadata from different data stores, independent from the actual location of the data. The Athena connector then references the Glue table and retrieves the corresponding DynamoDB metadata to enable queries.

Create the AWS Glue database

Complete the following steps to create the Glue database:

On the AWS Glue console, under Data Catalog in the navigation pane, choose Databases.
Choose Add database (you can also edit an existing database if you already have one).
For Name, enter a database name.
For Location, enter the string literal dynamo-db-flag. This keyword indicates that the database contains tables that the connector can use for supplemental metadata.
Choose Create database.

Following security best practices, it is also recommended that you enable encryption at rest for your Data Catalog. For details, refer to Encrypting your Data Catalog.

Create the AWS Glue crawler

Complete the following steps to create and run the Glue crawler:

On the AWS Glue console, under Data Catalog in the navigation pane, choose Crawlers.
Choose Create crawler.
Enter a crawler name and choose Next.
For Data sources, choose Add a data source.
On the Data source drop-down menu, choose DynamoDB. For Table name, enter the name of your DynamoDB table (string literal).
Choose Add a DynamoDB data source.
Choose Next.
For IAM Role, choose Create new IAM role.
Enter a role name and choose Create. This will automatically create an IAM role that trusts AWS Glue and has permissions to access the crawler targets.
Choose Next.
For Target database, choose the database previously created.
Choose Next.
Review and choose Create crawler.
On the newly created crawler page, choose Run crawler.

Crawler runtimes depend on your DynamoDB table size and properties. You can find crawler run details under Crawler runs.

Validate the output metadata

When your crawler run status shows as Completed, follow the below steps to validate the output metadata:

On the AWS Glue console, choose Tables in the navigation pane. Here, you can confirm a new table has been added to the database as a result of the crawler run.
Navigate to the newly created table and take a look at the Schema tab. This tab shows the column names, data types, and other parameters inferred from your DynamoDB table.
If needed, edit the schema by choosing Edit schema.
Choose Advanced properties.
Under Table properties, verify the crawler automatically created and set the classification key to dynamodb. This indicates to the Athena connector that the table can be used for supplemental metadata.
Optionally, add the following properties to correctly catalog and reference DynamoDB data in AWS Glue and Athena queries. This is due to capital letters not being permitted in AWS Glue table and column names, but being permitted in DynamoDB table and attribute names.
1. If your DynamoDB table name contains any capital letters, choose Actions and Edit Table and add an extra table property as follows:
  - Key: sourceTable
  - Value: YourDynamoDBTableName
2. If your DynamoDB table has attributes that contain any capital letters, add an extra table property as follows:
  - Key: columnMapping
  - Value: yourcolumn1=YourColumn1, yourcolumn2=YourColumn2, …

Test the connector with the Athena SQL editor

After the Athena DynamoDB connector is deployed and the AWS Glue table is populated with supplemental metadata, the DynamoDB table is ready for analysis. The example in this post uses the Athena editor to make SQL queries to the ProductCatalog table. For further options to interact with Athena, see Accessing Athena.

Complete the following steps to test the connector:

Open the Athena query editor.
If this is your first time visiting the Athena console in your current AWS Region, complete the following steps. This is a prerequisite before you can run Athena queries. See Getting Started for more details.
1. Choose Query editor in the navigation pane to open the editor.
2. Navigate to Settings and choose Manage to set up a query result location in Amazon S3.
Under Data, select the data source and database you created (you may need to choose the refresh icon for them to sync up with Athena).
Tables belonging to the selected database appear under Tables. You can choose a table name for Athena to show the table column list and data types.
Test the connector by pulling data from your table via a SELECT statement. When you run Athena queries, you can reference Athena data sources, databases, and tables as <datasource_name>.<database>.<table_name>. Retrieved records are shown under Results.

For increased security, refer to Encrypting Athena query results stored in Amazon S3 to encrypt query results at rest.

For this post, we run a SELECT statement to validate the process. You can refer to the SQL reference for Athena to build more complex queries and analyses.

Visualize in QuickSight

QuickSight allows for building modern interactive dashboards, paginated reports, embedded analytics, and natural language queries through a unified BI solution. In this step, we use QuickSight to generate visual insights from the DynamoDB table by connecting to the Athena data source previously created.

Allow QuickSight to access to resources

Complete the following steps to grant QuickSight access to resources:

On the QuickSight console, choose the profile icon and choose Manage QuickSight.
In the navigation pane, choose Security & Permissions.
Under QuickSight access to AWS services, choose Manage.
QuickSight may ask you to switch to the Region in which users and groups in your account are managed. To change the current Region, navigate to the profile icon on the QuickSight console and choose the Region you want to switch to.
For IAM Role, choose Use QuickSight-managed role (default).

Subsequent instructions assume that the default QuickSight-managed role is being used. If this is not the case, make sure to update the existing role to the same effect.

Under Allow access and autodiscovery for these resources, select IAM and Amazon S3.
For Amazon S3, choose Select S3 buckets.
Choose the spill bucket you specified in earlier when deploying the Lambda function for the connector and the bucket you specified as the Athena query result location in Amazon S3.
For both buckets, select Write permission for Athena Workgroup.
Choose Amazon Athena.
In the pop-up window, choose Next.
Choose Lambda and choose the Amazon Resource Name (ARN) of the Lambda function previously used for the Athena data source connector.
Choose Finish.
Choose Save.

Create the Athena dataset

To create the Athena dataset, complete the following steps:

On the QuickSight console, choose the user profile and switch to the Region you deployed the Athena data source to.
Return to the QuickSight home page.
In the navigation pane, choose Datasets.
Choose New dataset.
For Create a Dataset, select Athena.
For Data source name, enter a name and choose Validate connection.
When the connection shows as Validated, choose Create data source.
Under Catalog, Database, and Tables, select the Athena data source, AWS Glue database, and AWS Glue table previously created.
Choose Select.
On the Finish dataset creation page, select Import to SPICE for quicker analytics.
Choose Visualize.

For additional information on QuickSight query modes, see Importing data into SPICE and Using SQL to customize data.

Build QuickSight visualizations

Once the DynamoDB data is available in QuickSight via the Athena DynamoDB connector, it is ready to be visualized. The QuickSight analysis in the below example shows a vertical stacked bar chart with the average price per product category for the ProductCatalog sample dataset. In addition, it shows a donut chart with the proportion of products by product category, and a tree map containing the count of bicycles per bicycle type.

If you use data imported to SPICE in a QuickSight analysis, the dataset will only be available after the import is complete. For further details, see Using SPICE data in an analysis.

For comprehensive information on how to create and share visualizations in QuickSight, refer to Visualizing data in Amazon QuickSight and Sharing and subscribing to data in Amazon QuickSight.

Clean up

To avoid incurring continued AWS usage charges, make sure you delete all resources created as part of this walkthrough.

Delete the Athena data source:
1. On the Athena console, switch to the Region you deployed your resources in.
2. Choose Data sources in the navigation pane.
3. Select the data source you created and on the Actions menu, choose Delete.
Delete the Lambda application:
1. On the AWS CloudFormation console, switch to the Region you deployed your resources in.
2. Choose Stacks in the navigation pane.
3. Select serverlessrepo-AthenaDynamoDBConnector and choose Delete.
Delete the AWS Glue resources:
1. On the AWS Glue console, switch to the Region you deployed your resources in.
2. Choose Databases in the navigation pane.
3. Select the database you created and choose Delete.
4. Choose Crawlers in the navigation pane.
5. Select the crawler you created and on the Action menu, choose Delete crawler.
Delete the QuickSight resources:
1. On the QuickSight console, switch to the Region you deployed your resources in.
2. Delete the analysis created for this walkthrough.
3. Delete the Athena dataset created for this walkthrough.
4. If you no longer need the Athena data source to create other datasets, delete the data source.

Summary

This post demonstrated how you can use the Athena DynamoDB connector to query data in DynamoDB with SQL and build visualizations in QuickSight.

Learn more about the Athena DynamoDB connector in the Amazon Athena User Guide. Discover more available data source connectors to query and visualize a variety of data sources without setting up or managing any infrastructure while only paying for the queries you run.

For advanced QuickSight capabilities powered by AI, see Gaining insights with machine learning (ML) in Amazon QuickSight and Answering business questions with Amazon QuickSight Q.

About the Authors

Antonio Samaniego Jurado is a Solutions Architect at Amazon Web Services. With a strong passion for modern technology, Antonio helps customers build state-of-the-art applications on AWS. A creator at heart, he loves community-driven learning and sharing of best practices across the AWS service portfolio to make the best of customers cloud journey.

Pascal Vogel is a Solutions Architect at Amazon Web Services. Pascal helps startups and enterprises build cloud-native solutions. As a cloud enthusiast, Pascal loves learning new technologies and connecting with like-minded customers who want to make a difference in their cloud journey.

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

2023-11-15 Phillip Karg

Post Syndicated from Phillip Karg original https://aws.amazon.com/blogs/big-data/bmw-cloud-efficiency-analytics-powered-by-amazon-quicksight-and-amazon-athena/

This post is written in collaboration with Philipp Karg and Alex Gutfreund from BMW Group.

Bayerische Motoren Werke AG (BMW) is a motor vehicle manufacturer headquartered in Germany with 149,475 employees worldwide and the profit before tax in the financial year 2022 was € 23.5 billion on revenues amounting to € 142.6 billion. BMW Group is one of the world’s leading premium manufacturers of automobiles and motorcycles, also providing premium financial and mobility services.

BMW Group uses 4,500 AWS Cloud accounts across the entire organization but is faced with the challenge of reducing unnecessary costs, optimizing spend, and having a central place to monitor costs. BMW Cloud Efficiency Analytics (CLEA) is a homegrown tool developed within the BMW FinOps CoE (Center of Excellence) aiming to optimize and reduce costs across all these accounts.

In this post, we explore how the BMW Group FinOps CoE implemented their Cloud Efficiency Analytics tool (CLEA), powered by Amazon QuickSight and Amazon Athena. With this tool, they effectively reduced costs and optimized spend across all their AWS Cloud accounts, utilizing a centralized cost monitoring system and using key AWS services. The CLEA dashboards were built on the foundation of the Well-Architected Lab. For more information on this foundation, refer to A Detailed Overview of the Cost Intelligence Dashboard.

CLEA gives full transparency into cloud costs, usage, and efficiency from a high-level overview to granular service, resource, and operational levels. It seamlessly consolidates data from various data sources within AWS, including AWS Cost Explorer (and forecasting with Cost Explorer), AWS Trusted Advisor, and AWS Compute Optimizer. Additionally, it incorporates BMW Group’s internal system to integrate essential metadata, offering a comprehensive view of the data across various dimensions, such as group, department, product, and applications.

The ultimate goal is to raise awareness of cloud efficiency and optimize cloud utilization in a cost-effective and sustainable manner. The dashboards, which offer a holistic view together with a variety of cost and BMW Group-related dimensions, were successfully launched in May 2023 and became accessible to users within the BMW Group.

Overview of the BMW Cloud Data Hub

At the BMW Group, Cloud Data Hub (CDH) is the central platform for managing company-wide data and data solutions. It works as a bundle for resources that are bound to a specific staging environment and Region to store data on Amazon Simple Storage Service (Amazon S3), which is renowned for its industry-leading scalability, data availability, security, and performance. Additionally, it manages table definitions in the AWS Glue Data Catalog, containing references to data sources and targets of extract, transform, and load (ETL) jobs in AWS Glue.

Data providers and consumers are the two fundamental users of a CDH dataset. Providers create datasets within assigned domain and as the owner of a dataset, they are responsible for the actual content and for providing appropriate metadata. They can use their own toolsets or rely on provided blueprints to ingest the data from source systems. Once released, consumers use datasets from different providers for analysis, machine learning (ML) workloads, and visualization.

Each CDH dataset has three processing layers: source (raw data), prepared (transformed data in Parquet), and semantic (combined datasets). It is possible to define stages (DEV, INT, PROD) in each layer to allow structured release and test without affecting PROD. Within each stage, it’s possible to create resources for storing actual data. Two resource types are associated with each database in a layer:

File store – S3 buckets for data storage
Database – AWS Glue databases for metadata sharing

Overview of the CLEA Landscape

The following diagram is a high-level overview of some of the technologies used for the extract, load, and transform (ELT) stages, as well as the final visualization and analysis layer. You might notice that this differs slightly from traditional ETL. The difference lies in when and where data transformation takes place. In ETL, data is transformed before it’s loaded into the data warehouse. In ELT, raw data is loaded into the data warehouse first, then it’s transformed directly within the warehouse. The ELT process has gained popularity with the rise of cloud-based, high-performance data warehouses, where transformation can be done more efficiently after loading.

Regardless of the method used, the goal is to provide high-quality, reliable data that can be used to drive business decisions.

CLEA Architecture

In this section, we take a closer look at the three essential stages mentioned previously: extract, load and transform.

Extract

The extract stage plays a pivotal role in the CLEA, serving as the initial step where data related to cost and usage and optimization is collected from a diverse range of sources within AWS. These sources encompass the AWS Cost and Usage Reports, Cost Explorer (and forecasting with Cost Explorer), Trusted Advisor, and Compute Optimizer. Furthermore, it fetches essential metadata from BMW Group’s internal system, offering a comprehensive view of the data across various dimensions, such as group, department, product, and applications in the later stages of data transformation.

The following diagram illustrates one of the data collection architectures that we use to collect Trusted Advisor data from nearly 4,500 AWS accounts and subsequently load that into Cloud Data Hub.

Let’s go through each numbered step as outlined in the architecture:

A time-based rule in Amazon EventBridge triggers the CLEA Shared Workflow AWS Step Functions state machine.
Based on the inputs, the Shared Workflow state machine invokes the Account Collector AWS Lambda function to retrieve AWS account details from AWS Organizations.
The Account Collector Lambda function assumes an AWS Identity and Access Management (IAM) role to access linked account details via the Organizations API and writes them to Amazon Simple Queue Service (Amazon SQS) queues.
The SQS queues trigger the Data Collector Lambda function using SQS Lambda triggers.
The Data Collector Lambda function assumes an IAM role in each linked account to retrieve the relevant data and load it into the CDH source S3 bucket.
When all linked accounts data is collected, the Shared Workflow state machine triggers an AWS Glue job for further data transformation.
The AWS Glue job reads raw data from the CDH source bucket and transforms it into a compact Parquet format.

Load and transform

For the data transformations, we used an open-source data transformation tool called dbt (Data Build Tool), modifying and preprocessing the data through a number of abstract data layers:

Source – This layer contains the raw data the data source provides. The preferred data format is Parquet, but JSON, CSV, or plain text file are also allowed.
Prepared – The source layer is transformed and stored as the prepared layer in Parquet format for optimized columnar access. Initial cleaning, filtering, and basic transformations are performed in this layer.
Semantic – A semantic layer combines several prepared layer datasets to a single dataset that contains transformations, calculations, and business logic to deliver business-friendly insights.
QuickSight – QuickSight is the final presentation layer, which is directly ingested into QuickSight SPICE from Athena via incremental daily ingestion queries. These ingested datasets are used as a source in CLEA dashboards.

Overall, using dbt’s data modeling and the pay-as-you-go pricing of Athena, BMW Group can control costs by running efficient queries on demand. Furthermore, with the serverless architecture of Athena and dbt’s structured transformations, you can scale data processing without worrying about infrastructure management. In CLEA there are currently more than 120 dbt models implemented with complex transformations. The semantic layer is incrementally materialized and partially ingested into QuickSight with up to 4 TB of SPICE capacity. For dbt deployment and scheduling, we use GitHub Actions which allows us to introduce new dbt models and changes easily with automatic deployments and tests.

CLEA Access control

In this section, we explain how we implemented access control using row-level security in QuickSight and QuickSight embedding for authentication and authorization.

RLS for QuickSight

Row-level security (RLS) is a key feature that governs data access and privacy, which we implemented for CLEA. RLS is a mechanism that allows us to control the visibility of data at the row level based on user attributes. In essence, it ensures that users can only access the data that they are authorized to view, adding an additional layer of data protection within the QuickSight environment.

Understanding the importance of RLS requires a broader view of the data landscape. In organizations where multiple users interact with the same datasets but require different access levels due to their roles, RLS becomes a pivotal tool. It ensures data security and compliance with privacy regulations, preventing unauthorized access to sensitive data. Additionally, it offers a tailored user experience by displaying only relevant data to the user, thereby enhancing the effectiveness of data analysis.

For CLEA, we collected BMW Group metadata such as department, application, and organization, which are quite important to allow users to only see the accounts within their department, application, organization, and so on. This is achieved using both a user name and group name for access control. We use the user name for user-specific access control and the group name for adding some users to a specific group to extend their permissions for different use cases.

Lastly, because there are many dashboards created by CLEA, we also control which users a unique user can see and also the data itself in the dashboard. This is done at the group level. By default, all users are assigned to CLEA-READER, which is granted access to core dashboards that we want to share with users, but there are different groups that allow users to see additional dashboards after they’re assigned to that group.

The RLS dataset is refreshed daily to catch recent changes regarding new user additions, group changes, or any other user access changes. This dataset is also ingested to SPICE daily, which automatically updates all datasets restricted via this RLS dataset.

QuickSight embedding

CLEA is a cross-platform application that provides secure access to QuickSight embedded content with custom-built authentication and authorization logic that sits on top of BMW Group identity and role management services (referred to as BMW IAM).

CLEA provides access to sensitive data to multiple users with different permissions, which is why it is designed with fine-grained access control rules. It enforces access control using role-based access control (RBAC) and attribute-based access control (ABAC) models at two different levels:

At the dashboard level via QuickSight user groups (RBAC)
At the dashboard data level via QuickSight RLS (RBAC and ABAC)

Dashboard-level permissions define the list of dashboards users are able to visualize.

Dashboard data-level permissions define the subsets of dashboard data shown to the user and are applied using RLS with the user attributes mentioned earlier. Although the majority of roles defined in CLEA are used for dashboard-level permissions, some specific roles are strategically defined to grant permissions at the dashboard data level, taking priority over the ABAC model.

BMW has a defined set of guidelines suggesting the usage of their IAM services as the single source of truth for identity and access control, which the team took into careful consideration when designing the authentication and authorization processes for CLEA.

Upon their first login, users are automatically registered in CLEA and assigned a base role that grants them access to a basic set of dashboards.

The process of registering users in CLEA consists of mapping a user’s identity as retrieved from BMW’s identity provider (IdP) to a QuickSight user, then assigning the newly created user to the respective QuickSight user group.

For users that require more extensive permissions (at one of the levels mentioned before), it is possible to order additional role assignments via BMW’s self-service portal for role management. Authorized reviewers will then review it and either accept or reject the role assignments.

Role assignments will take effect the next time the user logs in, at which time the user’s assigned roles in BMW Group IAM are synced to the user’s QuickSight groups—internally referred to as the identity and permissions sync. As shown in the following diagram, the sync groups step calculates which users’ group memberships should be kept, created, and deleted following the logic.

Usage Insights

Amazon CloudWatch plays an indispensable role in enhancing the efficiency and usability of CLEA dashboards. Not only does CloudWatch offer real-time monitoring of AWS resources, but it also allows to track user activity and dashboard utilization. By analyzing usage metrics and logs, we can see who has logged in to the CLEA dashboards, what features are most frequently accessed, and how long users interact with various elements. These insights are invaluable for making data-driven decisions on how to improve the dashboards for a better user experience. Through the intuitive interface of CloudWatch, it’s possible to set up alarms for alerting about abnormal activities or performance issues. Ultimately, employing CloudWatch for monitoring offers a comprehensive view of both system health and user engagement, helping us refine and enhance our dashboards continually.

Conclusion

BMW Group’s CLEA platform offers a comprehensive and effective solution to manage and optimize cloud resources. By providing full transparency into cloud costs, usage, and efficiency, CLEA offers insights from high-level overviews to granular details at the service, resource, and operational level.

CLEA aggregates data from various sources, enabling a detailed roadmap of the cloud operations, tracking footprints across primes, departments, products, applications, resources, and tags. This dynamic vision helps identify trends, anticipate future needs, and make strategic decisions.

Future plans for CLEA include enhancing capabilities with data consistency and accuracy, integrating additional sources like Amazon S3 Storage Lens for deeper insights, and introducing Amazon QuickSight Q for intelligent recommendations powered by machine learning, further streamlining cloud operations.

By following the practices here, you can unlock the potential of efficient cloud resource management by implementing Cloud Intelligence Dashboards, providing you with precise insights into costs, savings, and operational effectiveness.

About the Authors

Philipp Karg is Lead FinOps Engineer at BMW Group and founder of the CLEA platform. He focus on boosting cloud efficiency initiatives and establishing a cost-aware culture within the company to ultimately leverage the cloud in a sustainable way.

Alex Gutfreund is Head of Product and Technology Integration at the BMW Group. He spearheads the digital transformation with a particular focus on platforms ecosystems and efficiencies. With extensive experience at the interface of business and IT, he drives change and makes an impact in various organizations. His industry knowledge spans from automotive, semiconductor, public transportation, and renewable energies.

Cizer Pereira is a Senior DevOps Architect at AWS Professional Services. He works closely with AWS customers to accelerate their journey to the cloud. He has a deep passion for Cloud Native and DevOps, and in his free time, he also enjoys contributing to open-source projects.

Selman Ay is a Data Architect in the AWS Professional Services team. He has worked with customers from various industries such as e-commerce, pharma, automotive and finance to build scalable data architectures and generate insights from the data. Outside of work, he enjoys playing tennis and engaging in outdoor activities.

Nick McCarthy is a Senior Machine Learning Engineer in the AWS Professional Services team. He has worked with AWS clients across various industries including healthcare, finance, sports, telecoms and energy to accelerate their business outcomes through the use of AI/ML. Outside of work Nick loves to travel, exploring new cuisines and cultures in the process.

Miguel Henriques is a Cloud Application Architect in the AWS Professional Services team with 4 years of experience in the automotive industry delivering cloud native solutions. In his free time, he is constantly looking for advancements in the web development space and searching for the next great pastel de nata.

Managing AWS Lambda runtime upgrades

2023-11-15 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/managing-aws-lambda-runtime-upgrades/

This post is written by Julian Wood, Principal Developer Advocate, and Dan Fox, Principal Specialist Serverless Solutions Architect.

AWS Lambda supports multiple programming languages through the use of runtimes. A Lambda runtime provides a language-specific execution environment, which provides the OS, language support, and additional settings, such as environment variables and certificates that you can access from your function code.

You can use managed runtimes that Lambda provides or build your own. Each major programming language release has a separate managed runtime, with a unique runtime identifier, such as python3.11 or nodejs20.x.

Lambda automatically applies patches and security updates to all managed runtimes and their corresponding container base images. Automatic runtime patching is one of the features customers love most about Lambda. When these patches are no longer available, Lambda ends support for the runtime. Over the next few months, Lambda is deprecating a number of popular runtimes, triggered by end of life of upstream language versions and of Amazon Linux 1.

Runtime	Deprecation
Node.js 14	Nov 27, 2023
Node.js 16	Mar 11, 2024
Python 3.7	Nov 27, 2023
Java 8 (Amazon Linux 1)	Dec 31, 2023
Go 1.x	Dec 31, 2023
Ruby 2.7	Dec 07, 2023
Custom Runtime (provided)	Dec 31, 2023

Runtime deprecation is not unique to Lambda. You must upgrade code using Python 3.7 or Node.js 14 when those language versions reach end of life, regardless of which compute service your code is running on. Lambda can help make this easier by tracking which runtimes you are using and providing deprecation notifications.

This post contains considerations and best practices for managing runtime deprecations and upgrades when using Lambda. Adopting these techniques makes managing runtime upgrades easier, especially when working with a large number of functions.

Specifying Lambda runtimes

When you deploy your function as a .zip file archive, you choose a runtime when you create the function. To change the runtime, you can update your function’s configuration.

Lambda keeps each managed runtime up to date by taking on the operational burden of patching the runtimes with security updates, bug fixes, new features, performance enhancements, and support for minor version releases. These runtime updates are published as runtime versions. Lambda applies runtime updates to functions by migrating the function from an earlier runtime version to a new runtime version.

You can control how your functions receive these updates using runtime management controls. Runtime versions and runtime updates apply to patch updates for a given Lambda runtime. Lambda does not automatically upgrade functions between major language runtime versions, for example, from nodejs14.x to nodejs18.x.

For a function defined as a container image, you choose a runtime and the Linux distribution when you create the container image. Most customers start with one of the Lambda base container images, although you can also build your own images from scratch. To change the runtime, you create a new container image from a different base container image.

Why does Lambda deprecate runtimes?

Lambda deprecates a runtime when upstream runtime language maintainers mark their language end-of-life or security updates are no longer available.

In almost all cases, the end-of-life date of a language version or operating system is published well in advance. The Lambda runtime deprecation policy gives end-of-life schedules for each language that Lambda supports. Lambda notifies you by email and via your Personal Health Dashboard if you are using a runtime that is scheduled for deprecation.

Lambda runtime deprecation happens in several stages. Lambda first blocks creating new functions that use a given runtime. Lambda later also blocks updating existing functions using the unsupported runtime, except to update to a supported runtime. Lambda does not block invocations of functions that use a deprecated runtime. Function invocations continue indefinitely after the runtime reaches end of support.

Lambda is extending the deprecation notification period from 60 days before deprecation to 180 days. Previously, blocking new function creation happened at deprecation and blocking updates to existing functions 30 days later. Blocking creation of new functions now happens 30 days after deprecation, and blocking updates to existing functions 60 days after.

Lambda occasionally delays deprecation of a Lambda runtime for a limited period beyond the end of support date of the language version that the runtime supports. During this period, Lambda only applies security patches to the runtime OS. Lambda doesn’t apply security patches to programming language runtimes after they reach their end of support date.

Can Lambda automatically upgrade my runtime?

Moving from one major version of the language runtime to another has a significant risk of being a breaking change. Some libraries and dependencies within a language have deprecation schedules and do not support versions of a language past a certain point. Moving functions to new runtimes could potentially impact large-scale production workloads that customers depend on.

Since Lambda cannot guarantee backward compatibility between major language versions, upgrading the Lambda runtime used by a function is a customer-driven operation.

Lambda function versions

You can use function versions to manage the deployment of your functions. In Lambda, you make code and configuration changes to the default function version, which is called $LATEST. When you publish a function version, Lambda takes a snapshot of the code, runtime, and function configuration to maintain a consistent experience for users of that function version. When you invoke a function, you can specify the version to use or invoke the $LATEST version. Lambda function versions are required when using Provisioned Concurrency or SnapStart.

Some developers use an auto-versioning process by creating a new function version each time they deploy a change. This results in many versions of a function, with only a single version actually in use.

While Lambda applies runtime updates to published function versions, you cannot update the runtime major version for a published function version, for example from Node.js 16 to Node.js 20. To update the runtime for a function, you must update the $LATEST version, then create a new published function version if necessary. This means that different versions of a function can use different runtimes. The following shows the same function with version 1 using Node.js 14.x and version 2 using Node.js 18.x.

Version 1 using Node.js 14.x

Version 2 using Node.js 18.x

Ensure you create a maintenance process for deleting unused function versions, which also impact your Lambda storage quota.

Managing function runtime upgrades

Managing function runtime upgrades should be part of your software delivery lifecycle, in a similar way to how you treat dependencies and security updates. You need to understand which functions are being actively used in your organization. Organizations can create prioritization based on security profiles and/or function usage. You can use the same communication mechanisms you may already be using for handling security vulnerabilities.

Implement preventative guardrails to ensure that developers can only create functions using supported runtimes. Using infrastructure as code, CI/CD pipelines, and robust testing practices makes updating runtimes easier.

Identifying impacted functions

There are tools available to check Lambda runtime configuration and to identify which functions and what published function versions are actually in use. Deleting a function or function version that is no longer in use is the simplest way to avoid runtime deprecations.

You can identify functions using deprecated or soon to be deprecated runtimes using AWS Trusted Advisor. Use the AWS Lambda Functions Using Deprecated Runtimes check, in the Security category that provides 120 days’ notice.

AWS Trusted Advisor Lambda functions using deprecated runtimes

Trusted Advisor scans all versions of your functions, including $LATEST and published versions.

The AWS Command Line Interface (AWS CLI) can list all functions in a specific Region that are using a specific runtime. To find all functions in your account, repeat the following command for each AWS Region and account. Replace the <REGION> and <RUNTIME> parameters with your values. The --function-version ALL parameter causes all function versions to be returned; omit this parameter to return only the $LATEST version.

aws lambda list-functions --function-version ALL --region <REGION> --output text —query "Functions[?Runtime=='<RUNTIME>'].FunctionArn"

You can use AWS Config to create a view of the configuration of resources in your account and also store configuration snapshot data in Amazon S3. AWS Config queries do not support published function versions, they can only query the $LATEST version.

You can then use Amazon Athena and Amazon QuickSight to make dashboards to visualize AWS Config data. For more information, see the Implementing governance in depth for serverless applications learning guide.

Dashboard showing AWS Config data

There are a number of ways that you can track Lambda function usage.

You can use Amazon CloudWatch metrics explorer to view Lambda by runtime and track the Invocations metric within the default CloudWatch metrics retention period of 15 months.

Track invocations in Amazon CloudWatch metrics

You can turn on AWS CloudTrail data event logging to log an event every time Lambda functions are invoked. This helps you understand what identities are invoking functions and the frequency of their invocations.

AWS Cost and Usage Reports can show which functions are incurring cost and in use.

Limiting runtime usage

AWS CloudFormation Guard is an open-source evaluation tool to validate infrastructure as code templates. Create policy rules to ensure that developers only chose approved runtimes. For more information, see Preventative Controls with AWS CloudFormation Guard.

AWS Config rules allow you to check that Lambda function settings for the runtime match expected values. For more information on running these rules before deployment, see Preventative Controls with AWS Config. You can also reactively flag functions as non-compliant as your governance policies evolve. For more information, see Detective Controls with AWS Config.

Lambda does not currently have service control policies (SCP) to block function creation based on the runtime

Upgrade best practices

Use infrastructure as code tools to build and manage your Lambda functions, which can make it easier to manage upgrades.

Ensure you run tests against your functions when developing locally. Include automated tests as part of your CI/CD pipelines to provide confidence in your runtime upgrades. When rolling out function upgrades, you can use weighted aliases to shift traffic between two function versions as you monitor for errors and failures.

Using runtimes after deprecation

AWS strongly advises you to upgrade your functions to a supported runtime before deprecation to continue to benefit from security patches, bug-fixes, and the latest runtime features. While deprecation does not affect function invocations, you will be using an unsupported runtime, which may have unpatched security vulnerabilities. Your function may eventually stop working, for example, due to a certificate expiry.

Lambda blocks function creation and updates for functions using deprecated runtimes. To create or update functions after these operations are blocked, contact AWS Support.

Conclusion

Lambda is deprecating a number of popular runtimes over the next few months, reflecting the end-of-life of upstream language versions and Amazon Linux 1. This post covers considerations for managing Lambda function runtime upgrades.

For more serverless learning resources, visit Serverless Land.

Deploy Amazon QuickSight dashboards to monitor AWS Glue ETL job metrics and set alarms

2023-11-03 Michael Hamilton

Post Syndicated from Michael Hamilton original https://aws.amazon.com/blogs/big-data/deploy-amazon-quicksight-dashboards-to-monitor-aws-glue-etl-job-metrics-and-set-alarms/

No matter the industry or level of maturity within AWS, our customers require better visibility into their AWS Glue usage. Better visibility can lend itself to gains in operational efficiency, informed business decisions, and further transparency into your return on investment (ROI) when using the various features available through AWS Glue.

As your company grows, you should be able to answer simple questions about your AWS Glue usage, such as the following:

Where am I spending the most with AWS Glue?
Where can I save the most by taking advantage of new AWS Glue features?
What does my overall usage look like using AWS Glue?

AWS offers services such as Amazon QuickSight, a serverless business intelligence (BI) service that lets you centralize this view and even ask natural language questions of your data, using Amazon QuickSight Q. QuickSight can give business leaders and their technology counterparts a common landscape for reporting important details of their usage, providing automated narratives to bridge communication gaps.

In this post, we explore how to combine AWS Glue usage information and metrics with centralized reporting and visualization using QuickSight. This can provide you with a more comprehensive view of your usage and tools to help you dive deep into your AWS Glue job run environment. You have metrics available per job run within the AWS Glue console, but they don’t cover all available AWS Glue job metrics, and the visuals aren’t as interactive compared to the QuickSight dashboard.

Although we don’t cover optimizing your jobs for costs in this post, you can refer to Monitor and optimize cost on AWS Glue for Apache Spark to learn how to fine-tune your AWS Glue jobs for performance, efficiency ,and cost-optimization.

Let’s dive in!

Solution overview

The following diagram illustrates the architecture for the given solution. At a high level, a scheduled event triggers an orchestration flow consisting of multiple data, compute, and analytics resources—the output of which culminates as a set of visuals in a BI dashboard.

solution architecture

Now let’s dig into the technical details involved in this solution.

An AWS Step Functions workflow is scheduled to run once per hour through Amazon EventBridge, which triggers an AWS Lambda function that calls the AWS Glue GetJob and GetJobRun APIs. We parse this data to check for jobs that have succeeded, stopped, or failed in the past hour, as well as any streaming jobs. The metadata is extracted from each job run, including information like runtime, start time, end time, auto scaling, number of workers, and worker type, and is written to an Amazon DynamoDB table with TTL (time to live) enabled to ensure the table doesn’t grow too large.

We move into a parallel state to check two tables that Amazon Athena writes the output of the federated queries to. Athena first checks to make sure the tables exist in Amazon Simple Storage Service (Amazon S3), where the data will be stored. If the tables don’t exist, Athena creates them. One federated query gathers AWS Glue metric data from Amazon CloudWatch metrics; the other gathers data from the DynamoDB table where Lambda writes the AWS Glue job metadata it’s collecting. Both federated queries utilize appropriate filtering in order to only scan the necessary data from each source.

There is a choice state for each branch. If there is no new data to be added to a table in Amazon S3, the state ends and waits for the other to complete. For example, there could be an AWS Glue job that is running while the step is evaluating. In this case, the metrics for the job would be inserted in the table on Amazon S3, but the metadata from DynamoDB wouldn’t arrive until the following hour after the job has succeeded, stopped, or failed.

When new metrics or metadata are found, Athena inserts this data to the metrics or metadata tables in Amazon S3, which are both partitioned by the hour. After the data is inserted, the final steps call the QuickSight CreateIngestion API, which triggers data ingestion into QuickSight SPICE to power interactive analysis. At this point, the workflow has finished running and will run again the following hour.

In the following sections, we show you how to set up the solution, explore the dashboards, and configure alarms.

The code for this solution can be found at the AWS samples GitHub repository.

Prerequisites

You should have the following prerequisites:

An AWS account with AWS Identity and Access Management (IAM) privileges sufficient to create the solution resources
QuickSight Standard or Enterprise Edition with a QuickSight user created
An AWS Cloud9 integrated development environment (IDE) or your local machine using your preferred IDE with the following packages installed:
- Node Package Manager (npm)
- Git CLI
- AWS CDK v2
- AWS CDK Library
- TypeScript
The AWS Cloud Development Kit (AWS CDK) bootstrapped in your target AWS account and Region

Deploy solution resources with the AWS CDK

To provision the resources that build the dashboard and keep it up to date, we provide steps to download and deploy the solution via the AWS CDK. The solution was developed with cost-optimization as a priority, but some resources in the stack will incur costs once deployed.

This solution generates the following resources:

IAM role
EventBridge rule
Step Functions state machine
Lambda function
S3 bucket
Two AWS Glue tables and one AWS Glue database
DynamoDB table
Athena queries invoked by Step Functions
QuickSight data source, dataset, analysis, and dashboard

To deploy the solution, complete the following steps:

Clone the source code from AWS samples GitHub repository to the client:
```
git clone https://github.com/aws-samples/glue-metrics-in-quicksight
```

Bootstrap your AWS CDK app:

cd glue-metrics-in-quicksight
npm i aws-cdk-lib
cdk bootstrap

Deploy the solution with the required parameters:
1. The first parameter is for a new S3 bucket to be created, which holds the AWS Glue metrics and metadata.
2. The second parameter is required in order for QuickSight to assign permissions to the user who will manage the assets. Refer to Managing user access inside Amazon QuickSight to find your existing QuickSight users.
```
cdk deploy --parameters BucketName=New-Unique-Bucket-Name --parameters QuicksightUsername=QuickSight-Existing-User
```

If your deployment fails, make sure you installed the AWS CDK library and rerun cdk deploy after installing:

npm i aws-cdk-lib

The deployment may take up to 10 minutes.

After the solution is deployed, the Step Functions state machine will evaluate once per hour if it should ingest data into QuickSight. You can run some AWS Glue jobs after the stack is deployed and check the QuickSight dashboard in the next hour or two, where the job metadata and metrics will be populated for your analysis.

Explore the dashboard

The dashboard contains two sheets: Glue Jobs and Glue Metrics.

The Glue Jobs sheet includes all of the metadata about your AWS Glue job runs, including AWS Glue for Apache Spark, AWS Glue for Ray, and AWS Glue streaming ETL. Most of the visuals also have a hierarchy that you can drill down into with QuickSight, going as low as each specific job run ID. You can use controls to filter by date, job name, and job run ID.

In the following demonstration, you will see the pivot table, which is a simple view of all our job metadata, including estimated cost per job and job run. We open up a job name and see the different job runs. There is one individual job run that we would like to inspect the metrics on, so we choose the job name and choose View metrics for job run id: <my job run id>. This will take us to the Glue Metrics sheet and automatically filter for the job run ID we want to view.

glue information sheet

The Glue Metrics sheet is built to reflect the documentation we provide in AWS Glue resource monitoring. This documentation helps explain each visual in the dashboard. You can use the Glue Metrics sheet to view aggregated metrics across all jobs, a single job, or down to the job run ID.

To populate the Glue Metrics sheet, your AWS Glue jobs must be enabled to capture metrics in CloudWatch.

glue metrics sheet

Set up alerts

Setting up alerts on measures is also straightforward to do in QuickSight. To do so, choose (right-click) one of the tracked measures on either worksheet and choose Create Alarm. This will bring you to the configuration page to set up the metric you’d like to be alerted on.

quicksight alarm

The dashboard is designed to give you the freedom to alter it and make your own visualizations with the metadata and metrics that are provided to you. If you want even more insight into cost, consider deploying the CUDOS dashboard as well!

Clean up

If you no longer need the dashboard, delete the CDK app:

cdk destroy

Conclusion

In this post, we talked about the importance of having observability of your AWS Glue jobs and provided an AWS CDK app that deploys a QuickSight dashboard for you. We hope this helps you optimize your AWS Glue environment using the insights the dashboard provides. To learn about event-based alerting for your AWS Glue for Apache Spark and Ray jobs, refer to Automate alerting and reporting for AWS Glue job resource usage.

About the authors

Michael Hamilton is a Sr Analytics Solutions Architect focusing on helping enterprise customers in the south east modernize and simplify their analytics workloads on AWS. He enjoys mountain biking and spending time with his wife and three children when not working.

Cody Penta is a Solutions Architect at Amazon Web Services and is based out of Charlotte, NC. He has a focus in security and CDK, and enjoys solving the really difficult problems in the technology world. Off the clock, he loves relaxing in the mountains, coding personal projects, and gaming.

Angus Ferguson is a Solutions Architect at AWS who is passionate about meeting customers across the world, helping them solve their technical challenges. Angus specializes in Data & Analytics with a focus on customers in the financial services industry.

Aggregating, searching, and visualizing log data from distributed sources with Amazon Athena and Amazon QuickSight

2023-11-02 Pratima Singh

Post Syndicated from Pratima Singh original https://aws.amazon.com/blogs/security/aggregating-searching-and-visualizing-log-data-from-distributed-sources-with-amazon-athena-and-amazon-quicksight/

Customers using Amazon Web Services (AWS) can use a range of native and third-party tools to build workloads based on their specific use cases. Logs and metrics are foundational components in building effective insights into the health of your IT environment. In a distributed and agile AWS environment, customers need a centralized and holistic solution to visualize the health and security posture of their infrastructure.

You can effectively categorize the members of the teams involved using the following roles:

Executive stakeholder: Owns and operates with their support staff and has total financial and risk accountability.
Data custodian: Aggregates related data sources while managing cost, access, and compliance.
Operator or analyst: Uses security tooling to monitor, assess, and respond to related events such as service disruptions.

In this blog post, we focus on the data custodian role. We show you how you can visualize metrics and logs centrally with Amazon QuickSight irrespective of the service or tool generating them. We use Amazon Simple Storage Service (Amazon S3) for storage, AWS Glue for cataloguing, and Amazon Athena for querying the data and creating structured query language (SQL) views for QuickSight to consume.

Target architecture

This post guides you towards building a target architecture in line with the AWS Well-Architected Framework. The tiered and multi-account target architecture, shown in Figure 1, uses account-level isolation to separate responsibilities across the various roles identified above and makes access management more defined and specific to those roles. The workload accounts generate the telemetry around the applications and infrastructure. The data custodian account is where the data lake is deployed and collects the telemetry. The operator account is where the queries and visualizations are created.

Throughout the post, I mention AWS services that reduce the operational overhead in one or more stages of the architecture.

Figure 1: Data visualization architecture

Ingestion

Irrespective of the technology choices, applications and infrastructure configurations should generate metrics and logs that report on resource health and security. The format of the logs depends on which tool and which part of the stack is generating the logs. For example, the format of log data generated by application code can capture bespoke and additional metadata deemed useful from a workload perspective as compared to access logs generated by proxies or load balancers. For more information on types of logs and effective logging strategies, see Logging strategies for security incident response.

Amazon S3 is a scalable, highly available, durable, and secure object storage that you will use as the storage layer. To build a solution that captures events agnostic of the source, you must forward data as a stream to the S3 bucket. Based on the architecture, there are multiple tools you can use to capture and stream data into S3 buckets. Some tools support integration with S3 and directly stream data to S3. Resources like servers and virtual machines need forwarding agents such as Amazon Kinesis Agent, Amazon CloudWatch agent, or Fluent Bit.

Amazon Kinesis Data Streams provides a scalable data streaming environment. Using on-demand capacity mode eliminates the need for capacity provisioning and capacity management for streaming workloads. For log data and metric collection, you should use on-demand capacity mode, because log data generation can be unpredictable depending on the requests that are being handled by the environment. Amazon Kinesis Data Firehose can convert the format of your input data from JSON to Apache Parquet before storing the data in Amazon S3. Parquet is naturally compressed, and using Parquet native partitioning and compression allows for faster queries compared to JSON formatted objects.

Scalable data lake

Use AWS Lake Formation to build, secure, and manage the data lake to store log and metric data in S3 buckets. We recommend using tag-based access control and named resources to share the data in your data store to share data across accounts to build visualizations. Data custodians should configure access for relevant datasets to the operators who can use Athena to perform complex queries and build compelling data visualizations with QuickSight, as shown in Figure 2. For cross-account permissions, see Use Amazon Athena and Amazon QuickSight in a cross-account environment. You can also use Amazon DataZone to build additional governance and share data at scale within your organization. Note that the data lake is different to and separate from the Log Archive bucket and account described in Organizing Your AWS Environment Using Multiple Accounts.

Figure 2: Account structure

Amazon Security Lake

Amazon Security Lake is a fully managed security data lake service. You can use Security Lake to automatically centralize security data from AWS environments, SaaS providers, on-premises, and third-party sources into a purpose-built data lake that’s stored in your AWS account. Using Security Lake reduces the operational effort involved in building a scalable data lake, as the service automates the configuration and orchestration for the data lake with Lake Formation. Security Lake automatically transforms logs into a standard schema—the Open Cybersecurity Schema Framework (OCSF) — and parses them into a standard directory structure, which allows for faster queries. For more information, see How to visualize Amazon Security Lake findings with Amazon QuickSight.

Querying and visualization

Figure 3: Data sharing overview

After you’ve configured cross-account permissions, you can use Athena as the data source to create a dataset in QuickSight, as shown in Figure 3. You start by signing up for a QuickSight subscription. There are multiple ways to sign in to QuickSight; this post uses AWS Identity and Access Management (IAM) for access. To use QuickSight with Athena and Lake Formation, you first must authorize connections through Lake Formation. After permissions are in place, you can add datasets. You should verify that you’re using QuickSight in the same AWS Region as the Region where Lake Formation is sharing the data. You can do this by checking the Region in the QuickSight URL.

You can start with basic queries and visualizations as described in Query logs in S3 with Athena and Create a QuickSight visualization. Depending on the nature and origin of the logs and metrics that you want to query, you can use the examples published in Running SQL queries using Amazon Athena. To build custom analytics, you can create views with Athena. Views in Athena are logical tables that you can use to query a subset of data. Views help you to hide complexity and minimize maintenance when querying large tables. Use views as a source for new datasets to build specific health analytics and dashboards.

You can also use Amazon QuickSight Q to get started on your analytics journey. Powered by machine learning, Q uses natural language processing to provide insights into the datasets. After the dataset is configured, you can use Q to give you suggestions for questions to ask about the data. Q understands business language and generates results based on relevant phrases detected in the questions. For more information, see Working with Amazon QuickSight Q topics.

Conclusion

Logs and metrics offer insights into the health of your applications and infrastructure. It’s essential to build visibility into the health of your IT environment so that you can understand what good health looks like and identify outliers in your data. These outliers can be used to identify thresholds and feed into your incident response workflow to help identify security issues. This post helps you build out a scalable centralized visualization environment irrespective of the source of log and metric data.

This post is part 1 of a series that helps you dive deeper into the security analytics use case. In part 2, How to visualize Amazon Security Lake findings with Amazon QuickSight, you will learn how you can use Security Lake to reduce the operational overhead involved in building a scalable data lake and centralizing log data from SaaS providers, on-premises, AWS, and third-party sources into a purpose-built data lake. You will also learn how you can integrate Athena with Security Lake and create visualizations with QuickSight of the data and events captured by Security Lake.

Part 3, How to share security telemetry per Organizational Unit using Amazon Security Lake and AWS Lake Formation, dives deeper into how you can query security posture using AWS Security Hub findings integrated with Security Lake. You will also use the capabilities of Athena and QuickSight to visualize security posture in a distributed environment.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Create, train, and deploy Amazon Redshift ML model integrating features from Amazon SageMaker Feature Store

2023-10-26 Anirban Sinha

Post Syndicated from Anirban Sinha original https://aws.amazon.com/blogs/big-data/create-train-and-deploy-amazon-redshift-ml-model-integrating-features-from-amazon-sagemaker-feature-store/

Amazon Redshift is a fast, petabyte-scale, cloud data warehouse that tens of thousands of customers rely on to power their analytics workloads. Data analysts and database developers want to use this data to train machine learning (ML) models, which can then be used to generate insights on new data for use cases such as forecasting revenue, predicting customer churn, and detecting anomalies. Amazon Redshift ML makes it easy for SQL users to create, train, and deploy ML models using SQL commands familiar to many roles such as executives, business analysts, and data analysts. We covered in a previous post how you can use data in Amazon Redshift to train models in Amazon SageMaker, a fully managed ML service, and then make predictions within your Redshift data warehouse.

Redshift ML currently supports ML algorithms such as XGBoost, multilayer perceptron (MLP), KMEANS, and Linear Learner. Additionally, you can import existing SageMaker models into Amazon Redshift for in-database inference or remotely invoke a SageMaker endpoint.

Amazon SageMaker Feature Store is a fully managed, purpose-built repository to store, share, and manage features for ML models. However, one challenge in training a production-ready ML model using SageMaker Feature Store is access to a diverse set of features that aren’t always owned and maintained by the team that is building the model. For example, an ML model to identify fraudulent financial transactions needs access to both identifying (device type, browser) and transaction (amount, credit or debit, and so on) related features. As a data scientist building an ML model, you may have access to the identifying information but not the transaction information, and having access to a feature store solves this.

In this post, we discuss the combined feature store pattern, which allows teams to maintain their own local feature stores using a local Redshift table while still being able to access shared features from the centralized feature store. In a local feature store, you can store sensitive data that can’t be shared across the organization for regulatory and compliance reasons.

We also show you how to use familiar SQL statements to create and train ML models by combining shared features from the centralized store with local features and use these models to make in-database predictions on new data for use cases such as fraud risk scoring.

Overview of solution

For this post, we create an ML model to predict if a transaction is fraudulent or not, given the transaction record. To build this, we need to engineer features that describe an individual credit card’s spending pattern, such as the number of transactions or the average transaction amount, and also information about the merchant, the cardholder, the device used to make the payment, and any other data that may be relevant to detecting fraud.

To get started, we need an Amazon Redshift Serverless data warehouse with the Redshift ML feature enabled and an Amazon SageMaker Studio environment with access to SageMaker Feature Store. For an introduction to Redshift ML and instructions on setting it up, see Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML.

We also need an offline feature store to store features in feature groups. The offline store uses an Amazon Simple Storage Service (Amazon S3) bucket for storage and can also fetch data using Amazon Athena queries. For an introduction to SageMaker Feature Store and instructions on setting it up, see Getting started with Amazon SageMaker Feature Store.

The following diagram illustrates solution architecture.

The workflow contains the following steps:

Create the offline feature group in SageMaker Feature Store and ingest data into the feature group.
Create a Redshift table and load local feature data into the table.
Create an external schema for Amazon Redshift Spectrum to access the offline store data stored in Amazon S3 using the AWS Glue Data Catalog.
Train and validate a fraud risk scoring ML model using local feature data and external offline feature store data.
Use the offline feature store and local store for inference.

Dataset

To demonstrate this use case, we use a synthetic dataset with two tables: identity and transactions. They can both be joined by the TransactionID column. The transaction table contains information about a particular transaction, such as amount, credit or debit card, and so on, and the identity table contains information about the user, such as device type and browser. The transaction must exist in the transaction table, but might not always be available in the identity table.

The following is an example of the transactions dataset.

The following is an example of the identity dataset.

Let’s assume that across the organization, data science teams centrally manage the identity data and process it to extract features in a centralized offline feature store. The data warehouse team ingests and analyzes transaction data in a Redshift table, owned by them.

We work through this use case to understand how the data warehouse team can securely retrieve the latest features from the identity feature group and join it with transaction data in Amazon Redshift to create a feature set for training and inferencing a fraud detection model.

Create the offline feature group and ingest data

To start, we set up SageMaker Feature Store, create a feature group for the identity dataset, inspect and process the dataset, and ingest some sample data. We then prepare the transaction features from the transaction data and store it in Amazon S3 for further loading into the Redshift table.

Alternatively, you can author features using Amazon SageMaker Data Wrangler, create feature groups in SageMaker Feature Store, and ingest features in batches using an Amazon SageMaker Processing job with a notebook exported from SageMaker Data Wrangler. This mode allows for batch ingestion into the offline store.

Let’s explore some of the key steps in this section.

Download the sample notebook.
On the SageMaker console, under Notebook in the navigation pane, choose Notebook instances.
Locate your notebook instance and choose Open Jupyter.
Choose Upload and upload the notebook you just downloaded.
Open the notebook sagemaker_featurestore_fraud_redshiftml_python_sdk.ipynb.
Follow the instructions and run all the cells up to the Cleanup Resources section.

The following are key steps from the notebook:

We create a Pandas DataFrame with the initial CSV data. We apply feature transformations for this dataset.

identity_data = pd.read_csv(io.BytesIO(identity_data_object["Body"].read()))
transaction_data = pd.read_csv(io.BytesIO(transaction_data_object["Body"].read()))

identity_data = identity_data.round(5)
transaction_data = transaction_data.round(5)

identity_data = identity_data.fillna(0)
transaction_data = transaction_data.fillna(0)

# Feature transformations for this dataset are applied 
# One hot encode card4, card6
encoded_card_bank = pd.get_dummies(transaction_data["card4"], prefix="card_bank")
encoded_card_type = pd.get_dummies(transaction_data["card6"], prefix="card_type")

transformed_transaction_data = pd.concat(
    [transaction_data, encoded_card_type, encoded_card_bank], axis=1
)

We store the processed and transformed transaction dataset in an S3 bucket. This transaction data will be loaded later in the Redshift table for building the local feature store.

transformed_transaction_data.to_csv("transformed_transaction_data.csv", header=False, index=False)
s3_client.upload_file("transformed_transaction_data.csv", default_s3_bucket_name, prefix + "/training_input/transformed_transaction_data.csv")

Next, we need a record identifier name and an event time feature name. In our fraud detection example, the column of interest is TransactionID.EventTime can be appended to your data when no timestamp is available. In the following code, you can see how these variables are set, and then EventTime is appended to both features’ data.
```
# record identifier and event time feature names
record_identifier_feature_name = "TransactionID"
event_time_feature_name = "EventTime"

# append EventTime feature
identity_data[event_time_feature_name] = pd.Series(
    [current_time_sec] * len(identity_data), dtype="float64"
)
```

We then create and ingest the data into the feature group using the SageMaker SDK FeatureGroup.ingest API. This is a small dataset and therefore can be loaded into a Pandas DataFrame. When we work with large amounts of data and millions of rows, there are other scalable mechanisms to ingest data into SageMaker Feature Store, such as batch ingestion with Apache Spark.

identity_feature_group.create(
    s3_uri=<S3_Path_Feature_Store>,
    record_identifier_name=record_identifier_feature_name,
    event_time_feature_name=event_time_feature_name,
    role_arn=<role_arn>,
    enable_online_store=False,
)

identity_feature_group_name = "identity-feature-group"

# load feature definitions to the feature group. SageMaker FeatureStore Python SDK will auto-detect the data schema based on input data.
identity_feature_group.load_feature_definitions(data_frame=identity_data)
identity_feature_group.ingest(data_frame=identity_data, max_workers=3, wait=True)

We can verify that data has been ingested into the feature group by running Athena queries in the notebook or running queries on the Athena console.

At this point, the identity feature group is created in an offline feature store with historical data persisted in Amazon S3. SageMaker Feature Store automatically creates an AWS Glue Data Catalog for the offline store, which enables us to run SQL queries against the offline data using Athena or Redshift Spectrum.

Create a Redshift table and load local feature data

To build a Redshift ML model, we build a training dataset joining the identity data and transaction data using SQL queries. The identity data is in a centralized feature store where the historical set of records are persisted in Amazon S3. The transaction data is a local feature for training data that needs to made available in the Redshift table.

Let’s explore how to create the schema and load the processed transaction data from Amazon S3 into a Redshift table.

Create the customer_transaction table and load daily transaction data into the table, which you’ll use to train the ML model:

DROP TABLE customer_transaction;
CREATE TABLE customer_transaction (
  TransactionID INT,    
  isFraud INT,  
  TransactionDT INT,    
  TransactionAmt decimal(10,2), 
  card1 INT,    
  card2 decimal(10,2),card3 decimal(10,2),  
  card4 VARCHAR(20),card5 decimal(10,2),    
  card6 VARCHAR(20),    
  B1 INT,B2 INT,B3 INT,B4 INT,B5 INT,B6 INT,
  B7 INT,B8 INT,B9 INT,B10 INT,B11 INT,B12 INT,
  F1 INT,F2 INT,F3 INT,F4 INT,F5 INT,F6 INT,
  F7 INT,F8 INT,F9 INT,F10 INT,F11 INT,F12 INT,
  F13 INT,F14 INT,F15 INT,F16 INT,F17 INT,  
  N1 VARCHAR(20),N2 VARCHAR(20),N3 VARCHAR(20), 
  N4 VARCHAR(20),N5 VARCHAR(20),N6 VARCHAR(20), 
  N7 VARCHAR(20),N8 VARCHAR(20),N9 VARCHAR(20), 
  card_type_0  boolean,
  card_type_credit boolean,
  card_type_debit  boolean,
  card_bank_0  boolean,
  card_bank_american_express boolean,
  card_bank_discover  boolean,
  card_bank_mastercard  boolean,
  card_bank_visa boolean  
);

Load the sample data by using the following command. Replace your Region and S3 path as appropriate. You will find the S3 path in the S3 Bucket Setup For The OfflineStore section in the notebook or by checking the dataset_uri_prefix in the notebook.
```
COPY customer_transaction
FROM '<s3path>/transformed_transaction_data.csv' 
IAM_ROLE default delimiter ',' 
region 'your-region';
```

Now that we have created a local feature store for the transaction data, we focus on integrating a centralized feature store with Amazon Redshift to access the identity data.

Create an external schema for Redshift Spectrum to access the offline store data

We have created a centralized feature store for identity features, and we can access this offline feature store using services such as Redshift Spectrum. When the identity data is available through the Redshift Spectrum table, we can create a training dataset with feature values from the identity feature group and customer_transaction, joining on the TransactionId column.

This section provides an overview of how to enable Redshift Spectrum to query data directly from files on Amazon S3 through an external database in an AWS Glue Data Catalog.

First, check that the identity-feature-group table is present in the Data Catalog under the sagemamker_featurestore database.

Using Redshift Query Editor V2, create an external schema using the following command:

CREATE EXTERNAL SCHEMA sagemaker_featurestore
FROM DATA CATALOG
DATABASE 'sagemaker_featurestore'
IAM_ROLE default
create external database if not exists;

All the tables, including identity-feature-group external tables, are visible under the sagemaker_featurestore external schema. In Redshift Query Editor v2, you can check the contents of the external schema.

Run the following query to sample a few records—note that your table name may be different:
```
Select * from sagemaker_featurestore.identity_feature_group_1680208535 limit 10;
```

Create a view to join the latest data from identity-feature-group and customer_transaction on the TransactionId column. Be sure to change the external table name to match your external table name:

create or replace view public.credit_fraud_detection_v
AS select  "isfraud",
        "transactiondt",
        "transactionamt",
        "card1","card2","card3","card5",
         case when "card_type_credit" = 'False' then 0 else 1 end as card_type_credit,
         case when "card_type_debit" = 'False' then 0 else 1 end as card_type_debit,
         case when "card_bank_american_express" = 'False' then 0 else 1 end as card_bank_american_express,
         case when "card_bank_discover" = 'False' then 0 else 1 end as card_bank_discover,
         case when "card_bank_mastercard" = 'False' then 0 else 1 end as card_bank_mastercard,
         case when "card_bank_visa" = 'False' then 0 else 1 end as card_bank_visa,
        "id_01","id_02","id_03","id_04","id_05"
from public.customer_transaction ct left join sagemaker_featurestore.identity_feature_group_1680208535 id
on id.transactionid = ct.transactionid with no schema binding;

Train and validate the fraud risk scoring ML model

Redshift ML gives you the flexibility to specify your own algorithms and model types and also to provide your own advanced parameters, which can include preprocessors, problem type, and hyperparameters. In this post, we create a customer model by specifying AUTO OFF and the model type of XGBOOST. By turning AUTO OFF and using XGBoost, we are providing the necessary inputs for SageMaker to train the model. A benefit of this can be faster training times. XGBoost is as open-source version of the gradient boosted trees algorithm. For more details on XGBoost, refer to Build XGBoost models with Amazon Redshift ML.

We train the model using 80% of the dataset by filtering on transactiondt < 12517618. The other 20% will be used for inference. A centralized feature store is useful in providing the latest supplementing data for training requests. Note that you will need to provide an S3 bucket name in the create model statement. It will take approximately 10 minutes to create the model.

CREATE MODEL frauddetection_xgboost
FROM (select  "isfraud",
        "transactiondt",
        "transactionamt",
        "card1","card2","card3","card5",
        "card_type_credit",
        "card_type_debit",
        "card_bank_american_express",
        "card_bank_discover",
        "card_bank_mastercard",
        "card_bank_visa",
        "id_01","id_02","id_03","id_04","id_05"
from credit_fraud_detection_v where transactiondt < 12517618
)
TARGET isfraud
FUNCTION ml_fn_frauddetection_xgboost
IAM_ROLE default
AUTO OFF
MODEL_TYPE XGBOOST
OBJECTIVE 'binary:logistic'
PREPROCESSORS 'none'
HYPERPARAMETERS DEFAULT EXCEPT(NUM_ROUND '100')
SETTINGS (S3_BUCKET <s3_bucket>);

When you run the create model command, it will complete quickly in Amazon Redshift while the model training is happening in the background using SageMaker. You can check the status of the model by running a show model command:

show model frauddetection_xgboost;

The output of the show model command shows that the model state is TRAINING. It also shows other information such as the model type and the training job name that SageMaker assigned.
After a few minutes, we run the show model command again:

show model frauddetection_xgboost;

Now the output shows the model state is READY. We can also see the train:error score here, which at 0 tells us we have a good model. Now that the model is trained, we can use it for running inference queries.

Use the offline feature store and local store for inference

We can use the SQL function to apply the ML model to data in queries, reports, and dashboards. Let’s use the function ml_fn_frauddetection_xgboost created by our model against our test dataset by filtering where transactiondt >=12517618, to predict whether a transaction is fraudulent or not. SageMaker Feature Store can be useful in supplementing data for inference requests.

Run the following query to predict whether transactions are fraudulent or not:

select  "isfraud" as "Actual",
        ml_fn_frauddetection_xgboost(
        "transactiondt",
        "transactionamt",
        "card1","card2","card3","card5",
        "card_type_credit",
        "card_type_debit",
        "card_bank_american_express",
        "card_bank_discover",
        "card_bank_mastercard",
        "card_bank_visa",
        "id_01","id_02","id_03","id_04","id_05") as "Predicted"
from credit_fraud_detection_v where transactiondt >= 12517618;

For binary and multi-class classification problems, we compute the accuracy as the model metric. Accuracy can be calculated based on the following:

accuracy = (sum (actual == predicted)/total) *100

Let’s apply the preceding code to our use case to find the accuracy of the model. We use the test data (transactiondt >= 12517618) to test the accuracy, and use the newly created function ml_fn_frauddetection_xgboost to predict and take the columns other than the target and label as the input:

-- check accuracy 
WITH infer_data AS (
SELECT "isfraud" AS label,
ml_fn_frauddetection_xgboost(
        "transactiondt",
        "transactionamt",
        "card1","card2","card3","card5",
        "card_type_credit",
        "card_type_debit",
        "card_bank_american_express",
        "card_bank_discover",
        "card_bank_mastercard",
        "card_bank_visa",
        "id_01","id_02","id_03","id_04","id_05") AS predicted,
CASE 
   WHEN label IS NULL
       THEN 0
   ELSE label
   END AS actual,
CASE 
   WHEN actual = predicted
       THEN 1::INT
   ELSE 0::INT
   END AS correct
FROM credit_fraud_detection_v where transactiondt >= 12517618),
aggr_data AS (
SELECT SUM(correct) AS num_correct,
COUNT(*) AS total
FROM infer_data) 

SELECT (num_correct::FLOAT / total::FLOAT) AS accuracy FROM aggr_data;

Clean up

As a final step, clean up the resources:

Delete the Redshift cluster.
Run the Cleanup Resources section of your notebook.

Conclusion

Redshift ML enables you to bring machine learning to your data, powering fast and informed decision-making. SageMaker Feature Store provides a purpose-built feature management solution to help organizations scale ML development across business units and data science teams.

In this post, we showed how you can train an XGBoost model using Redshift ML with data spread across SageMaker Feature Store and a Redshift table. Additionally, we showed how you can make inferences on a trained model to detect fraud using Amazon Redshift SQL commands.

About the authors

Anirban Sinha is a Senior Technical Account Manager at AWS. He is passionate about building scalable data warehouses and big data solutions working closely with customers. He works with large ISVs customers, in helping them build and operate secure, resilient, scalable, and high-performance SaaS applications in the cloud.

Phil Bates is a Senior Analytics Specialist Solutions Architect at AWS. He has more than 25 years of experience implementing large-scale data warehouse solutions. He is passionate about helping customers through their cloud journey and using the power of ML within their data warehouse.

Gaurav Singh is a Senior Solutions Architect at AWS, specializing in AI/ML and Generative AI. Based in Pune, India, he focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. In his spare time, Gaurav loves to explore nature, read, and run.