All posts by Darshit Thakkar

From data lakes to insights: dbt adapter for Amazon Athena now supported in dbt Cloud

Post Syndicated from Darshit Thakkar original https://aws.amazon.com/blogs/big-data/from-data-lakes-to-insights-dbt-adapter-for-amazon-athena-now-supported-in-dbt-cloud/

At AWS, we are committed to empowering organizations with tools that streamline data analytics and transformation processes. We are excited to announce that the dbt adapter for Amazon Athena is now officially supported in dbt Cloud. This integration enables data teams to efficiently transform and manage data using Athena with dbt Cloud’s robust features, enhancing the overall data workflow experience.

In this post, we discuss the advantages of dbt Cloud over dbt Core, common use cases, and how to get started with Amazon Athena using the dbt adapter.

The need for streamlined data transformations

As organizations increasingly adopt cloud-based data lakes and warehouses, the demand for efficient data transformation tools has grown. Athena plays a critical role in this ecosystem by providing a serverless, interactive query service that simplifies analyzing vast amounts of data stored in Amazon Simple Storage Service (Amazon S3) using standard SQL. This enables you to extract insights from your data without the complexity of managing infrastructure.

dbt has emerged as a leading framework, allowing data teams to transform and manage data pipelines effectively. With the dbt adapter for Athena adapter now supported in dbt Cloud, you can seamlessly integrate your AWS data architecture with dbt Cloud, taking advantage of the scalability and performance of Athena to simplify and scale your data workflows efficiently.

Benefits of the dbt adapter for Athena

We have collaborated with dbt Labs and the open source community on an adapter for dbt that enables dbt to interface directly with Athena. Previously, the dbt adapter for Athena was only compatible with dbt Core, requiring teams to manually manage configurations and execute transformations locally or through custom setups. Now, with support for dbt Cloud, you can access a managed, cloud-based environment that automates and enhances your data transformation workflows. This upgrade allows you to build, test, and deploy data models in dbt with greater ease and efficiency, using all the features that dbt Cloud provides.

The support of the dbt adapter for Athena in dbt Cloud offers several advantages over using it with dbt Core:

  • Managed infrastructure – dbt Cloud provides a fully managed environment for running dbt projects, eliminating the need for local setup, maintenance, and configuration. This saves time and effort, especially for teams looking to minimize infrastructure management and focus solely on data modeling.
  • Scheduling and automation – dbt Cloud comes with a job scheduler, allowing you to automate the execution of dbt models. This feature makes sure your datasets are always up to date without needing to set up and maintain external scheduling systems like Apache Airflow. You can also set up dependencies between jobs easily within dbt Cloud, making sure that transformations run in the correct sequence without manual oversight.
  • Enhanced collaboration and version control – You can use a web-based interface for editing and reviewing dbt models, enabling collaboration among data teams. You can review code changes directly on the platform, facilitating efficient teamwork. Additionally, dbt Cloud integrates with Git providers, making version control and code collaboration more streamlined. This makes sure your data models are well-documented, versioned, and straightforward to manage within a collaborative environment.
  • Monitoring and alerting – You get built-in tools for monitoring job executions and performance to set up alerts and notifications for job failures, providing quick response times and minimizing disruptions. Furthermore, you can gain insights into the performance of your data transformations with detailed execution logs and metrics, all accessible through the dbt Cloud interface.

Common use cases for using the dbt adapter with Athena

The following are common use cases for using the dbt adapter with Athena:

  • Building a data warehouse – Many organizations are moving towards a data warehouse architecture, combining the flexibility of data lakes with the performance and structure of data warehouses. Using Athena and the dbt adapter, you can transform raw data in Amazon S3 into well-structured tables suitable for analytics. This setup allows businesses to build a scalable and efficient data lakehouse where they can perform SQL-based transformations and make sure data is clean and ready for analytics without investing heavily in data warehouse infrastructure.
  • Incremental data processing – The adapter allows for incremental data processing, where only new or updated data is transformed and processed. This feature reduces the amount of data scanned by Athena, resulting in faster query performance and lower costs. For example, instead of processing an entire dataset daily, dbt can be configured to transform only the data ingested in the last 24 hours, making data operations more efficient and cost-effective.
  • Cost management and optimization – Because Athena charges based on the amount of data scanned by each query, cost optimization is critical. The adapter enables data teams to optimize transformations by creating efficient data models, such as partitioning and compressing data to minimize scan costs. Additionally, dbt’s automated scheduling in dbt Cloud can be used to manage the frequency of data transformations, making sure queries are run only when necessary, helping to control costs effectively.
  • Data archiving and tiered storage – Organizations with a large amount of historical data can use Athena to query archived data stored in the lower-cost storage classes of Amazon S3 (such as Amazon S3 Glacier). With the adapter, data teams can build models that segment and process data based on usage patterns, making sure frequently accessed data is optimized for quick queries while older data remains accessible but cost-efficient. Alternatively, you can use Amazon S3 Intelligent-Tiering to optimize storage costs by moving data between two access tiers when access patterns change. This approach helps in managing storage costs while maintaining the flexibility to analyze historical trends when needed.
  • Event-driven data transformations – In scenarios where organizations need to process data in near real time, such as for streaming event logs or Internet of Things (IoT) data, you can integrate the adapter into an event-driven architecture. For example, event data can be continuously loaded into Amazon S3, and dbt models can be configured to run incrementally, transforming the new data into structured formats for immediate analysis. This setup supports agile data processing while taking advantage of the serverless architecture of Athena to keep operational costs low.
  • Compliance and data governance – For organizations managing sensitive or regulated data, you can use Athena and the adapter to enforce data governance rules. With dbt, teams can define data quality checks and access controls as part of their transformation workflow. This makes sure that only compliant, high-quality data is made available for analytics, and costs are optimized by processing only the data that meets governance standards. Additionally, dbt’s documentation features help maintain a clear record of data transformations, supporting audit and compliance efforts.

How to use the dbt adapter for Athena

To get started, create a project and set up a connection with Athena in dbt Cloud. The following figure shows the steps to create a project using dbt Cloud and configure the Athena connection.

Next, use the dbt Cloud interactive development environment (IDE) to deploy your project. The following figure demonstrates how to build dbt runs and deploy changes to Athena using the dbt Cloud interface.

Conclusion

At AWS, we are committed to providing you with the best possible tools and services to help you succeed in the cloud. dbt has emerged as a leading data transformation platform, trusted by thousands of organizations worldwide. By partnering with dbt Labs, we are able to bring the power of dbt directly to the AWS Cloud, empowering you to seamlessly integrate your data transformation workflows into the broader cloud infrastructure. This partnership is a testament to our shared vision of making data more accessible, reliable, and valuable for organizations of all sizes.

We are excited to see how you will use the dbt Cloud compatible dbt adapter for Athena to drive your data-driven initiatives forward. The combination of dbt and Athena creates a powerful and efficient environment for transforming and analyzing data in a serverless architecture. This synergy allows you to take advantage of the strengths of both tools, making it straightforward to manage complex data pipelines, reduce costs, and scale your operations.


About the Authors

Darshit Thakkar is a Technical Product Manager with AWS and works with the Amazon Athena team.

Selman Ay is a Data Architect in the AWS Professional Services team.

BP Yau is a Sr Partner Solutions Architect at AWS helping customers architect big data solutions to process data at scale

Speed up queries with the cost-based optimizer in Amazon Athena

Post Syndicated from Darshit Thakkar original https://aws.amazon.com/blogs/big-data/speed-up-queries-with-cost-based-optimizer-in-amazon-athena/

Amazon Athena is a serverless, interactive analytics service built on open source frameworks, supporting open table file formats. Athena provides a simplified, flexible way to analyze petabytes of data where it lives. You can analyze data or build applications from an Amazon Simple Storage Service (Amazon S3) data lake and 30 data sources, including on-premises data sources or other cloud systems using SQL or Python. Athena is built on open source Trino and Presto engines and Apache Spark frameworks, with no provisioning or configuration effort required.

Starting today, the Athena SQL engine uses a cost-based optimizer (CBO), a new feature that uses table and column statistics stored in the AWS Glue Data Catalog as part of the table’s metadata. By using these statistics, CBO improves query run plans and boosts the performance of queries run in Athena. Some of the specific optimizations CBO can employ include join reordering and pushing aggregations down based on the statistics available for each table and column.

TPC-DS benchmarks These benchmarks demonstrate the power of the cost-based optimizer—queries run up to 2x times faster with CBO enabled compared to running the same TPC-DS queries without CBO.

Performance and cost comparison on TPC-DS benchmarks

We used the industry-standard TPC-DS 3 TB to represent different customer use cases. These are representative of workloads with 10 times the stated benchmark size. This means a 3 TB benchmark dataset accurately represents customer workloads on 30–50 TB datasets.

In our testing, the dataset was stored in Amazon S3 in non-compressed Parquet format and the AWS Glue Data Catalog was used to store metadata for databases and tables. Fact tables were partitioned on the date column used for join operations, and each fact table consisted of 2,000 partitions. To help illustrate the performance of CBO, we compare the behavior of various queries and highlight the performance differences between running with CBO enabled vs. disabled.

The following graph illustrates the runtime of queries on the engine with and without CBO.

The following graph presents the top 10 queries from the TPC-DS benchmark with the greatest performance improvement.

Let’s discuss some of the cost-based optimization techniques that contributed to improved query performance.

Cost-based join reordering

Join reordering, an optimization technique used by cost-based SQL optimizers, analyzes different join sequences to select the order that minimizes query runtime by reducing intermediate data processed at each step, lowering memory and CPU requirements.

Let’s talk about query 82 of the TPC-DS 3TB dataset. The query performs inner joins on four tables: item, inventory, date_dim, and store_sales. The store_sales table has 8.6 billion rows and is partitioned by date. The inventory table has 1 billion rows and is also partitioned by date. The item table contains 360,000 rows, and the date_dim table holds 73,000 rows.

Query 82

select  i_item_id ,i_item_desc ,i_current_price
from item, inventory, date_dim, store_sales
where i_current_price between 30 and 30+30
and inv_item_sk = i_item_sk
and d_date_sk=inv_date_sk
and cast(d_date as date) between cast('2002-05-30' as date) and (cast('2002-05-30' as date) +  interval '60' day)
and i_manufact_id in (437,129,727,663)
and inv_quantity_on_hand between 100 and 500
and ss_item_sk = i_item_sk
group by i_item_id,i_item_desc,i_current_price
order by i_item_id
limit 100

Without CBO

Without using CBO, the engine will determine the join order based on the sequence of tables defined in the input query with internal heuristics. The FROM clause of the input query is "from item, inventory, date_dim, store_sales" (all inner joins). After passing through internal heuristics, Athena chose the join order as ((item ⋈ (inventorydate_dim)) ⋈ store_sales). Despite store_sales being the largest fact table, it’s defined last in the FROM clause and therefore gets joined last. This plan fails to reduce the intermediate join sizes as early as possible, resulting in an increased query runtime. The following diagram shows the join order without CBO and the number of rows flowing through different stages.

With CBO

When using CBO, the optimizer determines the best join order using a variety of data, including statistics as well as join size estimation, join build side, and join type. In this instance, Athena’s selected join order is ((store_salesitem) ⋈ (inventorydate_dim)). The largest fact table, store_sales, without being shuffled, is first joined with the item dimension table. The other partitioned table, inventory, is also first joined in-place with the date_dim dimension table. The join with the dimension table acts as a filter on the fact table, which dramatically reduces the input data size of the join that follows. Note that which side a table resides for a join is significant in Athena, because it’s the table on the right that will be built into memory for the join operation. Therefore, we always want to keep the larger table on the left and the smaller table on the right. CBO chose a plan that the left side was 8.6 billion before, and now it’s 13.6 million.

With CBO, the query runtime improved by 25% (from 15 seconds down to 11 seconds) by choosing the optimal join order.

Next, let’s discuss another CBO technique.

Cost-based aggregation pushdown

Aggregation pushdown is an optimization technique used by query optimizers to improve performance. It involves pushing aggregation operations like SUM, COUNT, and AVG into an earlier stage in the query plan, while maintaining the same query semantics. This reduces the amount of data transferred between the stages. By minimizing data processing, aggregation pushdown decreases memory usage, I/O costs, and network traffic.

However, pushing down aggregation is not always beneficial. It depends on the data distribution. For example, grouping on a column with many rows but few distinct values (like gender) before joins works better. Grouping first means aggregating a large number of records into fewer records (just male, female, for example). Grouping after joining means a large number of records have to participate the join before being aggregated. On the other hand, grouping on a high cardinality column is better done after joins. Doing it before risks unnecessary aggregation overhead because each value is likely unique anyway and that step will not result in an earlier reduction in the amount of data transferred between intermediate stages.

Therefore, whether to push down aggregation should be a cost-based decision. Let’s take example of the query 2 run on a 3TB TPC-DS dataset, showing how the aggregation pushdown’s value depends on data distribution. The web_sales table has 2.1 billion rows and the catalog_sales table has 4.23 billion rows. Both tables are partitioned on the date column.

Query 2

with wscs as
 (select sold_date_sk
        ,sales_price
  from (select ws_sold_date_sk sold_date_sk
              ,ws_ext_sales_price sales_price
        from web_sales 
        union all
        select cs_sold_date_sk sold_date_sk
              ,cs_ext_sales_price sales_price
        from catalog_sales)),
 wswscs as 
 (select d_week_seq,
        sum(case when (d_day_name='Sunday') then sales_price else null end) sun_sales,
        sum(case when (d_day_name='Monday') then sales_price else null end) mon_sales,
        sum(case when (d_day_name='Tuesday') then sales_price else  null end) tue_sales,
        sum(case when (d_day_name='Wednesday') then sales_price else null end) wed_sales,
        sum(case when (d_day_name='Thursday') then sales_price else null end) thu_sales,
        sum(case when (d_day_name='Friday') then sales_price else null end) fri_sales,
        sum(case when (d_day_name='Saturday') then sales_price else null end) sat_sales
 from wscs
     ,date_dim
 where d_date_sk = sold_date_sk
 group by d_week_seq)
 select d_week_seq1
       ,round(sun_sales1/sun_sales2,2)
       ,round(mon_sales1/mon_sales2,2)
       ,round(tue_sales1/tue_sales2,2)
       ,round(wed_sales1/wed_sales2,2)
       ,round(thu_sales1/thu_sales2,2)
       ,round(fri_sales1/fri_sales2,2)
       ,round(sat_sales1/sat_sales2,2)
 from
 (select wswscs.d_week_seq d_week_seq1
        ,sun_sales sun_sales1
        ,mon_sales mon_sales1
        ,tue_sales tue_sales1
        ,wed_sales wed_sales1
        ,thu_sales thu_sales1
        ,fri_sales fri_sales1
        ,sat_sales sat_sales1
  from wswscs,date_dim 
  where date_dim.d_week_seq = wswscs.d_week_seq and
        d_year = 2001) y,
 (select wswscs.d_week_seq d_week_seq2
        ,sun_sales sun_sales2
        ,mon_sales mon_sales2
        ,tue_sales tue_sales2
        ,wed_sales wed_sales2
        ,thu_sales thu_sales2
        ,fri_sales fri_sales2
        ,sat_sales sat_sales2
  from wswscs
      ,date_dim 
  where date_dim.d_week_seq = wswscs.d_week_seq and
        d_year = 2001+1) z
 where d_week_seq1=d_week_seq2-53
 order by d_week_seq1

Without CBO

Athena first joins the result of the union all operation on the web_sales table and the catalog_sales table with another table. Only then does it perform aggregation on the joined results. In this example, the amount of data that needed to be joined was huge, resulting in a longer query runtime.

With CBO

Athena utilizes one of the statistics values, the distinct value count, to evaluate the cost implications of pushing down the aggregation vs. not doing so. When a column has many rows but few distinct values, CBO is more likely to push aggregation down. This shrank the qualified rows from web_sales and catalog_sales tables to 2,590 and 3,590 rows, respectively. These aggregated records were then unioned and used to join with the tables. Comparing to the plan without CBO, the records participating in the join from the two large tables dropped from 6.33 billion rows (2.1 billion + 4.23 billion) to just 6,180 rows (2,590 + 3,590). This significantly decreased query runtime.

With CBO, the query runtime improved by 50% (from 37 seconds down to 18 seconds). In summary, CBO helped Athena choose an optimal aggregation pushdown plan, cutting the query time in half compared to not using cost-based optimization.

Conclusion

In this post, we discussed how Athena uses a cost-based optimizer (CBO) in its engine v3 to use table statistics for generating more efficient query run plans. Testing on the TPC-DS benchmark showed an 11% improvement in overall query performance when using CBO compared to without it.

Two key optimization employed by CBO are join reordering and aggregate pushdown. Join reordering reduces intermediate data by intelligently picking the order to join tables based on statistics. Aggregate pushdown decreases intermediate data by pushing aggregations earlier in the plan when beneficial.

In summary, Athena’s new cost-based optimizer significantly speeds up queries by choosing superior run plans. CBO optimizes based on table statistics stored in the AWS Glue Data Catalog. This automatic optimization improves productivity for Athena users through more responsive query performance. To take advantage of optimization techniques of CBO, refer to working with column statistics to generate statistics on the tables and columns in the AWS Glue Data Catalog.


About the Authors

Darshit Thakkar is a Technical Product Manager with AWS and works with the Amazon Athena team based out of Boston, Massachusetts.

Wei Zheng is a Sr. Software Development Engineer with Amazon Athena. He joined AWS in 2021 and has been working on multiple performance improvements on Athena.

Chuho Chang is a Software Development Engineer with Amazon Athena. He has been working on query optimizers for over a decade.

Pathik Shah is a Sr. Analytics Architect on Amazon Athena. He joined AWS in 2015 and has been focusing in the big data analytics space since then, helping customers build scalable and robust solutions using AWS analytics services.