All posts by Kishore Dhamodaran

Interact with Apache Iceberg tables using Amazon Athena and cross account fine-grained permissions using AWS Lake Formation

Post Syndicated from Kishore Dhamodaran original https://aws.amazon.com/blogs/big-data/interact-with-apache-iceberg-tables-using-amazon-athena-and-cross-account-fine-grained-permissions-using-aws-lake-formation/

We recently announced support for AWS Lake Formation fine-grained access control policies in Amazon Athena queries for data stored in any supported file format using table formats such as Apache Iceberg, Apache Hudi and Apache Hive. AWS Lake Formation allows you to define and enforce database, table, and column-level access policies to query Iceberg tables stored in Amazon S3. Lake Formation provides an authorization and governance layer on data stored in Amazon S3. This capability requires that you upgrade to Athena engine version 3.

Large organizations often have lines of businesses (LoBs) that operate with autonomy in managing their business data. It makes sharing data across LoBs non-trivial. These organizations have adopted a federated model, with each LoB having the autonomy to make decisions on their data. They use the publisher/consumer model with a centralized governance layer that is used to enforce access controls. If you are interested in learning more about data mesh architecture, visit Design a data mesh architecture using AWS Lake Formation and AWS Glue. With Athena engine version 3, customers can use the same fine-grained controls for open data frameworks such as Apache Iceberg, Apache Hudi, and Apache Hive.

In this post, we deep dive into a use-case where you have a producer/consumer model with data sharing enabled to give restricted access to an Apache Iceberg table that the consumer can query. We’ll discuss column filtering to restrict certain rows, filtering to restrict column level access, schema evolution, and time travel.

Solution overview

To illustrate the functionality of fine-grained permissions for Apache Iceberg tables with Athena and Lake Formation, we set up the following components:

  • In the producer account:
    • An AWS Glue Data Catalog to register the schema of a table in Apache Iceberg format
    • Lake Formation to provide fine-grained access to the consumer account
    • Athena to verify data from the producer account
  • In the consumer account:
    • AWS Resource Access Manager (AWS RAM) to create a handshake between the producer Data Catalog and consumer
    • Lake Formation to provide fine-grained access to the consumer account
    • Athena to verify data from producer account

The following diagram illustrates the architecture.

Cross-account fine-grained permissions architecture

Prerequisites

Before you get started, make sure you have the following:

Data producer setup

In this section, we present the steps to set up the data producer.

Create an S3 bucket to store the table data

We create a new S3 bucket to save the data for the table:

  1. On the Amazon S3 console, create an S3 bucket with unique name (for this post, we use iceberg-athena-lakeformation-blog).
  2. Create the producer folder inside the bucket to use for the table.

Amazon S3 bucket and folder creation

Register the S3 path storing the table using Lake Formation

We register the S3 full path in Lake Formation:

  1. Navigate to the Lake Formation console.
  2. If you’re logging in for the first time, you’re prompted to create an admin user.
  3. In the navigation pane, under Register and ingest, choose Data lake locations.
  4. Choose Register location, and provide the S3 bucket path that you created earlier.
  5. Choose AWSServiceRoleForLakeFormationDataAccess for IAM role.

For additional information about roles, refer to Requirements for roles used to register locations.

If you enabled encryption of your S3 bucket, you have to provide permissions for Lake Formation to perform encryption and decryption operations. Refer to Registering an encrypted Amazon S3 location for guidance.

  1. Choose Register location.

Register Lake Formation location

Create an Iceberg table using Athena

Now let’s create the table using Athena backed by Apache Iceberg format:

  1. On the Athena console, choose Query editor in the navigation pane.
  2. If you’re using Athena for the first time, under Settings, choose Manage and enter the S3 bucket location that you created earlier (iceberg-athena-lakeformation-blog/producer).
  3. Choose Save.
  4. In the query editor, enter the following query (replace the location with the S3 bucket that you registered with Lake Formation). Note that we use the default database, but you can use any other database.
CREATE TABLE consumer_iceberg (
customerid bigint,
customername string,
email string,
city string,
country string,
territory string,
contactfirstname string,
contactlastname string)
LOCATION 's3://YOUR-BUCKET/producer/' -- *** Change bucket name to your bucket***
TBLPROPERTIES ('table_type'='ICEBERG')
  1. Choose Run.

Athena query editor to create Iceberg table

Share the table with the consumer account

To illustrate functionality, we implement the following scenarios:

  • Provide access to selected columns
  • Provide access to selected rows based on a filter

Complete the following steps:

  1. On the Lake Formation console, in the navigation pane under Data catalog, choose Data filters.
  2. Choose Create new filter.
  3. For Data filter name, enter blog_data_filter.
  4. For Target database, enter lf-demo-db.
  5. For Target table, enter consumer_iceberg.
  6. For Column-level access, select Include columns.
  7. Choose the columns to share with the consumer: country, address, contactfirstname, city, customerid, and customername.
  8. For Row filter expression, enter the filter country='France'.
  9. Choose Create filter.

create data filter

Now let’s grant access to the consumer account on the consumer_iceberg table.

  1. In the navigation pane, choose Tables.
  2. Select the consumer_iceberg table, and choose Grant on the Actions menu.
    Grant access to consumer account on consumer_iceberg table
  3. Select External accounts.
  4. Enter the external account ID.
    Grant data permissions
  5. Select Named data catalog resources.
  6. Choose your database and table.
  7. For Data filters, choose the data filter you created.
    Add data filter
  8. For Data filter permissions and Grantable permissions, select Select.
  9. Choose Grant.

Permissions for creating grant

Data consumer setup

To set up the data consumer, we accept the resource share and create a table using AWS RAM and Lake Formation. Complete the following steps:

  1. Log in to the consumer account and navigate to the AWS RAM console.
  2. Under Shared with me in the navigation pane, choose Resource shares.
  3. Choose your resource share.
    Resource share in consumer account
  4. Choose Accept resource share.
  5. Note the name of the resource share to use in the next steps.
    Accept resource share
  6. Navigate to the Lake Formation console.
  7. If you’re logging in for the first time, you’re prompted to create an admin user.
  8. Choose Databases in the navigation pane, then choose your database.
  9. On the Actions menu, choose Create resource link.
    Create a resource link
  10. For Resource link name, enter the name of your resource link (for example, consumer_iceberg).
  11. Choose your database and shared table.
  12. Choose Create.
    Create table with resource link

Validate the solution

Now we can run different operations on the tables to validate the fine-grained access controls.

Insert operation

Let’s insert data into the consumer_iceberg table in the producer account, and validate the data filtering works as expected in the consumer account.

  1. Log in to the producer account.
  2. On the Athena console, choose Query editor in the navigation pane.
  3. Use the following SQL to write and insert data into the Iceberg table. Use the Query editor to run one query at a time. You can highlight/select one query at a time and click “Run”/“Run again:
INSERT INTO consumer_iceberg VALUES (1, 'Land of Toys Inc.', '[email protected]',
'NYC','USA', 'NA', 'James', 'xxxx 118th NE');

INSERT INTO consumer_iceberg VALUES (2, 'Reims Collectables', '[email protected]',
'Reims','France', 'EMEA', 'Josephine', 'Darakjy');

INSERT INTO consumer_iceberg VALUES (3, 'Lyon Souveniers', '[email protected]',
'Paris', 'France', 'EMEA','Art', 'Venere');

Insert data into consumer_iceberg table in the producer account

  1. Use the following SQL to read and select data in the Iceberg table:
SELECT * FROM "lf-demo-db"."consumer_iceberg" limit 10;

Run select query to validate rows were inserted

  1. Log in to the consumer account.
  2. In the Athena query editor, run the following SELECT query on the shared table:
SELECT * FROM "lf-demo-db"."consumer_iceberg" limit 10;

Run same query in consumer account

Based on the filters, the consumer has visibility to a subset of columns, and rows where the country is France.

Update/Delete operations

Now let’s update one of the rows and delete one from the dataset shared with the consumer.

  1. Log in to the producer account.
  2. Update city='Paris' WHERE city='Reims' and delete the row customerid = 3;
    UPDATE consumer_iceberg SET city= 'Paris' WHERE city= 'Reims' ;

    Run update query in producer account

DELETE FROM consumer_iceberg WHERE customerid =3;

Run delete query in producer account

  1. Verify the updated and deleted dataset:
SELECT * FROM consumer_iceberg;

Verify update and delete reflected in producer account

  1. Log in to the consumer account.
  2. In the Athena query editor, run the following SELECT query on the shared table:
SELECT * FROM "lf-demo-db"."consumer_iceberg" limit 10;

Verify update and delete in consumer account

We can observe that only one row is available and the city is updated to Paris.

Schema evolution: Add a new column

Let’s update one of the rows and delete one from the dataset shared with the consumer.

  1. Log in to the producer account.
  2. Add a new column called geo_loc in the Iceberg table. Use the Query editor to run one query at a time. You can highlight/select one query at a time and click “Run”/“Run again:
ALTER TABLE consumer_iceberg ADD COLUMNS (geo_loc string);

INSERT INTO consumer_iceberg VALUES (5, 'Test_user', '[email protected]',
'Reims','France', 'EMEA', 'Test_user', 'Test_user', 'test_geo');

SELECT * FROM consumer_iceberg;

Add a new column in producer aacccount

To provide visibility to the newly added geo_loc column, we need to update the Lake Formation data filter.

  1. On the Lake Formation console, choose Data filters in the navigation pane.
  2. Select your data filter and choose Edit.
    Update data filter
  3. Under Column-level access, add the new column (geo_loc).
  4. Choose Save.
    Add new column to data filter
  5. Log in to the consumer account.
  6. In the Athena query editor, run the following SELECT query on the shared table:
SELECT * FROM "lf-demo-db"."consumer_iceberg" limit 10;

Validate new column appears in consumer account

The new column geo_loc is visible and an additional row.

Schema evolution: Delete column

Let’s update one of the rows and delete one from the dataset shared with the consumer.

  1. Log in to the producer account.
  2. Alter the table to drop the address column from the Iceberg table. Use the Query editor to run one query at a time. You can highlight/select one query at a time and click “Run”/“Run again:
ALTER TABLE consumer_iceberg DROP COLUMN address;

SELECT * FROM consumer_iceberg;

Delete a column in producer account

We can observe that the column address is not present in the table.

  1. Log in to the consumer account.
  2. In the Athena query editor, run the following SELECT query on the shared table:
SELECT * FROM "lf-demo-db"."consumer_iceberg" limit 10;

Validate column deletion in consumer account

The column address is not present in the table.

Time travel

We have now changed the Iceberg table multiple times. The Iceberg table keeps track of the snapshots. Complete the following steps to explore the time travel functionality:

  1. Log in to the producer account.
  2. Query the system table:
SELECT * FROM "lf-demo-db"."consumer_iceberg$snapshots" limit 10;

We can observe that we have generated multiple snapshots.

  1. Note down one of the committed_at values to use in the next steps (for this example, 2023-01-29 21:35:02.176 UTC).
    Time travel query in consumer account
  2. Use time travel to find the table snapshot. Use the Query editor to run one query at a time. You can highlight/select one query at a time and click “Run”/“Run again:
SELECT * FROM consumer_iceberg FOR TIMESTAMP
AS OF TIMESTAMP '2023-01-29 21:35:02.176 UTC';

Find table snapshot using time travel

Clean up

Complete the following steps to avoid incurring future charges:

  1. On the Amazon S3 console, delete the table storage bucket (for this post, iceberg-athena-lakeformation-blog).
  2. In the producer account on the Athena console, run the following commands to delete the tables you created:
DROP TABLE "lf-demo-db"."consumer_iceberg";
DROP DATABASE lf-demo-db;
  1. In the producer account on the Lake Formation console, revoke permissions to the consumer account.
    Clean up - Revoke permissions to consumer account
  2. Delete the S3 bucket used for the Athena query result location from the consumer account.

Conclusion

With the support for cross account, fine-grained access control policies for formats such as Iceberg, you have the flexibility to work with any format supported by Athena. The ability to perform CRUD operations against the data in your S3 data lake combined with Lake Formation fine-grained access controls for all tables and formats supported by Athena provides opportunities to innovate and simplify your data strategy. We’d love to hear your feedback!


About the authors

Kishore Dhamodaran is a Senior Solutions Architect at AWS. Kishore helps strategic customers with their cloud enterprise strategy and migration journey, leveraging his years of industry and cloud experience.

Jack Ye is a software engineer of the Athena Data Lake and Storage team at AWS. He is an Apache Iceberg Committer and PMC member.

Chris Olson is a Software Development Engineer at AWS.

Xiaoxuan Li is a Software Development Engineer at AWS.

Rahul Sonawane is a Principal Analytics Solutions Architect at AWS with AI/ML and Analytics as his area of specialty.

Proactive Insights with Amazon DevOps Guru for RDS

Post Syndicated from Kishore Dhamodaran original https://aws.amazon.com/blogs/devops/proactive-insights-with-amazon-devops-guru-for-rds/

Today, we are pleased to announce a new Amazon DevOps Guru for RDS capability: Proactive Insights. DevOps Guru for RDS is a fully-managed service powered by machine learning (ML), that uses the data collected by RDS Performance Insights to detect and alert customers of anomalous behaviors within Amazon Aurora databases. Since its release, DevOps Guru for RDS has empowered customers with information to quickly react to performance problems and to take corrective actions. Now, Proactive Insights adds recommendations related to operational issues that may prevent potential issues in the future.

Proactive Insights requires no additional set up for customers already using DevOps Guru for RDS, for both Amazon Aurora MySQL-Compatible Edition and Amazon Aurora PostgreSQL-Compatible Edition.

The following are example use cases of operational issues available for Proactive Insights today, with more insights coming over time:

  • Long InnoDB History for Aurora MySQL-Compatible engines – Triggered when the InnoDB history list length becomes very large.
  • Temporary tables created on disk for Aurora MySQL-Compatible engines – Triggered when the ratio of temporary tables created versus all temporary tables breaches a threshold.
  • Idle In Transaction for Aurora PostgreSQL-Compatible engines – Triggered when sessions connected to the database are not performing active work, but can keep database resources blocked.

To get started, navigate to the Amazon DevOps Guru Dashboard where you can see a summary of your system’s overall health, including ongoing proactive insights. In the following screen capture, the number three indicates that there are three ongoing proactive insights. Click on that number to see the listing of the corresponding Proactive Insights, which may include RDS or other Proactive Insights supported by Amazon DevOps Guru.

Amazon DevOps Guru Dashboard where you can see a summary of your system’s overall health, including ongoing proactive insights

Figure 1. Amazon DevOps Guru Dashboard where you can see a summary of your system’s overall health, including ongoing proactive insights.

Ongoing problems (including reactive and proactive insights) are also highlighted against your database instance on the Database list page in the Amazon RDS console.

Proactive and Reactive Insights are highlighted against your database instance on the Database list page in the Amazon RDS console

Figure 2. Proactive and Reactive Insights are highlighted against your database instance on the Database list page in the Amazon RDS console.

In the following sections, we will dive deep on these use cases of DevOps Guru for RDS Proactive Insights.

Long InnoDB History for Aurora MySQL-Compatible engines

The InnoDB history list is a global list of the undo logs for committed transactions. MySQL uses the history list to purge records and log pages when transactions no longer require the history.  If the InnoDB history list length grows too large, indicating a large number of old row versions, queries and even the database shutdown process can become slower.

DevOps Guru for RDS now detects when the history list length exceeds 1 million records and alerts users to close (either by commit or by rollback) any unnecessary long-running transactions before triggering database changes that involve a shutdown (this includes reboots and database version upgrades).

From the DevOps Guru console, navigate to Insights, choose Proactive, then choose “RDS InnoDB History List Length Anomalous” Proactive Insight with an ongoing status. You will notice that Proactive Insights provides an “Insight overview”, “Metrics” and “Recommendations”.

Insight overview provides you basic information on this insight. In our case, the history list for row changes increased significantly, which affects query and shutdown performance.

Long InnoDB History for Aurora MySQL-Compatible engines Insight overview

Figure 3. Long InnoDB History for Aurora MySQL-Compatible engines Insight overview.

The Metrics panel gives you a graphical representation of the history list length and the timeline, allowing you to correlate it with any anomalous application activity that may have occurred during this window.

Long InnoDB History for Aurora MySQL-Compatible engines Metrics panel

Figure 4. Long InnoDB History for Aurora MySQL-Compatible engines Metrics panel.

The Recommendations section suggests actions that you can take to mitigate this issue before it leads to a bigger problem. You will also notice the rationale behind the recommendation under the “Why is DevOps Guru recommending this?” column.

The Recommendations section suggests actions that you can take to mitigate this issue before it leads to a bigger problem

Figure 5. The Recommendations section suggests actions that you can take to mitigate this issue before it leads to a bigger problem.

Temporary tables created on disk for Aurora MySQL-Compatible engines

Sometimes it is necessary for the MySQL database to create an internal temporary table while processing a query. An internal temporary table can be held in memory and processed by the TempTable or MEMORY storage engine, or stored on disk by the InnoDB storage engine. An increase of temporary tables created on disk instead of in memory can impact the database performance.

DevOps Guru for RDS now monitors the rate at which the database creates temporary tables and the percentage of those temporary tables that use disk. When these values cross recommended levels over a given period of time, DevOps Guru for RDS creates an insight exposing this situation before it becomes critical.

From the DevOps Guru console, navigate to Insights, choose Proactive, then choose “RDS Temporary Tables On Disk AnomalousProactive Insight with an ongoing status. You will notice this Proactive Insight provides an “Insight overview”, “Metrics” and “Recommendations”.

Insight overview provides you basic information on this insight. In our case, more than 58% of the total temporary tables created per second were using disk, with a sustained rate of two temporary tables on disk created every second, which indicates that query performance is degrading.

Temporary tables created on disk insight overview

Figure 6. Temporary tables created on disk insight overview.

The Metrics panel shows you a graphical representation of the information specific for this insight. You will be presented with the evolution of the amount of temporary tables created on disk per second, the percentage of temporary tables on disk (out of the total number of database-created temporary tables), and of the overall rate at which the temporary tables are created (per second).

Temporary tables created on disk evolution of the amount of temporary tables created on disk per second

Figure 7. Temporary tables created on disk – evolution of the amount of temporary tables created on disk per second.

Temporary tables created on disk the percentage of temporary tables on disk (out of the total number of database-created temporary tables)

Figure 8. Temporary tables created on disk – the percentage of temporary tables on disk (out of the total number of database-created temporary tables).

Temporary tables created on disk overall rate at which the temporary tables are created (per second)

Figure 9. Temporary tables created on disk – overall rate at which the temporary tables are created (per second).

The Recommendations section suggests actions to avoid this situation when possible, such as not using BLOB and TEXT data types, tuning tmp_table_size and max_heap_table_size database parameters, data set reduction, columns indexing and more.

Temporary tables created on disk actions to avoid this situation when possible, such as not using BLOB and TEXT data types, tuning tmp_table_size and max_heap_table_size database parameters, data set reduction, columns indexing and more

Figure 10. Temporary tables created on disk – actions to avoid this situation when possible, such as not using BLOB and TEXT data types, tuning tmp_table_size and max_heap_table_size database parameters, data set reduction, columns indexing and more.

Additional explanations on this use case can be found by clicking on the “View troubleshooting doc” link.

Idle In Transaction for Aurora PostgreSQL-Compatible engines

A connection that has been idle in transaction  for too long can impact performance by holding locks, blocking other queries, or by preventing VACUUM (including autovacuum) from cleaning up dead rows.
PostgreSQL database requires periodic maintenance, which is known as vacuuming. Autovacuum in PostgreSQL automates the execution of VACUUM and ANALYZE commands. This process gathers the table statistics and deletes the dead rows. When vacuuming does not occur, this negatively impacts the database performance. It leads to an increase in table and index bloat (the disk space that was used by a table or index and is available for reuse by the database but has not been reclaimed), leads to stale statistics and can even end in transaction wraparound (when the number of unique transaction ids reaches its maximum of about two billion).

DevOps Guru for RDS monitors the time spent by sessions in an Aurora PostgreSQL database in idle in transaction state and raises initially a warning notification, followed by an alarm notification if the idle in transaction state continues (the current thresholds are 1800 seconds for the warning and 3600 seconds for the alarm).

From the DevOps Guru console, navigate to Insights, choose Proactive, then choose “RDS Idle In Transaction Max Time AnomalousProactive Insight with an ongoing status. You will notice this Proactive Insights provides an “Insight overview”, “Metrics” and “Recommendations”.

In our case, a connection has been in “idle in transaction” state for more than 1800 seconds, which could impact the database performance.

A connection has been in “idle in transaction” state for more than 1800 seconds, which could impact the database performance

Figure 11. A connection has been in “idle in transaction” state for more than 1800 seconds, which could impact the database performance.

The Metrics panel shows you a graphical representation of when the long-running “idle in transaction” connections started.

The Metrics panel shows you a graphical representation of when the long-running “idle in transaction” connections started

Figure 12. The Metrics panel shows you a graphical representation of when the long-running “idle in transaction” connections started.

As with the other insights, recommended actions are listed and a troubleshooting doc is linked for even more details on this use case.

Recommended actions are listed and a troubleshooting doc is linked for even more details on this use case

Figure 13. Recommended actions are listed and a troubleshooting doc is linked for even more details on this use case.

Conclusion

With Proactive Insights, DevOpsGuru for RDS enhances its abilities to help you monitor your databases by notifying you about potential operational issues, before they become bigger problems down the road. To get started, you need to ensure that you have enabled Performance Insights on the database instance(s) you want monitored, as well as ensure and confirm that DevOps Guru is enabled to monitor those instances (for example by enabling it at account level, by monitoring specific CloudFormation stacks or by using AWS tags for specific Aurora resources). Proactive Insights is available in all regions where DevOps Guru for RDS is supported. To learn more about Proactive Insights, join us for a free hands-on Immersion Day (available in three time zones) on March 15th or April 12th.

About the authors:

Kishore Dhamodaran

Kishore Dhamodaran is a Senior Solutions Architect at AWS.

Raluca Constantin

Raluca Constantin is a Senior Database Engineer with the Relational Database Services (RDS) team at Amazon Web Services. She has 16 years of experience in the databases world. She enjoys travels, hikes, arts and is a proud mother of a 12y old daughter and a 7y old son.

Jonathan Vogel

Jonathan is a Developer Advocate at AWS. He was a DevOps Specialist Solutions Architect at AWS for two years prior to taking on the Developer Advocate role. Prior to AWS, he practiced professional software development for over a decade. Jonathan enjoys music, birding and climbing rocks.

Build an Apache Iceberg data lake using Amazon Athena, Amazon EMR, and AWS Glue

Post Syndicated from Kishore Dhamodaran original https://aws.amazon.com/blogs/big-data/build-an-apache-iceberg-data-lake-using-amazon-athena-amazon-emr-and-aws-glue/

Most businesses store their critical data in a data lake, where you can bring data from various sources to a centralized storage. The data is processed by specialized big data compute engines, such as Amazon Athena for interactive queries, Amazon EMR for Apache Spark applications, Amazon SageMaker for machine learning, and Amazon QuickSight for data visualization.

Apache Iceberg is an open-source table format for data stored in data lakes. It is optimized for data access patterns in Amazon Simple Storage Service (Amazon S3) cloud object storage. Iceberg helps data engineers tackle complex challenges in data lakes such as managing continuously evolving datasets while maintaining query performance. Iceberg allows you to do the following:

  • Maintain transactional consistency where files can be added, removed, or modified atomically with full read isolation and multiple concurrent writes
  • Implement full schema evolution to process safe table schema updates as the table data evolves
  • Organize tables into flexible partition layouts with partition evolution, enabling updates to partition schemes as queries and data volume changes without relying on physical directories
  • Perform row-level update and delete operations to satisfy new regulatory requirements such as the General Data Protection Regulation (GDPR)
  • Provide versioned tables and support time travel queries to query historical data and verify changes between updates
  • Roll back tables to prior versions to return tables to a known good state in case of any issues

In 2021, AWS teams contributed the Apache Iceberg integration with the AWS Glue Data Catalog to open source, which enables you to use open-source compute engines like Apache Spark with Iceberg on AWS Glue. In 2022, Amazon Athena announced support of Iceberg and Amazon EMR added support of Iceberg starting with version 6.5.0.

In this post, we show you how to use Amazon EMR Spark to create an Iceberg table, load sample books review data, and use Athena to query, perform schema evolution, row-level update and delete, and time travel, all coordinated through the AWS Glue Data Catalog.

Solution overview

We use the Amazon Customer Reviews public dataset as our source data. The dataset contains data files in Apache Parquet format on Amazon S3. We load all the book-related Amazon review data as an Iceberg table to demonstrate the advantages of using the Iceberg table format on top of raw Parquet files. The following diagram illustrates our solution architecture.

Architecture that shows the flow from Amazon EMR loading data into Amazon S3, and queried by Amazon Athena through AWS Glue Data Catalog.

To set up and test this solution, we complete the following high-level steps:

  1. Create an S3 bucket.
  2. Create an EMR cluster.
  3. Create an EMR notebook.
  4. Configure a Spark session.
  5. Load data into the Iceberg table.
  6. Query the data in Athena.
  7. Perform a row-level update in Athena.
  8. Perform a schema evolution in Athena.
  9. Perform time travel in Athena.
  10. Consume Iceberg data across Amazon EMR and Athena.

Prerequisites

To follow along with this walkthrough, you must have the following:

  • An AWS Account with a role that has sufficient access to provision the required resources.

Create an S3 bucket

To create an S3 bucket that holds your Iceberg data, complete the following steps:

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Choose Create bucket.
  3. For Bucket name, enter a name (for this post, we enter aws-lake-house-iceberg-blog-demo).

Because S3 bucket names are globally unique, choose a different name when you create your bucket.

  1. For AWS Region, choose your preferred Region (for this post, we use us-east-1).

Create a new Amazon S3 bucket. Choose us-east-1 as region

  1. Complete the remaining steps to create your bucket.
  2. If this is the first time that you’re using Athena to run queries, create another globally unique S3 bucket to hold your Athena query output.

Create an EMR cluster

Now we’re ready to start an EMR cluster to run Iceberg jobs using Spark.

  1. On the Amazon EMR console, choose Create cluster.
  2. Choose Advanced options.
  3. For Software Configuration, choose your Amazon EMR release version.

Iceberg requires release 6.5.0 and above.

  1. Select JupyterEnterpriseGateway and Spark as the software to install.
  2. For Edit software settings, select Enter configuration and enter [{"classification":"iceberg-defaults","properties":{"iceberg.enabled":true}}].
  3. Leave other settings at their default and choose Next.

Choose Amazon EMR release 6.6.0 and JupyterEnterpriseGateway and Spark. Enter configuration information.

  1. You can change the hardware used by the Amazon EMR cluster in this step. In this demo, we use the default setting.
  2. Choose Next.
  3. For Cluster name, enter Iceberg Spark Cluster.
  4. Leave the remaining settings unchanged and choose Next.

Provide Iceberg Spark Cluster as the Cluster name

  1. You can configure security settings such as adding an EC2 key pair to access your EMR cluster locally. In this demo, we use the default setting.
  2. Choose Create cluster.

You’re redirected to the cluster detail page, where you wait for the EMR cluster to transition from Starting to Waiting.

Create an EMR notebook

When the cluster is active and in the Waiting state, we’re ready to run Spark programs in the cluster. For this demo, we use an EMR notebook to run Spark commands.

  1. On the Amazon EMR console, choose Notebooks in the navigation pane.
  2. Choose Create notebook.
  3. For Notebook name, enter a name (for this post, we enter iceberg-spark-notebook).
  4. For Cluster, select Choose an existing cluster and choose Iceberg Spark Cluster.
  5. For AWS service role, choose Create a new role to create EMR_Notebook_DefaultRole or choose a different role to access resources in the notebook.
  6. Choose Create notebook.

Create an Amazon EMR notebook. Use EMR_Notebooks_DefaultRole

You’re redirected to the notebook detail page.

  1. Choose Open in JupyterLab next to your notebook.
  2. Choose to create a new notebook.
  3. Under Notebook, choose Spark.

Choose Spark from the options provided in the Launcher

Configure a Spark session

In your notebook, run the following code:

%%configure -f
{
  "conf": {
    "spark.sql.catalog.demo": "org.apache.iceberg.spark.SparkCatalog",
    "spark.sql.catalog.demo.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",
    "spark.sql.catalog.demo.warehouse": "s3://<your-iceberg-blog-demo-bucket>",
    "spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
  }
}

This sets the following Spark session configurations:

  • spark.sql.catalog.demo – Registers a Spark catalog named demo, which uses the Iceberg Spark catalog plugin
  • spark.sql.catalog.demo.catalog-impl – The demo Spark catalog uses AWS Glue as the physical catalog to store Iceberg database and table information
  • spark.sql.catalog.demo.warehouse – The demo Spark catalog stores all Iceberg metadata and data files under the root path s3://<your-iceberg-blog-demo-bucket>
  • spark.sql.extensions – Adds support to Iceberg Spark SQL extensions, which allows you to run Iceberg Spark procedures and some Iceberg-only SQL commands (you use this in a later step)

Load data into the Iceberg table

In our Spark session, run the following commands to load data:

// create a database in AWS Glue named reviews if not exist
spark.sql("CREATE DATABASE IF NOT EXISTS demo.reviews")

// load reviews related to books
val book_reviews_location = "s3://amazon-reviews-pds/parquet/product_category=Books/*.parquet"
val book_reviews = spark.read.parquet(book_reviews_location)

// write book reviews data to an Iceberg v2 table
book_reviews.writeTo("demo.reviews.book_reviews").tableProperty("format-version", "2").createOrReplace()

Iceberg format v2 is needed to support row-level updates and deletes. See Format Versioning for more details.

It may take up to 15 minutes for the commands to complete. When it’s complete, you should be able to see the table on the AWS Glue console, under the reviews database, with the table_type property shown as ICEBERG.

Shows the table properties for book_reviews table

The table schema is inferred from the source Parquet data files. You can also create the table with a specific schema before loading data using Spark SQL, Athena SQL, or Iceberg Java and Python SDKs.

Query in Athena

Navigate to the Athena console and choose Query editor. If this is your first time using the Athena query editor, you need to configure to use the S3 bucket you created earlier to store the query results.

The table book_reviews is available for querying. Run the following query:

SELECT * FROM reviews.book_reviews LIMIT 5;

The following screenshot shows the first five records from the table being displayed.

Amazon Athena query the first 5 rows and show the results

Perform a row-level update in Athena

In the next few steps, let’s focus on a record in the table with review ID RZDVOUQG1GBG7. Currently, it has no total votes when we run the following query:

SELECT total_votes FROM reviews.book_reviews 
WHERE review_id = 'RZDVOUQG1GBG7'

Query total_votes for a particular review which shows a value of 0

Let’s update the total_votes value to 2 using the following query:

UPDATE reviews.book_reviews
SET total_votes = 2
WHERE review_id = 'RZDVOUQG1GBG7'

Update query to set the total_votes for the previous review_id to 2

After your update command runs successfully, run the below query and note the updated result showing a total of two votes:

SELECT total_votes FROM reviews.book_reviews
WHERE review_id = 'RZDVOUQG1GBG7'

Athena enforces ACID transaction guarantee for all the write operations against an Iceberg table. This is done through the Iceberg format’s optimistic locking specification. When concurrent attempts are made to update the same record, a commit conflict occurs. In this scenario, Athena displays a transaction conflict error, as shown in the following screenshot.

Concurrent updates causes a failure. This shows the TRANSACTION_CONFLICT error during this scenario.

Delete queries work in a similar way; see DELETE for more details.

Perform a schema evolution in Athena

Suppose the review suddenly goes viral and gets 10 billion votes:

UPDATE reviews.book_reviews
SET total_votes = 10000000000
WHERE review_id = 'RZDVOUQG1GBG7'

Based on the AWS Glue table information, the total_votes is an integer column. If you try to update a value of 10 billion, which is greater than the maximum allowed integer value, you get an error reporting a type mismatch.

Updating to a very large value greater than maximum allowed integer value results in an error

Iceberg supports most schema evolution features as metadata-only operations, which don’t require a table rewrite. This includes add, drop, rename, reorder column, and promote column types. To solve this issue, you can change the integer column total_votes to a BIGINT type by running the following DDL:

ALTER TABLE reviews.book_reviews
CHANGE COLUMN total_votes total_votes BIGINT;

You can now update the value successfully:

UPDATE reviews.book_reviews
SET total_votes = 10000000000
WHERE review_id = 'RZDVOUQG1GBG7'

Querying the record now gives us the expected result in BIGINT:

SELECT total_votes FROM reviews.book_reviews
WHERE review_id = 'RZDVOUQG1GBG7'

Perform time travel in Athena

In Iceberg, the transaction history is retained, and each transaction commit creates a new version. You can perform time travel to look at a historical version of a table. In Athena, you can use the following syntax to travel to a time that is after when the first version was committed:

SELECT total_votes FROM reviews.book_reviews
FOR SYSTEM_TIME AS OF localtimestamp + interval '-20' minute
WHERE review_id = 'RZDVOUQG1GBG7'

Query an earlier snapshot using time travel feature

Consume Iceberg data across Amazon EMR and Athena

One of the most important features of a data lake is for different systems to seamlessly work together through the Iceberg open-source protocol. After all the operations are performed in Athena, let’s go back to Amazon EMR and confirm that Amazon EMR Spark can consume the updated data.

First, run the same Spark SQL and see if you get the same result for the review used in the example:

val select_votes = """SELECT total_votes FROM demo.reviews.book_reviews
WHERE review_id = 'RZDVOUQG1GBG7'"""

spark.sql(select_votes).show()

Spark shows 10 billion total votes for the review.

Shows the latest value of total_votes when querying using the Amazon EMR notebook

Check the transaction history of the operation in Athena through Spark Iceberg’s history system table:

val select_history = "SELECT * FROM demo.reviews.book_reviews.history"

spark.sql(select_history).show()

This shows three transactions corresponding to the two updates you ran in Athena.

Shows snapshots corresponding to the two updates you ran in Athena

Iceberg offers a variety of Spark procedures to optimize the table. For example, you can run an expire_snapshots procedure to remove old snapshots, and free up storage space in Amazon S3:

import java.util.Calendar
import java.text.SimpleDateFormat

val now = Calendar.getInstance().getTime()
val form = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
val now_formatted = form.format(now.getTime())
val procedure = s"""CALL demo.system.expire_snapshots(
  table => 'reviews.book_reviews',
  older_than => TIMESTAMP '$now_formatted',
  retain_last => 1)"""

spark.sql(procedure)

Note that, after running this procedure, time travel can no longer be performed against expired snapshots.

Examine the history system table again and notice that it shows you only the most recent snapshot.

Running the following query in Athena results in an error “No table snapshot found before timestamp…” as older snapshots were deleted, and you are no longer able to time travel to the older snapshot:

SELECT total_votes FROM reviews.book_reviews
FOR SYSTEM_TIME AS OF localtimestamp + interval '-20' minute
WHERE review_id = 'RZDVOUQG1GBG7'

Clean up

To avoid incurring ongoing costs, complete the following steps to clean up your resources:

  1. Run the following code in your notebook to drop the AWS Glue table and database:
// DROP the table 
spark.sql("DROP TABLE demo.reviews.book_reviews") 
// DROP the database 
spark.sql("DROP DATABASE demo.reviews")
  1. On the Amazon EMR console, choose Notebooks in the navigation pane.
  2. Select the notebook iceberg-spark-notebook and choose Delete.
  3. Choose Clusters in the navigation pane.
  4. Select the cluster Iceberg Spark Cluster and choose Terminate.
  5. Delete the S3 buckets and any other resources that you created as part of the prerequisites for this post.

Conclusion

In this post, we showed you an example of using Amazon S3, AWS Glue, Amazon EMR, and Athena to build an Iceberg data lake on AWS. An Iceberg table can seamlessly work across two popular compute engines, and you can take advantage of both to design your customized data production and consumption use cases.

With AWS Glue, Amazon EMR, and Athena, you can already use many features through AWS integrations, such as SageMaker Athena integration for machine learning, or QuickSight Athena integration for dashboard and reporting. AWS Glue also offers the Iceberg connector, which you can use to author and run Iceberg data pipelines.

In addition, Iceberg supports a variety of other open-source compute engines that you can choose from. For example, you can use Apache Flink on Amazon EMR for streaming and change data capture (CDC) use cases. The strong transaction guarantee and efficient row-level update, delete, time travel, and schema evolution experience offered by Iceberg offers a sound foundation and infinite possibilities for users to unlock the power of big data.


About the Authors

Kishore Dhamodaran is a Senior Solutions Architect at AWS. Kishore helps strategic customers with their cloud enterprise strategy and migration journey, leveraging his years of industry and cloud experience.

Jack Ye is a software engineer of the Athena Data Lake and Storage team. He is an Apache Iceberg Committer and PMC member.

Mohit Mehta is a Principal Architect at AWS with expertise in AI/ML and data analytics. He holds 12 AWS certifications and is passionate about helping customers implement cloud enterprise strategies for digital transformation. In his free time, he trains for marathons and plans hikes across major peaks around the world.

Giovanni Matteo Fumarola is the Engineering Manager of the Athena Data Lake and Storage team. He is an Apache Hadoop Committer and PMC member. He has been focusing in the big data analytics space since 2013.

Jared Keating is a Senior Cloud Consultant with AWS Professional Services. Jared assists customers with their cloud infrastructure, compliance, and automation requirements, drawing from his 20+ years of IT experience.

Leverage DevOps Guru for RDS to detect anomalies and resolve operational issues

Post Syndicated from Kishore Dhamodaran original https://aws.amazon.com/blogs/devops/leverage-devops-guru-for-rds-to-detect-anomalies-and-resolve-operational-issues/

The Relational Database Management System (RDBMS) is a popular choice among organizations running critical applications that supports online transaction processing (OLTP) use-cases. But managing the RDBMS database comes with its own challenges. AWS has made it easier for organizations to operate these databases in the cloud, thereby addressing the undifferentiated heavy lifting with managed databases (Amazon Aurora, Amazon RDS). Although using managed services has freed up engineering from provisioning hardware, database setup, patching, and backups, they still face the challenges that come with running a highly performant database. As applications scale in size and sophistication, it becomes increasingly challenging for customers to detect and resolve relational database performance bottlenecks and other operational issues quickly.

Amazon RDS Performance Insights is a database performance tuning and monitoring feature, that lets you quickly assess your database load and determine when and where to take action. Performance Insights lets non-experts in database administration diagnose performance problems with an easy-to-understand dashboard that visualizes database load. Furthermore, Performance Insights expands on the existing Amazon RDS monitoring features to illustrate database performance and help analyze any issues that affect it. The Performance Insights dashboard also lets you visualize the database load and filter the load by waits, SQL statements, hosts, or users.

On Dec 1st, 2021, we announced Amazon DevOps Guru for RDS, a new capability for Amazon DevOps Guru. It’s a fully-managed machine learning (ML)-powered service that detects operational and performance related issues for Amazon Aurora engines. It uses the data that it collects from Performance Insights, and then automatically detects and alerts customers of application issues, including database problems. When DevOps Guru detects an issue in an RDS database, it publishes an insight in the DevOps Guru dashboard. The insight contains an anomaly for the resource AWS/RDS. If DevOps Guru for RDS is turned on for your instances, then the anomaly contains a detailed analysis of the problem. DevOps Guru for RDS also recommends that you perform an investigation, or it provides a specific corrective action. For example, the recommendation might be to investigate a specific high-load SQL statement or to scale database resources.

In this post, we’ll deep-dive into some of the common issues that you may encounter while running your workloads against Amazon Aurora MySQL-Compatible Edition databases, with simulated performance issues. We’ll also look at how DevOps Guru for RDS can help identify and resolve these issues. Simulating a performance issue is resource intensive, and it will cost you money to run these tests. If you choose the default options that are provided, and clean up your resources using the following clean-up instructions, then it will cost you approximately $15 to run the first test only. If you wish to run all of the tests, then you can choose “all” in the Tests parameter choice. This will cost you approximately $28 to run all three tests.

Prerequisites

To follow along with this walkthrough, you must have the following prerequisites:

  • An AWS account with a role that has sufficient access to provision the required infrastructure. The account should also not have exceeded its quota for the resources being deployed (VPCs, Amazon Aurora, etc.).
  • Credentials that enable you to interact with your AWS account.
  • If you already have Amazon DevOps Guru turned on, then make sure that it’s tagged properly to detect issues for the resource being deployed.

Solution overview

You will clone the project from GitHub and deploy an AWS CloudFormation template, which will set up the infrastructure required to run the tests. If you choose to use the defaults, then you can run only the first test. If you would like to run all of the tests, then choose the “all” option under Tests parameter.

We simulate some common scenarios that your database might encounter when running enterprise applications. The first test simulates locking issues. The second test simulates the behavior when the AUTOCOMMIT property of the database driver is set to: True. This could result in statement latency. The third test simulates performance issues when an index is missing on a large table.

Solution walk through

Clone the repo and deploy resources

  1. Utilize the following command to clone the GitHub repository that contains the CloudFormation template and the scripts necessary to simulate the database load. Note that by default, we’ve provided the command to run only the first test.
    git clone https://github.com/aws-samples/amazon-devops-guru-rds.git
    cd amazon-devops-guru-rds
    
    aws cloudformation create-stack --stack-name DevOpsGuru-Stack \
        --template-body file://DevOpsGuruMySQL.yaml \
        --capabilities CAPABILITY_IAM \
        --parameters ParameterKey=Tests,ParameterValue=one \
    ParameterKey=EnableDevOpsGuru,ParameterValue=y

    If you wish to run all four of the tests, then flip the ParameterValue of the Tests ParameterKey to “all”.

    If Amazon DevOps Guru is already enabled in your account, then change the ParameterValue of the EnableDevOpsGuru ParameterKey to “n”.

    It may take up to 30 minutes for CloudFormation to provision the necessary resources. Visit the CloudFormation console (make sure to choose the region where you have deployed your resources), and make sure that DevOpsGuru-Stack is in the CREATE_COMPLETE state before proceeding to the next step.

  2. Navigate to AWS Cloud9, then choose Your environments. Next, choose DevOpsGuruMySQLInstance followed by Open IDE. This opens a cloud-based IDE environment where you will be running your tests. Note that in this setup, AWS Cloud9 inherits the credentials that you used to deploy the CloudFormation template.
  3. Open a new terminal window which you will be using to clone the repository where the scripts are located.

  1. Clone the repo into your Cloud9 environment, then navigate to the directory where the scripts are located, and run initial setup.
git clone https://github.com/aws-samples/amazon-devops-guru-rds.git
cd amazon-devops-guru-rds/scripts
sh setup.sh 
# NOTE: If you are running all test cases, use sh setup.sh all command instead. 
source ~/.bashrc
  1. Initialize databases for all of the test cases, and add random data into them. The script to insert random data takes approximately five hours to complete. Your AWS Cloud9 instance is set up to run for up to 24 hours before shutting down. You can exit the browser and return between 5–24 hours to validate that the script ran successfully, then continue to the next step.
source ./connect.sh test 1
USE devopsgurusource;
CREATE TABLE IF NOT EXISTS test1 (id int, filler char(255), timer timestamp);
exit;
python3 ct.py

If you chose to run all test cases, and you ran the sh setup.sh all command in Step 4, open two new terminal windows and run the following commands to insert random data for test cases 2 and 3.

# Test case 2 – Open a new terminal window to run the commands
cd amazon-devops-guru-rds/scripts
source ./connect.sh test 2
USE devopsgurusource;
CREATE TABLE IF NOT EXISTS test1 (id int, filler char(255), timer timestamp);
exit;
python3 ct.py
# Test case 3 - Open a new terminal window to run the commands
cd amazon-devops-guru-rds/scripts
source ./connect.sh test 3
USE devopsgurusource;
CREATE TABLE IF NOT EXISTS test1 (id int, filler char(255), timer timestamp);
exit;
python3 ct.py
  1. Return between 5-24 hours to run the next set of commands.
  1. Add an index to the first database.
source ./connect.sh test 1
CREATE UNIQUE INDEX test1_pk ON test1(id);
INSERT INTO test1 VALUES (-1, 'locker', current_timestamp);
exit;
  1. If you chose to run all test cases, and you ran the sh setup.sh all command in Step 4, add an index to the second database. NOTE: Do no add an index to the third database.
source ./connect.sh test 2
CREATE UNIQUE INDEX test1_pk ON test1(id);
INSERT INTO test1 VALUES (-1, 'locker', current_timestamp);
exit;

DevOps Guru for RDS uses Performance Insights, and it establishes a baseline for the database metrics. Baselining involves analyzing the database performance metrics over a period of time to establish a “normal” behavior. DevOps Guru for RDS then uses ML to detect anomalies against the established baseline. If your workload pattern changes, then DevOps Guru for RDS establishes a new baseline that it uses to detect anomalies against the new “normal”. For new database instances, DevOps Guru for RDS takes up to two days to establish an initial baseline, as it requires an analysis of the database usage patterns and establishing what is considered a normal behavior.

  1. Allow two days before you start running the following tests.

Scenario 1: Locking Issues

In this scenario, multiple sessions compete for the same (“locked”) record, and they must wait for each other.
In real life, this often happens when:

  • A database session gets disconnected due to a (i.e., temporary network) malfunction, while still holding a critical lock.
  • Other sessions become stuck while waiting for the lock to be released.
  • The problem is often exacerbated by the application connection manager that keeps spawning additional sessions (because the existing sessions don’t complete the work on time), thus creating a distinct “inclined slope” pattern that you’ll see in this scenario.

Here’s how you can reproduce it:

  1. Connect to the database.
cd amazon-devops-guru-rds/scripts
source ./connect.sh test 1
  1. In your MySQL, enter the following SQL, and don’t exit the shell.
START TRANSACTION;
UPDATE test1 SET timer=current_timestamp WHERE id=-1;
-- Do NOT exit!
  1. Open a new terminal, and run the command to simulate competing transactions. Give it approximately five minutes before you run the commands in this step.
cd amazon-devops-guru-rds/scripts
source ./connect.sh test 1
exit;
python3 locking_scenario.py 1 1200 2
  1. After the program completes its execution, navigate to the Amazon DevOps Guru console, choose Insights, and then choose RDS DB Load Anomalous. You’ll notice a summary of the insight under Description.

Shows navigation to Amazon DevOps Guru Insights and RDS DB Load Anomalous screen to find the summary description of the anomaly.

  1. Choose the View Recommendations link on the top right, and observe the databases for which it’s showing the recommendations.
  2. Next, choose View detailed analysis for database performance anomaly for the following resources.
  3. Under To view a detailed analysis, choose a resource name, choose the database associated with the first test.

 Shows the detailed analysis of the database performance anomaly. The database experiencing load is chosen, and a graphical representation of how the Average active sessions (AAS) spikes, which Amazon DevOps Guru is able to identify.

  1. Observe the recommendations under Analysis and recommendations. It provides you with analysis, recommendations, and links to troubleshooting documentation.

Shows a different section of the detailed analysis screen that provides Analysis and recommendations and links to the troubleshooting documentation.

In this example, DevOps Guru for RDS has detected a high and unusual spike of database load, and then marked it as “performance anomaly”.

Note that the relative size of the anomaly is significant: 490 times higher than the “typical” database load, which is why it’s deemed: “HIGH severity”.

In the analysis section, note that a single “wait event”, wait/synch/mutex/innodb/aurora_lock_thread_slot_futex, is dominating the entire spike. Moreover, a single SQL is “responsible” (or more precisely: “suffering”) from this wait event at the time of the problem. Select the wait event name and see a simple explanation of what’s happening in the database. For example, it’s “record locking”, where multiple sessions are competing for the same database records. Additionally, you can select the SQL hash and see the exact text of the SQL that’s responsible for the issue.

If you’re interested in why DevOps Guru for RDS detected this problem, and why these particular wait events and an SQL were selected, the Why is this a problem? and Why do we recommend this? links will provide the answer.

Finally, the most relevant part of this analysis is a View troubleshooting doc link. It references a document that contains a detailed explanation of the likely causes for this problem, as well as the actions that you can take to troubleshoot and address it.

Scenario 2: Autocommit: ON

In this scenario, we must run multiple batch updates, and we’re using a fairly popular driver setting: AUTOCOMMIT: ON.

This setting can sometimes lead to performance issues as it causes each UPDATE statement in a batch to be “encased” in its own “transaction”. This leads to data changes being frequently synchronized to disk, thus dramatically increasing batch latency.

Here’s how you can reproduce the scenario:

  1. On your Cloud9 terminal, run the following commands:
cd amazon-devops-guru-rds/scripts
source ./connect.sh test 2
exit;
python3 batch_autocommit.py 50 1200 1000 10000000
  1. Once the program completes its execution, or after an hour, navigate to the Amazon DevOps Guru console, choose Insights, and then choose RDS DB Load Anomalous. Then choose Recommendations and choose View detailed analysis for database performance anomaly for the following resources. Under To view a detailed analysis, choose a resource name, choose the database associated with the second test.

  1. Observe the recommendations under Analysis and recommendations. It provides you with analysis, recommendations, and links to troubleshooting documentation.

Shows a different section of the detailed analysis screen that provides Analysis and recommendations and links to the troubleshooting documentation.

Note that DevOps Guru for RDS detected a significant (and unusual) spike of database load and marked it as a HIGH severity anomaly.

The spike looks similar to the previous example (albeit, “smaller”), but it describes a different database problem (“COMMIT slowdowns”). This is because of a different database wait event that dominates the spike: wait/io/aurora_redo_log_flush.

As in the previous example, you can select the wait event name to see a simple description of what’s going on, and you can select the SQL hash to see the actual statement that is slow. Furthermore, just as before, the View troubleshooting doc link references the document that describes what you can do to troubleshoot the problem further and address it.

Scenario 3: Missing index

Have you ever wondered what would happen if you drop a frequently accessed index on a large table?

In this relatively simple scenario, we’re testing exactly that – an index gets dropped causing queries to switch from fast index lookups to slow full table scans, thus dramatically increasing latency and resource use.

Here’s how you can reproduce this problem and see it for yourself:

  1. On your Cloud9 terminal, run the following commands:
cd amazon-devops-guru-rds/scripts
source ./connect.sh test 3
exit;
python3 no_index.py 50 1200 1000 10000000
  1. Once the program completes its execution, or after an hour, navigate to the Amazon DevOps Guru console, choose Insights, and then choose RDS DB Load Anomalous. Then choose Recommendations and choose View detailed analysis for database performance anomaly for the following resources. Under To view a detailed analysis, choose a resource name, choose the database associated with the third test.

Shows the detailed analysis of the database performance anomaly. The database experiencing load is chosen and a graphical representation of how the Average active sessions (AAS) spikes which Amazon DevOps Guru is able to identify.

  1. Observe the recommendations under Analysis and recommendations. It provides you with analysis, recommendations, and links to troubleshooting documentation.

Shows a different section of the detailed analysis screen that provides Analysis and recommendations and links to the troubleshooting documentation.

As with the previous examples, DevOps Guru for RDS detected a high and unusual spike of database load (in this case, ~ 50 times larger than the “typical” database load). It also identified that a single wait event, wait/io/table/sql/handler, and a single SQL, are responsible for this issue.

The analysis highlights the SQL that you must pay attention to, and it links a detailed troubleshooting document that lists the likely causes and recommended actions for the problems that you see. While it doesn’t tell you that the “missing index” is the real root cause of the issue (this is planned in future versions), it does offer many relevant details that can help you come to that conclusion yourself.

Cleanup

On your terminal where you originally ran the AWS Command Line Interface (AWS CLI) command to create the CloudFormation resources, run the following command:

aws cloudformation delete-stack --stack-name DevOpsGuru-Stack

Conclusion

In this post, you learned how to leverage DevOps Guru for RDS to alert you of any operational issues with recommendations. You simulated some of the commonly encountered, real-world production issues, such as locking contentions, AUTOCOMMIT, and missing indexes. Moreover, you saw how DevOps Guru for RDS helped you detect and resolve these issues. Try this out, and let us know how DevOps Guru for RDS was able to address your use-case.

Authors:

Kishore Dhamodaran

Kishore Dhamodaran is a Senior Solutions Architect at AWS. Kishore helps strategic customers with their cloud enterprise strategy and migration journey, leveraging his years of industry and cloud experience.

Simsek Mert

Simsek Mert is a Cloud Application Architect with AWS Professional Services.
Simsek helps customers with their application architecture, containers, serverless applications, leveraging his over 20 years of experience.

Maxym Kharchenko

Maxym Kharchenko is a Principal Database Engineer at AWS. He builds automated monitoring tools that use machine learning to discover and explain performance problems in relational databases.

Jared Keating

Jared Keating is a Senior Cloud Consultant with Amazon Web Services Professional Services. Jared assists customers with their cloud infrastructure, compliance, and automation requirements drawing from his over 20 years of experience in IT.