All posts by Luis Gerardo Baeza

Automate replication of relational sources into a transactional data lake with Apache Iceberg and AWS Glue

Post Syndicated from Luis Gerardo Baeza original https://aws.amazon.com/blogs/big-data/automate-replication-of-relational-sources-into-a-transactional-data-lake-with-apache-iceberg-and-aws-glue/

Organizations have chosen to build data lakes on top of Amazon Simple Storage Service (Amazon S3) for many years. A data lake is the most popular choice for organizations to store all their organizational data generated by different teams, across business domains, from all different formats, and even over history. According to a study, the average company is seeing the volume of their data growing at a rate that exceeds 50% per year, usually managing an average of 33 unique data sources for analysis.

Teams often try to replicate thousands of jobs from relational databases with the same extract, transform, and load (ETL) pattern. There is lot of effort in maintaining the job states and scheduling these individual jobs. This approach helps the teams add tables with few changes and also maintains the job status with minimum effort. This can lead to a huge improvement in the development timeline and tracking the jobs with ease.

In this post, we show you how to easily replicate all your relational data stores into a transactional data lake in an automated fashion with a single ETL job using Apache Iceberg and AWS Glue.

Solution architecture

Data lakes are usually organized using separate S3 buckets for three layers of data: the raw layer containing data in its original form, the stage layer containing intermediate processed data optimized for consumption, and the analytics layer containing aggregated data for specific use cases. In the raw layer, tables usually are organized based on their data sources, whereas tables in the stage layer are organized based on the business domains they belong to.

This post provides an AWS CloudFormation template that deploys an AWS Glue job that reads an Amazon S3 path for one data source of the data lake raw layer, and ingests the data into Apache Iceberg tables on the stage layer using AWS Glue support for data lake frameworks. The job expects tables in the raw layer to be structured in the way AWS Database Migration Service (AWS DMS) ingests them: schema, then table, then data files.

This solution uses AWS Systems Manager Parameter Store for table configuration. You should modify this parameter specifying the tables you want to process and how, including information such as primary key, partitions, and the business domain associated. The job uses this information to automatically create a database (if it doesn’t already exist) for every business domain, create the Iceberg tables, and perform the data loading.

Finally, we can use Amazon Athena to query the data in the Iceberg tables.

The following diagram illustrates this architecture.

Solution architecture

This implementation has the following considerations:

  • All tables from the data source must have a primary key to be replicated using this solution. The primary key can be a single column or a composite key with more than one column.
  • If the data lake contains tables that don’t need upserts or don’t have a primary key, you can exclude them from the parameter configuration and implement traditional ETL processes to ingest them into the data lake. That’s outside of the scope of this post.
  • If there are additional data sources that need to be ingested, you can deploy multiple CloudFormation stacks, one to handle each data source.
  • The AWS Glue job is designed to process data in two phases: the initial load that runs after AWS DMS finishes the full load task, and the incremental load that runs on a schedule that applies change data capture (CDC) files captured by AWS DMS. Incremental processing is performed using an AWS Glue job bookmark.

There are nine steps to complete this tutorial:

  1. Set up a source endpoint for AWS DMS.
  2. Deploy the solution using AWS CloudFormation.
  3. Review the AWS DMS replication task.
  4. Optionally, add permissions for encryption and decryption or AWS Lake Formation.
  5. Review the table configuration on Parameter Store.
  6. Perform initial data loading.
  7. Perform incremental data loading.
  8. Monitor table ingestion.
  9. Schedule incremental batch data loading.

Prerequisites

Before starting this tutorial, you should already be familiar with Iceberg. If you’re not, you can get started by replicating a single table following the instructions in Implement a CDC-based UPSERT in a data lake using Apache Iceberg and AWS Glue. Additionally, set up the following:

Set up a source endpoint for AWS DMS

Before we create our AWS DMS task, we need to set up a source endpoint to connect to the source database:

  1. On the AWS DMS console, choose Endpoints in the navigation pane.
  2. Choose Create endpoint.
  3. If your database is running on Amazon RDS, choose Select RDS DB instance, then choose the instance from the list. Otherwise, choose the source engine and provide the connection information either through AWS Secrets Manager or manually.
  4. For Endpoint identifier, enter a name for the endpoint; for example, source-postgresql.
  5. Choose Create endpoint.

Deploy the solution using AWS CloudFormation

Create a CloudFormation stack using the provided template. Complete the following steps:

  1. Choose Launch Stack:
  2. Choose Next.
  3. Provide a stack name, such as transactionaldl-postgresql.
  4. Enter the required parameters:
    1. DMSS3EndpointIAMRoleARN – The IAM role ARN for AWS DMS to write data into Amazon S3.
    2. ReplicationInstanceArn – The AWS DMS replication instance ARN.
    3. S3BucketStage – The name of the existing bucket used for the stage layer of the data lake.
    4. S3BucketGlue – The name of the existing S3 bucket for storing AWS Glue scripts.
    5. S3BucketRaw – The name of the existing bucket used for the raw layer of the data lake.
    6. SourceEndpointArn – The AWS DMS endpoint ARN that you created earlier.
    7. SourceName – The arbitrary identifier of the data source to replicate (for example, postgres). This is used to define the S3 path of the data lake (raw layer) where data will be stored.
  5. Do not modify the following parameters:
    1. SourceS3BucketBlog – The bucket name where the provided AWS Glue script is stored.
    2. SourceS3BucketPrefix – The bucket prefix name where the provided AWS Glue script is stored.
  6. Choose Next twice.
  7. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  8. Choose Create stack.

After approximately 5 minutes, the CloudFormation stack is deployed.

Review the AWS DMS replication task

The AWS CloudFormation deployment created an AWS DMS target endpoint for you. Because of two specific endpoint settings, the data will be ingested as we need it on Amazon S3.

  1. On the AWS DMS console, choose Endpoints in the navigation pane.
  2. Search for and choose the endpoint that begins with dmsIcebergs3endpoint.
  3. Review the endpoint settings:
    1. DataFormat is specified as parquet.
    2. TimestampColumnName will add the column last_update_time with the date of creation of the records on Amazon S3.

AWS DMS endpoint settings

The deployment also creates an AWS DMS replication task that begins with dmsicebergtask.

  1. Choose Replication tasks in the navigation pane and search for the task.

You will see that the Task Type is marked as Full load, ongoing replication. AWS DMS will perform an initial full load of existing data, and then create incremental files with changes performed to the source database.

On the Mapping Rules tab, there are two types of rules:

  • A selection rule with the name of the source schema and tables that will be ingested from the source database. By default, it uses the sample database provided in the prerequisites, dms_sample, and all tables with the keyword %.
  • Two transformation rules that include in the target files on Amazon S3 the schema name and table name as columns. This is used by our AWS Glue job to know to which tables the files in the data lake correspond.

To learn more about how to customize this for your own data sources, refer to Selection rules and actions.

AWS mapping rules

Let’s change some configurations to finish our task preparation.

  1. On the Actions menu, choose Modify.
  2. In the Task Settings section, under Stop task after full load completes, choose Stop after applying cached changes.

This way, we can control the initial load and incremental file generation as two different steps. We use this two-step approach to run the AWS Glue job once per each step.

  1. Under Task logs, choose Turn on CloudWatch logs.
  2. Choose Save.
  3. Wait about 1 minute for the database migration task status to show as Ready.

Add permissions for encryption and decryption or Lake Formation

Optionally, you can add permissions for encryption and decryption or Lake Formation.

Add encryption and decryption permissions

If your S3 buckets used for the raw and stage layers are encrypted using AWS Key Management Service (AWS KMS) customer managed keys, you need to add permissions to allow the AWS Glue job to access the data:

Add Lake Formation permissions

If you’re managing permissions using Lake Formation, you need to allow your AWS Glue job to create your domain’s databases and tables through the IAM role GlueJobRole.

  1. Grant permissions to create databases (for instructions, refer to Creating a Database).
  2. Grant SUPER permissions to the default database.
  3. Grant data location permissions.
  4. If you create databases manually, grant permissions on all databases to create tables. Refer to Granting table permissions using the Lake Formation console and the named resource method or Granting Data Catalog permissions using the LF-TBAC method according to your use case.

After you complete the later step of performing the initial data load, make sure to also add permissions for consumers to query the tables. The job role will become the owner of all the tables created, and the data lake admin can then perform grants to additional users.

Review table configuration in Parameter Store

The AWS Glue job that performs the data ingestion into Iceberg tables uses the table specification provided in Parameter Store. Complete the following steps to review the parameter store that was configured automatically for you. If needed, modify according to your own needs.

  1. On the Parameter Store console, choose My parameters in the navigation pane.

The CloudFormation stack created two parameters:

  • iceberg-config for job configurations
  • iceberg-tables for table configuration
  1. Choose the parameter iceberg-tables.

The JSON structure contains information that AWS Glue uses to read data and write the Iceberg tables on the target domain:

  • One object per table – The name of the object is created using the schema name, a period, and the table name; for example, schema.table.
  • primaryKey – This should be specified for every source table. You can provide a single column or a comma-separated list of columns (without spaces).
  • partitionCols – This optionally partitions columns for target tables. If you don’t want to create partitioned tables, provide an empty string. Otherwise, provide a single column or a comma-separated list of columns to be used (without spaces).
  1. If you want to use your own data source, use the following JSON code and replace the text in CAPS from the template provided. If you’re using the sample data source provided, keep the default settings:
{
    "SCHEMA_NAME.TABLE_NAME_1": {
        "primaryKey": "ONLY_PRIMARY_KEY",
        "domain": "TARGET_DOMAIN",
        "partitionCols": ""
    },
    "SCHEMA_NAME.TABLE_NAME_2": {
        "primaryKey": "FIRST_PRIMARY_KEY,SECOND_PRIMARY_KEY",
        "domain": "TARGET_DOMAIN",
        "partitionCols": "PARTITION_COLUMN_ONE,PARTITION_COLUMN_TWO"
    }
}
  1. Choose Save changes.

Perform initial data loading

Now that the required configuration is finished, we ingest the initial data. This step includes three parts: ingesting the data from the source relational database into the raw layer of the data lake, creating the Iceberg tables on the stage layer of the data lake, and verifying results using Athena.

Ingest data into the raw layer of the data lake

To ingest data from the relational data source (PostgreSQL if you are using the sample provided) to our transactional data lake using Iceberg, complete the following steps:

  1. On the AWS DMS console, choose Database migration tasks in the navigation pane.
  2. Select the replication task you created and on the Actions menu, choose Restart/Resume.
  3. Wait about 5 minutes for the replication task to complete. You can monitor the tables ingested on the Statistics tab of the replication task.

AWS DMS full load statistics

After some minutes, the task finishes with the message Full load complete.

  1. On the Amazon S3 console, choose the bucket you defined as the raw layer.

Under the S3 prefix defined on AWS DMS (for example, postgres), you should see a hierarchy of folders with the following structure:

  • Schema
    • Table name
      • LOAD00000001.parquet
      • LOAD0000000N.parquet

AWS DMS full load objects created on S3

If your S3 bucket is empty, review Troubleshooting migration tasks in AWS Database Migration Service before running the AWS Glue job.

Create and ingest data into Iceberg tables

Before running the job, let’s navigate the script of the AWS Glue job provided as part of the CloudFormation stack to understand its behavior.

  1. On the AWS Glue Studio console, choose Jobs in the navigation pane.
  2. Search for the job that starts with IcebergJob- and a suffix of your CloudFormation stack name (for example, IcebergJob-transactionaldl-postgresql).
  3. Choose the job.

AWS Glue ETL job review

The job script gets the configuration it needs from Parameter Store. The function getConfigFromSSM() returns job-related configurations such as source and target buckets from where the data needs to be read and written. The variable ssmparam_table_values contain table-related information like the data domain, table name, partition columns, and primary key of the tables that needs to be ingested. See the following Python code:

# Main application
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'stackName'])
SSM_PARAMETER_NAME = f"{args['stackName']}-iceberg-config"
SSM_TABLE_PARAMETER_NAME = f"{args['stackName']}-iceberg-tables"

# Parameters for job
rawS3BucketName, rawBucketPrefix, stageS3BucketName, warehouse_path = getConfigFromSSM(SSM_PARAMETER_NAME)
ssm_param_table_values = json.loads(ssmClient.get_parameter(Name = SSM_TABLE_PARAMETER_NAME)['Parameter']['Value'])
dropColumnList = ['db','table_name', 'schema_name','Op', 'last_update_time', 'max_op_date']

The script uses an arbitrary catalog name for Iceberg that is defined as my_catalog. This is implemented on the AWS Glue Data Catalog using Spark configurations, so a SQL operation pointing to my_catalog will be applied on the Data Catalog. See the following code:

catalog_name = 'my_catalog'
errored_table_list = []

# Iceberg configuration
spark = SparkSession.builder \
    .config('spark.sql.warehouse.dir', warehouse_path) \
    .config(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog') \
    .config(f'spark.sql.catalog.{catalog_name}.warehouse', warehouse_path) \
    .config(f'spark.sql.catalog.{catalog_name}.catalog-impl', 'org.apache.iceberg.aws.glue.GlueCatalog') \
    .config(f'spark.sql.catalog.{catalog_name}.io-impl', 'org.apache.iceberg.aws.s3.S3FileIO') \
    .config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
    .getOrCreate()

The script iterates over the tables defined in Parameter Store and performs the logic for detecting if the table exists and if the incoming data is an initial load or an upsert:

# Iteration over tables stored on Parameter Store
for key in ssm_param_table_values:
    # Get table data
    isTableExists = False
    schemaName, tableName = key.split('.')
    logger.info(f'Processing table : {tableName}')

The initialLoadRecordsSparkSQL() function loads initial data when no operation column is present in the S3 files. AWS DMS adds this column only to Parquet data files produced by the continuous replication (CDC). The data loading is performed using the INSERT INTO command with SparkSQL. See the following code:

sqltemp = Template("""
    INSERT INTO $catalog_name.$dbName.$tableName  ($insertTableColumnList)
    SELECT $insertTableColumnList FROM insertTable $partitionStrSQL
""")
SQLQUERY = sqltemp.substitute(
    catalog_name = catalog_name, 
    dbName = dbName, 
    tableName = tableName,
    insertTableColumnList = insertTableColumnList[ : -1],
    partitionStrSQL = partitionStrSQL)

logger.info(f'****SQL QUERY IS : {SQLQUERY}')
spark.sql(SQLQUERY)

Now we run the AWS Glue job to ingest the initial data into the Iceberg tables. The CloudFormation stack adds the --datalake-formats parameter, adding the required Iceberg libraries to the job.

  1. Choose Run job.
  2. Choose Job Runs to monitor the status. Wait until the status is Run Succeeded.

Verify the data loaded

To confirm that the job processed the data as expected, complete the following steps:

  1. On the Athena console, choose Query Editor in the navigation pane.
  2. Verify AwsDataCatalog is selected as the data source.
  3. Under Database, choose the data domain that you want to explore, based on the configuration you defined in the parameter store. If using the sample database provided, use sports.

Under Tables and views, we can see the list of tables that were created by the AWS Glue job.

  1. Choose the options menu (three dots) next to the first table name, then choose Preview Data.

You can see the data loaded into Iceberg tables. Amazon Athena review initial data loaded

Perform incremental data loading

Now we start capturing changes from our relational database and applying them to the transactional data lake. This step is also divided in three parts: capturing the changes, applying them to the Iceberg tables, and verifying the results.

Capture changes from the relational database

Due to the configuration we specified, the replication task stopped after running the full load phase. Now we restart the task to add incremental files with changes into the raw layer of the data lake.

  1. On the AWS DMS console, select the task we created and ran before.
  2. On the Actions menu, choose Resume.
  3. Choose Start task to start capturing changes.
  4. To trigger new file creation on the data lake, perform inserts, updates, or deletes on the tables of your source database using your preferred database administration tool. If using the sample database provided, you could run the following SQL commands:
UPDATE dms_sample.nfl_stadium_data_upd
SET seatin_capacity=93703
WHERE team = 'Los Angeles Rams' and sport_location_id = '31';

update  dms_sample.mlb_data 
set bats = 'R'
where mlb_id=506560 and bats='L';

update dms_sample.sporting_event 
set start_date  = current_date 
where id=11 and sold_out=0;
  1. On the AWS DMS task details page, choose the Table statistics tab to see the changes captured.
    AWS DMS CDC statistics
  2. Open the raw layer of the data lake to find a new file holding the incremental changes inside every table’s prefix, for example under the sporting_event prefix.

The record with changes for the sporting_event table looks like the following screenshot.

AWS DMS objects migrated into S3 with CDC

Notice the Op column in the beginning identified with an update (U). Also, the second date/time value is the control column added by AWS DMS with the time the change was captured.

CDC file schema on Amazon S3

Apply changes on the Iceberg tables using AWS Glue

Now we run the AWS Glue job again, and it will automatically process only the new incremental files since the job bookmark is enabled. Let’s review how it works.

The dedupCDCRecords() function performs deduplication of data because multiple changes to a single record ID could be captured within the same data file on Amazon S3. Deduplication is performed based on the last_update_time column added by AWS DMS that indicates the timestamp of when the change was captured. See the following Python code:

def dedupCDCRecords(inputDf, keylist):
    IDWindowDF = Window.partitionBy(*keylist).orderBy(inputDf.last_update_time).rangeBetween(-sys.maxsize, sys.maxsize)
    inputDFWithTS = inputDf.withColumn('max_op_date', max(inputDf.last_update_time).over(IDWindowDF))
    
    NewInsertsDF = inputDFWithTS.filter('last_update_time=max_op_date').filter("op='I'")
    UpdateDeleteDf = inputDFWithTS.filter('last_update_time=max_op_date').filter("op IN ('U','D')")
    finalInputDF = NewInsertsDF.unionAll(UpdateDeleteDf)

    return finalInputDF

On line 99, the upsertRecordsSparkSQL() function performs the upsert in a similar fashion to the initial load, but this time with a SQL MERGE command.

Review the applied changes

Open the Athena console and run a query that selects the changed records on the source database. If using the provided sample database, use one the following SQL queries:

SELECT * FROM "sports"."nfl_stadiu_data_upd"
WHERE team = 'Los Angeles Rams' and sport_location_id = 31
LIMIT 1;

Amazon Athena review cdc data loaded

Monitor table ingestion

The AWS Glue job script is coded with simple Python exception handling to catch errors during processing a specific table. The job bookmark is saved after each table finishes processing successfully, to avoid reprocessing tables if the job run is retried for the tables with errors.

The AWS Command Line Interface (AWS CLI) provides a get-job-bookmark command for AWS Glue that provides insight into the status of the bookmark for each table processed.

  1. On the AWS Glue Studio console, choose the ETL job.
  2. Choose the Job Runs tab and copy the job run ID.
  3. Run the following command on a terminal authenticated for the AWS CLI, replacing <GLUE_JOB_RUN_ID> on line 1 with the value you copied. If your CloudFormation stack is not named transactionaldl-postgresql, provide the name of your job on line 2 of the script:
jobrun=<GLUE_JOB_RUN_ID>
jobname=IcebergJob-transactionaldl-postgresql
aws glue get-job-bookmark --job-name jobname --run-id $jobrun

In this solution, when a table processing causes an exception, the AWS Glue job will not fail according to this logic. Instead, the table will be added into an array that is printed after the job is complete. In such scenario, the job will be marked as failed after it tries to process the rest of the tables detected on the raw data source. This way, tables without errors don’t have to wait until the user identifies and solves the problem on the conflicting tables. The user can quickly detect job runs that had issues using the AWS Glue job run status, and identify which specific tables are causing the problem using the CloudWatch logs for the job run.

  1. The job script implements this feature with the following Python code:
# Performed for every table
        try:
            # Table processing logic
        except Exception as e:
            logger.info(f'There is an issue with table: {tableName}')
            logger.info(f'The exception is : {e}')
            errored_table_list.append(tableName)
            continue
        job.commit()
if (len(errored_table_list)):
    logger.info('Total number of errored tables are ',len(errored_table_list))
    logger.info('Tables that failed during processing are ', *errored_table_list, sep=', ')
    raise Exception(f'***** Some tables failed to process.')

The following screenshot shows how the CloudWatch logs look for tables that cause errors on processing.

AWS Glue job monitoring with logs

Aligned with the AWS Well-Architected Framework Data Analytics Lens practices, you can adapt more sophisticated control mechanisms to identify and notify stakeholders when errors appear on the data pipelines. For example, you can use an Amazon DynamoDB control table to store all tables and job runs with errors, or using Amazon Simple Notification Service (Amazon SNS) to send alerts to operators when certain criteria is met.

Schedule incremental batch data loading

The CloudFormation stack deploys an Amazon EventBridge rule (disabled by default) that can trigger the AWS Glue job to run on a schedule. To provide your own schedule and enable the rule, complete the following steps:

  1. On the EventBridge console, choose Rules in the navigation pane.
  2. Search for the rule prefixed with the name of your CloudFormation stack followed by JobTrigger (for example, transactionaldl-postgresql-JobTrigger-randomvalue).
  3. Choose the rule.
  4. Under Event Schedule, choose Edit.

The default schedule is configured to trigger every hour.

  1. Provide the schedule you want to run the job.
  2. Additionally, you can use an EventBridge cron expression by selecting A fine-grained schedule.
    Amazon EventBridge schedule ETL job
  3. When you finish setting up the cron expression, choose Next three times, and finally choose Update Rule to save changes.

The rule is created disabled by default to allow you to run the initial data load first.

  1. Activate the rule by choosing Enable.

You can use the Monitoring tab to view rule invocations, or directly on the AWS Glue Job Run details.

Conclusion

After deploying this solution, you have automated the ingestion of your tables on a single relational data source. Organizations using a data lake as their central data platform usually need to handle multiple, sometimes even tens of data sources. Also, more and more use cases require organizations to implement transactional capabilities to the data lake. You can use this solution to accelerate the adoption of such capabilities across all your relational data sources to enable new business use cases, automating the implementation process to derive more value from your data.


About the Authors

Luis Gerardo BaezaLuis Gerardo Baeza is a Big Data Architect in the Amazon Web Services (AWS) Data Lab. He has 12 years of experience helping organizations in the healthcare, financial and education sectors to adopt enterprise architecture programs, cloud computing, and data analytics capabilities. Luis currently helps organizations across Latin America to accelerate strategic data initiatives.

SaiKiran Reddy AenuguSaiKiran Reddy Aenugu is a Data Architect in the Amazon Web Services (AWS) Data Lab. He has 10 years of experience implementing data loading, transformation, and visualization processes. SaiKiran currently helps organizations in North America to adopt modern data architectures such as data lakes and data mesh. He has experience in the retail, airline, and finance sectors.

Narendra MerlaNarendra Merla is a Data Architect in the Amazon Web Services (AWS) Data Lab. He has 12 years of experience in designing and productionalizing both real-time and batch-oriented data pipelines and building data lakes on both cloud and on-premises environments. Narendra currently helps organizations in North America to build and design robust data architectures, and has experience in the telecom and finance sectors.

Power operational insights with Amazon QuickSight

Post Syndicated from Luis Gerardo Baeza original https://aws.amazon.com/blogs/big-data/power-operational-insights-with-amazon-quicksight/

Organizations need a consolidated view of their applications, but typically application health status is siloed: end-users complain on social media platforms, operational data coming from application logs is stored on complex monitoring tools, formal ticketing systems track reported issues, and synthetic monitoring data is only available for the tool administrators.

In this post, we show how to use Amazon QuickSight, AWS’ fully managed, cloud-native Business Intelligence service to quickly build a dashboard that consolidates:

  • Operational data from application logs coming from Amazon CloudWatch.
  • Issues reported on Jira software as a service (SaaS) edition.
  • Public posts on Twitter.
  • Synthetic monitoring performed using a CloudWatch Synthetics canary.

This dashboard can provide a holistic view of the health status of a workload, for example, you can:

  • Trace end-users complaining on Twitter or creating issues on Jira back to the application logs where the error occurred.
  • Identify when customers are complaining about system performance or availability on social media.
  • Corelate reports with monitoring metrics (such as availability and latency).
  • Track down errors in application code using log information.
  • Prioritize issues already being addressed based on Jira information.

Solution overview

The following architecture for the solution consists of a subscription filter for CloudWatch Logs to continuously send the application logs to an Amazon Simple Storage Service (Amazon S3) bucket, an AWS Glue crawler to update Amazon S3 log table metadata, a view on Amazon Athena that formats the data on the bucket, and three QuickSight datasets: Athena, Jira, and Twitter.

To implement this solution, complete the following steps:

  1. Set up CloudWatch and AWS Glue resources.
  2. Set up a QuickSight dataset for Athena.
  3. Set up a QuickSight dataset for CloudWatch Synthetics.
  4. Set up a QuickSight dataset for Twitter.
  5. Set up a QuickSight dataset for Jira.
  6. Create a QuickSight overview analysis.
  7. Create a QuickSight detailed analysis.
  8. Publish your QuickSight dashboard.

Prerequisites

To get started, make sure you make the following prerequisites:

  • An AWS account.
  • Previous experience working with the AWS Management Console.
  • A Twitter account.
  • A Jira SaaS account. Make sure that the DNS name of your Jira Cloud is accessible to QuickSight.
  • Access to the Athena engine v2.

Set up CloudWatch and AWS Glue resources

Start by deploying a Lambda transformation function to use with your Amazon Kinesis Data Firehose delivery stream:

  1. On the Lambda console, launch a new function using the kinesis-firehose-cloudwatch-logs-processor blueprint.
  2. Enter a function name and choose Create function.
  3. Modify the transformLogEvent function in the Lambda code:
function transformLogEvent(logEvent) {
return Promise.resolve(`${logEvent.timestamp},${logEvent.message}\n`);
}
  1. Choose Deploy to update the function.

As part of these steps, you create a new AWS Identity and Access Management (IAM) role with basic permissions and will attach an IAM policy created by AWS CloudFormation later.

  1. Choose Copy ARN and save the ARN temporarily for later use.

To create the sample resources, complete the following steps:

  1. Choose Launch Stack:

  1. Choose Next.
  2. Enter a stack name.
  3. For TransformationLambdaArn, enter the function ARN you copied earlier.
  4. Choose Next twice, then acknowledge the message about IAM capabilities.
  5. Choose Create stack.

By default, when you launch the template, you’re taken to the AWS CloudFormation Events page. After 5 minutes, the stack launch is complete.

Test the Lambda function

We can test the function to write sample logs into the log group.

  1. On the Resources page of the AWS CloudFormation console search for LogGenerator.
  2. Choose the physical ID of the Lambda function.
  3. On the Lambda console, choose the function you created.
  4. Choose Test, enter sample for the Event name and leave the other configurations at their default.
  5. Choose Create.
  6. Choose Test again and you should receive the message “Logs generated.”
  7. On the Functions page of the Lambda console, choose the transformation function you created.

Change the AWS Lambda function configuration

  1. On the Configuration tab, choose General configuration.
  2. Choose Edit.
  3. For Memory, set to 256 MB.
  4. For Timeout, increase to 5 minutes.
  5. Choose Save.
  6. Choose Permissions, then choose the role name ID to open the IAM console.
  7. Choose Attach policies.
  8. Search for and select InsightsTransformationFunctionPolicy.
  9. Choose Attach policy.

The policy you attached allows your Lambda transformation function to put records into your Kinesis Data Firehose delivery stream.

Partitioning has emerged as an important technique for organizing datasets so that they can be queried efficiently by a variety of big data systems. Data is organized in a hierarchical directory structure based on the distinct values of one or more columns. The Firehose delivery stream automatically partitions data by date.

  1. On the Amazon S3 console, locate and choose the bucket insightlogsbucket.
  2. Choose the cwlogs prefix and navigate through the partitions created by Kinesis Data Firehose (year/month/day/hour).

Make sure the bucket contains at least one file. (It may take up to 5 minutes to show because Kinesis Data Firehose buffers the data by default.)

To optimize the log storage, you can later enable record transformation to Parquet, a columnar data format. For more information, see Converting Input Record Format (Console).

Run the AWS Glue crawler

To complete this section, run the AWS Glue crawler to discover bucket metadata:

  1. On the AWS Glue console, choose Crawlers.
  2. Select the crawler InsightsLogCrawler and choose Run crawler.

  1. Wait for the crawler to complete, then choose Tables.
  2. Choose the filter bar and choose the resource attribute Database.
  3. Enter insightsdb and choose Enter.
  4. You should see a table with a CSV classification.
  5. On the Athena console, enter the following into the query editor:
select * from cwlogs limit 5;
  1. Choose Run query.

You should see a table like the following screenshot.

Set up a QuickSight dataset for Athena

If you haven’t signed up for a QuickSight subscription, do so before creating your dataset.

To use the table created in the AWS Glue Data Catalog, you have to authorize connections to Athena.

  1. On the QuickSight console, choose your QuickSight username and choose Manage QuickSight.
  2. Choose Security & permissions in the navigation pane.
  3. Choose Add or remove to set up access to Amazon S3.

  1. Choose your S3 bucket insightslosbucket created by the CloudFormation template to allow QuickSight to access it.
  2. Choose the QuickSight logo to exit the management screen.

Add the Athena dataset

Now we can set up the Athena dataset.

  1. On the QuickSight console, choose Datasets in the navigation pane, then choose New dataset.
  2. Choose Athena from the available options.
  3. For Data source name, enter cwlogs.
  4. Leave the default workgroup selected (primary).
  5. Choose Create data source.
  6. Open the list of databases and choose insightsdb.
  7. Choose the cwlogs table and choose Use custom SQL.
  8. Replace New custom SQL with cwlogs and enter the following SQL code:
SELECT col1 as logmessage, col0 as datetime FROM "insightsdb"."cwlogs"
  1. Choose Confirm query.

You receive a confirmation of the dataset creation.

  1. Choose Edit/Preview Data.
  2. For Dataset name, enter cwlogs.

Separate severity levels

The logs contain a severity level (INFO, WARN, ERRR) embedded into the message, to enable analysis of the logs based on this value, you separate the severity level from the message using QuickSight calculated fields.

  1. Choose Dataset below the query editor and choose the datetime
  2. Change the type to Date

  1. Open the Fields panel on the left and choose Add calculated field.
  2. For Add name, enter level.
  3. Enter the following code:
substring(logmessage,1,4)
  1. Choose Save.
  2. Choose Add calculated field.
  3. For Add name, enter message.
  4. Enter the following code:
replace(logmessage,concat(level," - "),"")
  1. Choose Save.
  2. Choose the options icon (three dots) next to the logmessage field and choose Exclude field.

The final dataset should be similar to the following screenshot.

  1. Choose Save.

Set up a QuickSight dataset for CloudWatch Synthetics

You add Synthetics monitoring metrics to our QuickSight dashboard to have visibility into the availability of your website. For this, you use the Athena CloudWatch connector.

Create a CloudWatch Synthetics canary

To create a CloudWatch Synthetics canary that monitors your website using heartbeats (sample requests) to test availability, complete the following steps:

  1. On the CloudWatch Synthetics console, choose Create canary.
  2. Make sure the blueprint Heartbeat monitoring is selected.
  3. For Name, enter webstatus.
  4. For Application or endpoint URL, enter your website’s URL.
  5. Under Schedule, enter a frequency that works best for you (1–60 minutes).

The default setup is every 5 minutes.

  1. Enter an S3 location to store artifacts.
  2. Choose Create canary.

Set up the Athena CloudWatch connector

CloudWatch Synthetics sends availability data from the canary to CloudWatch Metrics. To query the metrics, you can use the Athena CloudWatch Metrics connector.

  1. On the Athena console, choose Data sources.
  2. Choose Connect data source and choose Query a data source.
  3. Choose Amazon CloudWatch Metrics and choose Next.
  4. Choose Configure new AWS Lambda function.

 A new tab opens in the browser, which you return to after deploying the Lambda function.

  1. For SpillBucket, enter the name of your S3 bucket insightslobsbucket created by the CloudFormation template you deployed.
  2. For AthenaCatalogName, enter cwmetrics.
  3. Select I acknowledge that this app creates custom IAM roles.
  4. Choose Deploy.
  5. Close the browser tab and go to the Athena tab you were on before.
  6. Refresh the list of Lambda functions by choosing the refresh icon.
  7. Under Lambda function, choose the Lambda function you just created.
  8. For Catalog name, enter cwmetrics.
  9. Choose Connect.

Set up permissions for QuickSight to use the connector

The CloudWatch connector runs on Lambda, which uses a spill bucket to handle large queries, so QuickSight needs permission to invoke the Lambda function and write to the spill bucket. For more information, see the GitHub repo. Let’s set up permission to allow QuickSight use the CloudWatch connector.

  1. On the QuickSight console admin page, choose Security & permissions.
  2. Choose Add or remove.
  3. Choose Athena.
  4. On the S3 tab, specify write permissions for your insightslogsbucket S3 bucket
  5. On the Lambda tab, choose your Lambda function cwmetrics.

Create the new QuickSight dataset

Now we set up the QuickSight dataset for CloudWatch Synthetics.

  1. On the QuickSight console, choose Datasets.
  2. Choose New dataset and choose Athena.
  3. For Data source name, enter cwmetrics.
  4. Choose Create data source.
  5. Open the list of catalogs and choose cwmetrics.
  6. Choose the metric_samples table and choose Use custom SQL.
  7. For New custom SQL field, enter cwmetrics.
  8. Enter the following SQL code:
SELECT timestamp, value, dimension.dim_name, dimension.dim_value
    FROM cwmetrics.default.metric_samples CROSS JOIN UNNEST(dimensions) as  t(dimension)
    WHERE namespace='CloudWatchSynthetics' AND metric_name='SuccessPercent'
        AND statistic='Average' and dim_name='StepName' 
        AND dimension.dim_name='CanaryName' AND dimension.dim_value='webstatus'
        AND timestamp BETWEEN To_unixtime(Now() - INTERVAL '7' DAY) 
        AND To_unixtime(Now())
  1. Choose Confirm query.

You receive a confirmation of the dataset creation.

  1. Choose Edit/Preview data.
  2. Choose Dataset below the query editor and choose the timestamp field.
  3. Change the type to Date.
  4. Choose Save.

Set up a QuickSight dataset for Twitter

To set up the Twitter dataset, complete the following steps:

  1. On the QuickSight console, create a new dataset with Twitter as the source.
  2. For Data source name, enter twitterds.
  3. For Query, enter a hashtag or keyword to analyze from Twitter posts.
  4. Choose Create data source.

A new window opens requesting you to give QuickSight OAuth authorization for Twitter.

  1. Sign in to Twitter and choose Authorize application.

  1. Choose the table Twitt, then choose Edit/Preview data.

It might take a couple of minutes for the records to be imported into SPICE, the QuickSight Super-fast, Parallel, In-memory Calculation Engine. It’s engineered to rapidly perform advanced calculations and serve data. You can continue with the tutorial while SPICE finishes in the background.

Let’s create a calculated field that classifies tweets as a Good experience or Bad experience by searching for the words “error,” “problem,” or “expensive.” You can choose other words that fit your use case.

  1. For Dataset name, enter twitterds.
  2. Choose Add calculated field.
  3. For Add name, enter Experience.
  4. Enter the following code:
ifelse(locate(toLower({Text}),"error")<>0,"BAD",locate(toLower({Text}),"problem")<>0,"BAD",locate(toLower({Text}),"expensive")<>0,"BAD","GOOD")
  1. Choose Save and then choose Save & visualize.

The Twitter Standard Search API returns data for 7 days only. For more information, see Supported Data Sources.

Set up a QuickSight dataset for Jira Cloud

QuickSight can connect to SaaS data sources, including Jira Cloud. To set up the Jira dataset, complete the following steps:

  1. To create an API token, open the Jira website within an authenticated browser.
  2. Choose Create API token.
  3. For Label, enter QuickSight.
  4. Choose Create.

  1. Create a new dataset in QuickSight and choose Jira as the data source.
  2. For Data source name, enter jirads.
  3. For Site base URL, enter the URL you use to access Jira.
  4. For Username, enter your Jira username.
  5. For API token or password, enter the token you created.
  6. Choose Create and choose the Issues  table.
  7. Choose Select and then choose Visualize.

You’re redirected to a newly created QuickSight analysis.

Create a QuickSight overview analysis

We use the previously created analysis as a starting point for our overview analysis.

  1. Choose the Edit data icon (pencil).

  1. Choose Add dataset.
  2. Choose the Issues dataset and choose Select.
  3. Repeat the steps for the twitterds, cwlogs and cwmetrics

You should see the four datasets added to the QuickSight visual.

In the navigation pane, you can find the available fields for the selected dataset and on the right, the visuals. Each visual is tied to a particular dataset. When the tutorial instructs you to create a new visual from a dataset, you must choose the dataset from the list first, then choose + Add.

  1. Choose the visual, then the options icon (three dots).
  2. Choose Delete.

  1. Update the QuickSight analysis name to Operational Dashboard.

Add a visual for the Twitter dataset

To add your first visual, complete the following steps:

  1. Add a new visual from the Twitter dataset.
  2. Change the visual type to KPI.
  3. Choose the field RetweetCount.
  4. Name the visual Bad experience retweet count.
  5. Use the resize button to set the size to a fourth of the screen width.

  1. In the navigation pane, choose Filter and then choose Create one.
  2. Choose the Created field and then choose the filter.
  3. Change the filter type to relative dates and choose
  4. Choose Last N hours and enter 24 for Number of hours.
  5. Choose

This filter allows us to see only the most up-to-date information from the last 24 hours.

  1. Add a second filter and choose the Experience
  2. Leave BAD selected marked and choose Apply.

Now you only see information about Twitter customers with a bad experience.

Add a visual for CloudWatch metrics

Next, we add a new visual for our CloudWatch metrics.

  1. Choose Visualize and then choose the cwmetrics
  2. Add a new gauge chart.
  3. Choose Field wells to open the visual field configuration.
  4. Choose Value and change Aggregate to Average.

  1. Drag and drop the value field from the field list into Target value and change the aggregate to
  2. Name the visual System health status.

  1. Similar to what you did on the previous visual, add a filter to include only the last 24 hours based on the timestamp field.

Add a visual for CloudWatch error logs

Next, we create a visual for our CloudWatch logs dataset.

  1. Choose the cwlogs dataset and add a new KPI visual.
  2. Drag and drop the message field into the Value
  3. Name the visual Error log count.
  4. Create a filter using the level field and from the list of values.
  5. Deselect all but ERRR.
  6. Choose Apply.

This filters only logs where there was an error found.

  1. Create a filter for the last 24 hours.

Add a visual for Jira issues

Now we add a visual for open Jira issues.

  1. Choose the Issues dataset.
  2. Create a KPI visual using the Id field for Value.
  3. Add a filter for issues of type bug.
    1. Use the IssueType_Name field and select only the records with value Error.
  4. Add a filter for issues open.
    1. Use the Status_Name field and select only the records with the value On-going or To-do.
  5. Add a filter for the last 24 hours using the Date_Created
  6. Name the filter Bug Issues open.
  7. Resize the visuals and organize them as needed.

Complete the oversight dashboard

We’re almost done with our oversight dashboard.

  1. Create four new visuals as specified in the following table.
Dataset Visual type Field on X-axis Field on Value Field on Color
twitterds Stacked line chart Created (aggregate: hour) RetweetCount (aggregate: sum)
cwmetrics Stacked line chart timestamp (aggregate: hour) Value (aggregate: avg)
cwlogs Stacked line chart datetime (aggregate: hour) message (aggregate: count)
Issues Vertical stacked bar Date_Created (aggregate: hour) Id (aggregate: count) Status_Name
  1. Modify every filter you created to apply them to all the visuals.

  1. Choose the tab Sheet 1 twice to edit it.
  2. Enter Overview.
  3. Choose Enter.

The overview analysis of our application health dashboard is complete.

QuickSight provides a drill-up and drill-down feature to view data at different levels of a hierarchy; the feature is added automatically for date fields. For more information, see Adding Drill-Downs to Visual Data in Amazon QuickSight.

In the previous step, you applied a drill-down when you changed the aggregation to hour.

Create a QuickSight detailed analysis

To create a detailed analysis, we create new tabs for CloudWatch logs, Tweets, and Jira issues.

Create a tab for CloudWatch logs

To create a tab to analyze our CloudWatch logs, complete the following steps:

  1. Choose the add icon next to the Overview tab and name it Logs.
  2. Create three visuals from the cwlogs dataset :
    1. Donut chart with the level field for Group/Color and message (count) for Value.
    2. Stacked combo bar with the datetime (aggregate: hour) field for X axis box, level (count) for Bars, and level for Group/Color for bars.
    3. Table with the fields level, message, and datetime for Value.

To improve the analysis of log data, format the level field based on its content using QuickSight conditional formatting: red for ERRR and Green for DEBG.

  1. Choose the table visual and choose on the visual options icon (three dots), then choose Conditional formatting.
  2. Choose the add icon and select the level field, then choose Add Text color.
  3. For Value, enter ERRR.
  4. For Color, choose red.
  5. Choose Add condition.
  6. For Value, enter DEBG.
  7. For Color, choose green.

  1. Choose Apply.
  2. Resize the visuals and update their titles as needed.

To enable data exploration, let’s set up an action on the Logs tab.

  1. Choose the Top Logs visual and choose Actions.
  2. Choose Filter same-sheet visuals.
  3. Choose ERRR on the donut chart.

The Latest Logs table filters only the DEBG level rows.

  1. Create a filter for the last 24 hours applicable to all visuals from the dataset.

Create a tab for Tweets

To create a tab for Twitter analysis and add visuals, complete the following steps:

  1. Choose the add icon next to the Logs tab and name the new tab Tweets.
  2. Delete the visual added and create a new donut chart from the twitterds dataset.
  3. Choose the field Source and name the visual Twit Source.
  4. Create a new word cloud visual and choose the Username

With QuickSight, you can exclude visual elements to focus the analysis on certain data. If you see a big “Other” username on the word cloud, choose it and then choose Hide “other” categories.

  1. To narrow down the elements on the word cloud to the top 50, choose the mesh icon.
  2. Under the Group by panel, enter 50 for Number of words.
  3. Choose the mesh icon again and choose Allow vertical words.
  4. Name the visual Top 50 Users.

Let’s create a table with the Twitter details.

  1. Add a new table visual.
  2. Drag and drop the field Text into the Group by box and RetweetCount into Value.
  3. Name the visual Top Retweet.
  4. Resize the columns on the table using the headers border as needed.
  5. To sort the table from the top retweeted posts, choose the header of the field RetweetCount and choose the sort descending icon.

Let’s add a color and an icon based on number of retweets.

  1. Choose the configuration icon (three dots) and choose conditional formatting.
  2. Choose the RetweetCount field, then choose the add icon and choose the three bars icon set.

  1. Choose the Custom conditions option and enter the Value field as follows:
    1. Condition #1 – Value: 10000; color: red
    2. Condition #2 – Start Value: 2000; End Value: 10000; color: orange
    3. Condition #3 – Value: 2000; color: keep the default

Now you can see the field RetweetCount formatted with an icon and color based on the value.

Now we add the user location to the analysis.

  1. Add a new horizontal bar chart visual.
  2. Use the UserLocation field for Y axis and the RetweetCount as Value.
  3. Sort descending by RetweetCount.
  4. Choose the mesh icon to expand the Y-axis panel and enter 10 for Number of data points to show.
  5. If you see an empty country, choose it and choose Exclude empty.
  6. Name the visual Top 10 Locations.

To complete this tab of your analysis, resize the visuals and organize them as follows.

  1. Similarly, as you did before, choose every visual and add the action Filter same-sheet visuals, which allows you to explore your data.

For example, you can choose one location or source and the Top users and Retweet tables are filtered.

  1. Create a filter for the last 24 hours applicable to all visuals from the dataset.

Create a tab for Jira issues

Finally, we create a tab for Jira issue analysis.

  1. Choose the add icon next to the Tweets tab and name the new tab Issues.
  2. Delete the visual created and create a new horizontal stacked 100% bar chart visual from the Issues dataset.
  3. Drag and drop the fields as follows (because this dataset has many fields, you can find them using the Field list search bar):
    1. Y-axisStatus_Name
    2. ValueId (count)
    3. Group/ColorAssigne_DisplayName

This visual shows you how issues have progressed among assignee name.

  1. Add a new area line chart visual with the field Date_Updated for X axis and TimeEstimate for Value.
  2. Add another word cloud visual to find out who the top issue reporters are; use Reporter_DisplayName for Group by and Id (count) for Size.
  3. The last visual you add for this tab is a table, include all the necessary fields on the Value box to be able to investigate. I suggest you include Id, Key, Summary, Votes, WatchCount, Priority, and Reporter_DisplayName.
  4. Resize and rearrange the visuals as needed.

  1. As you did before, choose every visual and add the action Filter same-sheet visuals, which allows you to explore your data.

For example, you can choose one reporter display name or a status and the other visuals are filtered.

  1. Create the filter for the last 24 hours applicable to all visuals from the dataset.

Publish your QuickSight dashboard

The analysis that you’ve been working on is automatically saved. To enable other people to view your findings with read-only capabilities, publish your analysis as a dashboard.

  1. Choose Share and choose Publish dashboard.
  2. Enter a name for your dashboard, such as holistic health status, and choose Publish dashboard.
  3. Optionally, select a person or group to share the dashboard with by entering a name in the search bar.
  4. Choose Share.

Your dashboard is now published and ready to use. You can easily correlate errors in application logs with posts on Twitter and availability data from your website, and quickly identify which errors are being already addressed based on Jira bug issues open.

By default, this dashboard can only be accessed by you, but you can share your dashboard with other people in your QuickSight account.

When you created the QuickSight datasets for Twitter and Jira, the data was automatically imported into SPICE, accelerating the time to query. You can also set up SPICE for other dataset types. Remember that data is imported into SPICE and must be refreshed.

Conclusion

In this post, you created a dashboard with a holistic view of your workload health status, including application logs, issue tracking on Jira, social media comments on Twitter, and monitoring data from CloudWatch Synthetics. To expand on this solution, you can  include data from Amazon CloudFront logs or Application Load Balancer access logs so you can have a complete view of your application. Also, you could easily embed your dashboard into a custom application.

You can also use machine learning to discover hidden data trends, saving hours of manual analysis with QuickSight ML Insights, or use QuickSight Q to power data discovery using natural language questions on your dashboards. Both features are ready to use in QuickSight without machine learning experience required.


About the author

Luis Gerardo Baeza is an Amazon Web Services solutions architect with 10 years of experience in business process transformation, enterprise architecture, agile methodologies adoption, and cloud technologies integration. Luis has worked with education, healthcare, and financial companies in México and Chile.