All posts by Virendhar Sivaraman

Monitor data pipelines in a serverless data lake

2023-08-09 Virendhar Sivaraman

Post Syndicated from Virendhar Sivaraman original https://aws.amazon.com/blogs/big-data/monitor-data-pipelines-in-a-serverless-data-lake/

AWS serverless services, including but not limited to AWS Lambda, AWS Glue, AWS Fargate, Amazon EventBridge, Amazon Athena, Amazon Simple Notification Service (Amazon SNS), Amazon Simple Queue Service (Amazon SQS), and Amazon Simple Storage Service (Amazon S3), have become the building blocks for any serverless data lake, providing key mechanisms to ingest and transform data without fixed provisioning and the persistent need to patch the underlying servers. The combination of a data lake in a serverless paradigm brings significant cost and performance benefits. The advent of rapid adoption of serverless data lake architectures—with ever-growing datasets that need to be ingested from a variety of sources, followed by complex data transformation and machine learning (ML) pipelines—can present a challenge. Similarly, in a serverless paradigm, application logs in Amazon CloudWatch are sourced from a variety of participating services, and traversing the lineage across logs can also present challenges. To successfully manage a serverless data lake, you require mechanisms to perform the following actions:

Reinforce data accuracy with every data ingestion
Holistically measure and analyze ETL (extract, transform, and load) performance at the individual processing component level
Proactively capture log messages and notify failures as they occur in near-real time

In this post, we will walk you through a solution to efficiently track and analyze ETL jobs in a serverless data lake environment. By monitoring application logs, you can gain insights into job execution, troubleshoot issues promptly to ensure the overall health and reliability of data pipelines.

Overview of solution

The serverless monitoring solution focuses on achieving the following goals:

Capture state changes across all steps and tasks in the data lake
Measure service reliability across a data lake
Quickly notify operations of failures as they happen

To illustrate the solution, we create a serverless data lake with a monitoring solution. For simplicity, we create a serverless data lake with the following components:

Storage layer – Amazon S3 is the natural choice, in this case with the following buckets:
- Landing – Where raw data is stored
- Processed – Where transformed data is stored
Ingestion layer – For this post, we use Lambda and AWS Glue for data ingestion, with the following resources:
- Lambda functions – Two Lambda functions that run to simulate a success state and failure state, respectively
- AWS Glue crawlers – Two AWS Glue crawlers that run to simulate a success state and failure state, respectively
- AWS Glue jobs – Two AWS Glue jobs that run to simulate a success state and failure state, respectively
Reporting layer – An Athena database to persist the tables created via the AWS Glue crawlers and AWS Glue jobs
Alerting layer – Slack is used to notify stakeholders

The serverless monitoring solution is devised to be loosely coupled as plug-and-play components that complement an existing data lake. The Lambda-based ETL tasks state changes are tracked using AWS Lambda Destinations. We have used an SNS topic for routing both success and failure states for the Lambda-based tasks. In the case of AWS Glue-based tasks, we have configured EventBridge rules to capture state changes. These event changes are also routed to the same SNS topic. For demonstration purposes, this post only provides state monitoring for Lambda and AWS Glue, but you can extend the solution to other AWS services.

The following figure illustrates the architecture of the solution.

The architecture contains the following components:

EventBridge rules – EventBridge rules that capture the state change for the ETL tasks—in this case AWS Glue tasks. This can be extended to other supported services as the data lake grows.
SNS topic – An SNS topic that serves to catch all state events from the data lake.
Lambda function – The Lambda function is the subscriber to the SNS topic. It’s responsible for analyzing the state of the task run to do the following:
- Persist the status of the task run.
- Notify any failures to a Slack channel.
Athena database – The database where the monitoring metrics are persisted for analysis.

Deploy the solution

The source code to implement this solution uses AWS Cloud Development Kit (AWS CDK) and is available on the GitHub repo monitor-serverless-datalake. This AWS CDK stack provisions required network components and the following:

Three S3 buckets (the bucket names are prefixed with the AWS account name and Regions, for example, the landing bucket is <aws-account-number>-<aws-region>-landing):
- Landing
- Processed
- Monitor
Three Lambda functions:
- datalake-monitoring-lambda
- lambda-success
- lambda-fail
Two AWS Glue crawlers:
- glue-crawler-success
- glue-crawler-fail
Two AWS Glue jobs:
- glue-job-success
- glue-job-fail
An SNS topic named datalake-monitor-sns
Three EventBridge rules:
- glue-monitor-rule
- event-rule-lambda-fail
- event-rule-lambda-success
An AWS Secrets Manager secret named datalake-monitoring
Athena artifacts:
- monitor database
- monitor-table table

You can also follow the instructions in the GitHub repo to deploy the serverless monitoring solution. It takes about 10 minutes to deploy this solution.

Connect to a Slack channel

We still need a Slack channel to which the alerts are delivered. Complete the following steps:

Set up a workflow automation to route messages to the Slack channel using webhooks.
Note the webhook URL.

The following screenshot shows the field names to use.

The following is a sample message for the preceding template.

On the Secrets Manager console, navigate to the datalake-monitoring secret.
Add the webhook URL to the slack_webhook secret.

Load sample data

The next step is to load some sample data. Copy the sample data files to the landing bucket using the following command:

aws s3 cp --recursive s3://awsglue-datasets/examples/us-legislators s3://<AWS_ACCCOUNT>-<AWS_REGION>-landing/legislators

In the next sections, we show how Lambda functions, AWS Glue crawlers, and AWS Glue jobs work for data ingestion.

Test the Lambda functions

On the EventBridge console, enable the rules that trigger the lambda-success and lambda-fail functions every 5 minutes:

event-rule-lambda-fail
event-rule-lambda-success

After a few minutes, the failure events are relayed to the Slack channel. The following screenshot shows an example message.

Disable the rules after testing to avoid repeated messages.

Test the AWS Glue crawlers

On the AWS Glue console, navigate to the Crawlers page. Here you can start the following crawlers:

glue-crawler-success
glue-crawler-fail

In a minute, the glue-crawler-fail crawler’s status changes to Failed, which triggers a notification in Slack in near-real time.

Test the AWS Glue jobs

On the AWS Glue console, navigate to the Jobs page, where you can start the following jobs:

glue-job-success
glue-job-fail

In a few minutes, the glue-job-fail job status changes to Failed, which triggers a notification in Slack in near-real time.

Analyze the monitoring data

The monitoring metrics are persisted in Amazon S3 for analysis and can be used of historical analysis.

On the Athena console, navigate to the monitor database and run the following query to find the service that failed the most often:

SELECT service_type, count(*) as "fail_count"
FROM "monitor"."monitor"
WHERE event_type = 'failed'
group by service_type
order by fail_count desc;

Over time with rich observability data – time series based monitoring data analysis will yield interesting findings.

Clean up

The overall cost of the solution is less than one dollar but to avoid future costs, make sure to clean up the resources created as part of this post.

Summary

The post provided an overview of a serverless data lake monitoring solution that you can configure and deploy to integrate with enterprise serverless data lakes in just a few hours. With this solution, you can monitor a serverless data lake, send alerts in near-real time, and analyze performance metrics for all ETL tasks operating in the data lake. The design was intentionally kept simple to demonstrate the idea; you can further extend this solution with Athena and Amazon QuickSight to generate custom visuals and reporting. Check out the GitHub repo for a sample solution and further customize it for your monitoring needs.

About the Authors

Virendhar (Viru) Sivaraman is a strategic Senior Big Data & Analytics Architect with Amazon Web Services. He is passionate about building scalable big data and analytics solutions in the cloud. Besides work, he enjoys spending time with family, hiking & mountain biking.

Vivek Shrivastava is a Principal Data Architect, Data Lake in AWS Professional Services. He is a Bigdata enthusiast and holds 14 AWS Certifications. He is passionate about helping customers build scalable and high-performance data analytics solutions in the cloud. In his spare time, he loves reading and finds areas for home automation.

Bolster security with role-based access control in Amazon MWAA

2021-08-09 Virendhar Sivaraman

Post Syndicated from Virendhar Sivaraman original https://aws.amazon.com/blogs/big-data/bolster-security-with-role-based-access-control-in-amazon-mwaa/

Amazon Studios invests in content that drives global growth of Amazon Prime Video and IMDb TV. Amazon Studios has a number of internal-facing applications that aim to streamline end-to-end business processes and information workflows for the entire content creation lifecycle. The Amazon Studios Data Infrastructure (ASDI) is a centralized, curated, and secure data lake that stores data, both in its original form and processed for analysis and machine learning (ML). The centralized ASDI is essential to break down data silos and combine different types of analytics, thereby allowing Amazon Studios to gain valuable insights, guide better business decisions, and innovate using the latest ML concepts.

What are the primary goals for Amazon MWAA adoption?

Amazon Managed Workflows for Apache Airflow (MWAA) is a fully managed service that makes it easier to run open-source versions of Apache Airflow on AWS. Builders at Amazon.com are engineering Amazon MWAA Directed Acyclic Graphs (DAGs) with prerequisites for provisioning the least privilege access model to the underlying services and resources, and restricting the blast radius of a given task.

Apache Airflow connections provide mechanisms for securely accessing the resources during DAG execution and are intended for coarse-grained access. Incorporating fine-grained access requires different mechanisms for implementation and code review prior to deployment. The additional challenge of codifying the infrastructure and stitching multiple systems together can also inject redundant activities when implementing fine-grained access patterns in Airflow.

How did Amazon achieve this goal?

The objective to enforce security for DAGs at its lowest possible granularity is done at the DAG’s task level. The solution aligns with integration of Amazon MWAA task security with AWS Identity and Access Management (IAM) service and AWS Security Token Service (AWS STS). The engineers customized the existing Airflow PythonOperators to tightly couple task access requirements to separately deployed IAM roles. The customized Airflow operator takes advantage of AWS STS to assume the associated IAM role. The temporary session created from AWS STS is used within PythonOperator to access the underlying resources required to run the task.

In this post, we discuss how to strengthen security in Amazon MWAA with role-based access control.

Prerequisites

To implement this solution, complete the following prerequisites:

Create an AWS account with admin access.
Create an Amazon MWAA environment.
1. Note down the execution role ARN associated with the Amazon MWAA environment. This is available in the Permissions section of the environment.

Create two Amazon Simple Storage Service (Amazon S3) buckets:
1. s3://<AWS_ACCOUNT_ID>-<AWS_REGION>-mwaa-processed/
2. s3://<AWS_ACCOUNT_ID>-<AWS_REGION>-mwaa-published/
Create two IAM roles; one for each of the buckets:
1. write_access_processed_bucket with the following policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::<AWS_ACCOUNT_ID>-<AWS_REGION>-mwaa-processed/*"
        }
    ]
}

1. write_access_published_bucket with the following policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::<AWS_ACCOUNT_ID>-<AWS_REGION>-mwaa-published/*"
        }
    ]
}

Update the trust relationship for the preceding two roles with the Amazon MWAA execution role obtained from Amazon MWAA environment page:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": [
          "arn:aws:iam::<AWS_ACCOUNT_ID>:assumed-role/<MWAA-EXECUTION_ROLE>/AmazonMWAA-airflow"
        ],
        "Service": "s3.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

In the preceding policy, replace AWS_ACCOUNT_ID and MWAA-EXECUTION_ROLE with the respective account number, region and Amazon MWAA execution role.

Run the DAG

The proposed DAG has two tasks that access each of the preceding buckets created:

Process task – Performs a task in the processed S3 bucket, which mocks a transformation using the Python sleep() function. The last step in this task adds a control file with the current timestamp.
Publish task – Performs a similar transformation in the published S3 bucket, which again mocks a transformation using the Python sleep() function. The last step in this task adds a control file with the current timestamp.

The fine-grained access restriction is enforced by a custom implementation of a widely used Airflow operator: PythonOperator. The custom PythonOperator negotiates with AWS STS to trade a session using the IAM role. The session is exclusively used by the tasks’ callable to access the underlying AWS resources. The following diagram shows the sequence of events.

The source code for the preceding implementation is available in the mwaa-rbac-task GitHub repository.

The code base is set up in the following location in Amazon S3, as seen from the Amazon MWAA environment on the Amazon MWAA console.

Run the DAG and monitor its progress, as shown in the following screenshot.

After you run the DAG, the following files are created with timestamps updated:

s3://<AWS_ACCOUNT_ID>-<AWS_REGION>-mwaa-processed/control_file/processed.json 
	{
		"processed_dt": "03/05/2021 01:03:58"
	}

s3://<AWS_ACCOUNT_ID>-<AWS_REGION>-mwaa-published/control_file/published.json
	{
		"published_dt": "03/05/2021 01:04:12"
	}

The change in the preceding control files reflects that the tasks in the DAGs enforced the policies defined for these tasks.

Create custom Airflow Operators to support least privilege access

You can extend the demonstrated methodology for enabling fine-grained access using a customized PythonOperator to other Airflow operators and sensors as needed. For more information about how to customize operators, see Creating a custom Operator.

Conclusion

In this post, we presented a solution to bolster security in Amazon MWAA with role-based access controls. You can extend the concept to other Airflow operators in order enhance the workflow security at the task level. In addition, using the AWS Cloud Development Kit (AWS CDK) can make provisioning the Amazon MWAA environment and fine-grained IAM task roles seamless. We look forward to sharing more about fine-grained access patterns for Airflow tasks in a future post.

About the Author

Kishan Desai is a Data Engineer at Amazon Studios building a data platform to support the content creation process. He is passionate about building flexible and modular systems on AWS using serverless paradigms. Outside of work, Kishan enjoys learning new technologies, watching sports, experiencing SoCal’s great food, and spending quality time with friends and family.

Virendhar (Viru) Sivaraman is a strategic Big Data & Analytics Architect with Amazon Web Services. He is passionate about building scalable big data and analytics solutions in the cloud. Besides work, he enjoys spending time with family, hiking & mountain biking.

Noise