All posts by Shibu Nair

Boosting Resiliency with an ML-based Telemetry Analytics Architecture

Post Syndicated from Shibu Nair original https://aws.amazon.com/blogs/architecture/boosting-resiliency-with-an-ml-based-telemetry-analytics-architecture/

Data proliferation has become a norm and as organizations become more data driven, automating data pipelines that enable data ingestion, curation, and processing is vital. Since many organizations have thousands of time-bound, automated, complex pipelines, monitoring their telemetry information is critical. Keeping track of telemetry data helps businesses monitor and recover their pipelines faster which results in better customer experiences.

In our blog post, we explain how you can collect telemetry from your data pipeline jobs and use machine learning (ML) to build a lower- and upper-bound threshold to help operators identify anomalies in near-real time.

The applications of anomaly detection on telemetry data from job pipelines are wide-ranging, including these and more:

  • Detecting abnormal runtimes
  • Detecting jobs running slower than expected
  • Proactive monitoring
  • Notifications

Key tenets of telemetry analytics

There are five key tenets of telemetry analytics, as in Figure 1.

Key tenets of telemetry analytics

Figure 1. Key tenets of telemetry analytics

The key tenets for near real-time telemetry analytics for data pipelines are:

  1. Collecting the metrics
  2. Aggregating the metrics
  3. Identify anomaly
  4. Notify and resolve issues
  5. Persist for compliance reasons, historical trend analysis, and to visualize

This blog post describes how customers can easily implement these steps by using AWS native no-code, low-code (AWS LCNC) solutions.

ML-based telemetry analytics solution architecture

The architecture defined here helps customers incrementally enable features with AWS LCNC solutions by leveraging AWS managed services to avoid the overhead of infrastructure provisioning. Most of the steps are configurations of the features provided by AWS services. This enables customers to make their applications resilient by tracking and resolving anomalies in near real time, as in Figure 2.

ML-based telemetry analytics solution architecture

Figure 2. ML-based telemetry analytics solution architecture

Let’s explore each of the architecture steps in detail.

1. Indicative AWS data analytics services: Choose from a broad range of AWS analytics services, including data movement, data storage, data lakes, big data analytics, log analytics, and streaming analytics to business intelligence, ML, and beyond. This diagram shows a subset of these data analytics services. You may use one or a combination of many, depending on your use case.

2. Amazon CloudWatch metrics for telemetry analytics: Collecting and visualizing real-time logs, metrics, and event data is a key step in any process. CloudWatch helps you accomplish these tasks without any infrastructure provisioning. Almost every AWS data analytics service is integrated with CloudWatch to enable automatic capturing of the detailed metrics needed for telemetry analytics.

3. Near real-time use case examples: Step three presents practical, near real-time use cases that represent a range of real-world applications, one or more of which may apply to your own business needs.

Use case 1: Anomaly detection

CloudWatch provides the functionality to apply anomaly detection for a metric. The key business use case of this feature is to apply statistical and ML algorithms on a per-metrics basis of business critical applications to proactively identify issues and raise alarms.

The focus is on a single set of metrics that will be important for the application’s functioning—for example, AWS Lambda metrics of a 24/7 credit card company’s fraud monitoring application.

Use case 2: Unified metrics using Amazon Managed Grafana

For proper insights into telemetry data, it is important to unify metrics and collaboratively identify and troubleshoot issues in analytical systems. Amazon Managed Grafana helps to visualize, query, and corelate metrics from CloudWatch in near real-time.

For example, Amazon Managed Grafana can be used to monitor container metrics for Amazon EMR running on Amazon Elastic Kubernetes Service (Amazon EKS), which supports processing high-volume data from business critical Internet of Things (IoT) applications like connected factories, offsite refineries, wind farms, and more.

Use case 3: Combined business and metrics data using Amazon OpenSearch Service

Amazon OpenSearch Service provides the capability to perform near real-time, ML-based interactive log analytics, application monitoring, and search by combining business and telemetry data.

As an example, customers can combine AWS CloudTrail logs for AWS logins, Amazon Athena, and Amazon RedShift query access times with employee reference data to detect insider threats.

This log analytics use case architecture integrates into OpenSearch, as in Figure 3.

Log analytics use case architecture overview with OpenSearch

Figure 3. Log analytics use case architecture overview with OpenSearch

Use case 4: ML-based advanced analytics

Using Amazon Simple Storage Service (Amazon S3) as data storage, data lake customers can tap into AWS analytics services such as the AWS Glue Catalog, AWS Glue DataBrew, and Athena for preparing and transforming data, as well as build trend analysis using ML models in Amazon SageMaker. This mechanism helps with performing ML-based advanced analytics to identify and resolve recurring issues.

4. Anomaly resolution: When an alert is generated either by CloudWatch alarm, OpenSearch, or Amazon Managed Grafana, you have the option to act on the alert in near-real time. Amazon Simple Notification Service (Amazon SNS) and Lambda can help build workflows. Lambda also helps integrate with ServiceNow ticket creation, Slack channel notifications, or other ticketing systems.

Simple data pipeline example

Let’s explore another practical example using an architecture that demonstrates how AWS Step Functions orchestrates Lambda, AWS Glue jobs, and crawlers.

To report an anomaly on AWS Glue jobs based on total number of records processed, you can leverage the glue.driver.aggregate.recordsRead CloudWatch metric and set up a CloudWatch alarm based on anomaly detection, Amazon SNS topic for notifications, and Lambda for resolution, as in Figure 4.

AWS Step Functions orchestrating Lamba, AWS Glue jobs, and crawlers

Figure 4. AWS Step Functions orchestrating Lamba, AWS Glue jobs, and crawlers

Here are the steps involved in the architecture proposed:

  • CloudWatch automatically captures the metric glue.driver.aggregate.recordsRead from AWS Glue jobs.
  • Customers set a CloudWatch alarm based on the anomaly detection of glue.driver.aggregate.recordsRead metric and set a notification to Amazon SNS topic.
  • CloudWatch applies a ML algorithm to the metric’s past data and creates a model of metric’s expected values.
  • When the number of records increases significantly, the metric from the CloudWatch anomaly detection model notifies the Amazon SNS topic.
  • Customers can notify an email group and trigger a Lambda function to resolve the issue, or create tickets in their operational monitoring system.
  • Customers can also unify all the AWS Glue metrics using Amazon Managed Grafana. Using Amazon S3, data lake customers can crawl and catalog the data in the AWS Glue catalog and make it available for ad-hoc querying. Amazon SageMaker can be used for custom model training and inferencing.

Conclusion

In this blog post, we covered a recommended architecture to enable near-real time telemetry analytics for data pipelines, anomaly detection, notification, and resolution. This provides resiliency to the customer applications by proactively identifying and resolving issues.

Detecting data drift using Amazon SageMaker

Post Syndicated from Shibu Nair original https://aws.amazon.com/blogs/architecture/detecting-data-drift-using-amazon-sagemaker/

As companies continue to embrace the cloud and digital transformation, they use historical data in order to identify trends and insights. This data is foundational to power tools, such as data analytics and machine learning (ML), in order to achieve high quality results.

This is a time where major disruptions are not only lasting longer, but also happening more frequently, as discussed in a McKinsey article on risk and resilience. Any disruption—a pandemic, hurricane, or even blocked sailing routes—has a major impact on the patterns of data and can create anomalous behavior.

ML models are dependent on data insights to help plan and support production-ready applications. With any disruptions, data drift can occur. Data drift is unexpected and undocumented changes to data structure, semantics, and/or infrastructure. If there is data drift, the model performance will degrade and no longer provide an accurate guidance. To mitigate the effects of the disruption, data drift needs to be detected and the ML models quickly trained and adjusted accordingly.

This blog post explains how to approach changing data patterns in the age of disruption and how to mitigate its effects on ML models. We also discuss the steps of building a feedback loop to capture the request data in the production environment and create a data pipeline to store the data for profiling and baselining. Then, we explain how Amazon SageMaker Clarify can help detect data drift.

How to detect data drift

There are three stages to detecting data drift: data quality monitoring, model quality monitoring, and drift evaluation (see Figure 1).

Stages in detecting data drift

Figure 1. Stages in detecting data drift

Data quality monitoring establishes a profile of the input data during model training, and then continuously compares incoming data with the profile. Deviations in the data profile signal a drift in the input data.

You can also detect drift through model quality monitoring, which requires capturing actual values that can be compared with the predictions. For example, using weekly demand forecasting, you can compare the forecast quantities one week later with the actual demand. Some use cases can require extra steps to collect actual values. For example, product recommendations may require you to ask a selected group of consumers for their feedback to the recommendation.

SageMaker Clarify provides insights into your trained models, including importance of model features and any biases towards certain segments of the input data. Changes of these attributes between re-trained models also signal drift. Drift evaluation constitutes the monitoring data and mechanisms to detect changes and triggering consequent actions. With Amazon CloudWatch, you can define rules and thresholds that prompt drift notifications.

Figure 2 illustrates a basic architecture with the data sources for training and production (on the left) and the observed data concerning drift (on the right). You can use Amazon SageMaker Data Wrangler, a visual data preparation tool, to clean and normalize your input data for your ML task. You can store the features that you defined for your models in the Amazon SageMaker Feature Store, a fully managed, purpose-built repository to store, update, retrieve, and share ML features.

The white, rectangular boxes in the architecture diagram represent the tasks for detecting data and model drift. You can integrate those tasks into your ML workflow with Amazon SageMaker Pipelines.

Basic architecture on how data drift is detected using Amazon SageMaker

Figure 2. Basic architecture on how data drift is detected using Amazon SageMaker

The drift observation data can be captured in tabular format, such as comma-separated values or Parquet, on Amazon Simple Storage Service (S3) and analyzed with Amazon Athena and Amazon QuickSight.

How to build a feedback loop

The baselining task establishes a data profile from training data. It uses Amazon SageMaker Model Monitor and runs before training or re-training the model. The baseline profile is stored on Amazon S3 to be referenced by the data drift monitoring job.

The data drift monitoring task continuously profiles the input data, compares it with baseline, and the results are captured in CloudWatch. This tasks runs on its own computation resources using Deequ, which checks that the monitoring job does not slow down your ML inference flow and scales with the data. The frequency of running this task can be adjusted to control cost, which can depend on how rapidly you anticipate that the data may change.

The model quality monitoring task computes model performance metrics from actuals and predicted values. The origin of these data points depends on the use case. Demand forecasting use cases naturally capture actuals that can be used to validate past predictions. Other use cases can require extra steps to acquire ground-truth data.

CloudWatch is a monitoring and observability service with which you can define rules to act on deviation in model performance or data drift. With CloudWatch, you can setup alerts to users via e-mail or SMS, and it can automatically start the ML model re-training process.

Run the baseline task on your updated data set before re-training your model. Use the SageMaker model registry to catalog your ML models for production, manage model versions, and control the associate training metrics.

Gaining insight into data and models

SageMaker Clarify provides greater visibility into your training data and models, helping identify and limit bias and explain predictions. For example, the trained models may consider some features more strongly than others when generating predictions. Compare the feature importance and bias between model-provided versions for a better understanding of the changes.

Conclusion

As companies continue to use data analytics and ML to inform daily activity, data drift may become a more common occurrence. Recognizing that drift can have a direct impact on models and production-ready applications, it is important to architect to identify potential data drift and avoid downgrading the models and negatively impacting results. Failure to capture changes in data can result in loss of process confidence, downgraded model accuracy, or a bottom-line impact to the business.