Tag Archives: Anomaly detection

Unsupervised graph anomaly detection – Catching new fraudulent behaviours

Post Syndicated from Grab Tech original https://engineering.grab.com/graph-anomaly-model

Earlier in this series, we covered the importance of graph networks, graph concepts, graph visualisation, and graph-based fraud detection methods. In this article, we will discuss how to automatically detect new types of fraudulent behaviour and swiftly take action on them.

One of the challenges in fraud detection is that fraudsters are incentivised to always adversarially innovate their way of conducting frauds, i.e., their modus operandi (MO in short). Machine learning models trained using historical data may not be able to pick up new MOs, as they are new patterns that are not available in existing training data. To enhance Grab’s existing security defences and protect our users from these new MOs, we needed a machine learning model that is able to detect them quickly without the need for any label supervision, i.e., an unsupervised learning model rather than the regular supervised learning model.

To address this, we developed an in-house machine learning model for detecting anomalous patterns in graphs, which has led to the discovery of new fraud MOs. Our focus was initially on GrabFood and GrabMart verticals, where we monitored the interactions between consumers and merchants. We modelled these interactions as a bipartite graph (a type of graph for modelling interactions between two groups) and then performed anomaly detection on the graph. Our in-house anomaly detection model was also presented at the International Joint Conference on Neural Networks (IJCNN) 2023, a premier academic conference in the area of neural networks, machine learning, and artificial intelligence.

In this blog, we discuss the model and its application within Grab. For avid audiences that want to read the details of our model, you can access it here. Note that even though we implemented our model for anomaly detection in GrabFood and GrabMart, the model is designed for general purposes and is applicable to interaction graphs between any two groups.

Interaction-Focused Anomaly Detection on Bipartite Node-and-Edge-Attributed Graphs
By Rizal Fathony, Jenn Ng, Jia Chen
Presented at International Joint Conference on Neural Networks (IJCNN) 2023

Before we dive into how our model works, it is important to understand the process of graph construction in our application as the model assumes the availability of the graphs in a standardised format.

Graph construction 

We modelled the interactions between consumers and merchants in GrabFood and GrabMart platforms as bipartite graphs (G), where the first group of nodes (U) represents the consumers, the second group of nodes (V) represents the merchants, and the edges (E) connecting them means that the consumers have placed some food/mart orders to the merchants. The graph is also supplied with rich transactional information about the consumers and the merchants in the form of node features (Xu and Xv), as well as order information in the form of edge features (Xe).

Fig 1. Graph construction process

The goal of our anomaly model is to detect anomalous and suspicious behaviours from the consumers or merchants (node-level anomaly detection), as well as anomalous order interactions (edge-level anomaly detection). As mentioned, this detection needs to be done without any label supervision.

Model architecture

We designed our graph anomaly model as a type of autoencoder, with an encoder and two decoders – a feature decoder and a structure decoder. The key feature of our model is that it accepts a bipartite graph with both node and edge attributes as the input. This is important as both node and edge attributes encode essential information for determining if certain behaviours are suspicious. Many previous works on graph anomaly detection only support node attributes. In addition, our model can produce both node and edge level anomaly scores, unlike most of the previous works that produce node-level scores only. We named our model GraphBEAN, which is short for Bipartite Node-and-Edge-Attributed Networks.

From the input, the encoder then processes the attributed bipartite graph into a series of graph convolution layers to produce latent representations for both node groups. Our graph convolution layers produce new representations for each node in both node groups (U and V), as well as for each edge in the graph. Note that the last convolution layer in the encoder only produces the latent representations for nodes, without producing edge representations. The reason for this design is that we only put the latent representations for the active actors, the nodes representing consumers and merchants, but not their interactions.

Fig 2. GraphBEAN architecture

From the nodes’ latent representations, the feature decoder is tasked to reconstruct the original graph with both node and edge attributes via a series of graph convolution layers. As the graph structure is provided by the feature decoder, we task the structure decoder to learn the graph structure by predicting if there exists an edge connecting two nodes. This edge prediction, as well as the graph reconstructed by the feature decoder, are then compared to the original input graph via a reconstruction loss function.

The model is then trained using the bipartite graph constructed from GrabFood and GrabMart transactions. We use a reconstruction-based loss function as the training objective of the model. After the training is completed, we compute the anomaly score of each node and edge in the graph using the trained model.

Anomaly score computation

Our anomaly scores are reconstruction-based. The score design assumes that normal behaviours are common in the dataset and thus, can be easily reconstructed by the model. On the other hand, anomalous behaviours are rare. Therefore the model will have a hard time reconstructing them, hence producing high errors.

Fig 3. Edge-level and node-level anomaly scores computation

The model produces two types of anomaly scores. First, the edge-level anomaly scores, which are calculated from the edge reconstruction error. Second, the node-level anomaly scores, which are calculated from node reconstruction error plus an aggregate over the edge scores from the edges connected to the node. This aggregate could be a mean or max aggregate.

Actioning system

In our implementation of GraphBEAN within Grab, we designed a full pipeline of anomaly detection and actioning systems. It is a fully-automated system for constructing a bipartite graph from GrabFood and GrabMart transactions, training a GraphBEAN model using the graph, and computing anomaly scores. After computing anomaly scores for all consumers and merchants (node-level), as well as all of their interactions (edge-level), it automatically passes the scores to our actioning system. But before that, it also passes them through a system we call fraud type tagger. This is also a fully-automated heuristic-based system that tags some of the detected anomalies with some fraud tags. The purpose of this tagging is to provide some context in general, like the types of detected anomalies. Some examples of these tags are promo abuse or possible collusion.

Fig 4. Pipeline in our actioning system

Both the anomaly scores and the fraud type tags are then forwarded to our actioning system. The system consists of two subsystems:

  • Human expert actioning system: Our fraud experts analyse the detected anomalies and perform certain actioning on them, like suspending certain transaction features from suspicious merchants.
  • Automatic actioning system: Combines the anomaly scores and fraud type tags with other external signals to automatically do actioning on the detected anomalies, like preventing promos from being used by fraudsters or preventing fraudulent transactions from occurring. These actions vary depending on the type of fraud and the scores.

What’s next?

The GraphBEAN model enables the detection of suspicious behaviour on graph data without the need for label supervision. By implementing the model on GrabFood and GrabMart platforms, we learnt that having such a system enables us to quickly identify new types of fraudulent behaviours and then swiftly perform action on them. This also allows us to enhance Grab’s defence against fraudulent activity and actively protect our users.

We are currently working on extending the model into more generic heterogeneous (multi-entity) graphs. In addition, we are also working on implementing it to more use cases within Grab.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Machine Learning for Fraud Detection in Streaming Services

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/machine-learning-for-fraud-detection-in-streaming-services-b0b4ef3be3f6

By Soheil Esmaeilzadeh, Negin Salajegheh, Amir Ziai, Jeff Boote

Introduction

Streaming services serve content to millions of users all over the world. These services allow users to stream or download content across a broad category of devices including mobile phones, laptops, and televisions. However, some restrictions are in place, such as the number of active devices, the number of streams, and the number of downloaded titles. Many users across many platforms make for a uniquely large attack surface that includes content fraud, account fraud, and abuse of terms of service. Detection of fraud and abuse at scale and in real-time is highly challenging.

Data analysis and machine learning techniques are great candidates to help secure large-scale streaming platforms. Even though such techniques can scale security solutions proportional to the service size, they bring their own set of challenges such as requiring labeled data samples, defining effective features, and finding appropriate algorithms. In this work, by relying on the knowledge and experience of streaming security experts, we define features based on the expected streaming behavior of the users and their interactions with devices. We present a systematic overview of the unexpected streaming behaviors together with a set of model-based and data-driven anomaly detection strategies to identify them.

Background on Anomaly Detection

Anomalies (also known as outliers) are defined as certain patterns (or incidents) in a set of data samples that do not conform to an agreed-upon notion of normal behavior in a given context.

There are two main anomaly detection approaches, namely, (i) rule-based, and (ii) model-based. Rule-based anomaly detection approaches use a set of rules which rely on the knowledge and experience of domain experts. Domain experts specify the characteristics of anomalous incidents in a given context and develop a set of rule-based functions to discover the anomalous incidents. As a result of this reliance, the deployment and use of rule-based anomaly detection methods become prohibitively expensive and time-consuming at scale, and cannot be used for real-time analyses. Furthermore, the rule-based anomaly detection approaches require constant supervision by experts in order to keep the underlying set of rules up-to-date for identifying novel threats. Reliance on experts can also make rule-based approaches biased or limited in scope and efficacy.

On the other hand, in model-based anomaly detection approaches, models are built and used to detect anomalous incidents in a fairly automated manner. Although model-based anomaly detection approaches are more scalable and suitable for real-time analysis, they highly rely on the availability of (often labeled) context-specific data. Model-based anomaly detection approaches, in general, are of three kinds, namely, (i) supervised, (ii) semi-supervised, and (iii) unsupervised. Given a labeled dataset, a supervised anomaly detection model can be built to distinguish between anomalous and benign incidents. In semi-supervised anomaly detection models, only a set of benign examples are required for training. These models learn the distributions of benign samples and leverage that knowledge for identifying anomalous samples at the inference time. Unsupervised anomaly detection models do not require any labeled data samples, but it is not straightforward to reliably evaluate their efficacy.

Figure 1. Schematic of a streaming service platform: (a) illustrates device types that can be used for streaming, (b) designates the set of authentication and authorization systems such as license and manifest servers for providing encrypted contents as well as decryption keys and manifests, and (c) shows the streaming service provider, as a surrogate entity for digital content providers, that interacts with the other two components.

Streaming Platforms

Commercial streaming platforms shown in Figure 1 mainly rely on Digital Rights Management (DRM) systems. DRM is a collection of access control technologies that are used for protecting the copyrights of digital media such as movies and music tracks. DRM helps the owners of digital products prevent illegal access, modification, and distribution of their copyrighted work. DRM systems provide continuous content protection against unauthorized actions on digital content and restrict it to streaming and in-time consumption. The backbone of DRM is the use of digital licenses, which specify a set of usage rights for the digital content and contain the permissions from the owner to stream the content via an on-demand streaming service.

On the client’s side, a request is sent to the streaming server to obtain the protected encrypted digital content. In order to stream the digital content, the user requests a license from the clearinghouse that verifies the user’s credentials. Once a license gets assigned to a user, using a Content Decryption Module (CDM), the protected content gets decrypted and becomes ready for preview according to the usage rights enforced by the license. A decryption key gets generated using the license, which is specific to a certain movie title, can only be used by a particular account on a given device, has a limited lifetime, and enforces a limit on how many concurrent streams are allowed.

Another relevant component that is involved in a streaming experience is the concept of manifest. Manifest is a list of video, audio, subtitles, etc. which comes in the form of a few Uniform Resource Locators (URLs) that are used by the clients to get the movie streams. Manifest is requested by the client and gets delivered to the player before the license request, and it itemizes the available streams.

Data

Data Labeling

For the task of anomaly detection in streaming platforms, as we have neither an already trained model nor any labeled data samples, we use structural a priori domain-specific rule-based assumptions, for data labeling. Accordingly, we define a set of rule-based heuristics used for identifying anomalous streaming behaviors of clients and label them as anomalous or benign. The fraud categories that we consider in this work are (i) content fraud, (ii) service fraud, and (iii) account fraud. With the help of security experts, we have designed and developed heuristic functions in order to discover a wide range of suspicious behaviors. We then use such heuristic functions for automatically labeling the data samples. In order to label a set of benign (non-anomalous) accounts a group of vetted users that are highly trusted to be free of any forms of fraud is used.

Next, we share three examples as a subset of our in-house heuristics that we have used for tagging anomalous accounts:

  • (i) Rapid license acquisition: a heuristic that is based on the fact that benign users usually watch one content at a time and it takes a while for them to move on to another content resulting in a relatively low rate of license acquisition. Based on this reasoning, we tag all the accounts that acquire licenses very quickly as anomalous.
  • (ii) Too many failed attempts at streaming: a heuristic that relies on the fact that most devices stream without errors while a device, in trial and error mode, in order to find the “right’’ parameters leaves a long trail of errors behind. Abnormally high levels of errors are an indicator of a fraud attempt.
  • (iii) Unusual combinations of device types and DRMs: a heuristic that is based on the fact that a device type (e.g., a browser) is normally matched with a certain DRM system (e.g., Widevine). Unusual combinations could be a sign of compromised devices that attempt to bypass security enforcements.

It should be noted that the heuristics, even though work as a great proxy to embed the knowledge of security experts in tagging anomalous accounts, may not be completely accurate and they might wrongly tag accounts as anomalous (i.e., false-positive incidents), for example in the case of a buggy client or device. That’s up to the machine learning model to discover and avoid such false-positive incidents.

Data Featurization

A complete list of features used in this work is presented in Table 1. The features mainly belong to two distinct classes. One class accounts for the number of distinct occurrences of a certain parameter/activity/usage in a day. For instance, the dist_title_cnt feature characterizes the number of distinct movie titles streamed by an account. The second class of features on the other hand captures the percentage of a certain parameter/activity/usage in a day.

Due to confidentiality reasons, we have partially obfuscated the features, for instance, dev_type_a_pct, drm_type_a_pct, and end_frmt_a_pct are intentionally obfuscated and we do not explicitly mention devices, DRM types, and encoding formats.

Table 1. The list of streaming related features with the suffixes pct and cnt respectively referring to percentage and count

Data Statistics

In this part, we present the statistics of the features presented in Table 1. Over 30 days, we have gathered 1,030,005 benign and 28,045 anomalous accounts. The anomalous accounts have been identified (labeled) using the heuristic-aware approach. Figure 2(a) shows the number of anomalous samples as a function of fraud categories with 8,741 (31%), 13,299 (47%), 6,005 (21%) data samples being tagged as content fraud, service fraud, and account fraud, respectively. Figure 2(b) shows that out of 28,045 data samples being tagged as anomalous by the heuristic functions, 23,838 (85%), 3,365 (12%), and 842 (3%) are respectively considered as incidents of one, two, and three fraud categories.

Figure 3 presents the correlation matrix of the 23 data features described in Table 1 for clean and anomalous data samples. As we can see in Figure 3 there are positive correlations between features that correspond to device signatures, e.g., dist_cdm_cnt and dist_dev_id_cnt, and between features that refer to title acquisition activities, e.g., dist_title_cnt and license_cnt.

Figure 2. Number of anomalous samples as a function of (a) fraud categories and (b) number of tagged categories.
Figure 3. Correlation matrix of the features presented in Table 1 for (a) clean and (b) anomalous data samples.

Label Imbalance Treatment

It is well known that class imbalance can compromise the accuracy and robustness of the classification models. Accordingly, in this work, we use the Synthetic Minority Over-sampling Technique (SMOTE) to over-sample the minority classes by creating a set of synthetic samples.

Figure 4 shows a high-level schematic of Synthetic Minority Over-sampling Technique (SMOTE) with two classes shown in green and red where the red class has fewer number of samples present, i.e., is the minority class, and gets synthetically upsampled.

Figure 4. Synthetic Minority Over-sampling Technique

Evaluation Metrics

For evaluating the performance of the anomaly detection models we consider a set of evaluation metrics and report their values. For the one-class as well as binary anomaly detection task, such metrics are accuracy, precision, recall, f0.5, f1, and f2 scores, and area under the curve of the receiver operating characteristic (ROC AUC). For the multi-class multi-label task we consider accuracy, precision, recall, f0.5, f1, and f2 scores together with a set of additional metrics, namely, exact match ratio (EMR) score, Hamming loss, and Hamming score.

Model Based Anomaly Detection

In this section, we briefly describe the modeling approaches that are used in this work for anomaly detection. We consider two model-based anomaly detection approaches, namely, (i) semi-supervised, and (ii) supervised as presented in Figure 5.

Figure 5. Model-based anomaly detection approaches: (a) semi-supervised and (b) supervised.

Semi-Supervised Anomaly Detection

The key point about the semi-supervised model is that at the training step the model is supposed to learn the distribution of the benign data samples so that at the inference time it would be able to distinguish between the benign samples (that has been trained on) and the anomalous samples (that has not observed). Then at the inference stage, the anomalous samples would simply be those that fall out of the distribution of the benign samples. The performance of One-Class methods could become sub-optimal when dealing with complex and high-dimensional datasets. However, supported by the literature, deep neural autoencoders can perform better than One-Class methods on complex and high-dimensional anomaly detection tasks.

As the One-Class anomaly detection approaches, in addition to a deep auto-encoder, we use the One-Class SVM, Isolation Forest, Elliptic Envelope, and Local Outlier Factor approaches.

Supervised Anomaly Detection

Binary Classification: In the anomaly detection task using binary classification, we only consider two classes of samples namely benign and anomalous and we do not make distinctions between the types of the anomalous samples, i.e., the three fraud categories. For the binary classification task we use multiple supervised classification approaches, namely, (i) Support Vector Classification (SVC), (ii) K-Nearest Neighbors classification, (iii) Decision Tree classification, (iv) Random Forest classification, (v) Gradient Boosting, (vi) AdaBoost, (vii) Nearest Centroid classification (viii) Quadratic Discriminant Analysis (QDA) classification (ix) Gaussian Naive Bayes classification (x) Gaussian Process Classifier (xi) Label Propagation classification (xii) XGBoost. Finally, upon doing stratified k-fold cross-validation, we carry out an efficient grid search to tune the hyper-parameters in each of the aforementioned models for the binary classification task and only report the performance metrics for the optimally tuned hyper-parameters.

Multi-Class Multi-Label Classification: In the anomaly detection task using multi-class multi-label classification, we consider the three fraud categories as the possible anomalous classes (hence multi-class), and each data sample is assigned one or more than one of the fraud categories as its set of labels (hence multi-label) using the heuristic-aware data labeling strategy presented earlier. For the multi-class multi-label classification task we use multiple supervised classification techniques, namely, (i) K-Nearest Neighbors, (ii) Decision Tree, (iii) Extra Trees, (iv) Random Forest, and (v) XGBoost.

Results and Discussion

Table 2 shows the values of the evaluation metrics for the semi-supervised anomaly detection methods. As we see from Table 2, the deep auto-encoder model performs the best among the semi-supervised anomaly detection approaches with an accuracy of around 96% and f1 score of 94%. Figure 6(a) shows the distribution of the Mean Squared Error (MSE) values for the anomalous and benign samples at the inference stage.

Table 2. The values of the evaluation metrics for a set of semi-supervised anomaly detection models.
Figure 6. For the deep auto-encoder model: (a) distribution of the Mean Squared Error (MSE) values for anomalous and benign samples at the inference stage — (b) confusion matrix across benign and anomalous samples- (c) Mean Squared Error (MSE) values averaged across the anomalous and benign samples for each of the 23 features.
Table 3. The values of the evaluation metrics for a set of supervised binary anomaly detection classifiers.
Table 4. The values of the evaluation metrics for a set of supervised multi-class multi-label anomaly detection approaches. The values in parenthesis refer to the performance of the models trained on the original (not upsampled) dataset.

Table 3 shows the values of the evaluation metrics for a set of supervised binary anomaly detection models. Table 4 shows the values of the evaluation metrics for a set of supervised multi-class multi-label anomaly detection models.

In Figure 7(a), for the content fraud category, the three most important features are the count of distinct encoding formats (dist_enc_frmt_cnt), the count of distinct devices (dist_dev_id_cnt), and the count of distinct DRMs (dist_drm_cnt). This implies that for content fraud the uses of multiple devices, as well as encoding formats, stand out from the other features. For the service fraud category in Figure 7(b) we see that the three most important features are the count of content licenses associated with an account (license_cnt), the count of distinct devices (dist_dev_id_cnt), and the percentage use of type (a) devices by an account (dev_type_a_pct). This shows that in the service fraud category the counts of content licenses and distinct devices of type (a) stand out from the other features. Finally, for the account fraud category in Figure 7(c), we see that the count of distinct devices (dist_dev_id_cnt) dominantly stands out from the other features.

Figure 7. The normalized feature importance values (NFIV) for the multi-class multi-label anomaly detection task using the XGBoost approach in Table 4 across the three anomaly classes, i.e., (a) content fraud, (b) service fraud, and (c) account fraud.

You can find more technical details in our paper here.

Are you interested in solving challenging problems at the intersection of machine learning and security? We are always looking for great people to join us.


Machine Learning for Fraud Detection in Streaming Services was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Detect anomalies on one million unique entities with Amazon OpenSearch Service

Post Syndicated from Kaituo Li original https://aws.amazon.com/blogs/big-data/detect-anomalies-on-one-million-unique-entities-with-amazon-opensearch-service/

Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) supports a highly performant, integrated anomaly detection engine that enables the real-time identification of anomalies in streaming data. Last year, we released high-cardinality anomaly detection (HCAD) to detect individual entities’ anomalies. With the 1.1 release, we have allowed you to monitor a million entities with steady, predictable performance. HCAD is easiest when described in contrast to the non-HCAD single-stream solution. In a single-stream detector, we detect anomalies for an aggregate entity. For example, we can use a single-stream detector to sift through aggregated traffic across all IP addresses so that users can be notified when unusual spikes occur. However, we often need to identify anomalies in entities, such as individual hosts and IP addresses. Each entity may work on a different baseline, which means its time series’ distribution (measured in parameters such as magnitude, trend, and seasonality, to name a few) are different. The different baselines make it inaccurate to detect anomalies using a single monolithic model. HCAD distinguishes itself from single-stream detectors by customizing anomaly detection models to entities.

Example use cases of HCAD include the following:

  • Internet of things – Continuously tracking the temperature of fridges and warning users of temperatures at which food or medicine longevity is at risk, so users can take measures to avoid them. Each entity has specific categorical fields that describe it, and you can think of the categorical fields as characteristics for those entities. A fridge’s serial number is the categorical field that uniquely identifies the fridges. Using a single model generates a lot of false alarms because ambient temperatures can be different. A temperature of 5° C is normal during winter in Seattle, US, but such a temperature in a tropical place during winter is likely anomalous. Also, users may open the door to a fridge several times, triggering a spike in the temperature. The duration and frequency of spikes can vary according to user behavior. HCAD can group temperature data into geographies and users to detect varying local temperatures and user behavior.
  • Security – An intrusion detection system identifying an increase in failed login attempts in authentication logs. The user name and host IP are the categorical fields used to determine the user accessing from the host. Hackers might guess user passwords by brute force, and not all users on the same host IP may be targeted. The number of failed login counts varies on a host for a particular user at a specific time of day. HCAD creates a representative baseline per user on each host and adapts to changes in the baseline.
  • IT operations – Monitoring access traffic by shard in a distributed service. The shard ID is the categorical field, and the entity is the shard. A modern distributed system usually consists of shards linked together. When a shard experiences an outage, the traffic increases significantly for dependent shards due to retry storms. It’s hard to discover the increase because only a limited number of shards are affected. For example, traffic on the related shards might be as much as 64 times that of normal levels, whereas average traffic across all shards might just grow by a small constant factor (less than 2).

Making HCAD real time and performant while achieving completeness and scalability is a formidable challenge:

  • Completeness – Model all or as many entities as possible.
  • ScalabilityHorizontal and vertical scaling without changing model fidelity. That is, when scaling the machine up or out, an anomaly detector can add models monotonically. HCAD uses the same model and gives the same answer for an entity’s time series as in single-stream detection.
  • Performance – Low impact to system resource usage and high overall throughput.

The first release of HCAD in Amazon OpenSearch Service traded completeness and scalability for performance: the anomaly detector limited the number of entities to 1,000. You can change the setting plugins.anomaly_detection.max_entities_per_query to increase the number of monitored entities per interval. However, such a change incurs a non-negligible cost, which opens the door to cluster instability. Each entity uses memory to host models, disk I/O to read and write model checkpoints and anomaly results, CPU cycles for metadata maintenance and model training and inference, and garbage collection for deleted models and metadata. The more entities, the more resource usage. Furthermore, HCAD could suffer a combinatorial explosion of entities when supporting multiple categorical fields (a feature released in Amazon OpenSearch Service 1.1). Imagine a detector with only one categorical field geolocation. Geolocation has 1,000 possible values. Adding another categorical field product with 1,000 allowed values gives the detector 1 million entities.

For the next version of HCAD, we devoted much effort to improving completeness and scalability. Our approach captures sizing a cluster right and combines in-memory model hosting and on-disk model loading. Performance metrics show HCAD doesn’t saturate the cluster with substantial cost and still leaves plenty of room for other tasks. As a result, HCAD can analyze one million entities in 10 minutes and flags anomalies in different patterns. In this post, we will explore how HCAD can analyze one million entities and the technical implementations behind the improvements.

How to size domains

Model management is a trade-off: disk-based solutions that reload-use-stop-store models on every interval offer savings in memory but suffer high overhead and are hard to scale. Memory-based solutions offer lower overhead and higher throughput but typically increase memory requirements. We exploit the trade-off by implementing an adaptive mechanism that hosts models in memory as much as allowed (capped via the cluster setting plugins.anomaly_detection.model_max_size_percent), as required by best performance. When models don’t fit in memory, we process extra model requests by loading models from disks.

The use of memory whenever possible is responsible for the HCAD scalability. Therefore, it is crucial to sizing a cluster right to offer enough memory for HCAD. The main factors to consider when sizing a cluster are:

  • Sum of all detectors’ total entity count – A detector’s total entity count is the cardinality of the categorical fields. If there are multiple categorical fields, the number counts all unique combinations of values of these fields present in data. You can decide the cardinality via cardinality aggregation in Amazon OpenSearch Service. If the detector is a single-stream detector, the number of entities is one because there is no defined categorical field.
  • Heap size – Amazon OpenSearch Service sets aside 50% of RAM for heap. To determine the heap size of an instance type, refer to Amazon OpenSearch Service pricing. For example, an r5.2xlarge host has 64 GB RAM. Therefore, the host’s heap size is 32 GB.
  • Anomaly detection (AD) maximum memory percentage – AD can use up to 10% of the heap by default. You can customize the percentage via the cluster setting plugins.anomaly_detection.model_max_size_percent. The following update allows AD to use half of the heap via the aforementioned setting:
PUT /_cluster/settings
{
	"persistent": {
		"plugins.anomaly_detection.model_max_size_percent": "0.5"
	}
}
  • Entity in-memory model size – An entity’s in-memory model size varies according to shingle size, the number of features, and Amazon OpenSearch Service version as we’re constantly improving. All entity models of the same detector configuration in the same software version have the same size. A safe way to obtain the size is to run a profile API on the same detector configuration on an experimental cluster before creating a production cluster. In the following case, each entity model of detector fkzfBX0BHok1ZbMqLMdu is of size 470,491 bytes:

Enter the following profile request:

GET /_plugins/_anomaly_detection/detectors/fkzfBX0BHok1ZbMqLMdu/_profile/models

We get the following response:

{
	...{
		"model_id": "fkzfBX0BHok1ZbMqLMdu_entity_GOIubzeHCXV-k6y_AA4K3Q",
		"entity": [{
				"name": "host",
				"value": "host141"
			},
			{
				"name": "process",
				"value": "process54"
			}
		],
		"model_size_in_bytes": 470491,
		"node_id": "OcxBDJKYRYKwCLDtWUKItQ"
	}
	...
}
  • Storage requirement for result indexes – Real-time detectors store detection results as much as possible when the indexing pressure isn’t high, including both anomalous and non-anomalous results. When the indexing pressure is high, we save anomalous and a random subset of non-anomalous results. OpenSearch Dashboard employs non-anomalous results as the context of abnormal results and plots the results as a function of time. Additionally, AD stores the history of all generated results for a configurable number of days after generating results. This result retention period is 30 days by default, and adjustable via the cluster setting plugins.anomaly_detection.ad_result_history_retention_period. We need to ensure enough disk space is available to store the results by multiplying the amount of data generated per day by the retention period. For example, consider a detector that generates 1 million result documents for a 10-minute interval detector with 1 million entities per interval. One document’s size is about 1 KB. That’s roughly 144 GB per day, 4,320 GB after a 30-day retention period. The total disk requirement should also be multiplied by the number of shard copies. Currently, AD chooses one primary shard per node (up to 10) and one replica when called for the first time. Because the number of replicas is 1, every shard has two copies, and the total disk requirement is closer to 8,640 GB for the million entities in our example.
  • Anomaly detection overhead – AD incurs memory overhead for historical analyses and internal operations. We recommend reserving 20% more memory for the overhead to keep running models uninterrupted.

In order to derive the required number of data nodes D, we must first derive an expression for the number of entity models N that a node can host in memory. We define Si to be the entity model size of detector i. If we use an instance type with heap size H where the maximum AD memory percentage is PN is equal to AD memory allowance divided by the maximum entity model size among all detectors:

We consider the required number of data nodes D as a function of N. Let’s denote by Ci the total entity counts of detector i. Given n detectors, it follows that:

The fact that AD needs an extra 20% memory overhead is expressed by multiplying 1.2 in the formula. The ceil function represents the smallest integer greater than or equal to the argument.

For example, an r5.2xlarge Amazon Elastic Compute Cloud (Amazon EC2) instance has 64 GB RAM, so the heap size is 32 GB. We configure AD to use at most half of the allowed heap size. We have two HCAD detectors, whose model sizes are 471 KB and 403 KB, respectively. To host 500,000 entities for each detector, we need a 36-data-node cluster according to the following calculation:


We also need to ensure there is enough disk space. In the end, we used a 39-node r5.2xlarge cluster (3 primary and 36 data nodes) with 4 TB Amazon Elastic Block Store (EBS) storage on each node.

What if a detector’s entity count is unknown?

Sometimes, it’s hard to know a detector’s entity count. We can check historical data and estimate the cardinality. But it’s impossible to predict the future accurately. A general guideline is to allocate buffer memory during planning. Appropriately used, buffer memory provides room for small changes. If the changes are significant, you can adjust the number of data nodes because HCAD can scale in and out horizontally.

What if the number of active entities is changing?

The total number of entities created can be higher than the number of active entities, as evident from the following two figures. The total number of entities in the HTTP logs dataset is 2 million within 2 months, but each entity only appears seven times on average. The number of active entities within a time-boxed interval is much less than 2 million. The following figure presents an example time series of network size of IP addresses from the HTTP logs dataset.

http log data distribution

The KPI dataset also shows similar behavior, where entities often appear in a short amount of time during bursts of entity activities.

kpi data distribution

AD requires large sample sizes to create a comprehensive picture of the data patterns, making it suitable for dense time series that can be uniformly sampled. AD can still train models and produce predictions if the preceding bursty behavior can last a while and provide at least 400 points. However, training becomes more difficult, and prediction accuracy is lower as data gets more sparse.

It’s wasteful to preallocate memory according to the total number of entities in this case. Instead of the total number of entities, we need to consider the maximal active entities within an interval. You can get an approximate number by using a date_histogram and cardinality aggregation pipeline, and sorting during a representative period. You can run the following query if you’re indexing host-cloudwatch and want to find out the maximal number of active hosts within a 10-minute interval throughout 10 days:

GET /host-cloudwatch/_search?size=0
{
	"query": {
		"range": {
			"@timestamp": {
				"gte": "2021-11-17T22:21:48",
				"lte": "2021-11-27T22:22:48"
			}
		}
	},
	"aggs": {
		"by_10m": {
			"date_histogram": {
				"field": "@timestamp",
				"fixed_interval": "10m"
			},
			"aggs": {
				"dimension": {
					"cardinality": {
						"field": "host"
					}
				},
				"multi_buckets_sort": {
					"bucket_sort": {
						"sort": [{
							"dimension": {
								"order": "desc"
							}
						}],
						"size": 1
					}
				}
			}
		}
	}
}

The query result shows that at most about 1,000 hosts are active during a ten-minute interval:

{
	...
	"aggregations": {
		"by_10m": {
			"buckets": [{
				"key_as_string": "2021-11-17T22:30:00.000Z",
				"key": 1637188200000,
				"doc_count": 1000000,
				"dimension": {
					"value": 1000
				}
			}]
		}
	}
	...
}

HCAD has a cache to store models and maintain a timestamp of last access for each model. For each model, an hourly job checks the time of inactivity and invalidates the model if the time of inactivity is longer than 1 hour. Depending on the timing of the hourly check and the cache capacity, the elapsed time a model is cached varies. If the cache capacity isn’t large enough to hold all non-expired models, we have an adapted least frequently used (LFU) cache policy to evict models (more on this in a later section), and the cache time of those invalidated models is less than 1 hour. If the last access time of a model is reset immediately after the hourly check, when the next hourly check happens, the model doesn’t expire. The model can take another hour to expire when the next hourly check comes. So the max cache time is 2 hours.

The upper bound of active entities that detector i can observe is:


This equation has the following parameters:

  • Ai is the maximum number of active entities per interval of detector i. We get the number from the preceding query.
  • 120 is the number of minutes in 2 hours. ∆Ti denotes detector i’s interval in minutes. The ceil function represents the smallest integer greater than or equal the argument. ceil(120÷∆Ti) refers to the max number of intervals a model is cached.

Accordingly, we should account for Bi in the sizing formula:

Sizing calculation flow chart

With the definitions of calculating the number of data nodes in place, we can use the following flow chart to make decisions under different scenarios.

sizing flowchart

What if the cluster is underscaled?

If the cluster is underscaled, AD prioritizes more frequent and recent entities. AD makes its best effort to accommodate extra entities by loading their models on demand from disk without hosting them in the in-memory cache. Loading the models on demand means reloading-using-stopping-storing models at every interval, whose overheads are quite high. The overheads mostly have to do with network or disk I/O, rather than with the cost of model inferencing. Therefore, we did it in a steady, controlled manner. If the system resource usage isn’t heavy and there is enough time, HCAD may finish processing the extra entities. Otherwise, HCAD doesn’t necessarily find all the anomalies it could otherwise find.

Example: Analysis of 1 million entities

In the following example, you will learn how to set up a detector to analyze one million entities.

Ingest data

We generated 10 billion documents for 1 million entities in our evaluation of scalability and completeness improvement. Each entity has a cosine wave time series with randomly injected anomalies. With help from the tips in this post, we created the index host-cloudwatch and ingested the documents into the cluster. host-cloudwatch records elapsed CPU and JVM garbage collection (GC) time by a process within a host. Index mapping is as follows:

{
	...
	"mappings": {
		"properties": {
			"@timestamp": {
				"type": "date"
			},
			"cpuTime": {
				"type": "double"
			},
			"jvmGcTime": {
				"type": "double"
			},
			"host": {
				"type": "keyword"
			},
			"process": {
				"type": "keyword"
			}
		}
	}
	...
}

Create a detector

Consider the following factors before you create a detector:

  • Indexes to monitor – You can use a group of index names, aliases, or patterns. Here we use the host-cloudwatch index created in the last step.
  • Timestamp field – A detector monitors time series data. Each document in the provided index must be associated with a timestamp. In our example, we use the @timetamp field.
  • Filter – A filter selects data you want to analyze based on some condition. One example filter selects requests with status code 400 afterwards from HTTP request logs. The 4xx and 5xx classes of HTTP status code indicate that a request is returned with an error. Then you can create an anomaly detector for the number of error requests. In our running example, we analyze all of the data, and thus no filter is used.
  • Category field – Every entity has specific characteristics that describe it. Category fields provide categories of those characteristics. An entity can have up to two category fields as of Amazon OpenSearch Service 1.1. Here we monitor a specific process of a particular host by specifying the process and host field.
  • Detector interval – The detector interval is typically application-defined. We aggregate data within an interval and run models on the aggregated data. As mentioned earlier, AD is suitable for dense time series that can be uniformly sampled. You should at least make sure most intervals have data. Also, different detector intervals require different trade-offs between delay and accuracy. Long intervals smooth out long-term and short-term workload fluctuations and, therefore, may be less prone to noise, resulting in a high delay in detection. Short intervals lead to quicker detection but may find anticipated workload fluctuations instead of anomalies. You can plot your time series with various intervals and observe which interval keeps relevant anomalies while reducing noise. For this example, we use the default 10-minute interval.
  • Feature – A feature is an aggregated value extracted from the monitored data. It gets sent to models to measure the degrees of abnormality. Forming a feature can be as simple as picking a field to monitor and the aggregation function that summarizes the field data as metrics. We provide a suite of functions such as min and average. You can also use a runtime field via scripting. We’re interested in the garbage collection time field aggregated via the average function in this example.
  • Window delay – Ingestion delay. If the value isn’t configured correctly, a detector might analyze data before the late data arrives at the cluster. Because we ingested all the data in advance, the window delay is 0 in this case.

Our detector’s configuration aggregates average garbage collection processing time every 10 minutes and analyzes the average at the granularity of processes on different hosts. The API request to create such a detector is as follows. You can also use our streamlined UI to create and start a detector.

POST _plugins/_anomaly_detection/detectors
{
	"name": "detect_gc_time",
	"description": "detect gc processing time anomaly",
	"time_field": "@timestamp",
	"indices": [
		"host-cloudwatch"
	],
	"category_field": ["host", "process"],
	"feature_attributes": [{
		"feature_name": "jvmGcTime average",
		"feature_enabled": true,
		"importance": 1,
		"aggregation_query": {
			"gc_time_average": {
				"avg": {
					"field": "jvmGcTime"
				}
			}
		}
	}],
	"detection_interval": {
		"period": {
			"interval": 10,
			"unit": "MINUTES"
		}
	},
	"schema_version": 2
}

After the initial training is complete, all models of the 1 million entities are up in the memory, and 1 million results are generated every detector interval after a few hours. To verify the number of active models in the cache, you can run the profile API:

GET /_plugins/_anomaly_detection/detectors/fkzfBX0BHok1ZbMqLMdu/_profile/models

We get the following response:

{
	...
	"model_count": 1000000
}

You can observe how many results are generated every detector interval (in our case 10 minutes) by invoking the result search API:

GET /_plugins/_anomaly_detection/detectors/results/_search
{
	"query": {
		"range": {
			"execution_start_time": {
				"gte": 1636501467000,
				"lte": 1636502067000
			}
		}
	},
	"track_total_hits": true
}

We get the following response:

{
	...
	"hits": {
		"total": {
			"value": 1000000,
			"relation": "eq"
		},
		...
	}
	...
}

The OpenSearch Dashboard gives an exposition of the top entities producing the most severe or most number of anomalies.

anomaly overview

You can choose a colored cell to review the details of anomalies occurring within that given period.

press anomaly

You can view anomaly grade, confidence, and the corresponding features in a shaded area.

feature graph

Create a monitor

You can create an alerting monitor to notify you of anomalies based on the defined anomaly detector, as shown in the following screenshot.

create monitor

We use anomaly grade and confidence to define a trigger. Both anomaly grade and confidence are values between 0 and 1.

Anomaly grade represents the severity of an anomaly. The closer the grade is to 1, the higher the severity. 0 grade means the corresponding prediction isn’t an anomaly.

Confidence measures whether an entity’s model has observed enough data such that the model contains enough unique, real-world data points. If a confidence value from one model is larger than the confidence of a different model, then the anomaly from the first model has observed more data.

Because we want to receive high fidelity alerts, we configured the grade threshold to be 0 and the confidence threshold to be 0.99.

edit trigger

The final step of creating a monitor is to add an action on what to include in the notification. Our example detector finds anomalies at a particular process in a host. The notification message should contain the entity identity. In this example, we use ctx.results.0.hits.hits.0._source.entity to grab the entity identity.

edit action

A monitor based on a detector extracts the maximum grade anomaly and triggers an alert based on the configured grade and confidence threshold. The following is an example alert message:

Attention

Monitor detect_cpu_gc_time2-Monitor just entered alert status. Please investigate the issue.
- Trigger: detect_cpu_gc_time2-trigger
- Severity: 1
- Period start: 2021-12-08T01:01:15.919Z
- Period end: 2021-12-08T01:21:15.919Z
- Entity: {0={name=host, value=host107}, 1={name=process, value=process622}}

You can customize the extraction query and trigger condition by changing the monitor defining method to Extraction query monitor and modifying the corresponding query and condition. Here is the explanation of all anomaly result index fields you can query.

edit monitor

Evaluation

In this section, we evaluate HCAD’s precision, recall, and overall performance.

Precision and recall

We evaluated precision and recall over the cosine wave data, as mentioned earlier. Such evaluations aren’t easy in the context of real-time processing because only one point is available per entity during each detector interval (10 minutes in the example). Processing all the points takes a long time. Instead, we simulated real-time processing by fast-forwarding the processing in a script. The results are an average of 100 runs. The standard deviation is around 0.12.

The overall average precision, including the effects of cold start using linear interpolation, for the synthetic data is 0.57. The recall is 0.61. We note that no transformations were applied; it’s possible and likely that transformations improve these numbers. The precision is 0.09, and recall is 0.34 for the first 300 points due to interpolated cold start data for training. The numbers pick up as the model observes more real data. After another 5,000 real data points, the precision and recall improve to 0.57 and 0.63, respectively. We reiterate that the exact numbers vary based on the data characteristics—a different benchmark or detection configuration would have other numbers. Further, if there is no missing data, the fidelity of the HCAD model would be the same as that of a single-stream detector.

Performance

We ran HCAD on an idle cluster without ingestion or search traffic. Metrics such as JVM memory pressure and CPU of each node are well within the safe zone, as shown in the following screenshot. JVM memory pressure varies between 23–39%. CPU is mostly around 1%, with hourly spikes up to 65%. An internal hourly maintenance job can account for the spike due to saving hundreds of thousands of model checkpoints, clearing unused models, and performing bookkeeping for internal states. However, this can be a future improvement.

jvm memory pressure

cpu

Implementation

We next discuss the specifics of the technical work that is germane to HCAD’s completeness and scalability.

RCF 2.0

In Amazon OpenSearch Service 1.1, we integrated with Random Cut Forest library (RCF) 2.0. RCF is based on partitioning data into different bounding boxes. The previous RCF version maintains bounding boxes in memory. However, a real-time detector only uses the bounding boxes when processing a new data point and leaves them dormant most of the time. RCF 2.0 allows for recreating those bounding boxes when required so that bounding boxes are present in memory when processing the corresponding input. The on-demand recreation has led to nine times memory overhead reduction and therefore can support hosting nine times as many models in a node. In addition, RCF 2.0 revamps the serialization module. The new module serializes and deserializes a model 26 times faster using 20 times smaller disk space.

Pagination

Regarding feature aggregation, we switched from getting top hits using terms aggregation to pagination via composite aggregation. We evaluated multiple pagination implementations using a generated dataset with 1 million entities. Each entity has two documents. The experiment configurations can vary according to the number of data nodes, primary shards, and categorical fields. We believe composite queries are the right choice because even though they may not be the fastest in all cases, they’re the most stable on average (40 seconds).

Amortize expensive operations

HCAD can face thundering herd traffic, in which many entities make requests like reading checkpoints from disks at approximately the same time. Therefore, we create various queues to buffer pent-up requests. These queues amortize expensive costs by performing a small and bounded amount of work steadily. Therefore, HCAD can offer predictable performance and availability at scale.

In-memory cache

HCAD appeals to caching to process entities whose memory requirement is larger than the configured memory size. At first, we tried a least recently used cache but experienced thrashing when running the HTTP logs workload: with 100 1-minute interval detectors and millions of entities for each detector, we saw few cache hits (many hundreds) within 7 hours. We were wasting CPU cycles swapping models in and out of memory all the time. As a general rule, a hit-to-miss ratio worse than 3:1 is not worth considering caching for quick model accesses.

Instead, we turned to a modified LFU caching, augmented to include heavy hitters’ approximation. A decayed count is maintained for each model in the cache. The decayed count for a model in the cache is incremented when the model is accessed. The model with the smallest decayed count is the least frequently used model. When the cache reaches its capacity, it invalidates and removes the least frequently used model if the new entity’s frequency is no smaller than the least frequently used entity. This connection between heavy hitter approximation and traditional LFU allows us to make the more frequent and recent models sticky in memory and phase out models with lesser cache hit probabilities.

Fault tolerance

Unrecoverable memory state is limited, and enough information of models is stored on disk for crash resilience. Models are recovered on a different host after a crash is detected.

High performance

HCAD builds on asynchronous I/O: all I/O requests such as network calls or disk accesses are non-blocking. In addition, model distribution is balanced across the cluster using a consistent hash ring.

Summary

We enhanced HCAD to improve its scalability and completeness without altering the fidelity of the computation. As a result of these improvements, I showed you how to size an OpenSearch domain and use HCAD to monitor 1 million entities in 10 minutes. To learn more about HCAD, see anomaly detection documentation.

If you have feedback about this post, submit comments in the comments section below. If you have questions about this post, start a new thread on the Machine Learning forum.


About the Author

bio

Kaituo Li is an engineer in Amazon OpenSearch Service. He has worked on distributed systems, applied machine learning, monitoring, and database storage in Amazon. Before Amazon, Kaituo was a PhD student in Computer Science at University of Massachusetts, Amherst. He likes reading and sports.

Analyze AWS WAF logs using Amazon OpenSearch Service anomaly detection built on Random Cut Forests

Post Syndicated from Umesh Ramesh original https://aws.amazon.com/blogs/security/analyze-aws-waf-logs-using-amazon-opensearch-service-anomaly-detection-built-on-random-cut-forests/

This blog post shows you how to use the machine learning capabilities of Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) to detect and visualize anomalies in AWS WAF logs. AWS WAF logs are streamed to Amazon OpenSearch Service using Amazon Kinesis Data Firehose. Kinesis Data Firehose invokes an AWS Lambda function to transform incoming source data and deliver the transformed data to Amazon OpenSearch Service. You can implement this solution without any machine learning expertise. AWS WAF logs capture a number of attributes about the incoming web request, and you can analyze these attributes to detect anomalous behavior. This blog post focuses on the following two scenarios:

  • Identifying anomalous behavior based on a high number of web requests coming from an unexpected country (Country Code is one of the request fields captured in AWS WAF logs).
  • Identifying anomalous behavior based on HTTP method for a read-heavy application like a content media website that receives unexpected write requests.

Log analysis is essential for understanding the effectiveness of any security solution. It helps with day-to-day troubleshooting, and also with long-term understanding of how your security environment is performing.

AWS WAF is a web application firewall that helps protect your web applications from common web exploits which could affect application availability, compromise security, or consume excessive resources. AWS WAF gives you control over which traffic sent to your web applications is allowed or blocked, by defining customizable web security rules. AWS WAF lets you define multiple types of rules to block unauthorized traffic.

Machine learning can assist in identifying unusual or unexpected behavior. Amazon OpenSearch Service is one of the commonly used services which offer log analytics for monitoring service logs, using dashboards and alerting mechanisms. Static, rule‑based analytics approaches are slow to adapt to evolving workloads, and can miss critical issues. With the announcement of real-time anomaly detection support in Amazon OpenSearch Service, you can use machine learning to detect anomalies in real‑time streaming data, and identify issues as they evolve so you can mitigate them quickly. Real‑time anomaly detection support uses Random Cut Forest (RCF), an unsupervised algorithm, which continuously adapts to evolving data patterns. Simply stated, RCF takes a set of random data points, divides them into multiple groups, each with the same number of points, and then builds a collection of models. As an unsupervised algorithm, RCF uses cluster analysis to detect spikes in time series data, breaks in periodicity or seasonality, and data point exceptions. The anomaly detection feature is lightweight, with the computational load distributed across Amazon OpenSearch Service nodes. Figure 1 shows the architecture of the solution described in this blog post.

Figure 1: End-to-end architecture

Figure 1: End-to-end architecture

The architecture flow shown in Figure 1 includes the following high-level steps:

  1. AWS WAF streams logs to Kinesis Data Firehose.
  2. Kinesis Data Firehose invokes a Lambda function to add attributes to the AWS WAF logs.
  3. Kinesis Data Firehose sends the transformed source records to Amazon OpenSearch Service.
  4. Amazon OpenSearch Service automatically detects anomalies.
  5. Amazon OpenSearch Service delivers anomaly alerts via Amazon Simple Notification Service (Amazon SNS).

Solution

Figure 2 shows examples of both an original and a modified AWS WAF log. The solution in this blog post focuses on Country and httpMethod. It uses a Lambda function to transform the AWS WAF log by adding fields, as shown in the snippet on the right side. The values of the newly added fields are evaluated based on the values of country and httpMethod in the AWS WAF log.

Figure 2: Sample processing done by a Lambda function

Figure 2: Sample processing done by a Lambda function

In this solution, you will use a Lambda function to introduce new fields to the incoming AWS WAF logs through Kinesis Data Firehose. You will introduce additional fields by using one-hot encoding to represent the incoming linear values as a “1” or “0”.

Scenario 1

In this scenario, the goal is to detect traffic from unexpected countries when serving user traffic expected to be from the US and UK. The function adds three new fields:

usTraffic
ukTraffic
otherTraffic

As shown in the lambda function inline code, we use the traffic_from_country function, in which we only want actions that ALLOW the traffic. Once we have that, we use conditions to check the country code. If the value of the country field in the web request captured in AWS WAF log is US, the usTraffic field in the transformed data will be assigned the value 1 while otherTraffic and ukTraffic will be assigned the value 0. The other two fields are transformed as shown in Table 1.

Original AWS WAF log Transformed AWS WAF log with new fields after one-hot encoding
Country usTraffic ukTraffic otherTraffic
US 1 0 0
UK 0 1 0
All other country codes 0 0 1

Table 1: One-hot encoding field mapping for country

Scenario 2

In the second scenario, you detect anomalous requests that use POST HTTP method.

As shown in the lambda function inline code, we use the filter_http_request_method function, in which we only want actions that ALLOW the traffic. Once we have that, we use conditions to check the HTTP _request method. If the value of the HTTP method in the AWS WAF log is GET, the getHttpMethod field is assigned the value 1 while headHttpMethod and postHttpMethod are assigned the value 0. The other two fields are transformed as shown in Table 2.

Original AWS WAF log Transformed AWS WAF log with new fields after one-hot encoding
HTTP method getHttpMethod headHttpMethod postHttpMethod
GET 1 0 0
HEAD 0 1 0
POST 0 0 1

Table 2: One-hot encoding field mapping for HTTP method

After adding these new fields, the transformed record from Lambda must contain the following parameters before the data is sent back to Kinesis Data Firehose

recordId The transformed record must contain the same original record ID as is received from the Kinesis Data Firehose.
result The status of the data transformation of the record (the status can be OK or Dropped).
data The transformed data payload.

AWS WAF logs are JSON files, and this anomaly detection feature works only on numeric data. This means that to use this feature for detecting anomalies in logs, you must pre-process your logs using a Lambda function.

Lambda function for one-hot encoding

Use the following Lambda function to transform the AWS WAF log by adding new attributes, as explained in Scenario 1 and Scenario 2.

import base64
import json

def lambda_handler(event,context):
    output = []
    
    try:
        # loop through records in incoming Event
        for record in event["records"]:
            # extract message
            message = json.loads(base64.b64decode(event["records"][0]["data"]))
            
            print('Country: ', message["httpRequest"]["country"])
            print('Action: ', message["action"])
            print('User Agent: ', message["httpRequest"]["headers"][1]["value"])
             
            timestamp = message["timestamp"]
            action = message["action"]
            country = message["httpRequest"]["country"]
            user_agent = message["httpRequest"]["headers"][1]["value"]
            http_method = message["httpRequest"]["httpMethod"]
            
            mobileUserAgent, browserUserAgent = filter_user_agent(user_agent)
            usTraffic, ukTraffic, otherTraffic = traffic_from_country(country, action)
            getHttpMethod, headHttpMethod, postHttpMethod = filter_http_request_method(http_method, action)
            
            # append new fields in message dict
            message["usTraffic"] = usTraffic
            message["ukTraffic"] = ukTraffic
            message["otherTraffic"] = otherTraffic
            message["mobileUserAgent"] = mobileUserAgent
            message["browserUserAgent"] = browserUserAgent
            message["getHttpMethod"] = getHttpMethod
            message["headHttpMethod"] = headHttpMethod
            message["postHttpMethod"] = postHttpMethod
            
            # base64-encoding
            data = base64.b64encode(json.dumps(message).encode('utf-8'))
            
            output_record = {
                "recordId": record['recordId'], # retain same record id from the Kinesis data Firehose
                "result": "Ok",
                "data": data.decode('utf-8')
            }
            output.append(output_record)
        return {"records": output}
    except Exception as e:
        print(e)
        
def filter_user_agent(user_agent):
    # returns one hot encoding based on user agent
    if "Mobile" in user_agent:
        mobile_user_agent = True
        return (1, 0)
    else:
        mobile_user_agent = False
        return (0, 1) # anomaly recorded
        
def traffic_from_country(country_code, action):
    # returns one hot encoding based on allowed traffic from countries
    if action == "ALLOW":
        if "US" in country_code:
            allowed_country_traffic = True
            return (1, 0, 0)
        elif "UK" in country_code:
            allowed_country_traffic = True
            return (0, 1, 0)
        else:
            allowed_country_traffic = False
            return (0, 0, 1) # anomaly recorded
            
def filter_http_request_method(http_method, action):
    # returns one hot encoding based on allowed http method type
    if action == "ALLOW":
        if "GET" in http_method:
            return (1, 0, 0)
        elif "HEAD" in http_method:
            return (0, 1, 0)
        elif "POST" in http_method:
            return (0, 0, 1) # anomaly recorded

After the transformation, the data that’s delivered to Amazon OpenSearch Service will have additional fields, as described in Table 1 and Table 2 above. You can configure an anomaly detector in Amazon OpenSearch Service to monitor these additional fields. The algorithm computes an anomaly grade and confidence score value for each incoming data point. Anomaly detection uses these values to differentiate an anomaly from normal variations in your data. Anomaly detection and alerting are plugins that are included in the available set of Amazon OpenSearch Service plugins. You can use these two plugins to generate a notification as soon as an anomaly is detected.

Deployment steps

In this section, you complete five high-level steps to deploy the solution. In this blog post, we are deploying this solution in the us-east-1 Region. The solution assumes you already have an active web application protected by AWS WAF rules. If you’re looking for details on creating AWS WAF rules, refer to Working with web ACLs and sample examples for more information.

Note: When you associate a web ACL with Amazon CloudFront as a protected resource, make sure that the Kinesis Firehose Delivery Stream is deployed in the us-east-1 Region.

The steps are:

  1. Deploy an AWS CloudFormation template
  2. Enable AWS WAF logs
  3. Create an anomaly detector
  4. Set up alerts in Amazon OpenSearch Service
  5. Create a monitor for the alerts

Deploy a CloudFormation template

To start, deploy a CloudFormation template to create the following AWS resources:

  • Amazon OpenSearch Service and Kibana (versions 1.5 to 7.10) with built-in AWS WAF dashboards.
  • Kinesis Data Firehose streams
  • A Lambda function for data transformation and an Amazon SNS topic with email subscription. 

To deploy the CloudFormation template

  1. Download the CloudFormation template and save it locally as Amazon-ES-Stack.yaml.
  2. Go to the AWS Management Console and open the CloudFormation console.
  3. Choose Create Stack.
  4. On the Specify template page, choose Upload a template file. Then select Choose File, and select the template file that you downloaded in step 1.
  5. Choose Next.
  6. Provide the Parameters:
    1. Enter a unique name for your CloudFormation stack.
    2. Update the email address for UserEmail with the address you want alerts sent to.
    3. Choose Next.
  7. Review and choose Create stack.
  8. When the CloudFormation stack status changes to CREATE_COMPLETE, go to the Outputs tab and make note of the DashboardLinkOutput value. Also note the credentials you’ll receive by email (Subject: Your temporary password) and subscribe to the SNS topic for which you’ll also receive an email confirmation request.

Enable AWS WAF logs

Before enabling the AWS WAF logs, you should have AWS WAF web ACLs set up to protect your web application traffic. From the console, open the AWS WAF service and choose your existing web ACL. Open your web ACL resource, which can either be deployed on an Amazon CloudFront distribution or on an Application Load Balancer.

To enable AWS WAF logs

  1. From the AWS WAF home page, choose Create web ACL.
  2. From the AWS WAF home page, choose  Logging and metrics
  3. From the AWS WAF home page, choose the web ACL for which you want to enable logging, as shown in Figure 3:
    Figure 3 – Enabling WAF logging

    Figure 3 – Enabling WAF logging

  4. Go to the Logging and metrics tab, and then choose Enable Logging. The next page displays all the delivery streams that start with aws-waf-logs. Choose the Kinesis Data Firehose delivery stream that was created by the Cloud Formation template, as shown in Figure 3 (in this example, aws-waf-logs-useast1). Don’t redact any fields or add filters. Select Save.

Create an Index template

Index templates lets you initialize new indices with predefined mapping. For example, in this case you predefined mapping for timestamp.

To create an Index template

  • Log into the Kibana dashboard. You can find the Kibana dashboard link in the Outputs tab of the CloudFormation stack. You should have received the username and temporary password (Ignore the period (.) at the end of the temporary password) by email, at the email address you entered as part of deploying the CloudFormation template. You will be logged in to the Kibana dashboard after setting a new password.
  • Choose Dev Tools in the left menu panel to access Kibana’s console.
  • The left pane in the console is the request pane, and the right pane is the response pane.
  • Select the green arrow at the end of the command line to execute the following PUT command.
    PUT  _template/awswaf
    {
        "index_patterns": ["awswaf-*"],
        "settings": {
        "number_of_shards": 1
        },
        "mappings": {
           "properties": {
              "timestamp": {
                "type": "date",
                "format": "epoch_millis"
              }
          }
      }
    }

  • You should see the following response:
    {
      "acknowledged": true
    }

The command creates a template named awswaf and applies it to any new index name that matches the regular expression awswaf-*

Create an anomaly detector

A detector is an individual anomaly detection task. You can create multiple detectors, and all the detectors can run simultaneously, with each analyzing data from different sources.

To create an anomaly detector

  1. Select Anomaly Detection from the menu bar, select Detectors and Create Detector.
    Figure 4- Home page view with menu bar on the left

    Figure 4- Home page view with menu bar on the left

  2. To create a detector, enter the following values and features:

    Name and description

    Name: aws-waf-country
    Description: Detect anomalies on other country values apart from “US” and “UK

    Data Source

    Index: awswaf*
    Timestamp field: timestamp
    Data filter: Visual editor
    Figure 5 – Detector features and their values

    Figure 5 – Detector features and their values

  3. For Detector operation settings, enter a value in minutes for the Detector interval to set the time interval at which the detector collects data. To add extra processing time for data collection, set a Window delay value (also in minutes). This tells the detector that the data isn’t ingested into Amazon OpenSearch Service in real time, but with a delay. The example in Figure 6 uses a 1-minute interval and a 2-minute delay.
    Figure 6 – Detector operation settings

    Figure 6 – Detector operation settings

  4. Next, select Create.
  5. Once you create a detector, select Configure Model and add the following values to Model configuration:

    Feature Name: waf-country-other
    Feature State: Enable feature
    Find anomalies based on: Field value
    Aggregation method: sum()
    Field: otherTraffic

    The aggregation method determines what constitutes an anomaly. For example, if you choose min(), the detector focuses on finding anomalies based on the minimum values of your feature. If you choose average(), the detector finds anomalies based on the average values of your feature. For this scenario, you will use sum().The value otherTraffic for Field is the transformed field in the Amazon OpenSearch Service logs that was added by the Lambda function.

    Figure 7 – Detector Model configuration

    Figure 7 – Detector Model configuration

  6. Under Advanced Settings on the Model configuration page, update the Window size to an appropriate interval (1 equals 1 minute) and choose Save and Start detector and Automatically start detector.

    We recommend you choose this value based on your actual data. If you expect missing values in your data, or if you want the anomalies based on the current value, choose 1. If your data is continuously ingested and you want the anomalies based on multiple intervals, choose a larger window size.

    Note: The detector takes 4 to 5 minutes to start. 

    Figure 8 – Detector window size

    Figure 8 – Detector window size

Set up alerts

You’ll use Amazon SNS as a destination for alerts from Amazon OpenSearch Service.

Note: A destination is a reusable location for an action.

To set up alerts:
  1. Go to the Kibana main menu bar and select Alerting, and then navigate to the Destinations tab.
  2. Select Add destination and enter a unique name for the destination.
  3. For Type, choose Amazon SNS and provide the topic ARN that was created as part of the CloudFormation resources (captured in the Outputs tab).
  4. Provide the ARN for an IAM role that was created as part of the CloudFormation outputs (SNSAccessIAMRole-********) that has the following trust relationship and permissions (at a minimum):
    {"Version": "2012-10-17",
      "Statement": [{"Effect": "Allow",
        "Principal": {"Service": "es.amazonaws.com"
        },
        "Action": "sts:AssumeRole"
      }]
    }
    {"Version": "2012-10-17",
      "Statement": [{"Effect": "Allow",
        "Action": "sns:Publish",
        "Resource": "sns-topic-arn"
      }]
    }

    Figure 9 – Destination

    Figure 9 – Destination

    Note: For more information, see Adding IAM Identity Permissions in the IAM user guide.

  5. Choose Create.

Create a monitor

A monitor can be defined as a job that runs on a defined schedule and queries Amazon OpenSearch Service. The results of these queries are then used as input for one or more triggers.

To create a monitor for the alert

  1. Select Alerting on the Kibana main menu and navigate to the Monitors tab. Select Create monitor
  2. Create a new record with the following values:

    Monitor Name: aws-waf-country-monitor
    Method of definition: Define using anomaly detector
    Detector: aws-waf-country
    Monitor schedule: Every 2 minutes
  3. Select Create.
    Figure 10 – Create monitor

    Figure 10 – Create monitor

  4. Choose Create Trigger to connect monitoring alert with the Amazon SNS topic using the below values:

    Trigger Name: SNS_Trigger
    Severity Level: 1
    Trigger Type: Anomaly Detector grade and confidence

    Under Configure Actions, set the following values:

    Action Name: SNS-alert
    Destination: select the destination name you chose when you created the Alert above
    Message Subject: “Anomaly detected – Country”
    Message: <Use the default message displayed>
  5. Select Create to create the trigger.
    Figure 11 – Create trigger

    Figure 11 – Create trigger

    Figure 12 – Configure actions

    Figure 12 – Configure actions

Test the solution

Now that you’ve deployed the solution, the AWS WAF logs will be sent to Amazon OpenSearch Service.

Kinesis Data Generator sample template

When testing the environment covered in this blog outside a production context, we used Kinesis Data Generator to generate sample user traffic with the template below, changing the country strings in different runs to reflect expected records or anomalous ones. Other tools are also available.

{
"timestamp":"[{{date.now("DD/MMM/YYYY:HH:mm:ss Z")}}]",
"formatVersion":1,
"webaclId":"arn:aws:wafv2:us-east-1:066931718055:regional/webacl/FMManagedWebACLV2test-policy1596636761038/3b9e0dde-812c-447f-afe7-2dd16658e746",
"terminatingRuleId":"Default_Action",
"terminatingRuleType":"REGULAR",
"action":"ALLOW",
"terminatingRuleMatchDetails":[
],
"httpSourceName":"ALB",
"httpSourceId":"066931718055-app/Webgoat-ALB/d1b4a2c257e57f2f",
"ruleGroupList":[
{
"ruleGroupId":"AWS#AWSManagedRulesAmazonIpReputationList",
"terminatingRule":null,
"nonTerminatingMatchingRules":[
],
"excludedRules":null
}
],
"rateBasedRuleList":[
],
"nonTerminatingMatchingRules":[
],
"httpRequest":{
"clientIp":"{{internet.ip}}",
"country":"{{random.arrayElement(
["US","UK"]
)}}",
"headers":[
{
"name":"Host",
"value":"34.225.62.38"
},
{
"name":"User-Agent",
"value":"{{internet.userAgent}}"
},
{
"name":"Accept",
"value":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
},
{
"name":"Accept-Language",
"value":"en-GB,en;q=0.5"
},
{
"name":"Accept-Encoding",
"value":"gzip, deflate"
},
{
"name":"Upgrade-Insecure-Requests",
"value":"1"
}
],
"uri":"/config/getuser",
"args":"index=0",
"httpVersion":"HTTP/1.1",
"httpMethod":"{{random.arrayElement(
["GET","HEAD"]
)}}",
"requestId":null
}
}

You will receive an email alert via Amazon SNS if the traffic contains any anomalous data. You should also be able to view the anomalies recorded in Amazon OpenSearch Service by selecting the detector and choosing Anomaly results for the detector, as shown in Figure 13.

Figure 13 – Anomaly results

Figure 13 – Anomaly results

Conclusion

In this post, you learned how you can discover anomalies in AWS WAF logs across parameters like Country and httpMethod defined by the attribute values. You can further expand your anomaly detection use cases with application logs and other AWS Service logs. To learn more about this feature with Amazon OpenSearch Service, we suggest reading the Amazon OpenSearch Service documentation. We look forward to hearing your questions, comments, and feedback. 

If you found this post interesting and useful, you may be interested in https://aws.amazon.com/blogs/security/how-to-improve-visibility-into-aws-waf-with-anomaly-detection/ and https://aws.amazon.com/blogs/big-data/analyzing-aws-waf-logs-with-amazon-es-amazon-athena-and-amazon-quicksight/ as further alternative approaches.

 
If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security news? Follow us on Twitter.

uramesh

Umesh Kumar Ramesh

Umesh is a senior cloud infrastructure architect with AWS who delivers proof-of-concept projects and topical workshops, and leads implementation projects. He holds a bachelor’s degree in computer science and engineering from the National Institute of Technology, Jamshedpur (India). Outside of work, he enjoys watching documentaries, biking, practicing meditation, and discussing spirituality.

Anuj Butail

Anuj Butail

Anuj is a solutions architect at AWS. He is based out of San Francisco and helps customers in San Francisco and Silicon Valley design and build large scale applications on AWS. He has expertise in the area of AWS, edge services, and containers. He enjoys playing tennis, watching sitcoms, and spending time with his family.

mahekp

Mahek Pavagadhi

Mahek is a cloud infrastructure architect at AWS in San Francisco, CA. She has a master’s degree in software engineering with a major in cloud computing. She is passionate about cloud services and building solutions with it. Outside of work, she is an avid traveler who loves to explore local cafés.

Summary of Zabbix Summit Online 2021, Zabbix 6.0 LTS release date and Zabbix Workshops

Post Syndicated from Arturs Lontons original https://blog.zabbix.com/summary-of-zabbix-summit-online-2021-zabbix-6-0-lts-release-date-and-zabbix-workshops/17155/

Now that the Zabbix Summit Online 2021 has concluded, we are thrilled to report we hosted attendees from over 3000 organizations from more than 130 countries all across the globe.

This year, the main focus of the speeches was the upcoming Zabbix 6.0 LTS release, as well as speeches focused on automating Zabbix data collection and configuration, Integrating Zabbix within existing company infrastructures, and migrating from legacy tools to Zabbix. 21 speakers in total presented their use cases and talked about new Zabbix features during the Summit with over 8 hours of content.

In case you missed the Summit or wish to come back to some of the speeches – both the presentations (in PDF format) and the videos of the speeches are available on the Zabbix Summit Online 2021 Event page.

Zabbix 6.0 LTS release date

As for Zabbix 6.0 LTS – as per our statement during the event, you can expect Zabbix 6.0 LTS to release in early 2022. At the time of this post, the latest pre-release version is Zabbix 6.0 Alpha 7, with the first Beta version scheduled for release VERY soon. Feel free to deploy the latest pre-release version and take a look at features such as Geomaps, Business Service monitoring, improved Audit log, UX improvements, Anomaly detection with Machine Learning, and more! The list of the latest released Zabbix 6.0 versions as well as the improvements and fixes they contain is available in the Release notes section of our website.

Zabbix 6.0 LTS Workshops

The workshops will focus on particular Zabbix 6.0 LTS features and will be available once the Zabbix 6.0 LTS is released. The workshops will provide a unique chance to learn and practice the configuration of specific Zabbix 6.0 LTS features under the guidance of a certified Zabbix trainer at absolutely no cost! Some of the topics covered in the workshops will include – Deploying Zabbix server HA cluster, Creating triggers for Baseline monitoring and Anomaly detection, Displaying your infrastructure status on Geomaps, Deploying Business Service monitoring with root cause analysis, and more!

Upcoming events

But there’s more! On December 9 2021 Zabbix will host PostgreSQL Monitoring Day with Zabbix & Postgres Pro. The speeches will focus on monitoring PostgreSQL databases, running Zabbix on PostgreSQL DB backends with TimescaleDB, and securing your Zabbix + PostgreSQL instances. If you’re currently using PostgreSQL DB backends r plan to do so in the future – you definitely don’t want to miss out!

As for 2022 – you can expect multiple meetups regarding Zabbix 6.0 LTS features and use cases, as well as events focused on specific monitoring use cases. More information will be publicly available with the release of Zabbix 6.0 LTS.

How to improve visibility into AWS WAF with anomaly detection

Post Syndicated from Cyril Soler original https://aws.amazon.com/blogs/security/how-to-improve-visibility-into-aws-waf-with-anomaly-detection/

When your APIs are exposed on the internet, they naturally face unpredictable traffic. AWS WAF helps protect your application’s API against common web exploits, such as SQL injection and cross-site scripting. In this blog post, you’ll learn how to automatically detect anomalies in the AWS WAF metrics to improve your visibility into AWS WAF activity, identify malicious activity, and simplify your investigations. The service that this solution uses to detect anomalies is Amazon Lookout for Metrics.

Lookout for Metrics is a service you can use to monitor business or operational metrics such as successful or failed HTTP requests and detect anomalies by using machine learning (ML). You can configure Lookout for Metrics to monitor different data sources that contain AWS WAF metrics, including Amazon CloudWatch. Lookout for Metrics can also take actions such as publishing findings in AWS Security Hub.

Solution overview

The solution in this blog post uses Amazon API Gateway to serve a simple REST API. AWS WAF protects API Gateway with AWS Managed Rules for AWS WAF. Amazon Lookout for Metrics actively detects unusual patterns in AWS WAF rule actions and sends a finding to Security Hub when suspicious activity is detected. Figure 1 shows the solution architecture.

Because AWS WAF integrates with Application Load Balancer, Amazon CloudFront distributions, or AWS AppSync GraphQL APIs, this solution also applies to these services.
 

Figure 1: Solution architecture

Figure 1: Solution architecture

The workflow of the solution is as follows:

  1. An HTTP request reaches the API Gateway endpoint.
  2. AWS WAF analyzes the HTTP request using the configured rules.
  3. Amazon CloudWatch collects action metrics for each rule that is configured in AWS WAF.
  4. Amazon Lookout for Metrics monitors CloudWatch metrics, selects the best ML algorithm, and trains the ML model.
  5. Lookout for Metrics detects outliers and provides a severity score to diagnose the issue.
  6. Lookout for Metrics invokes an AWS Lambda function when an anomaly is detected.
  7. The Lambda function sends a finding to Security Hub for further analysis.

Let’s take a detailed look at the AWS services that you will use in this solution.

Amazon API Gateway

Amazon API Gateway is a serverless API management service that supports mock integrations for API methods. This is the easiest and the most cost-effective way to implement this solution. But you can also use Amazon CloudFront, AWS AppSync GraphQL API, and Application Load Balancer to implement this solution in your workload.

AWS WAF

AWS WAF is a web application firewall you can associate with API Gateway for REST APIs, Amazon CloudFront, AWS AppSync for GraphQL API, or Application Load Balancer. AWS WAF is integrated with other AWS services such as CloudWatch. AWS WAF uses rules to detect common web exploits in the incoming HTTP requests. You can configure your own rules, or use managed rulesets from AWS or from a third-party vendor. In this solution, you use AWS Managed Rules, which contains the CrossSiteScripting_QUERYARGUMENTS rule.

Amazon CloudWatch

Amazon CloudWatch is a monitoring and observability service. CloudWatch receives specific metrics from AWS WAF every 5 minutes. In particular, for each AWS WAF rule, CloudWatch provides PassedRequests, BlockedRequests, and CountedRequests metrics.

Amazon Lookout for Metrics

Amazon Lookout for Metrics uses machine learning (ML) algorithms to automatically detect and diagnose anomalies in your metrics. By using CloudWatch metrics as a data source for Lookout for Metrics, you can apply one of the Lookout for Metrics ML models to detect anomalies in a faster way. In addition, you can provide feedback on detected anomalies to help improve the model accuracy over time. Lookout for Metrics is available in the US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland), and Europe (Stockholm) AWS Regions.

AWS Lambda

In this solution, you use an AWS Lambda function as an alert mechanism for Lookout for Metrics. When the machine learning model detects an outlier, it invokes the Lambda function, which implements a custom code. The Lambda function then imports the anomaly as a finding to Security Hub.

AWS Security Hub

In this solution, you use AWS Security Hub as a centralized way to manage security findings. This integration has the advantage of providing a common place for the security team to diagnose security findings from various sources, and uniformly integrates with your existing Security Information and Event Management (SIEM) system.

Prerequisites

This solution uses Security Hub to collect anomaly detection findings. Before you deploy the solution, you need to enable Security Hub in your AWS account by following the instructions provided in to enable Security Hub manually. After you enable Security Hub, you can optionally select the security standards that are relevant for your workload, as shown in Figure 2.
 

Figure 2: Manually enabling Security Hub in the AWS Management Console

Figure 2: Manually enabling Security Hub in the AWS Management Console

Deploy the solution

A ready-to-use solution is provided as an AWS Cloud Development Kit (AWS CDK) application in the AWS WAF Anomaly Detection CDK project GitHub code repository. You can clone the GitHub repository and deploy the application by using the AWS CDK for Python.

Important: After you successfully deploy the solution, you should activate the Lookout for Metrics detector. This is not done as part of the CDK deployment. To activate the detector, in the AWS Management Console navigate to Amazon Lookout for Metrics, select the detector the solution created (WAFBlockingRequestDetector), and choose Activate. Alternatively, you can use the following AWS command to activate your detector.

aws lookoutmetrics activate-anomaly-detector --anomaly-detector-arn arn:aws:lookoutmetrics:<REGION_ID>:<ACCOUNT_ID>:AnomalyDetector:WAFBlockingRequestDetector

If you don’t want to run the CDK application, you can implement the same solution by using the AWS Management Console. In the following sections, I’ll go through the manual steps you can follow to achieve this.

Create an API to demonstrate the solution

First, you need an HTTP endpoint to protect. AWS WAF is integrated with CloudFront, Application Load Balancer, API Gateway, and AWS AppSync GraphQL API. In this blog post, I recommend a REST API Gateway because it’s a fully managed service to create and manage APIs. In addition, API Gateway provides a mechanism to implement mock APIs.

To build a REST API, follow the instructions for creating a REST API in Amazon API Gateway. After you create the API, create a GET method at the API root level and associate it to a mock endpoint, as shown in Figure 3. This is just enough to return an HTTP 200 status code to any GET requests.
 

Figure 3: Creating an API with mock integration

Figure 3: Creating an API with mock integration

Finally, deploy the API under the “prod” stage and keep all the default settings.

Create an AWS WAF web ACL to deploy the managed rules

Now that you’ve created an API in API Gateway, you need to create an AWS WAF web access control list (web ACL) by following the instructions in Creating a web ACL. A web ACL is the top-level configuration object of AWS WAF. This is the collection of AWS WAF rules that you will apply to your API. API Gateway is a regional service, so make sure to create a web ACL in the same AWS Region as the API. After you create the web ACL, add the Core rule set (CRS) rule group from AWS Managed Rules, also called AWSManagedRulesCommonRuleSet, as shown in Figure 4. This rule group contains the CrossSiteScripting_QUERYARGUMENTS rule, which you will use later to demonstrate the anomaly detection.
 

Figure 4: Adding AWSManagedRulesCommonRuleSet to the AWS WAF web ACL

Figure 4: Adding AWSManagedRulesCommonRuleSet to the AWS WAF web ACL

By observing Web ACL rule capacity units used, you can see that the Core rule set is consuming 700 web ACL capacity units (WCUs). The maximum capacity for a web ACL is 1,500, which is sufficient for most use cases. If you need more capacity, contact the AWS Support Center.

Associate the web ACL with the API deployment

After you create the web ACL, you associate it with the API. To do this, in the AWS WAF console, navigate to the web ACL you just created. On the Associated AWS resources tab, choose Add AWS resources. When prompted, choose the API you created earlier, and then choose Add.
 

Figure 5: Associating the web ACL with the API

Figure 5: Associating the web ACL with the API

Create a Lambda function to forward the anomaly to Security Hub

It’s useful to get visibility into the anomalies that are detected by the solution, and there are various ways to do that. In this solution, you provide such visibility as findings to Security Hub. Security Hub provides a centralized place to manage different findings from your AWS solutions. It also provides graphical tools to help with diagnostics.

You use a Lambda function that receives each anomaly and imports them into Security Hub. You can find the lookout_alarm Lambda function on GitHub, or follow the instructions to build a Lambda function with Python. You will use this Lambda function to provide additional context enrichment in the finding.

import boto3

securityHub = boto3.client('securityhub')

def lambda_handler(event, context):
    # submit the finding to Security Hub
    result = securityHub.batch_import_findings(Findings = [...])

Before you use this Lambda function, make sure you enable Security Hub.

Create the Lookout for Metrics detector, dataset, and alarm

Now you have an API that is protected by an AWS WAF web ACL. You also have configured a way to integrate with Security Hub through a Lambda function. The next step is to create a Lookout for Metrics detector and connect all these elements together. The key concepts and terminology of Lookout for Metrics are:

  • Detector – A Lookout for Metrics resource that monitors a dataset and identifies anomalies.
  • Dataset – The detector’s copy of the data that Lookout for Metrics is analyzing.
  • Alert – A mechanism to send a notification or initiate a processing workflow when the detector finds an anomaly.

First, follow the instructions to create a detector. The only information you need to provide is a name and an interval. The interval is the amount of time between two analyses. Your choice of the interval depends upon criteria such as the metrics you are processing, or the retention time of your data. For more information on the detector interval, see Lookout for Metrics quotas. In the example in Figure 6, I chose an interval of 5 minutes, which is the minimum.
 

Figure 6: Creating an Amazon Lookout for Metrics detector

Figure 6: Creating an Amazon Lookout for Metrics detector

After you create the detector, follow the instructions to configure a dataset that uses CloudWatch as a data source. Select Create a role in the service role, choose Next, and enter the following parameters:

  • For the CloudWatch namespace, choose AWS/WAFV2.
  • For Dimensions, choose Region, Rule, and WebACL.
  • For Measure, choose BlockedRequests.
  • For Aggregation function, choose SUM.

Figure 7 shows the data source fields that the detector will check for anomalies.
 

Figure 7: Creating an Amazon Lookout for Metrics dataset

Figure 7: Creating an Amazon Lookout for Metrics dataset

Next, create a Lookout for Metrics alert to invoke the Lambda function. To do so, follow the instructions for working with alerts. You provide a name, a channel (the Lambda function), and a severity threshold. One of the main advantages of Lookout for Metrics is the scoring of the detected anomaly, which indicates the severity. Anomalies have a score from 0 to 100. You can set up different alerts with different thresholds that are associated to the same detector. This way, you can provide alerts for different severity levels. In the example in Figure 8, I created a single alert with a severity threshold of 10.
 

Figure 8: Creating an Amazon Lookout for Metrics alert

Figure 8: Creating an Amazon Lookout for Metrics alert

The last steps are to activate the detector and configure Lookout for Metrics to select a ML model and train it. To do so, choose Activate on the detector details page.
 

Figure 9: Activating the Amazon Lookout for Metrics detector

Figure 9: Activating the Amazon Lookout for Metrics detector

Why does this solution use Lookout for Metrics anomaly detection?

Amazon CloudWatch offers native anomaly detection on a given metric. This function is useful to apply statistical and ML algorithms that continuously analyze metrics, determine normal baselines, and identify anomalies with minimal user intervention.

Lookout for Metrics provides a more sophisticated version of anomaly detection, which makes it the better choice for this solution. Lookout for Metrics automatically supports a collection of ML algorithms. For example, no one algorithm works for all kinds of data, so Lookout for Metrics inspects the data and applies the right ML algorithm to the right data to accurately detect anomalies. In addition, Lookout for Metrics groups concurrent anomalies into logical groups, and sends a single alert for the anomaly group rather than separate alerts, so you can see the full picture. Finally, Lookout for Metrics allows you to provide feedback on the detected anomalies, which AWS uses to continuously improve the accuracy and performance of the models.

Publish the value zero in CloudWatch metrics

The reporting criteria for AWS WAF metrics is a nonzero value. This means that the BlockedRequests metric isn’t updated if AWS WAF isn’t blocking any requests. In the absence of real HTTP traffic, typically in a testing environment, the value zero must be published. In production, because AWS WAF is actively blocking illegitimate requests, this publication is not required. To train the ML model in the absence of blocked requests, you need to publish the value zero by calling the PutMetricData CloudWatch API method every 5 minutes.

In my example, I selected a 5-minute period to be aligned with the Lookout for Metrics interval. It’s possible to publish a zero value every five minutes by using the CloudWatch metrics API, as shown following. The zero value doesn’t impact the SUM and ensures that at least one value is published every five minutes. You can use the cloudwatch_zero Lambda function on GitHub to publish the value zero by using the AWS SDK for Python.

import boto3

cloudwatch = boto3.client('cloudwatch')

def lambda_handler(event, context):

    result = cloudwatch.put_metric_data(
        Namespace='AWS/WAFV2',
        MetricData=[{
                'MetricName': 'BlockedRequests',
                'Dimensions': [...],
                'Value': 0
        }]
    )

To create a CloudWatch Events rule to schedule the call every 5 minutes

  1. Navigate to the CloudWatch Event console and choose Create Rule.
  2. Choose Schedule, keep the 5-minute default interval, and choose Add target.
  3. Select the name of the function you previously created, expand the Configure input section.
  4. Choose Constant (JSON text), as shown in Figure 10. In the text field, paste the following configuration:
    {"WebACLId":"WebACLForWAFDemo","RuleId":"AWS-AWSManagedRulesCommonRuleSet"}
    

  5. Choose Configure details.
  6. Enter a name for your rule, and then choose Create rule.

 

Figure 10: Creating a CloudWatch Events rule scheduled every 5 minutes

Figure 10: Creating a CloudWatch Events rule scheduled every 5 minutes

Training time

Before the activated detector attempts to find anomalies, it uses data from several intervals to learn. If no historical data is available, the training process takes approximately one day for a five-minute interval. When you first deploy this solution, you have no historical data in CloudWatch for your AWS WAF resources, and you’re facing a cold start of Lookout for Metrics anomaly detection. Because the Lookout for Metrics detector interval is set to 5 minutes, you have to wait for 25 hours before being able to detect an anomaly. If you deploy the solution against an AWS WAF resource that’s been in production for days, you’ll have a reduced training time.

Test the anomaly detection

After 25 hours, Lookout for Metrics correctly selects an ML model that fits your metrics behavior, and correctly trains it based on your actual data. You can then start to test the anomaly detection. You can use a simple curl command, injecting a JavaScript alert() call in a query parameter as described in the AWS WAF documentation, to invoke the CrossSiteScripting_QUERYARGUMENTS managed rule. Make sure to inject a significant number of requests to ensure detection of blocked requests anomalies.

for i in {1..150}
do
  curl https://<api_gateway_endpoint>?test=%3Cscript%3Ealert%28%22hello%22%29%3C%2Fscript%3E
done

After you run the injection script, wait for the system to detect the anomaly. The CloudWatch BlockedRequests metric takes up to 5 minutes to update, and Lookout for Metrics is configured to detect anomalies in the CloudWatch data every 5 minutes. For those reasons, it can take 10 minutes to detect the simulated anomaly.

After detection and processing time, the finding is visible in Security Hub. To view the finding, go to the AWS Management Console, choose Services, choose Security Hub, and then choose Findings.
 

Figure 11: AWS Security Hub findings

Figure 11: AWS Security Hub findings

In Figure 11, you can see the new finding, coming from Lookout for Metrics, with a Low severity and an anomaly score of 100. You can use the remediation field to open the Lookout for Metrics console, where you can give feedback on the anomaly detection to improve the model for future detections.
 

Figure 12: Lookout for Metrics console, Finding view

Figure 12: Lookout for Metrics console, Finding view

Figure 12 shows the Lookout for Metrics graphical interface, where you can see the metrics related to the finding. The previous injection script impacted only one metric, but the same setup works to observe anomalies that arise between two or more metrics together. This feature makes diagnosis of issues easier.

For each of the impacted metrics, to confirm that the anomaly is relevant, choose the Yes button next to Is this relevant? above the graph.

Extend the solution

The solution in this post detects anomalies in the AWS WAF blocked request behavior. But you can also configure AWS WAF rule actions to count your requests instead of blocking them. This is usually done on legacy systems or for some particular rules of a managed ruleset that present an incompatibility with your workload. When you configure the rule action as a count, you increase the need for a comprehensive observability approach. By implementing anomaly detection against counted requests, this solution will help you to achieve better observability for your system.

Concerning the remediation, it’s possible to modify this solution by integrating it with different AWS services. As an example, you can integrate the anomaly detection with your own SIEM system, or simply notify your security team distribution list by using Amazon Simple Notification Service (Amazon SNS).

AWS WAF provides additional information in its logs, such as the IP address for the client. To detect anomalies in AWS WAF logs, you can ingest the AWS WAF logs to Amazon Simple Storage Service (Amazon S3), and then use Lookout for Metrics with Amazon S3 as a data source.

Conclusion

AWS WAF is integrated with CloudWatch and provides metrics for passed requests, blocked requests, or counted requests. With Lookout for Metrics, you can detect unexpected behavior in CloudWatch metrics by using a machine learning (ML) model. In this blog, I showed you how to integrate both services to provide AWS WAF with an ML-based anomaly detection mechanism. ML is a way to gain more visibility into your AWS WAF behavior. In addition, you can easily be notified when the system detects abnormal levels of blocked (or counted) requests, in order to take the right remediation action.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS WAF forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Cyril Soler

Cyril is a Senior Solutions Architect at AWS, working with Spain-based enterprises. His interests include security and data protection. He has been passionate about computer science since he was 7. When he’s far from a keyboard, he enjoys mechanics. Cyril holds a Master’s degree from Polytech Marseille, School of Engineering.