Tag Archives: alerts

How to Receive Alerts When Your IAM Configuration Changes

2023-07-31 Dylan Souvage

Post Syndicated from Dylan Souvage original https://aws.amazon.com/blogs/security/how-to-receive-alerts-when-your-iam-configuration-changes/

July 27, 2023: This post was originally published February 5, 2015, and received a major update July 31, 2023.

As an Amazon Web Services (AWS) administrator, it’s crucial for you to implement robust protective controls to maintain your security configuration. Employing a detective control mechanism to monitor changes to the configuration serves as an additional safeguard in case the primary protective controls fail. Although some changes are expected, you might want to review unexpected changes or changes made by a privileged user. AWS Identity and Access Management (IAM) is a service that primarily helps manage access to AWS services and resources securely. It does provide detailed logs of its activity, but it doesn’t inherently provide real-time alerts or notifications. Fortunately, you can use a combination of AWS CloudTrail, Amazon EventBridge, and Amazon Simple Notification Service (Amazon SNS) to alert you when changes are made to your IAM configuration. In this blog post, we walk you through how to set up EventBridge to initiate SNS notifications for IAM configuration changes. You can also have SNS push messages directly to ticketing or tracking services, such as Jira, Service Now, or your preferred method of receiving notifications, but that is not discussed here.

In any AWS environment, many activities can take place at every moment. CloudTrail records IAM activities, EventBridge filters and routes event data, and Amazon SNS provides notification functionality. This post will guide you through identifying and setting alerts for IAM changes, modifications in authentication and authorization configurations, and more. The power is in your hands to make sure you’re notified of the events you deem most critical to your environment. Here’s a quick overview of how you can invoke a response, shown in Figure 1.

Figure 1: Simple architecture diagram of actors and resources in your account and the process for sending notifications through IAM, CloudTrail, EventBridge, and SNS.

Log IAM changes with CloudTrail

Before we dive into implementation, let’s briefly understand the function of AWS CloudTrail. It records and logs activity within your AWS environment, tracking actions such as IAM role creation, deletion, or modification, thereby offering an audit trail of changes.

With this in mind, we’ll discuss the first step in tracking IAM changes: establishing a log for each modification. In this section, we’ll guide you through using CloudTrail to create these pivotal logs.

For an in-depth understanding of CloudTrail, refer to the AWS CloudTrail User Guide.

In this post, you’re going to start by creating a CloudTrail trail with the Management events type selected, and read and write API activity selected. If you already have a CloudTrail trail set up with those attributes, you can use that CloudTrail trail instead.

To create a CloudTrail log

Open the AWS Management Console and select CloudTrail, and then choose Dashboard.
In the CloudTrail dashboard, choose Create Trail.

Figure 2: Use the CloudTrail dashboard to create a trail
In the Trail name field, enter a display name for your trail and then select Create a new S3 bucket. Leave the default settings for the remaining trail attributes.

Figure 3: Set the trail name and storage location
Under Event type, select Management events. Under API activity, select Read and Write.
Choose Next.

Figure 4: Choose which events to log

Set up notifications with Amazon SNS

Amazon SNS is a managed service that provides message delivery from publishers to subscribers. It works by allowing publishers to communicate asynchronously with subscribers by sending messages to a topic, a logical access point, and a communication channel. Subscribers can receive these messages using supported endpoint types, including email, which you will use in the blog example today.

For further reading on Amazon SNS, refer to the Amazon SNS Developer Guide.

Now that you’ve set up CloudTrail to log IAM changes, the next step is to establish a mechanism to notify you about these changes in real time.

To set up notifications

Open the Amazon SNS console and choose Topics.
Create a new topic. Under Type, select Standard and enter a name for your topic. Keep the defaults for the rest of the options, and then choose Create topic.

Figure 5: Select Standard as the topic type
Navigate to your topic in the topic dashboard, choose the Subscriptions tab, and then choose Create subscription.

Figure 6: Choose Create subscription
For Topic ARN, select the topic you created previously, then under Protocol, select Email and enter the email address you want the alerts to be sent to.

Figure 7: Select the topic ARN and add an endpoint to send notifications to
After your subscription is created, go to the mailbox you designated to receive notifications and check for a verification email from the service. Open the email and select Confirm subscription to verify the email address and complete setup.

Initiate events with EventBridge

Amazon EventBridge is a serverless service that uses events to connect application components. EventBridge receives an event (an indicator of a change in environment) and applies a rule to route the event to a target. Rules match events to targets based on either the structure of the event, called an event pattern, or on a schedule.

Events that come to EventBridge are associated with an event bus. Rules are tied to a single event bus, so they can only be applied to events on that event bus. Your account has a default event bus that receives events from AWS services, and you can create custom event buses to send or receive events from a different account or AWS Region.

For a more comprehensive understanding of EventBridge, refer to the Amazon EventBridge User Guide.

In this part of our post, you’ll use EventBridge to devise a rule for initiating SNS notifications based on IAM configuration changes.

To create an EventBridge rule

Go to the EventBridge console and select EventBridge Rule, and then choose Create rule.

Figure 8: Use the EventBridge console to create a rule
Enter a name for your rule, keep the defaults for the rest of rule details, and then choose Next.

Figure 9: Rule detail screen
Under Target 1, select AWS service.
In the dropdown list for Select a target, select SNS topic, select the topic you created previously, and then choose Next.

Figure 10: Target with target type of AWS service and target topic of SNS topic selected
Under Event source, select AWS events or EventBridge partner events.

Figure 11: Event pattern with AWS events or EventBridge partner events selected
Under Event pattern, verify that you have the following selected.
1. For Event source, select AWS services.
2. For AWS service, select IAM.
3. For Event type, select AWS API Call via CloudTrail.
4. Select the radio button for Any operation.
Figure 12: Event pattern details selected

Now that you’ve set up EventBridge to monitor IAM changes, test it by creating a new user or adding a new policy to an IAM role and see if you receive an email notification.

Centralize EventBridge alerts by using cross-account alerts

If you have multiple accounts, you should be evaluating using AWS Organizations. (For a deep dive into best practices for using AWS Organizations, we recommend reading this AWS blog post.)

By standardizing the implementation to channel alerts from across accounts to a primary AWS notification account, you can use a multi-account EventBridge architecture. This allows aggregation of notifications across your accounts through sender and receiver accounts. Figure 13 shows how this works. Separate member accounts within an AWS organizational unit (OU) have the same mechanism for monitoring changes and sending notifications as discussed earlier, but send notifications through an EventBridge instance in another account.

Figure 13: Multi-account EventBridge architecture aggregating notifications between two AWS member accounts to a primary management account

You can read more and see the implementation and deep dive of the multi-account EventBridge solution on the AWS samples GitHub, and you can also read more about sending and receiving Amazon EventBridge notifications between accounts.

Monitor calls to IAM

In this blog post example, you monitor calls to IAM.

The filter pattern you selected while setting up EventBridge matches CloudTrail events for calls to the IAM service. Calls to IAM have a CloudTrail eventSource of iam.amazonaws.com, so IAM API calls will match this pattern. You will find this simple default filter pattern useful if you have minimal IAM activity in your account or to test this example. However, as your account activity grows, you’ll likely receive more notifications than you need. This is when filtering only the relevant events becomes essential to prioritize your responses. Effectively managing your filter preferences allows you to focus on events of significance and maintain control as your AWS environment grows.

Monitor changes to IAM

If you’re interested only in changes to your IAM account, you can modify the event pattern inside EventBridge, the one you used to set up IAM notifications, with an eventName filter pattern, shown following.

"eventName": [
      "Add*",
      "Attach*",
      "Change*",
      "Create*",
      "Deactivate*",
      "Delete*",
      "Detach*",
      "Enable*",
      "Put*",
      "Remove*",
      "Set*",
      "Update*",
      "Upload*"
    ]

This filter pattern will only match events from the IAM service that begin with Add, Change, Create, Deactivate, Delete, Enable, Put, Remove, Update, or Upload. For more information about APIs matching these patterns, see the IAM API Reference.

To edit the filter pattern to monitor only changes to IAM

Open the EventBridge console, navigate to the Event pattern, and choose Edit pattern.

Figure 14: Modifying the event pattern
Add the eventName filter pattern from above to your event pattern.

Figure 15: Use the JSON editor to add the eventName filter pattern

Monitor changes to authentication and authorization configuration

Monitoring changes to authentication (security credentials) and authorization (policy) configurations is critical, because it can alert you to potential security vulnerabilities or breaches. For instance, unauthorized changes to security credentials or policies could indicate malicious activity, such as an attempt to gain unauthorized access to your AWS resources. If you’re only interested in these types of changes, use the preceding steps to implement the following filter pattern.

    "eventName": [
      "Put*Policy",
      "Attach*",
      "Detach*",
      "Create*",
      "Update*",
      "Upload*",
      "Delete*",
      "Remove*",
      "Set*"
    ]

This filter pattern matches calls to IAM that modify policy or create, update, upload, and delete IAM elements.

Conclusion

Monitoring IAM security configuration changes allows you another layer of defense against the unexpected. Balancing productivity and security, you might grant a user broad permissions in order to facilitate their work, such as exploring new AWS services. Although preventive measures are crucial, they can potentially restrict necessary actions. For example, a developer may need to modify an IAM role for their task, an alteration that could pose a security risk. This change, while essential for their work, may be undesirable from a security standpoint. Thus, it’s critical to have monitoring systems alongside preventive measures, allowing necessary actions while maintaining security.

Create an event rule for IAM events that are important to you and have a response plan ready. You can refer to Security best practices in IAM for further reading on this topic.

If you have questions or feedback about this or any other IAM topic, please visit the IAM re:Post forum. You can also read about the multi-account EventBridge solution on the AWS samples GitHub and learn more about sending and receiving Amazon EventBridge notifications between accounts.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Cloudflare Radar’s new BGP origin hijack detection system

2023-07-28 Mingwei Zhang

Post Syndicated from Mingwei Zhang original http://blog.cloudflare.com/bgp-highjack-detection/

Cloudflare Radar's new BGP origin hijack detection system

Border Gateway Protocol (BGP) is the de facto inter-domain routing protocol used on the Internet. It enables networks and organizations to exchange reachability information for blocks of IP addresses (IP prefixes) among each other, thus allowing routers across the Internet to forward traffic to its destination. BGP was designed with the assumption that networks do not intentionally propagate falsified information, but unfortunately that’s not a valid assumption on today’s Internet.

Malicious actors on the Internet who control BGP routers can perform BGP hijacks by falsely announcing ownership of groups of IP addresses that they do not own, control, or route to. By doing so, an attacker is able to redirect traffic destined for the victim network to itself, and monitor and intercept its traffic. A BGP hijack is much like if someone were to change out all the signs on a stretch of freeway and reroute automobile traffic onto incorrect exits.

You can learn more about BGP and BGP hijacking and its consequences in our learning center.

At Cloudflare, we have long been monitoring suspicious BGP anomalies internally. With our recent efforts, we are bringing BGP origin hijack detection to the Cloudflare Radar platform, sharing our detection results with the public. In this blog post, we will explain how we built our detection system and how people can use Radar and its APIs to integrate our data into their own workflows.

What is BGP origin hijacking?

Services and devices on the Internet locate each other using IP addresses. Blocks of IP addresses are called an IP prefix (or just prefix for short), and multiple prefixes from the same organization are aggregated into an autonomous system (AS).

Using the BGP protocol, ASes announce which routes can be imported or exported to other ASes and routers from their routing tables. This is called the AS routing policy. Without this routing information, operating the Internet on a large scale would quickly become impractical: data packets would get lost or take too long to reach their destinations.

During a BGP origin hijack, an attacker creates fake announcements for a targeted prefix, falsely identifying an autonomous systems (AS) under their control as the origin of the prefix.

In the following graph, we show an example where AS 4 announces the prefix P that was previously originated by AS 1. The receiving parties, i.e. AS 2 and AS 3, accept the hijacked routes and forward traffic toward prefix P to AS 4 instead.

As you can see, the normal and hijacked traffic flows back in the opposite direction of the BGP announcements we receive.

If successful, this type of attack will result in the dissemination of the falsified prefix origin announcement throughout the Internet, causing network traffic previously intended for the victim network to be redirected to the AS controlled by the attacker. As an example of a famous BGP hijack attack, in 2018 someone was able to convince parts of the Internet to reroute traffic for AWS to malicious servers where they used DNS to redirect MyEtherWallet.com, a popular crypto wallet, to a hacked page.

Prevention mechanisms and why they’re not perfect (yet)

The key difficulty in preventing BGP origin hijacks is that the BGP protocol itself does not provide a mechanism to validate the announcement content. In other words, the original BGP protocol does not provide any authentication or ownership safeguards; any route can be originated and announced by any random network, independent of its rights to announce that route.

To address this problem, operators and researchers have proposed the Resource Public Key Infrastructure (RPKI) to store and validate prefix-to-origin mapping information. With RPKI, operators can prove the ownership of their network resources and create ROAs, short for Route Origin Authorisations, cryptographically signed objects that define which Autonomous System (AS) is authorized to originate a specific prefix.

Cloudflare committed to support RPKI since the early days of the RFC. With RPKI, IP prefix owners can store and share the ownership information securely, and other operators can validate BGP announcements by checking the prefix origin to the information stored on RPKI. Any hijacking attempt to announce an IP prefix with an incorrect origin AS will result in invalid validation results, and such invalid BGP messages will be discarded. This validation process is referred to as route origin validation (ROV).

In order to further advocate for RPKI deployment and filtering of RPKI invalid announcements, Cloudflare has been providing a RPKI test service, Is BGP Safe Yet?, allowing users to test whether their ISP filters RPKI invalid announcements. We also provide rich information with regard to the RPKI status of individual prefixes and ASes at https://rpki.cloudflare.com/.

However, the effectiveness of RPKI on preventing BGP origin hijacks depends on two factors:

The ratio of prefix owners register their prefixes on RPKI;
The ratio of networks performing route origin validation.

Unfortunately, neither ratio is at a satisfactory level yet. As of today, July 27, 2023, only about 45% of the IP prefixes routable on the Internet are covered by some ROA on RPKI. The remaining prefixes are highly vulnerable to BGP origin hijacks. Even for the 45% prefix that are covered by some ROA, origin hijack attempts can still affect them due to the low ratio of networks that perform route origin validation (ROV). Based on our recent study, only 6.5% of the Internet users are protected by ROV from BGP origin hijacks.

Despite the benefits of RPKI and RPKI ROAs, their effectiveness in preventing BGP origin hijacks is limited by the slow adoption and deployment of these technologies. Until we achieve a high rate of RPKI ROA registration and RPKI invalid filtering, BGP origin hijacks will continue to pose a significant threat to the daily operations of the Internet and the security of everyone connected to it. Therefore, it’s also essential to prioritize developing and deploying BGP monitoring and detection tools to enhance the security and stability of the Internet's routing infrastructure.

Design of Cloudflare’s BGP hijack detection system

Our system comprises multiple data sources and three distinct modules that work together to detect and analyze potential BGP hijack events: prefix origin change detection, hijack detection and the storage and notification module.

The Prefix Origin Change Detection module provides the data, the Hijack Detection module analyzes the data, and the Alerts Storage and Delivery module stores and provides access to the results. Together, these modules work in tandem to provide a comprehensive system for detecting and analyzing potential BGP hijack events.

Prefix origin change detection module

At its core, the BGP protocol involves:

Exchanging prefix reachability (routing) information;
Deciding where to forward traffic based on the reachability information received.

The reachability change information is encoded in BGP update messages while the routing decision results are encoded as a route information base (RIB) on the routers, also known as the routing table.

In our origin hijack detection system, we focus on investigating BGP update messages that contain changes to the origin ASes of any IP prefixes. There are two types of BGP update messages that could indicate prefix origin changes: announcements and withdrawals.

Announcements include an AS-level path toward one or more prefixes. The path tells the receiving parties through which sequence of networks (ASes) one can reach the corresponding prefixes. The last hop of an AS path is the origin AS. In the following diagram, AS 1 is the origin AS of the announced path.

Withdrawals, on the other hand, simply inform the receiving parties that the prefixes are no longer reachable.

Both types of messages are stateless. They inform us of the current route changes, but provide no information about the previous states. As a result, detecting origin changes is not as straightforward as one may think. Our system needs to keep track of historical BGP updates and build some sort of state over time so that we can verify if a BGP update contains origin changes.

We didn't want to deal with a complex system like a database to manage the state of all the prefixes we see resulting from all the BGP updates we get from them. Fortunately, there's this thing called prefix trie in computer science that you can use to store and look up string-indexed data structures, which is ideal for our use case. We ended up developing a fast Rust-based custom IP prefix trie that we use to hold the relevant information such as the origin ASN and the AS path for each IP prefix and allows information to be updated based on BGP announcements and withdrawals.

The example figure below shows an example of the AS path information for prefix 192.0.2.0/24 stored on a prefix trie. When updating the information on the prefix trie, if we see a change of origin ASN for any given prefix, we record the BGP message as well as the change and create an Origin Change Signal.

The prefix origin changes detection module collects and processes live-stream and historical BGP data from various sources. For live streams, our system applies a thin layer of data processing to translate BGP messages into our internal data structure. At the same time, for historical archives, we use a dedicated deployment of the BGPKIT broker and parser to convert MRT files from RouteViews and RIPE RIS into BGP message streams as they become available.

After the data is collected, consolidated and normalized it then creates, maintains and destroys the prefix tries so that we can know what changed from previous BGP announcements from the same peers. Based on these calculations we then send enriched messages downstream to be analyzed.

Hijack detection module

Determining whether BGP messages suggest a hijack is a complex task, and no common scoring mechanism can be used to provide a definitive answer. Fortunately, there are several types of data sources that can collectively provide a relatively good idea of whether a BGP announcement is legitimate or not. These data sources can be categorized into two types: inter-AS relationships and prefix-origin binding.

The inter-AS relationship datasets include AS2org and AS2rel datasets from CAIDA/UCSD, AS2rel datasets from BGPKIT, AS organization datasets from PeeringDB, and per-prefix AS relationship data built at Cloudflare. These datasets provide information about the relationship between autonomous systems, such as whether they are upstream or downstream from one another, or if the origins of any change signal belong to the same organization.

Prefix-to-origin binding datasets include live RPKI validated ROA payload (VRP) from the Cloudflare RPKI portal, daily Internet Routing Registry (IRR) dumps curated and cleaned up by MANRS, and prefix and AS bogon lists (private and reserved addresses defined by RFC 1918, RFC 5735, and RFC 6598). These datasets provide information about the ownership of prefixes and the ASes that are authorized to originate them.

By combining all these data sources, it is possible to collect information about each BGP announcement and answer questions programmatically. For this, we have a scoring function that takes all the evidence gathered for a specific BGP event as the input and runs that data through a sequence of checks. Each condition returns a neutral, positive, or negative weight that keeps adding to the final score. The higher the score, the more likely it is that the event is a hijack attempt.

The following diagram illustrates this sequence of checks:

As you can see, for each event, several checks are involved that help calculate the final score: RPKI, Internet Routing Registry (IRR), bogon prefixes and ASNs lists, AS relationships, and AS path.

Our guiding principles are: if the newly announced origins are RPKI or IRR invalid, it’s more likely that it’s a hijack, but if the old origins are also invalid, then it’s less likely. We discard events about private and reserved ASes and prefixes. If the new and old origins have a direct business relationship, then it’s less likely that it’s a hijack. If the new AS path indicates that the traffic still goes through the old origin, then it’s probably not a hijack.

Signals that are deemed legitimate are discarded, while signals with a high enough confidence score are flagged as potential hijacks and sent downstream for further analysis.

It's important to reiterate that the decision is not binary but a score. There will be situations where we find false negatives or false positives. The advantage of this framework is that we can easily monitor the results, learn from additional datasets and conduct the occasional manual inspection, which allows us to adjust the weights, add new conditions and continue improving the score precision over time.

Aggregating BGP hijack events

Our BGP hijack detection system provides fast response time and requires minimal resources by operating on a per-message basis.

However, when a hijack is happening, the number of hijack signals can be overwhelming for operators to manage. To address this issue, we designed a method to aggregate individual hijack messages into BGP hijack events, thereby reducing the number of alerts triggered.

An event aggregates BGP messages that are coming from the same hijacker related to prefixes from the same victim. The start date is the same as the date of the first suspicious signal. To calculate the end of an event we look for one of the following conditions:

A BGP withdrawn message for the hijacked prefix: regardless of who sends the withdrawal, the route towards the prefix is no longer via the hijacker, and thus this hijack message is considered finished.
A new BGP announcement message with the previous (legitimate) network as the origin: this indicates that the route towards the prefix is reverted to the state before the hijack, and the hijack is therefore considered finished.

If all BGP messages for an event have been withdrawn or reverted, and there are no more new suspicious origin changes from the hijacker ASN for six hours, we mark the event as finished and set the end date.

Hijack events can capture both small-scale and large-scale attacks. Alerts are then based on these aggregated events, not individual messages, making it easier for operators to manage and respond appropriately.

Alerts, Storage and Notifications module

This module provides access to detected BGP hijack events and sends out notifications to relevant parties. It handles storage of all detected events and provides a user interface for easy access and search of historical events. It also generates notifications and delivers them to the relevant parties, such as network administrators or security analysts, when a potential BGP hijack event is detected. Additionally, this module can build dashboards to display high-level information and visualizations of detected events to facilitate further analysis.

Lightweight and portable implementation

Our BGP hijack detection system is implemented as a Rust-based command line application that is lightweight and portable. The whole detection pipeline runs off a single binary application that connects to a PostgreSQL database and essentially runs a complete self-contained BGP data pipeline. And if you are wondering, yes, the full system, including the database, can run well on a laptop.

The runtime cost mainly comes from maintaining the in-memory prefix tries for each full-feed router, each costing roughly 200 MB RAM. For the beta deployment, we use about 170 full-feed peers and the whole system runs well on a single 32 GB node with 12 threads.

Using the BGP Hijack Detection

The BGP Hijack Detection results are now available on both the Cloudflare Radar website and the Cloudflare Radar API.

Cloudflare Radar

Under the “Security & Attacks” section of the Cloudflare Radar for both global and ASN view, we now display the BGP origin hijacks table. In this table, we show a list of detected potential BGP hijack events with the following information:

The detected and expected origin ASes;
The start time and event duration;
The number of BGP messages and route collectors peers that saw the event;
The announced prefixes;
Evidence tags and confidence level (on the likelihood of the event being a hijack).

For each BGP event, our system generates relevant evidence tags to indicate why the event is considered suspicious or not. These tags are used to inform the confidence score assigned to each event. Red tags indicate evidence that increases the likelihood of a hijack event, while green tags indicate the opposite.

For example, the red tag "RPKI INVALID" indicates an event is likely a hijack, as it suggests that the RPKI validation failed for the announcement. Conversely, the tag "SIBLING ORIGINS" is a green tag that indicates the detected and expected origins belong to the same organization, making it less likely for the event to be a hijack.

Users can now access the BGP hijacks table in the following ways:

Global view under Security & Attacks page without location filters. This view lists the most recent 150 detected BGP hijack events globally.
When filtered by a specific ASN, the table will appear on Overview, Traffic, and Traffic & Attacks tabs.

Cloudflare Radar API

We also provide programmable access to the BGP hijack detection results via the Cloudflare Radar API, which is freely available under CC BY-NC 4.0 license. The API documentation is available at the Cloudflare API portal.

The following curl command fetches the most recent 10 BGP hijack events relevant to AS64512.

curl -X GET "https://api.cloudflare.com/client/v4/radar/bgp/hijacks/events?invlovedAsn=64512&format=json&per_page=10" \
    -H "Authorization: Bearer <API_TOKEN>"

Users can further filter events with high confidence by specifying the minConfidence parameter with a 0-10 value, where a higher value indicates higher confidence of the events being a hijack. The following example expands on the previous example by adding the minimum confidence score of 8 to the query:

curl -X GET "https://api.cloudflare.com/client/v4/radar/bgp/hijacks/events?invlovedAsn=64512&format=json&per_page=10&minConfidence=8" \
    -H "Authorization: Bearer <API_TOKEN>"

Additionally, users can also quickly build custom hijack alerters using a Cloudflare Workers + KV combination. We have a full tutorial on building alerters that send out webhook-based messages or emails (with Email Routing) available on the Cloudflare Radar documentation site.

More routing security on Cloudflare Radar

As we continue improving Cloudflare Radar, we are planning to introduce additional Internet routing and security data. For example, Radar will soon get a dedicated routing section to provide digestible BGP information for given networks or regions, such as distinct routable prefixes, RPKI valid/invalid/unknown routes, distribution of IPv4/IPv6 prefixes, etc. Our goal is to provide the best data and tools for routing security to the community, so that we can build a better and more secure Internet together.

Visit Cloudflare Radar for additional insights around (Internet disruptions, routing issues, Internet traffic trends, attacks, Internet quality, etc.). Follow us on social media at @CloudflareRadar (Twitter), cloudflare.social/@radar (Mastodon), and radar.cloudflare.com (Bluesky), or contact us via e-mail.

Introducing thresholds in Security Event Alerting: a z-score love story

2022-08-30 Kristina Galicova

Post Syndicated from Kristina Galicova original https://blog.cloudflare.com/introducing-thresholds-in-security-event-alerting-a-z-score-love-story/

Introducing thresholds in Security Event Alerting: a z-score love story

Today we are excited to announce thresholds for our Security Event Alerts: a new and improved way of detecting anomalous spikes of security events on your Internet properties. Previously, our calculations were based on z-score methodology alone, which was able to determine most of the significant spikes. By introducing a threshold, we are able to make alerts more accurate and only notify you when it truly matters. One can think of it as a romance between the two strategies. This is the story of how they met.

Author’s note: as an intern at Cloudflare I got to work on this project from start to finish from investigation all the way to the final product.

Once upon a time

In the beginning, there were Security Event Alerts. Security Event Alerts are notifications that are sent whenever we detect a threat to your Internet property. As the name suggests, they track the number of security events, which are requests to your application that match security rules. For example, you can configure a security rule that blocks access from certain countries. Every time a user from that country tries to access your Internet property, it will log as a security event. While a security event may be harmless and fired as a result of the natural flow of traffic, it is important to alert on instances when a rule is fired more times than usual. Anomalous spikes of too many security events in a short period of time can indicate an attack. To find these anomalies and distinguish between the natural number of security events and that which poses a threat, we need a good strategy.

The lonely life of a z-score

Before a threshold entered the picture, our strategy worked only on the basis of a z-score. Z-score is a methodology that looks at the number of standard deviations a certain data point is from the mean. In our current configuration, if a spike crosses the z-score value of 3.5, we send you an alert. This value was decided on after careful analysis of our customers’ data, finding it the most effective in determining a legitimate alert. Any lower and notifications will get noisy for smaller spikes. Any higher and we may miss out on significant events. You can read more about our z-score methodology in this blog post.

The following graphs are an example of how the z-score method works. The first graph shows the number of security events over time, with a recent spike.

To determine whether this spike is significant, we calculate the z-score and check if the value is above 3.5:

As the graph shows, the deviation is above 3.5 and so an alert is triggered.

However, relying on z-score becomes tricky for domains that experience no security events for a long period of time. With many security events at zero, the mean and standard deviation depress to zero as well. When a non-zero value finally appears, it will always be infinite standard deviations away from the mean. As a result, it will always trigger an alert even on spikes that do not pose any threat to your domain, such as the below:

With five security events, you are likely going to ignore this spike, as it is too low to indicate a meaningful threat. However, the z-score in this instance will be infinite:

Since a z-score of infinity is greater than 3.5, an alert will be triggered. This means that customers with few security events would often be overwhelmed by event alerts that are not worth worrying about.

Letting go of zeros

To avoid the mean and standard deviation becoming zero and thus alerting on every non-zero spike, zero values can be ignored in the calculation. In other words, to calculate the mean and standard deviation, only data points that are higher than zero will be considered.

With those conditions, the same spike to five security events will now generate a different z-score:

Great! With the z-score at zero, it will no longer trigger an alert on the harmless spike!

But what about spikes that could be harmful? When calculations ignore zeros, we need enough non-zero data points to accurately determine the mean and standard deviation. If only one non-zero value is present, that data point determines the mean and standard deviation. As such, the mean will always be equal to the spike, z-score will always be zero and an alert will never be triggered:

For a spike of 1000 events, we can tell that there is something wrong and we should trigger an alert. However, because there is only one non-zero data point, the z-score will remain zero:

The z-score does not cross the value 3.5 and an alert will not be triggered.

So what’s better? Including zeros in our calculations can skew the results for domains with too many zero events and alert them every time a spike appears. Not including zeros is mathematically wrong and will never alert on these spikes.

Threshold, the prince charming

Clearly, a z-score is not enough on its own.

Instead, we paired up the z-score with a threshold. The threshold represents the raw number of security events an Internet property can have, below which an alert will not be sent. While z-score checks whether the spike is at least 3.5 standard deviations above the mean, the threshold makes sure it is above a certain static value. If both of these conditions are met, we will send you an alert:

The above spike crosses the threshold of 200 security events. We now have to check that the z-score is above 3.5:

The z-score value crosses 3.5 and an alert will be sent.

A threshold for the number of security events comes as the perfect complement. By itself, the threshold cannot determine whether something is a spike, and would simply alert on any value crossing it. This blog post describes in more detail why thresholds alone do not work. However, when paired with z-score, they are able to share their strengths and cover for each other’s weaknesses. If the z-score falsely detects an insignificant spike, the threshold will stop the alert from triggering. Conversely, if a value does cross the security events threshold, the z-score ensures there is a reasonable variance from the data average before allowing an alert to be sent.

The invaluable value

To foster a successful relationship between the z-score and security events threshold, we needed to determine the most effective threshold value. After careful analysis of our previous attacks on customers, we set the value to 200. This number is high enough to filter out the smaller, noisier spikes, but low enough to expose any threats.

Am I invited to the wedding?

Yes, you are! The z-score and threshold relationship is already enabled for all WAF customers, so all you need to do is sit back and relax. For enterprise customers, the threshold will be applied to each type of alert enabled on your domain.

Happily ever after

The story certainly does not end here. We are constantly iterating on our alerts, so keep an eye out for future updates on the road to make our algorithms even more personalized for your Internet properties!

Cloudflare Observability

2022-03-18 Tanushree Sharma

Post Syndicated from Tanushree Sharma original https://blog.cloudflare.com/vision-for-observability/

Cloudflare Observability

Whether you’re a software engineer deploying a new feature, network engineer updating routes, or a security engineer configuring a new firewall rule: You need visibility to know if your system is behaving as intended — and if it’s not, to know how to fix it.

Cloudflare is committed to helping our customers get visibility into the services they have protected behind Cloudflare. Being a single pane of glass for all network activity has always been one of Cloudflare’s goals. Today, we’re outlining the future vision for Cloudflare observability.

What is observability?

Observability means gaining visibility into the internal state of a system. It’s used to give users the tools to figure out what’s happening, where it’s happening, and why.

At Cloudflare, we believe that observability has three core components: monitoring, analytics, and forensics. Monitoring measures the health of a system – it tells you when something is going wrong. Analytics give you the tools to visualize data to identify patterns and insights. Forensics helps you answer very specific questions about an event.

Observability becomes particularly important in the context of security to validate that any mitigating actions performed by our security products, such as Firewall or Bot Management, are not false positives. Was that request correctly classified as malicious? And if it wasn’t, which detection system classified it as such?

Cloudflare, additionally, has products to improve performance of applications and corporate networks and allow developers to write lightning fast code that runs on our global network. We want to be able to provide our customers with insights into every request, packet, and fetch that goes through Cloudflare’s network.

Monitoring and Notifying

Analytics are fantastic for summarizing data, but how do you know when to look at them? No one wants to sit on the dashboard clicking refresh over and over again just in case something looks off. That’s where notifications come in.

When we talk about something “looking off” on an analytics page, what we really mean is that there’s a significant change in your traffic or network which is reflected by spikes or drops in our analytics. Availability and performance directly affect end users, and our goal is to monitor and notify our customers as soon as we see things going wrong.

Today, we have many different types of notifications from Origin Error Rates, Security Events, and Advanced Security Events to Usage Based Billing and Health Checks. We’re continuously adding more notification types to have them correspond with our awesome analytics. As our analytics get more customizable, our notifications will as well.

There’s tons of different algorithms that can be used to detect spikes, including using burn rates and z-scores. We’re continuing to iterate on the algorithms that we use for detections to offer more variations, make them smarter, and make sure that our notifications are both accurate and not too noisy.

Analytics

So, you’ve received an alert from Cloudflare. What comes next?

Analytics can be used to get a birds eye view of traffic or focus on specific types of events by adding filters and time ranges. After you receive an alert, we want to show you exactly what’s been triggered through graphs, high level metrics, and top Ns on the Cloudflare dashboard.

Whether you’re a developer, security analyst, or network engineer, the Cloudflare dashboard should be the spot for you to see everything you need. We want to make the dashboard more customizable to serve the diverse use cases of our customers. Analyze data by specifying a timeframe and filter through dropdowns on the dashboard, or build your own metrics and graphs that work alongside the raw logs to give you a clear picture of what’s happening.

Focusing on security, we believe analytics are the best tool to build confidence before deploying security policies. Moving forward, we plan to layer all of our security related detection signals on top of HTTP analytics so you can use the dashboard to answer questions such as: if I were to block all requests that the WAF identifies as an XSS attack, what would I block?

Customers using our enterprise Bot Management may already be familiar with this experience, and as we improve it and build upon it further, all of our other security products will follow.

Analytics are a powerful tool to see high level patterns and identify anomalies that indicate that something unusual is happening. We’re working on new dashboards, customizations, and features that widen the use cases for our customers. Stay tuned!

Logs

Logs are used when you want to examine specific details about an event. They consist of a timestamp and fields that describe the event and are used to get visibility on a granular level when you need a play-by-play.

In each of our datasets, an event measures something different. For example, in HTTP request logs, an event is when an end user requests content from or sends content to a server. For Firewall logs, an event occurs when the Firewall takes an action on an HTTP request. There can be multiple Firewall events for each HTTP request.

Today, our customers access logs using Logpull, Logpush, or Instant Logs. Logpull and Logpush are great for customers that want to send their logs to third parties (like our Analytics Partners) to store, analyze, and correlate with other data sources. With Instant Logs, our customers can monitor and troubleshoot their traffic in real-time straight from the dashboard or CLI. We’re planning on building out more capabilities to dig into logs on Cloudflare. We’re hard at work on building log storage on R2 – but what’s next?

We’ve heard from customers that the activity log on the Firewall analytics dashboard is incredibly useful. We want to continue to bring the power of logs to the dashboard by adding the same functionality across our products. For customers that will store their logs on Cloudflare R2, this means that we can minimize the use of sampled data.

If you’re looking for something very specific, querying logs is also important, which is where forensics comes in. The goal is to let you investigate from high level analytics all the way down to individual logs lines that make them up. Given a unique identifier, such as the ray ID, you should be able to look up a single request, and then correlate it with all other related activity. Find out the client IP of that ray ID and from there, use cases are plentiful: what other requests from this IP are malicious? What paths did the client follow?

Tracing

Logs are really useful, but they don’t capture the context around a request. Traces show the end-to-end life cycle of a request from when a user requests a resource to each of the systems that are involved in its delivery. They’re another way of applying forensics to help you find something very specific.

These are used to differentiate each part of the application to identify where errors or bottlenecks are occurring. Let’s say that you have a Worker that performs a fetch event to your origin and a third party API. Analytics can show you average execution times and error rates for your Worker, but it doesn’t give you visibility into each of these operations.

Using wrangler dev and console.log statements are really helpful ways to test and debug your code. They bring some of the visibility that’s needed, but it can be tedious to instrument your code like this.

As a developer, you should have the tools to understand what’s going on in your applications so you can deliver the best experience to your end users. We can help you answer questions like: Where is my Worker execution failing? Which operation is causing a spike in latency in my application?

Putting it all together

Notifications, analytics, logs, and tracing each have their distinct use cases, but together, these are powerful tools to provide analysts and developers visibility. Looking forward, we’re excited to bring more and more of these capabilities on the Cloudflare dashboard.

We would love to hear from you as we build these features out. If you’re interested in sharing use cases and helping shape our roadmap, contact your account team!

Cloudflare customers on Free plans can now also get real-time DDoS alerts

2022-01-17 Omer Yoachimik

Post Syndicated from Omer Yoachimik original https://blog.cloudflare.com/free-ddos-alerts/

Cloudflare customers on Free plans can now also get real-time DDoS alerts

We’re excited to announce that customers using our Free plan can now get real-time alerts about HTTP DDoS attacks that were automatically detected and mitigated by Cloudflare. The real-time DDoS alerts were originally announced over a year ago but were made available to customers on the Pro plan or higher. This announcement extends the DDoS alerts feature to Free plan users. You can read the original announcement blog post here.

What is a DDoS attack?

A Distributed Denial of Service (DDoS) attack is a cyber-attack that attempts to disrupt your online business. Whether your business relies on VoIP servers, UDP-based gaming servers, or HTTP servers, DDoS attacks can be used to disrupt any type of Internet property, server, or network.

In this blog post, we’ll focus on DDoS attacks that target HTTP servers. Whether your HTTP server is powering a mobile app, an eCommerce website, an API gateway, or any other HTTP application, if an attacker sends you more requests than it can handle, your server won’t be able to serve your real users. A flood of requests can cause service disruptions or even take your entire server offline. DDoS attacks can have real-world consequences such as a blow to your revenue and reputation.

How Cloudflare detects and mitigates DDoS attacks

Protecting your server against DDoS attacks requires two main capabilities:

The bandwidth to absorb both your users’ requests and the attack requests
The ability to differentiate between your users’ requests and the attack requests

Using our home-grown systems, we do just that, regardless of the size, frequency and duration of the attacks. All Cloudflare customers, including those using the Free plan, are protected by our unmetered DDoS mitigation commitment.

To protect against DDoS attacks, first, we route your traffic to our network of data centers. Our network spans more than 250 cities in over 100 countries around the world. Its capacity is over 100 Tbps — fifty times larger than the largest attack we’ve ever seen. Our bandwidth is more than enough to absorb both your users’ traffic and attack traffic.

Cloudflare’s global network

Second, once your traffic reaches our data centers, it goes through state-of-the-art analysis mechanisms that constantly scan for DDoS attacks. Once an attack is detected, a real-time mitigation rule is automatically generated to surgically mitigate the attack requests based on the attack pattern, whilst leaving your users’ requests untouched. Using the HTTP DDoS Managed Ruleset you can customize the settings of the mitigation system to tailor it to your needs and specific traffic patterns.

Not sure what to do? That’s ok. For the most part, you won’t need to do anything and our system will automatically keep your servers protected. You can read more about it in our Get Started guide or in the original blog post. If you’re interested, you can also read more about how our mitigation system works in this technical blog post: A deep-dive into Cloudflare’s autonomous edge DDoS protection

Configuring a DDoS alert

Once our system detects and mitigates a DDoS attack, you’ll receive a real-time alert. To receive an alert, make sure you, first, configure a notification policy by following these steps:

Log in to the Cloudflare dashboard and select your account.
In the Home Screen, go to Notifications.
Click Add and choose the HTTP DDoS Attack Alerter.
Give your alert a name, an optional description, add the recipients’ email addresses and click Create.

To learn more about DDoS alerts and supported delivery methods, check out our guide Understanding Cloudflare DDoS Alerts.

Free DDoS protection, control, and visibility

Cloudflare’s mission is to help build a better Internet, and it guides everything we do. As part of this mission, we believe that a better Internet is one where enterprise-grade DDoS protection is available for everyone, not just bigger organizations.

Furthermore, we’ve also made our DDoS Managed Ruleset available for everyone to make sure that even non-paying customers can tailor and optimize their DDoS protection settings. Taking a step further, we want all of our users to be able to react as fast as possible when needed. This is why we’re providing real-time alerts for free. Knowledge is power, and notifying our users of attacks in real-time empowers them to ensure their website is safe, available, and performant.

Not using Cloudflare yet? Start now.

What’s new with Notifications?

2021-12-11 Natasha Wissmann

Post Syndicated from Natasha Wissmann original https://blog.cloudflare.com/whats-new-with-notifications/

What’s new with Notifications?

Back in 2019, we blogged about our brand new Notification center as a centralized hub for configuring notifications on your account. Since then, we’ve talked a lot about new types of notifications you can set up, but not as much about updates to the notification platform itself. So what’s new with Notifications?

Why we care about notifications

We know that notifications are incredibly important to our customers. Cloudflare sits in between your Internet property and the rest of the world. When something goes wrong, you want to know right away because it could have a huge impact on your end users. However, you don’t want to have to sit on the Cloudflare Dashboard all day, pressing refresh on analytics pages over and over just to make sure that you don’t miss anything important. This is where Notifications come in. Instead of requiring you to actively monitor your Internet properties, you want Cloudflare to be able to directly inform you when something might be going wrong.

Cloudflare has many different notification types to ensure that you don’t miss anything important. We have notifications to inform you that you’ve been DDoS’d, or that the Firewall is blocking more requests than normal, or that your origin is seeing high levels of 5xx errors, or even that your Workers script’s CPU usage is above average. We’re constantly adding new notifications, so make sure to check our Cloudflare Development Docs to see what’s new!

Emails are out, webhooks are in

So we have all of these super great notifications, but how do we actually inform you of an event? The classic answer is “we email you.” All of our customers have the ability to configure notifications to send to the email addresses of their choosing.

However, email isn’t always the optimal choice. What happens when an email gets sent to spam, or filtered out into another folder that you rarely check? What if you’re a person who never cleans out their inbox and has four thousand unread emails that can drown out new important emails that come in? You want a way for notifications to go directly to the messaging platform that you check the most, whether that’s Slack or Microsoft Teams or Discord or something else entirely. For customers on our Professional, Business, and Enterprise plans, this is where webhooks come in.

Webhooks are incredibly powerful! They’re a type of API with a simple, standardized behavior. They allow one service (Cloudflare) to send events directly to another service. This destination service can be nearly anything: messaging platforms, data management systems, workflow automation systems, or even your own internal APIs.

While Cloudflare has had first class support for webhooking into Slack, Microsoft Teams, Google Chat, and customer’s own APIs for a while, we’ve recently added support for DataDog, Discord, OpsGenie, and Splunk as well. You can read about how to set up each of those types of webhooks in our Cloudflare Development Docs.

Because webhooks are so versatile, more and more customers are using them! The number of webhooks configured within Cloudflare’s notification system doubles, on average, every three months. Customers can configure webhooks in the Notifications tab in the dashboard.

Those who forget history are doomed to repeat it

Webhooks are cool, but they still leave room for error. What happens when you receive a notification but accidentally delete it? Or when someone new starts at your company, but you forget to update the notification settings to send to the new employee?

Before now, Cloudflare notifications were entirely point in time. We sent you a notification via your preferred method, and we no longer had any visibility into that notification. If that notification gets lost on your end, we don’t have any way to help recover the information it contained.

Notification history fixes that exact issue. Users are able to see a log of the notifications that were sent, when they were sent, and who they were sent to. Customers on Free, Professional, or Business plans are able to see notification history for the past 30 days. Customers on Enterprise plans are able to see notification history for the past 90 days.

Right now, notification history is only available via API, but stay tuned for updates about viewing directly in the Cloudflare Dashboard!

Log IAM changes with CloudTrail

To create a CloudTrail log

Set up notifications with Amazon SNS

To set up notifications

Initiate events with EventBridge

To create an EventBridge rule

Centralize EventBridge alerts by using cross-account alerts

Monitor calls to IAM

Monitor changes to IAM

To edit the filter pattern to monitor only changes to IAM

Monitor changes to authentication and authorization configuration

Conclusion

What is BGP origin hijacking?

Prevention mechanisms and why they’re not perfect (yet)

Design of Cloudflare’s BGP hijack detection system

Prefix origin change detection module

Hijack detection module

Aggregating BGP hijack events

Alerts, Storage and Notifications module

Lightweight and portable implementation

Using the BGP Hijack Detection

Cloudflare Radar

Cloudflare Radar API

More routing security on Cloudflare Radar

Once upon a time

The lonely life of a z-score

Letting go of zeros

Threshold, the prince charming

The invaluable value

Am I invited to the wedding?

Happily ever after

What is observability?

Monitoring and Notifying

Analytics

Logs

Tracing

Putting it all together

What is a DDoS attack?

How Cloudflare detects and mitigates DDoS attacks

Cloudflare’s global network

Configuring a DDoS alert

Free DDoS protection, control, and visibility

Why we care about notifications

Emails are out, webhooks are in

Those who forget history are doomed to repeat it

The collective thoughts of the interwebz