Tag Archives: observability

Intelligent, automatic restarts for unhealthy Kafka consumers

Post Syndicated from Chris Shepherd original https://blog.cloudflare.com/intelligent-automatic-restarts-for-unhealthy-kafka-consumers/

Intelligent, automatic restarts for unhealthy Kafka consumers

Intelligent, automatic restarts for unhealthy Kafka consumers

At Cloudflare, we take steps to ensure we are resilient against failure at all levels of our infrastructure. This includes Kafka, which we use for critical workflows such as sending time-sensitive emails and alerts.

We learned a lot about keeping our applications that leverage Kafka healthy, so they can always be operational. Application health checks are notoriously hard to implement: What determines an application as healthy? How can we keep services operational at all times?

These can be implemented in many ways. We’ll talk about an approach that allows us to considerably reduce incidents with unhealthy applications while requiring less manual intervention.

Kafka at Cloudflare

Cloudflare is a big adopter of Kafka. We use Kafka as a way to decouple services due to its asynchronous nature and reliability. It allows different teams to work effectively without creating dependencies on one another. You can also read more about how other teams at Cloudflare use Kafka in this post.

Kafka is used to send and receive messages. Messages represent some kind of event like a credit card payment or details of a new user created in your platform. These messages can be represented in multiple ways: JSON, Protobuf, Avro and so on.

Kafka organises messages in topics. A topic is an ordered log of events in which each message is marked with a progressive offset. When an event is written by an external system, that is appended to the end of that topic. These events are not deleted from the topic by default (retention can be applied).

Intelligent, automatic restarts for unhealthy Kafka consumers

Topics are stored as log files on disk, which are finite in size. Partitions are a systematic way of breaking the one topic log file into many logs, each of which can be hosted on separate servers–enabling to scale topics.

Topics are managed by brokers–nodes in a Kafka cluster. These are responsible for writing new events to partitions, serving reads and replicating partitions among themselves.

Messages can be consumed by individual consumers or co-ordinated groups of consumers, known as consumer groups.

Consumers use a unique id (consumer id) that allows them to be identified by the broker as an application which is consuming from a specific topic.

Each topic can be read by an infinite number of different consumers, as long as they use a different id. Each consumer can replay the same messages as many times as they want.

When a consumer starts consuming from a topic, it will process all messages, starting from a selected offset, from each partition. With a consumer group, the partitions are divided amongst each consumer in the group. This division is determined by the consumer group leader. This leader will receive information about the other consumers in the group and will decide which consumers will receive messages from which partitions (partition strategy).

Intelligent, automatic restarts for unhealthy Kafka consumers

The offset of a consumer’s commit can demonstrate whether the consumer is working as expected. Committing a processed offset is the way a consumer and its consumer group report to the broker that they have processed a particular message.

Intelligent, automatic restarts for unhealthy Kafka consumers

A standard measurement of whether a consumer is processing fast enough is lag. We use this to measure how far behind the newest message we are. This tracks time elapsed between messages being written to and read from a topic. When a service is lagging behind, it means that the consumption is at a slower rate than new messages being produced.

Due to Cloudflare’s scale, message rates typically end up being very large and a lot of requests are time-sensitive so monitoring this is vital.

At Cloudflare, our applications using Kafka are deployed as microservices on Kubernetes.

Health checks for Kubernetes apps

Kubernetes uses probes to understand if a service is healthy and is ready to receive traffic or to run. When a liveness probe fails and the bounds for retrying are exceeded, Kubernetes restarts the services.

Intelligent, automatic restarts for unhealthy Kafka consumers

When a readiness probe fails and the bounds for retrying are exceeded, it stops sending HTTP traffic to the targeted pods. In the case of Kafka applications this is not relevant as they don’t run an http server. For this reason, we’ll cover only liveness checks.

A classic Kafka liveness check done on a consumer checks the status of the connection with the broker. It’s often best practice to keep these checks simple and perform some basic operations – in this case, something like listing topics. If, for any reason, this check fails consistently, for instance the broker returns a TLS error, Kubernetes terminates the service and starts a new pod of the same service, therefore forcing a new connection. Simple Kafka liveness checks do a good job of understanding when the connection with the broker is unhealthy.

Intelligent, automatic restarts for unhealthy Kafka consumers

Problems with Kafka health checks

Due to Cloudflare’s scale, a lot of our Kafka topics are divided into multiple partitions (in some cases this can be hundreds!) and in many cases the replica count of our consuming service doesn’t necessarily match the number of partitions on the Kafka topic. This can mean that in a lot of scenarios this simple approach to health checking is not quite enough!

Microservices that consume from Kafka topics are healthy if they are consuming and committing offsets at regular intervals when messages are being published to a topic. When such services are not committing offsets as expected, it means that the consumer is in a bad state, and it will start accumulating lag. An approach we often take is to manually terminate and restart the service in Kubernetes, this will cause a reconnection and rebalance.

Intelligent, automatic restarts for unhealthy Kafka consumers

When a consumer joins or leaves a consumer group, a rebalance is triggered and the consumer group leader must re-assign which consumers will read from which partitions.

When a rebalance happens, each consumer is notified to stop consuming. Some consumers might get their assigned partitions taken away and re-assigned to another consumer. We noticed when this happened within our library implementation; if the consumer doesn’t acknowledge this command, it will wait indefinitely for new messages to be consumed from a partition that it’s no longer assigned to, ultimately leading to a deadlock. Usually a manual restart of the faulty client-side app is needed to resume processing.

Intelligent health checks

As we were seeing consumers reporting as “healthy” but sitting idle, it occurred to us that maybe we were focusing on the wrong thing in our health checks. Just because the service is connected to the Kafka broker and can read from the topic, it does not mean the consumer is actively processing messages.

Therefore, we realised we should be focused on message ingestion, using the offset values to ensure that forward progress was being made.

The PagerDuty approach

PagerDuty wrote an excellent blog on this topic which we used as inspiration when coming up with our approach.

Their approach used the current (latest) offset and the committed offset values. The current offset signifies the last message that was sent to the topic, while the committed offset is the last message that was processed by the consumer.

Intelligent, automatic restarts for unhealthy Kafka consumers

Checking the consumer is moving forwards, by ensuring that the latest offset was changing (receiving new messages) and the committed offsets were changing as well (processing the new messages).

Therefore, the solution we came up with:

  • If we cannot read the current offset, fail liveness probe.
  • If we cannot read the committed offset, fail liveness probe.
  • If the committed offset == the current offset, pass liveness probe.
  • If the value for the committed offset has not changed since the last run of the health check, fail liveness probe.
Intelligent, automatic restarts for unhealthy Kafka consumers

To measure if the committed offset is changing, we need to store the value of the previous run, we do this using an in-memory map where partition number is the key. This means each instance of our service only has a view of the partitions it is currently consuming from and will run the health check for each.

Problems

When we first rolled out our smart health checks we started to notice cascading failures some time after release. After initial investigations we realised this was happening when a rebalance happens. It would initially affect one replica then quickly result in the others reporting as unhealthy.

What we observed was due to us storing the previous value of the committed offset in-memory, when a rebalance happens the service may get re-assigned a different partition. When this happened it meant our service was incorrectly assuming that the committed offset for that partition had not changed (as this specific replica was no longer updating the latest value), therefore it would start to report the service as unhealthy. The failing liveness probe would then cause it to restart which would in-turn trigger another rebalancing in Kafka causing other replicas to face the same issue.

Solution

To fix this issue we needed to ensure that each replica only kept track of the offsets for the partitions it was consuming from at that moment. Luckily, the Shopify Sarama library, which we use internally, has functionality to observe when a rebalancing happens. This meant we could use it to rebuild the in-memory map of offsets so that it would only include the relevant partition values.

This is handled by receiving the signal from the session context channel:

for {
  select {
  case message, ok := <-claim.Messages(): // <-- Message received

     // Store latest received offset in-memory
     offsetMap[message.Partition] = message.Offset


     // Handle message
     handleMessage(ctx, message)


     // Commit message offset
     session.MarkMessage(message, "")


  case <-session.Context().Done(): // <-- Rebalance happened

     // Remove rebalanced partition from in-memory map
     delete(offsetMap, claim.Partition())
  }
}

Verifying this solution was straightforward, we just needed to trigger a rebalance. To test this worked in all possible scenarios we spun up a single replica of a service consuming from multiple partitions, then proceeded to scale up the number of replicas until it matched the partition count, then scaled back down to a single replica. By doing this we verified that the health checks could safely handle new partitions being assigned as well as partitions being taken away.

Takeaways

Probes in Kubernetes are very easy to set up and can be a powerful tool to ensure your application is running as expected. Well implemented probes can often be the difference between engineers being called out to fix trivial issues (sometimes outside of working hours) and a service which is self-healing.

However, without proper thought, “dumb” health checks can also lead to a false sense of security that a service is running as expected even when it’s not. One thing we have learnt from this was to think more about the specific behaviour of the service and decide what being unhealthy means in each instance, instead of just ensuring that dependent services are connected.

Monitor your own network with free network flow analytics from Cloudflare

Post Syndicated from Chris Draper original https://blog.cloudflare.com/free-magic-network-monitoring/

Monitor your own network with free network flow analytics from Cloudflare

Monitor your own network with free network flow analytics from Cloudflare

As a network engineer or manager, answering questions about the traffic flowing across your infrastructure is a key part of your job. Cloudflare built Magic Network Monitoring (previously called Flow Based Monitoring) to give you better visibility into your network and to answer questions like, “What is my network’s peak traffic volume? What are the sources of that traffic? When does my network see that traffic?” Today, Cloudflare is excited to announce early access to a free version of Magic Network Monitoring that will be available to everyone. You can request early access by filling out this form.

Magic Network Monitoring now features a powerful analytics dashboard, self-serve configuration, and a step-by-step onboarding wizard. You’ll have access to a tool that helps you visualize your traffic and filter by packet characteristics including protocols, source IPs, destination IPs, ports, TCP flags, and router IP. Magic Network Monitoring also includes network traffic volume alerts for specific IP addresses or IP prefixes on your network.

Making Network Monitoring easy

Magic Networking Monitoring allows customers to collect network analytics without installing a physical device like a network TAP (Test Access Point) or setting up overly complex remote monitoring systems. Our product works with any hardware that exports network flow data, and customers can quickly configure any router to send flow data to Cloudflare’s network. From there, our network flow analyzer will aggregate your traffic data and display it in Magic Network Monitoring analytics.

Analytics dashboard

In Magic Network Monitoring analytics, customers can take a deep dive into their network traffic data. You can filter traffic data by protocol, source IP, destination IP, TCP flags, and router IP. Customers can combine these filters together to answer questions like, “How much ICMP data was requested from my speed test server over the past 24 hours?” Visibility into traffic analytics is a key part of understanding your network’s operations and proactively improving your security. Let’s walk through some cases where Magic Network Monitoring analytics can answer your network visibility and security questions.

Monitor your own network with free network flow analytics from Cloudflare

Create network volume alert thresholds per IP address or IP prefix

Magic Network Monitoring is incredibly flexible, and it can be customized to meet the needs of any network hobbyist or business. You can monitor your traffic volume trends over time via the analytics dashboard and build an understanding of your network’s traffic profile. After gathering historical network data, you can set custom volumetric threshold alerts for one IP prefix or a group of IP prefixes. As your network traffic changes over time, or their network expands, they can easily update their Magic Network Monitoring configuration to receive data from new routers or destinations within their network.

Monitoring a speed test server in a home lab

Let’s run through an example where you’re running a network home lab. You decide to use Magic Network Monitoring to track the volume of requests a speed test server you’re hosting receives and check for potential bad actors. Your goal is to identify when your speed test server experiences peak traffic, and the volume of that traffic. You set up Magic Network Monitoring and create a rule that analyzes all traffic destined for your speed test server’s IP address. After collecting data for seven days, the analytics dashboard shows that peak traffic occurs on weekdays in the morning, and that during this time, your traffic volume ranges from 450 – 550 Mbps.

As you’re checking over the analytics data, you also notice strange traffic spikes of 300 – 350 Mbps in the middle of the night that occur at the same time. As you investigate further, the analytics dashboard shows the source of this traffic spike is from the same IP prefix. You research some source IPs, and find they’re associated with malicious activity. As a result, you update your firewall to block traffic from this problematic source.

Identifying a network layer DDoS attack

Magic Network Monitoring can also be leveraged to identify a variety of L3, L4, and L7 DDoS attacks. Let’s run through an example of how ACME Corp, a small business using Magic Network Monitoring, can identify a Ping (ICMP) Flood attack on their network. Ping Flood attacks aim to overwhelm the targeted network’s ability to respond to a high number of requests or overload the network connection with bogus traffic.

At the start of a Ping Flood attack, your server’s traffic volume will begin to ramp up. Magic Network Monitoring will analyze traffic across your network, and send an email, webhook, or PagerDuty alert once an unusual volume of traffic is identified. Your network and security team can respond to the volumetric alert by checking the data in Magic Network Monitoring analytics and identifying the attack type. In this case, they’ll notice the following traffic characteristics:

  1. Network traffic volume above your historical traffic averages
  2. An unusually large amount of ICMP traffic
  3. ICMP traffic coming from a specific set of source IPs

Now, your network security team has confirmed the traffic is malicious by identifying the attack type, and can begin taking steps to mitigate the attack.

Magic Network Monitoring and Magic Transit

If your business is impacted by DDoS attacks, Magic Network Monitoring will identify attacks, and Magic Transit can be used to mitigate those DDoS attacks. Magic Transit protects customers’ entire network from DDoS attacks by placing our network in front of theirs. You can use Magic Transit Always On to reduce latency and mitigate attacks all the time, or Magic Transit On Demand to protect your network during active attacks. With Magic Transit, you get DDoS protection, traffic acceleration, and other network functions delivered as a service from every Cloudflare data center. Magic Transit works by allowing Cloudflare to advertise customers’ IP prefixes to the Internet with BGP to route the customer’s traffic through our network for DDoS protection. If you’re interested in protecting your network with Magic Transit, you can visit the Magic Transit product page and request a demo today.

Monitor your own network with free network flow analytics from Cloudflare

Sign up for early access and what’s next

The free version of Magic Network Monitoring (MNM) will be released in the next few weeks. You can request early access by filling out this form.

This is just the beginning for Magic Network Monitoring. In the future, you can look forward to features like advanced DDoS attack identification, network incident history and trends, and volumetric alert threshold recommendations.

Monitoring our monitoring: how we validate our Prometheus alert rules

Post Syndicated from Lukasz Mierzwa original https://blog.cloudflare.com/monitoring-our-monitoring/

Monitoring our monitoring: how we validate our Prometheus alert rules

Background

Monitoring our monitoring: how we validate our Prometheus alert rules

We use Prometheus as our core monitoring system. We’ve been heavy Prometheus users since 2017 when we migrated off our previous monitoring system which used a customized Nagios setup. Despite growing our infrastructure a lot, adding tons of new products and learning some hard lessons about operating Prometheus at scale, our original architecture of Prometheus (see Monitoring Cloudflare’s Planet-Scale Edge Network with Prometheus for an in depth walk through) remains virtually unchanged, proving that Prometheus is a solid foundation for building observability into your services.

One of the key responsibilities of Prometheus is to alert us when something goes wrong and in this blog post we’ll talk about how we make those alerts more reliable – and we’ll introduce an open source tool we’ve developed to help us with that, and share how you can use it too. If you’re not familiar with Prometheus you might want to start by watching this video to better understand the topic we’ll be covering here.

Prometheus works by collecting metrics from our services and storing those metrics inside its database, called TSDB. We can then query these metrics using Prometheus query language called PromQL using ad-hoc queries (for example to power Grafana dashboards) or via alerting or recording rules. A rule is basically a query that Prometheus will run for us in a loop, and when that query returns any results it will either be recorded as new metrics (with recording rules) or trigger alerts (with alerting rules).

Prometheus alerts

Since we’re talking about improving our alerting we’ll be focusing on alerting rules.

To create alerts we first need to have some metrics collected. For the purposes of this blog post let’s assume we’re working with http_requests_total metric, which is used on the examples page. Here are some examples of how our metrics will look:

http_requests_total{job="myserver", handler="/", method=”get”, status=”200”}
http_requests_total{job="myserver", handler="/", method=”get”, status=”500”}
http_requests_total{job="myserver", handler="/posts", method=”get”, status=”200”}
http_requests_total{job="myserver", handler="/posts", method=”get”, status=”500”}
http_requests_total{job="myserver", handler="/posts/new", method=”post”, status=”201”}
http_requests_total{job="myserver", handler="/posts/new", method=”post”, status=”401”}

Let’s say we want to alert if our HTTP server is returning errors to customers.

Since, all we need to do is check our metric that tracks how many responses with HTTP status code 500 there were, a simple alerting rule could like this:

- alert: Serving HTTP 500 errors
  expr: http_requests_total{status=”500”} > 0

This will alert us if we have any 500 errors served to our customers. Prometheus will run our query looking for a time series named http_requests_total that also has a status label with value “500”. Then it will filter all those matched time series and only return ones with value greater than zero.

If our alert rule returns any results a fire will be triggered, one for each returned result.

If our rule doesn’t return anything, meaning there are no matched time series, then alert will not trigger.

The whole flow from metric to alert is pretty simple here as we can see on the diagram below.

Monitoring our monitoring: how we validate our Prometheus alert rules

If we want to provide more information in the alert we can by setting additional labels and annotations, but alert and expr fields are all we need to get a working rule.

But the problem with the above rule is that our alert starts when we have our first error, and then it will never go away.

After all, our http_requests_total is a counter, so it gets incremented every time there’s a new request, which means that it will keep growing as we receive more requests. What this means for us is that our alert is really telling us “was there ever a 500 error?” and even if we fix the problem causing 500 errors we’ll keep getting this alert.

A better alert would be one that tells us if we’re serving errors right now.

For that we can use the rate() function to calculate the per second rate of errors.

Our modified alert would be:

- alert: Serving HTTP 500 errors
  expr: rate(http_requests_total{status=”500”}[2m]) > 0

The query above will calculate the rate of 500 errors in the last two minutes. If we start responding with errors to customers our alert will fire, but once errors stop so will this alert.

This is great because if the underlying issue is resolved the alert will resolve too.

We can improve our alert further by, for example, alerting on the percentage of errors, rather than absolute numbers, or even calculate error budget, but let’s stop here for now.

It’s all very simple, so what do we mean when we talk about improving the reliability of alerting? What could go wrong here?

Maybe a spot for a subheading here as you move on from the intro?

What could go wrong?

We can craft a valid YAML file with a rule definition that has a perfectly valid query that will simply not work how we expect it to work. Which, when it comes to alerting rules, might mean that the alert we rely upon to tell us when something is not working correctly will fail to alert us when it should. To better understand why that might happen let’s first explain how querying works in Prometheus.

Prometheus querying basics

There are two basic types of queries we can run against Prometheus. The first one is an instant query. It allows us to ask Prometheus for a point in time value of some time series. If we write our query as http_requests_total we’ll get all time series named http_requests_total along with the most recent value for each of them. We can further customize the query and filter results by adding label matchers, like http_requests_total{status=”500”}.

Let’s consider we have two instances of our server, green and red, each one is scraped (Prometheus collects metrics from it) every one minute (independently of each other).

This is what happens when we issue an instant query:

Monitoring our monitoring: how we validate our Prometheus alert rules

There’s obviously more to it as we can use functions and build complex queries that utilize multiple metrics in one expression. But for the purposes of this blog post we’ll stop here.

The important thing to know about instant queries is that they return the most recent value of a matched time series, and they will look back for up to five minutes (by default) into the past to find it. If the last value is older than five minutes then it’s considered stale and Prometheus won’t return it anymore.

Monitoring our monitoring: how we validate our Prometheus alert rules

The second type of query is a range query – it works similarly to instant queries, the difference is that instead of returning us the most recent value it gives us a list of values from the selected time range. That time range is always relative so instead of providing two timestamps we provide a range, like “20 minutes”. When we ask for a range query with a 20 minutes range it will return us all values collected for matching time series from 20 minutes ago until now.

An important distinction between those two types of queries is that range queries don’t have the same “look back for up to five minutes” behavior as instant queries. If Prometheus cannot find any values collected in the provided time range then it doesn’t return anything.

If we modify our example to request [3m] range query we should expect Prometheus to return three data points for each time series:

Monitoring our monitoring: how we validate our Prometheus alert rules

When queries don’t return anything

Knowing a bit more about how queries work in Prometheus we can go back to our alerting rules and spot a potential problem: queries that don’t return anything.

If our query doesn’t match any time series or if they’re considered stale then Prometheus will return an empty result. This might be because we’ve made a typo in the metric name or label filter, the metric we ask for is no longer being exported, or it was never there in the first place, or we’ve added some condition that wasn’t satisfied, like value of being non-zero in our http_requests_total{status=”500”} > 0 example.

Prometheus will not return any error in any of the scenarios above because none of them are really problems, it’s just how querying works. If you ask for something that doesn’t match your query then you get empty results. This means that there’s no distinction between “all systems are operational” and “you’ve made a typo in your query”. So if you’re not receiving any alerts from your service it’s either a sign that everything is working fine, or that you’ve made a typo, and you have no working monitoring at all, and it’s up to you to verify which one it is.

For example, we could be trying to query for http_requests_totals instead of http_requests_total (an extra “s” at the end) and although our query will look fine it won’t ever produce any alert.

Monitoring our monitoring: how we validate our Prometheus alert rules

Range queries can add another twist – they’re mostly used in Prometheus functions like rate(),  which we used in our example. This function will only work correctly if it receives a range query expression that returns at least two data points for each time series, after all it’s impossible to calculate rate from a single number.

Since the number of data points depends on the time range we passed to the range query, which we then pass to our rate() function, if we provide a time range that only contains a single value then rate won’t be able to calculate anything and once again we’ll return empty results.

The number of values collected in a given time range depends on the interval at which Prometheus collects all metrics, so to use rate() correctly you need to know how your Prometheus server is configured. You can read more about this here and here if you want to better understand how rate() works in Prometheus.

For example if we collect our metrics every one minute then a range query http_requests_total[1m] will be able to find only one data point. Here’s a reminder of how this looks:

Monitoring our monitoring: how we validate our Prometheus alert rules

Since, as we mentioned before, we can only calculate rate() if we have at least two data points, calling rate(http_requests_total[1m]) will never return anything and so our alerts will never work.

Monitoring our monitoring: how we validate our Prometheus alert rules

There are more potential problems we can run into when writing Prometheus queries, for example any operations between two metrics will only work if both have the same set of labels, you can read about this here. But for now we’ll stop here, listing all the gotchas could take a while. The point to remember is simple: if your alerting query doesn’t return anything then it might be that everything is ok and there’s no need to alert, but it might also be that you’ve mistyped your metrics name, your label filter cannot match anything, your metric disappeared from Prometheus, you are using too small time range for your range queries etc.

Renaming metrics can be dangerous

We’ve been running Prometheus for a few years now and during that time we’ve grown our collection of alerting rules a lot. Plus we keep adding new products or modifying existing ones, which often includes adding and removing metrics, or modifying existing metrics, which may include renaming them or changing what labels are present on these metrics.

A lot of metrics come from metrics exporters maintained by the Prometheus community, like node_exporter, which we use to gather some operating system metrics from all of our servers. Those exporters also undergo changes which might mean that some metrics are deprecated and removed, or simply renamed.

A problem we’ve run into a few times is that sometimes our alerting rules wouldn’t be updated after such a change, for example when we upgraded node_exporter across our fleet. Or the addition of a new label on some metrics would suddenly cause Prometheus to no longer return anything for some of the alerting queries we have, making such an alerting rule no longer useful.

It’s worth noting that Prometheus does have a way of unit testing rules, but since it works on mocked data it’s mostly useful to validate the logic of a query. Unit testing won’t tell us if, for example, a metric we rely on suddenly disappeared from Prometheus.

Chaining rules

When writing alerting rules we try to limit alert fatigue by ensuring that, among many things, alerts are only generated when there’s an action needed, they clearly describe the problem that needs addressing, they have a link to a runbook and a dashboard, and finally that we aggregate them as much as possible. This means that a lot of the alerts we have won’t trigger for each individual instance of a service that’s affected, but rather once per data center or even globally.

For example, we might alert if the rate of HTTP errors in a datacenter is above 1% of all requests. To do that we first need to calculate the overall rate of errors across all instances of our server. For that we would use a recording rule:

- record: job:http_requests_total:rate2m
  expr: sum(rate(http_requests_total[2m])) without(method, status, instance)

- record: job:http_requests_status500:rate2m
  expr: sum(rate(http_requests_total{status=”500”}[2m])) without(method, status, instance)

First rule will tell Prometheus to calculate per second rate of all requests and sum it across all instances of our server. Second rule does the same but only sums time series with status labels equal to “500”. Both rules will produce new metrics named after the value of the record field.

Now we can modify our alert rule to use those new metrics we’re generating with our recording rules:

- alert: Serving HTTP 500 errors
  expr: job:http_requests_status500:rate2m / job:http_requests_total:rate2m > 0.01

If we have a data center wide problem then we will raise just one alert, rather than one per instance of our server, which can be a great quality of life improvement for our on-call engineers.

But at the same time we’ve added two new rules that we need to maintain and ensure they produce results. To make things more complicated we could have recording rules producing metrics based on other recording rules, and then we have even more rules that we need to ensure are working correctly.

Monitoring our monitoring: how we validate our Prometheus alert rules

What if all those rules in our chain are maintained by different teams? What if the rule in the middle of the chain suddenly gets renamed because that’s needed by one of the teams? Problems like that can easily crop up now and then if your environment is sufficiently complex, and when they do, they’re not always obvious, after all the only sign that something stopped working is, well, silence – your alerts no longer trigger. If you’re lucky you’re plotting your metrics on a dashboard somewhere and hopefully someone will notice if they become empty, but it’s risky to rely on this.

We definitely felt that we needed something better than hope.

Introducing pint: a Prometheus rule linter

To avoid running into such problems in the future we’ve decided to write a tool that would help us do a better job of testing our alerting rules against live Prometheus servers, so we can spot missing metrics or typos easier. We also wanted to allow new engineers, who might not necessarily have all the in-depth knowledge of how Prometheus works, to be able to write rules with confidence without having to get feedback from more experienced team members.

Since we believe that such a tool will have value for the entire Prometheus community we’ve open-sourced it, and it’s available for anyone to use – say hello to pint!

You can find sources on github, there’s also online documentation that should help you get started.

Pint works in 3 different ways:

  • You can run it against a file(s) with Prometheus rules
  • It can run as a part of your CI pipeline
  • Or you can deploy it as a side-car to all your Prometheus servers

It doesn’t require any configuration to run, but in most cases it will provide the most value if you create a configuration file for it and define some Prometheus servers it should use to validate all rules against. Running without any configured Prometheus servers will limit it to static analysis of all the rules, which can identify a range of problems, but won’t tell you if your rules are trying to query non-existent metrics.

First mode is where pint reads a file (or a directory containing multiple files), parses it, does all the basic syntax checks and then runs a series of checks for all Prometheus rules in those files.

Second mode is optimized for validating git based pull requests. Instead of testing all rules from all files pint will only test rules that were modified and report only problems affecting modified lines.

Third mode is where pint runs as a daemon and tests all rules on a regular basis. If it detects any problem it will expose those problems as metrics. You can then collect those metrics using Prometheus and alert on them as you would for any other problems. This way you can basically use Prometheus to monitor itself.

What kind of checks can it run for us and what kind of problems can it detect?

All the checks are documented here, along with some tips on how to deal with any detected problems. Let’s cover the most important ones briefly.

As mentioned above the main motivation was to catch rules that try to query metrics that are missing or when the query was simply mistyped. To do that pint will run each query from every alerting and recording rule to see if it returns any result, if it doesn’t then it will break down this query to identify all individual metrics and check for the existence of each of them. If any of them is missing or if the query tries to filter using labels that aren’t present on any time series for a given metric then it will report that back to us.

So if someone tries to add a new alerting rule with http_requests_totals typo in it, pint will detect that when running CI checks on the pull request and stop it from being merged. Which takes care of validating rules as they are being added to our configuration management system.

Monitoring our monitoring: how we validate our Prometheus alert rules

Another useful check will try to estimate the number of times a given alerting rule would trigger an alert. Which is useful when raising a pull request that’s adding new alerting rules – nobody wants to be flooded with alerts from a rule that’s too sensitive so having this information on a pull request allows us to spot rules that could lead to alert fatigue.

Similarly, another check will provide information on how many new time series a recording rule adds to Prometheus. In our setup a single unique time series uses, on average, 4KiB of memory. So if a recording rule generates 10 thousand new time series it will increase Prometheus server memory usage by 10000*4KiB=40MiB. 40 megabytes might not sound like but our peak time series usage in the last year was around 30 million time series in a single Prometheus server, so we pay attention to anything that’s might add a substantial amount of new time series, which pint helps us to notice before such rule gets added to Prometheus.

On top of all the Prometheus query checks, pint allows us also to ensure that all the alerting rules comply with some policies we’ve set for ourselves. For example, we require everyone to write a runbook for their alerts and link to it in the alerting rule using annotations.

We also require all alerts to have priority labels, so that high priority alerts are generating pages for responsible teams, while low priority ones are only routed to karma dashboard or create tickets using jiralert. It’s easy to forget about one of these required fields and that’s not something which can be enforced using unit testing, but pint allows us to do that with a few configuration lines.

With pint running on all stages of our Prometheus rule life cycle, from initial pull request to monitoring rules deployed in our many data centers, we can rely on our Prometheus alerting rules to always work and notify us of any incident, large or small.

GitHub: https://github.com/cloudflare/pint

Putting it all together

Let’s see how we can use pint to validate our rules as we work on them.

We can begin by creating a file called “rules.yml” and adding both recording rules there.

The goal is to write new rules that we want to add to Prometheus, but before we actually add those, we want pint to validate it all for us.

groups:
- name: Demo recording rules
  rules:
  - record: job:http_requests_total:rate2m
    expr: sum(rate(http_requests_total[2m])) without(method, status, instance)

  - record: job:http_requests_status500:rate2m
    expr: sum(rate(http_requests_total{status="500"}[2m]) without(method, status, instance)

Next we’ll download the latest version of pint from GitHub and run check our rules.

$ pint lint rules.yml 
level=info msg="File parsed" path=rules.yml rules=2
rules.yml:8: syntax error: unclosed left parenthesis (promql/syntax)
    expr: sum(rate(http_requests_total{status="500"}[2m]) without(method, status, instance)

level=info msg="Problems found" Fatal=1
level=fatal msg="Execution completed with error(s)" error="problems found"

Whoops, we have “sum(rate(…)” and so we’re missing one of the closing brackets. Let’s fix that and try again.

groups:
- name: Demo recording rules
  rules:
  - record: job:http_requests_total:rate2m
    expr: sum(rate(http_requests_total[2m])) without(method, status, instance)

  - record: job:http_requests_status500:rate2m
    expr: sum(rate(http_requests_total{status="500"}[2m])) without(method, status, instance)

$ pint lint rules.yml 
level=info msg="File parsed" path=rules.yml rules=2

Our rule now passes the most basic checks, so we know it’s valid. But to know if it works with a real Prometheus server we need to tell pint how to talk to Prometheus. For that we’ll need a config file that defines a Prometheus server we test our rule against, it should be the same server we’re planning to deploy our rule to. Here we’ll be using a test instance running on localhost. Let’s create a “pint.hcl” file and define our Prometheus server there:

prometheus "prom1" {
  uri     = "http://localhost:9090"
  timeout = "1m"
}

Now we can re-run our check using this configuration file:

$ pint -c pint.hcl lint rules.yml 
level=info msg="Loading configuration file" path=pint.hcl
level=info msg="File parsed" path=rules.yml rules=2
rules.yml:5: prometheus "prom1" at http://localhost:9090 didn't have any series for "http_requests_total" metric in the last 1w (promql/series)
    expr: sum(rate(http_requests_total[2m])) without(method, status, instance)

rules.yml:8: prometheus "prom1" at http://localhost:9090 didn't have any series for "http_requests_total" metric in the last 1w (promql/series)
    expr: sum(rate(http_requests_total{status="500"}[2m])) without(method, status, instance)

level=info msg="Problems found" Bug=2
level=fatal msg="Execution completed with error(s)" error="problems found"

Yikes! It’s a test Prometheus instance, and we forgot to collect any metrics from it.

Let’s fix that by starting our server locally on port 8080 and configuring Prometheus to collect metrics from it:

scrape_configs:
  - job_name: webserver
    static_configs:
      - targets: ['localhost:8080’]

Let’ re-run our checks once more:

$ pint -c pint.hcl lint rules.yml 
level=info msg="Loading configuration file" path=pint.hcl
level=info msg="File parsed" path=rules.yml rules=2

This time everything works!

Now let’s add our alerting rule to our file, so it now looks like this:

groups:
- name: Demo recording rules
  rules:
  - record: job:http_requests_total:rate2m
    expr: sum(rate(http_requests_total[2m])) without(method, status, instance)

  - record: job:http_requests_status500:rate2m
    expr: sum(rate(http_requests_total{status="500"}[2m])) without(method, status, instance)

- name: Demo alerting rules
  rules:
  - alert: Serving HTTP 500 errors
    expr: job:http_requests_status500:rate2m / job:http_requests_total:rate2m > 0.01

And let’s re-run pint once again:

$ pint -c pint.hcl lint rules.yml 
level=info msg="Loading configuration file" path=pint.hcl
level=info msg="File parsed" path=rules.yml rules=3
rules.yml:13: prometheus "prom1" at http://localhost:9090 didn't have any series for "job:http_requests_status500:rate2m" metric in the last 1w but found recording rule that generates it, skipping further checks (promql/series)
    expr: job:http_requests_status500:rate2m / job:http_requests_total:rate2m > 0.01

rules.yml:13: prometheus "prom1" at http://localhost:9090 didn't have any series for "job:http_requests_total:rate2m" metric in the last 1w but found recording rule that generates it, skipping further checks (promql/series)
    expr: job:http_requests_status500:rate2m / job:http_requests_total:rate2m > 0.01

level=info msg="Problems found" Information=2

It all works according to pint, and so we now can safely deploy our new rules file to Prometheus.

Notice that pint recognised that both metrics used in our alert come from recording rules, which aren’t yet added to Prometheus, so there’s no point querying Prometheus to verify if they exist there.

Now what happens if we deploy a new version of our server that renames the “status” label to something else, like “code”?

$ pint -c pint.hcl lint rules.yml 
level=info msg="Loading configuration file" path=pint.hcl
level=info msg="File parsed" path=rules.yml rules=3
rules.yml:8: prometheus "prom1" at http://localhost:9090 has "http_requests_total" metric but there are no series with "status" label in the last 1w (promql/series)
    expr: sum(rate(http_requests_total{status="500"}[2m])) without(method, status, instance)

rules.yml:13: prometheus "prom1" at http://localhost:9090 didn't have any series for "job:http_requests_status500:rate2m" metric in the last 1w but found recording rule that generates it, skipping further checks (promql/series)
    expr: job:http_requests_status500:rate2m / job:http_requests_total:rate2m > 0.01

level=info msg="Problems found" Bug=1 Information=1
level=fatal msg="Execution completed with error(s)" error="problems found"

Luckily pint will notice this and report it, so we can adopt our rule to match the new name.

But what if that happens after we deploy our rule? For that we can use the “pint watch” command that runs pint as a daemon periodically checking all rules.

Please note that validating all metrics used in a query will eventually produce some false positives. In our example metrics with status=”500” label might not be exported by our server until there’s at least one request ending in HTTP 500 error.

The promql/series check responsible for validating presence of all metrics has some documentation on how to deal with this problem. In most cases you’ll want to add a comment that instructs pint to ignore some missing metrics entirely or stop checking label values (only check if there’s “status” label present, without checking if there are time series with status=”500”).

Summary

Prometheus metrics don’t follow any strict schema, whatever services expose will be collected. At the same time a lot of problems with queries hide behind empty results, which makes noticing these problems non-trivial.

We use pint to find such problems and report them to engineers, so that our global network is always monitored correctly, and we have confidence that lack of alerts proves how reliable our infrastructure is.

How US federal agencies can use AWS to improve logging and log retention

Post Syndicated from Derek Doerr original https://aws.amazon.com/blogs/security/how-us-federal-agencies-can-use-aws-to-improve-logging-and-log-retention/

This post is part of a series about how Amazon Web Services (AWS) can help your US federal agency meet the requirements of the President’s Executive Order on Improving the Nation’s Cybersecurity. You will learn how you can use AWS information security practices to help meet the requirement to improve logging and log retention practices in your AWS environment.

Improving the security and operational readiness of applications relies on improving the observability of the applications and the infrastructure on which they operate. For our customers, this translates to questions of how to gather the right telemetry data, how to securely store it over its lifecycle, and how to analyze the data in order to make it actionable. These questions take on more importance as our federal customers seek to improve their collection and management of log data in all their IT environments, including their AWS environments, as mandated by the executive order.

Given the interest in the technologies used to support logging and log retention, we’d like to share our perspective. This starts with an overview of logging concepts in AWS, including log storage and management, and then proceeds to how to gain actionable insights from that logging data. This post will address how to improve logging and log retention practices consistent with the Security and Operational Excellence pillars of the AWS Well-Architected Framework.

Log actions and activity within your AWS account

AWS provides you with extensive logging capabilities to provide visibility into actions and activity within your AWS account. A security best practice is to establish a wide range of detection mechanisms across all of your AWS accounts. Starting with services such as AWS CloudTrail, AWS Config, Amazon CloudWatch, Amazon GuardDuty, and AWS Security Hub provides a foundation upon which you can base detective controls, remediation actions, and forensics data to support incident response. Here is more detail on how these services can help you gain more security insights into your AWS workloads:

  • AWS CloudTrail provides event history for all of your AWS account activity, including API-level actions taken through the AWS Management Console, AWS SDKs, command line tools, and other AWS services. You can use CloudTrail to identify who or what took which action, what resources were acted upon, when the event occurred, and other details. If your agency uses AWS Organizations, you can automate this process for all of the accounts in the organization.
  • CloudTrail logs can be delivered from all of your accounts into a centralized account. This places all logs in a tightly controlled, central location, making it easier to both protect them as well as to store and analyze them. As with AWS CloudTrail, you can automate this process for all of the accounts in the organization using AWS Organizations.  CloudTrail can also be configured to emit metrical data into the CloudWatch monitoring service, giving near real-time insights into the usage of various services.
  • CloudTrail log file integrity validation produces and cyptographically signs a digest file that contains references and hashes for every CloudTrail file that was delivered in that hour. This makes it computationally infeasible to modify, delete or forge CloudTrail log files without detection. Validated log files are invaluable in security and forensic investigations. For example, a validated log file enables you to assert positively that the log file itself has not changed, or that particular user credentials performed specific API activity.
  • AWS Config monitors and records your AWS resource configurations and allows you to automate the evaluation of recorded configurations against desired configurations. For example, you can use AWS Config to verify that resources are encrypted, multi-factor authentication (MFA) is enabled, and logging is turned on, and you can use AWS Config rules to identify noncompliant resources. Additionally, you can review changes in configurations and relationships between AWS resources and dive into detailed resource configuration histories, helping you to determine when compliance status changed and the reason for the change.
  • Amazon GuardDuty is a threat detection service that continuously monitors for malicious activity and unauthorized behavior to protect your AWS accounts and workloads. Amazon GuardDuty analyzes and processes the following data sources: VPC Flow Logs, AWS CloudTrail management event logs, CloudTrail Amazon Simple Storage Service (Amazon S3) data event logs, and DNS logs. It uses threat intelligence feeds, such as lists of malicious IP addresses and domains, and machine learning to identify potential threats within your AWS environment.
  • AWS Security Hub provides a single place that aggregates, organizes, and prioritizes your security alerts, or findings, from multiple AWS services and optional third-party products to give you a comprehensive view of security alerts and compliance status.

You should be aware that most AWS services do not charge you for enabling logging (for example, AWS WAF) but the storage of logs will incur ongoing costs. Always consult the AWS service’s pricing page to understand cost impacts. Related services such as Amazon Kinesis Data Firehose (used to stream data to storage services), and Amazon Simple Storage Service (Amazon S3), used to store log data, will incur charges.

Turn on service-specific logging as desired

After you have the foundational logging services enabled and configured, next turn your attention to service-specific logging. Many AWS services produce service-specific logs that include additional information. These services can be configured to record and send out information that is necessary to understand their internal state, including application, workload, user activity, dependency, and transaction telemetry. Here’s a sampling of key services with service-specific logging features:

  • Amazon CloudWatch provides you with data and actionable insights to monitor your applications, respond to system-wide performance changes, optimize resource utilization, and get a unified view of operational health. CloudWatch collects monitoring and operational data in the form of logs, metrics, and events, providing you with a unified view of AWS resources, applications, and services that run on AWS and on-premises servers. You can gain additional operational insights from your AWS compute instances (Amazon Elastic Compute Cloud, or EC2) as well as on-premises servers using the CloudWatch agent. Additionally, you can use CloudWatch to detect anomalous behavior in your environments, set alarms, visualize logs and metrics side by side, take automated actions, troubleshoot issues, and discover insights to keep your applications running smoothly.
  • Amazon CloudWatch Logs is a component of Amazon CloudWatch which you can use to monitor, store, and access your log files from Amazon Elastic Compute Cloud (Amazon EC2) instances, AWS CloudTrail, Route 53, and other sources. CloudWatch Logs enables you to centralize the logs from all of your systems, applications, and AWS services that you use, in a single, highly scalable service. You can then easily view them, search them for specific error codes or patterns, filter them based on specific fields, or archive them securely for future analysis. CloudWatch Logs enables you to see all of your logs, regardless of their source, as a single and consistent flow of events ordered by time, and you can query them and sort them based on other dimensions, group them by specific fields, create custom computations with a powerful query language, and visualize log data in dashboards.
  • Traffic Mirroring allows you to achieve full packet capture (as well as filtered subsets) of network traffic from an elastic network interface of EC2 instances inside your VPC. You can then send the captured traffic to out-of-band security and monitoring appliances for content inspection, threat monitoring, and troubleshooting.
  • The Elastic Load Balancing service provides access logs that capture detailed information about requests that are sent to your load balancer. Each log contains information such as the time the request was received, the client’s IP address, latencies, request paths, and server responses. The specific information logged varies by load balancer type:
  • Amazon S3 access logs record the S3 bucket and account that are being accessed, the API action, and requester information.
  • AWS Web Application Firewall (WAF) logs record web requests that are processed by AWS WAF, and indicate whether the requests matched AWS WAF rules and what actions, if any, were taken. These logs are delivered to Amazon S3 by using Amazon Kinesis Data Firehose.
  • Amazon Relational Database Service (Amazon RDS) log files can be downloaded or published to Amazon CloudWatch Logs. Log settings are specific to each database engine. Agencies use these settings to apply their desired logging configurations and chose which events are logged.  Amazon Aurora and Amazon RDS for Oracle also support a real-time logging feature called “database activity streams” which provides even more detail, and cannot be accessed or modified by database administrators.
  • Amazon Route 53 provides options for logging for both public DNS query requests as well as Route53 Resolver DNS queries:
    • Route 53 Resolver DNS query logs record DNS queries and responses that originate from your VPC, that use an inbound Resolver endpoint, that use an outbound Resolver endpoint, or that use a Route 53 Resolver DNS Firewall.
    • Route 53 DNS public query logs record queries to Route 53 for domains you have hosted with AWS, including the domain or subdomain that was requested; the date and time of the request; the DNS record type; the Route 53 edge location that responded to the DNS query; and the DNS response code.
  • Amazon Elastic Compute Cloud (Amazon EC2) instances can use the unified CloudWatch agent to collect logs and metrics from Linux, macOS, and Windows EC2 instances and publish them to the Amazon CloudWatch service.
  • Elastic Beanstalk logs can be streamed to CloudWatch Logs. You can also use the AWS Management Console to request the last 100 log entries from the web and application servers, or request a bundle of all log files that is uploaded to Amazon S3 as a ZIP file.
  • Amazon CloudFront logs record user requests for content that is cached by CloudFront.

Store and analyze log data

Now that you’ve enabled foundational and service-specific logging in your AWS accounts, that data needs to be persisted and managed throughout its lifecycle. AWS offers a variety of solutions and services to consolidate your log data and store it, secure access to it, and perform analytics.

Store log data

The primary service for storing all of this logging data is Amazon S3. Amazon S3 is ideal for this role, because it’s a highly scalable, highly resilient object storage service. AWS provides a rich set of multi-layered capabilities to secure log data that is stored in Amazon S3, including encrypting objects (log records), preventing deletion (the S3 Object Lock feature), and using lifecycle policies to transition data to lower-cost storage over time (for example, to S3 Glacier). Access to data in Amazon S3 can also be restricted through AWS Identity and Access Management (IAM) policies, AWS Organizations service control policies (SCPs), S3 bucket policies, Amazon S3 Access Points, and AWS PrivateLink interfaces. While S3 is particularly easy to use with other AWS services given its various integrations, many customers also centralize their storage and analysis of their on-premises log data, or log data from other cloud environments, on AWS using S3 and the analytic features described below.

If your AWS accounts are organized in a multi-account architecture, you can make use of the AWS Centralized Logging solution. This solution enables organizations to collect, analyze, and display CloudWatch Logs data in a single dashboard. AWS services generate log data, such as audit logs for access, configuration changes, and billing events. In addition, web servers, applications, and operating systems all generate log files in various formats. This solution uses the Amazon Elasticsearch Service (Amazon ES) and Kibana to deploy a centralized logging solution that provides a unified view of all the log events. In combination with other AWS-managed services, this solution provides you with a turnkey environment to begin logging and analyzing your AWS environment and applications.

You can also make use of services such as Amazon Kinesis Data Firehose, which you can use to transport log information to S3, Amazon ES, or any third-party service that is provided with an HTTP endpoint, such as Datadog, New Relic, or Splunk.

Finally, you can use Amazon EventBridge to route and integrate event data between AWS services and to third-party solutions such as software as a service (SaaS) providers or help desk ticketing systems. EventBridge is a serverless event bus service that allows you to connect your applications with data from a variety of sources. EventBridge delivers a stream of real-time data from your own applications, SaaS applications, and AWS services, and then routes that data to targets such as AWS Lambda.

Analyze log data and respond to incidents

As the final step in managing your log data, you can use AWS services such as Amazon Detective, Amazon ES, CloudWatch Logs Insights, and Amazon Athena to analyze your log data and gain operational insights.

  • Amazon Detective makes it easy to analyze, investigate, and quickly identify the root cause of security findings or suspicious activities. Detective automatically collects log data from your AWS resources. It then uses machine learning, statistical analysis, and graph theory to help you visualize and conduct faster and more efficient security investigations.
  • Incident Manager is a component of AWS Systems Manger which enables you to automatically take action when a critical issue is detected by an Amazon CloudWatch alarm or Amazon Eventbridge event. Incident Manager executes pre-configured response plans to engage responders via SMS and phone calls, enable chat commands and notifications using AWS Chatbot, and execute AWS Systems Manager Automation runbooks. The Incident Manager console integrates with AWS Systems Manager OpsCenter to help you track incidents and post-incident action items from a central place that also synchronizes with popular third-party incident management tools such as Jira Service Desk and ServiceNow.
  • Amazon Elasticsearch Service (Amazon ES) is a fully managed service that collects, indexes, and unifies logs and metrics across your environment to give you unprecedented visibility into your applications and infrastructure. With Amazon ES, you get the scalability, flexibility, and security you need for the most demanding log analytics workloads. You can configure a CloudWatch Logs log group to stream data it receives to your Amazon ES cluster in near real time through a CloudWatch Logs subscription.
  • CloudWatch Logs Insights enables you to interactively search and analyze your log data in CloudWatch Logs.
  • Amazon Athena is an interactive query service that you can use to analyze data in Amazon S3 by using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Conclusion

As called out in the executive order, information from network and systems logs is invaluable for both investigation and remediation services. AWS provides a broad set of services to collect an unprecedented amount of data at very low cost, optionally store it for long periods of time in tiered storage, and analyze that telemetry information from your cloud-based workloads. These insights will help you improve your organization’s security posture and operational readiness and, as a result, improve your organization’s ability to deliver on its mission.

Next steps

To learn more about how AWS can help you meet the requirements of the executive order, see the other post in this series:

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Derek Doerr

Derek is a Senior Solutions Architect with the Public Sector team at AWS. He has been working with AWS technology for over four years. Specializing in enterprise management and governance, he is passionate about helping AWS customers navigate their journeys to the cloud. In his free time, he enjoys time with family and friends, as well as scuba diving.

Gaining operational insights with AIOps using Amazon DevOps Guru

Post Syndicated from Nikunj Vaidya original https://aws.amazon.com/blogs/devops/gaining-operational-insights-with-aiops-using-amazon-devops-guru/

Amazon DevOps Guru offers a fully managed AIOps platform service that enables developers and operators to improve application availability and resolve operational issues faster. It minimizes manual effort by leveraging machine learning (ML) powered recommendations. Its ML models take advantage of AWS’s expertise in operating highly available applications for the world’s largest e-commerce business for over 20 years. DevOps Guru automatically detects operational issues, predicts impending resource exhaustion, details likely causes, and recommends remediation actions.

This post walks you through how to enable DevOps Guru for your account in a typical serverless environment and observe the insights and recommendations generated for various activities. These insights are generated for operational events that could pose a risk to your application availability. DevOps Guru uses AWS CloudFormation stacks as the application boundary to detect resource relationships and co-relate with deployment events.

Solution overview

The goal of this post is to demonstrate the insights generated for anomalies detected by DevOps Guru from DevOps operations. If you don’t have a test environment and want to build out infrastructure to test the generation of insights, then you can follow along through this post. However, if you have a test or production environment (preferably spawned from CloudFormation stacks), you can skip the first section and jump directly to section, Enabling DevOps Guru and injecting traffic.

The solution includes the following steps:

1. Deploy a serverless infrastructure.

2. Enable DevOps Guru and inject traffic.

3. Review the generated DevOps Guru insights.

4. Inject another failure to generate a new insight.

As depicted in the following diagram, we use a CloudFormation stack to create a serverless infrastructure, comprising of Amazon API Gateway, AWS Lambda, and Amazon DynamoDB, and inject HTTP requests at a high rate towards the API published to list records.

Serverless infrastructure monitored by DevOps Guru

When DevOps Guru is enabled to monitor your resources within the account, it uses a combination of vended Amazon CloudWatch metrics and specific patterns from its ML models to detect anomalies. When an anomaly is detected, it generates an insight with the recommendations.

The generation of insights is dependent upon several factors. Although this post provides a canned environment to reproduce insights, the results may vary depending upon traffic pattern, timings of traffic injection, and so on.

Prerequisites

To complete this tutorial, you should have access to an AWS Cloud9 environment and the AWS Command Line Interface (AWS CLI).

Deploying a serverless infrastructure

To deploy your serverless infrastructure, you complete the following high-level steps:

1.   Create your IDE environment.

2.   Launch the CloudFormation template to deploy the serverless infrastructure.

3.   Populate the DynamoDB table.

 

1. Creating your IDE environment

We recommend using AWS Cloud9 to create an environment to get access to the AWS CLI from a bash terminal. AWS Cloud9 is a browser-based IDE that provides a development environment in the cloud. While creating the new environment, ensure you choose Linux2 as the operating system. Alternatively, you can use your bash terminal in your favorite IDE and configure your AWS credentials in your terminal.

When access is available, run the following command to confirm that you can see the Amazon Simple Storage Service (Amazon S3) buckets in your account:

aws s3 ls

Install the following prerequisite packages and ensure you have Python3 installed:

sudo yum install jq -y

export AWS_REGION=$(curl -s \
169.254.169.254/latest/dynamic/instance-identity/document | jq -r '.region')

sudo pip3 install requests

Clone the git repository to download the required CloudFormation templates:

git clone https://github.com/aws-samples/amazon-devopsguru-samples
cd amazon-devopsguru-samples/generate-devopsguru-insights/

2. Launching the CloudFormation template to deploy your serverless infrastructure

To deploy your infrastructure, complete the following steps:

  • Run the CloudFormation template using the following command:
aws cloudformation create-stack --stack-name myServerless-Stack \
--template-body file:///$PWD/cfn-shops-monitoroper-code.yaml \
--capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM

The AWS CloudFormation deployment creates an API Gateway, a DynamoDB table, and a Lambda function with sample code.

  • When it’s complete, go to the Outputs tab of the stack on the AWS CloudFormation console.
  • Record the links for the two APIs: one of them to list the table contents and other one to populate the contents.

3. Populating the DynamoDB table

Run the following commands (simply copy-paste) to populate the DynamoDB table. The below three commands will identify the name of the DynamoDB table from the CloudFormation stack and populate the name in the populate-shops-dynamodb-table.json file.

dynamoDBTableName=$(aws cloudformation list-stack-resources \
--stack-name myServerless-Stack | \
jq '.StackResourceSummaries[]|select(.ResourceType == "AWS::DynamoDB::Table").PhysicalResourceId' | tr -d '"')
sudo sed -i s/"<YOUR-DYNAMODB-TABLE-NAME>"/$dynamoDBTableName/g \
populate-shops-dynamodb-table.json
aws dynamodb batch-write-item \
--request-items file://populate-shops-dynamodb-table.json

The command gives the following output:

{
"UnprocessedItems": {}
}

This populates the DynamoDB table with a few sample records, which you can verify by accessing the ListRestApiEndpointMonitorOper API URL published on the Outputs tab of the CloudFormation stack. The following screenshot shows the output.

Screenshot showing the API output

Enabling DevOps Guru and injecting traffic

In this section, you complete the following high-level steps:

1.   Enable DevOps Guru for the CloudFormation stack.

2.   Wait for the serverless stack to complete.

3.   Update the stack.

4.   Inject HTTP requests into your API.

 

1. Enabling DevOps Guru for the CloudFormation stack

To enable DevOps Guru for CloudFormation, complete the following steps:

  • Run the CloudFormation template to enable DevOps Guru for this CloudFormation stack:
aws cloudformation create-stack \
--stack-name EnableDevOpsGuruForServerlessCfnStack \
--template-body file:///$PWD/EnableDevOpsGuruForServerlessCfnStack.yaml \
--parameters ParameterKey=CfnStackNames,ParameterValue=myServerless-Stack \
--region ${AWS_REGION}
  • When the stack is created, navigate to the Amazon DevOps Guru console.
  • Choose Settings.
  • Under CloudFormation stacks, locate myServerless-Stack.

If you don’t see it, your CloudFormation stack has not been successfully deployed. You may remove and redeploy the EnableDevOpsGuruForServerlessCfnStack stack.

Optionally, you can configure Amazon Simple Notification Service (Amazon SNS) topics or enable AWS Systems Manager integration to create OpsItem entries for every insight created by DevOps Guru. If you need to deploy as a stack set across multiple accounts and Regions, see Easily configure Amazon DevOps Guru across multiple accounts and Regions using AWS CloudFormation StackSets.

2. Waiting for baselining of resources

This is a necessary step to allow DevOps Guru to complete baselining the resources and benchmark the normal behavior. For our serverless stack with 3 resources, we recommend waiting for 2 hours before carrying out next steps. When enabled in a production environment, depending upon the number of resources selected to monitor, it can take up to 24 hours for it to complete baselining.

Note: Unlike many monitoring tools, DevOps Guru does not expect the dashboard to be continuously monitored and thus under normal system health, the dashboard would simply show zero’ed counters for the ongoing insights. It is only when an anomaly is detected, it will raise an alert and display an insight on the dashboard.

3. Updating the CloudFormation stack

When enough time has elapsed, we will make a configuration change to simulate a typical operational event. As shown below, update the CloudFormation template to change the read capacity for the DynamoDB table from 5 to 1.

CloudFormation showing read capacity settings to modify

Run the following command to deploy the updated CloudFormation template:

aws cloudformation update-stack --stack-name myServerless-Stack \
--template-body file:///$PWD/cfn-shops-monitoroper-code.yaml \
--capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM

4. Injecting HTTP requests into your API

We now inject ingress traffic in the form of HTTP requests towards the ListRestApiEndpointMonitorOper API, either manually or using the Python script provided in the current directory (sendAPIRequest.py). Due to reduced read capacity, the traffic will oversubscribe the dynamodb tables, thus inducing an anomaly. However, before triggering the script, populate the url parameter with the correct API link for the  ListRestApiEndpointMonitorOper API, listed in the CloudFormation stack’s output tab.

After the script is saved, trigger the script using the following command:

python sendAPIRequest.py

Make sure you’re getting an output of status 200 with the records that we fed into the DynamoDB table (see the following screenshot). You may have to launch multiple tabs (preferably 4) of the terminal to run the script to inject a high rate of traffic.

Terminal showing script executing and injecting traffic to API

After approximately 10 minutes of the script running in a loop, an operational insight is generated in DevOps Guru.

 

Reviewing DevOps Guru insights

While these operations are being run, DevOps Guru monitors for anomalies, logs insights that provide details about the metrics indicating an anomaly, and prints actionable recommendations to mitigate the anomaly. In this section, we review some of these insights.

Under normal conditions, DevOps Guru dashboard will show the ongoing insights counter to be zero. It monitors a high number of metrics behind the scenes and offloads the operator from manually monitoring any counters or graphs. It raises an alert in the form of an insight, only when anomaly occurs.

The following screenshot shows an ongoing reactive insight for the specific CloudFormation stack. When you choose the insight, you see further details. The number under the Total resources analyzed last hour may vary, so for this post, you can ignore this number.

DevOps Guru dashboard showing an ongoing reactive insight

The insight is divided into four sections: Insight overview, Aggregated metrics, Relevant events, and Recommendations. Let’s take a closer look into these sections.

The following screenshot shows the Aggregated metrics section, where it shows metrics for all the resources that it detected anomalies in (DynamoDB, Lambda, and API Gateway). Note that depending upon your traffic pattern, lambda settings, baseline traffic, the list of metrics may vary. In the example below, the timeline shows that an anomaly for DynamoDB started first and was followed by API Gateway and Lambda. This helps us understand the cause and symptoms, and prioritize the right anomaly investigation.

The listing of metrics inside an Insight

Initially, you may see only two metrics listed, however, over time, it populates more metrics that showed anomalies. You can see the anomaly for DynamoDB started earlier than the anomalies for API Gateway and Lambda, thus indicating them as after effects. In addition to the information in the preceding screenshot, you may see Duration p90 and IntegrationLatency p90 (for Lambda and API Gateway respectively, due to increased backend latency) metrics also listed.

Now we check the Relevant events section, which lists potential triggers for the issue. The events listed here depend on the set of operations carried out on this CloudFormation stack’s resources in this Region. This makes it easy for the operator to be reminded of a change that may have caused this issue. The dots (representing events) that are near the Insight start portion of timeline are of particular interest.

Related Events shown inside the Insight

If you need to delve into any of these events, just click of any of these points, and it provides more details as shown below screenshot.

Delving into details of the related event listed in Insight

You can choose the link for an event to view specific details about the operational event (configuration change via CloudFormation update stack operation).

Now we move to the Recommendations section, which provides prescribed guidance for mitigating this anomaly. As seen in the following screenshot, it recommends to roll back the configuration change related to the read capacity for the DynamoDB table. It also lists specific metrics and the event as part of the recommendation for reference.

Recommendations provided inside the Insight

Going back to the metric section of the insight, when we select Graphed anomalies, it shows you the graphs of all related resources. Below screenshot shows a snippet showing anomaly for DynamoDB ReadThrottleEvents metrics. As seen in the below screenshot of the graph pattern, the read operations on the table are exceeding the provisioned throughput of read capacity. This clearly indicates an anomaly.

Graphed anomalies in DevOps Guru

Let’s navigate to the DynamoDB table and check our table configuration. Checking the table properties, we notice that our read capacity is reduced to 1. This is our root cause that led to this anomaly.

Checking the DynamoDB table capacity settings

If we change it to 5, we fix this anomaly. Alternatively, if the traffic is stopped, the anomaly moves to a Resolved state.

The ongoing reactive insight takes a few minutes after resolution to move to a Closed state.

Note: When the insight is active, you may see more metrics get populated over time as we detect further anomalies. When carrying out the preceding tests, if you don’t see all the metrics as listed in the screenshots, you may have to wait longer.

 

Injecting another failure to generate a new DevOps Guru insight

Let’s create a new failure and generate an insight for that.

1.   After you close the insight from the previous section, trigger the HTTP traffic generation loop from the AWS Cloud9 terminal.

We modify the Lambda functions’s resource-based policy by removing the permissions for API Gateway to access this function.

2.   On the Lambda console, open the function ScanFunctionMonitorOper.

3.   On the Permissions tab, access the policy.

Accessing the permissions tab for the Lambda

 

4.   Save a copy of the policy offline as a backup before making any changes.

5.   Note down the “Sid” values for the “AWS:SourceArn” that ends with prod/*/ and prod/*/*.

Checking the Resource-based policy for the Lambda

6.   Run the following command to remove the “Sid” JSON statements in your Cloud9 terminal:

aws lambda remove-permission --function-name ScanFunctionMonitorOper \
--statement-id <Sid-value-ending-with-prod/*/>

7.   Run the same command for the second Sid value:

aws lambda remove-permission --function-name ScanFunctionMonitorOper \
--statement-id <Sid-value-ending-with-prod/*/*>

You should see several 5XX errors, as in the following screenshot.

Terminal output now showing 500 errors for the script output

After less than 8 minutes, you should see a new ongoing reactive insight on the DevOps Guru dashboard.

Let’s take a closer look at the insight. The following screenshot shows the anomalous metric 5XXError Average of API Gateway and its duration. (This insight shows as closed because I had already restored permissions.)

Insight showing 5XX errors for API-Gateway and link to OpsItem

If you have configured to enable creating OpsItem in Systems Manager, you would see the link to OpsItem ID created in the insight, as shown above. This is an optional configuration, which will enable you to track the insights in the form of open tickets (OpsItems) in Systems Manager OpsCenter.

The recommendations provide guidance based upon the related events and anomalous metrics.

After the insight has been generated, reviewed, and verified, restore the permissions by running the following command:

aws lambda add-permission --function-name ScanFunctionMonitorOper  \
--statement-id APIGatewayProdPerm --action lambda:InvokeFunction \
--principal apigateway.amazonaws.com

If needed, you can insert the condition to point to the API Gateway ARN to allow only specific API Gateways to access the Lambda function.

 

Cleaning up

After you walk through this post, you should clean up and un-provision the resources to avoid incurring any further charges.

1.   To un-provision the CloudFormation stacks, on the AWS CloudFormation console, choose Stacks.

2.   Select each stack (EnableDevOpsGuruForServerlessCfnStack and myServerless-Stack) and choose Delete.

3.   Check to confirm that the DynamoDB table created with the stacks is cleaned up. If not, delete the table manually.

4.   Un-provision the AWS Cloud9 environment.

 

Conclusion

This post reviewed how DevOps Guru can continuously monitor the resources in your AWS account in a typical production environment. When it detects an anomaly, it generates an insight, which includes the vended CloudWatch metrics that breached the threshold, the CloudFormation stack in which the resource existed, relevant events that could be potential triggers, and actionable recommendations to mitigate the condition.

DevOps Guru generates insights that are relevant to you based upon the pre-trained machine-learning models, removing the undifferentiated heavy lifting of manually monitoring several events, metrics, and trends.

I hope this post was useful to you and that you would consider DevOps Guru for your production needs.

 

Soar: Simulation for Observability, reliAbility, and secuRity

Post Syndicated from Yan Zhai original https://blog.cloudflare.com/soar-simulation-for-observability-reliability-and-security/

Soar: Simulation for Observability, reliAbility, and secuRity

Soar: Simulation for Observability, reliAbility, and secuRity

Serving more than approximately 25 million Internet properties is not an easy thing, and neither is serving 20 million requests per second on average. At Cloudflare, we achieve this by running a homogeneous edge environment: almost every Cloudflare server runs all Cloudflare products.

Soar: Simulation for Observability, reliAbility, and secuRity
Figure 1. Typical Cloudflare service model: when an end-user (a browser/mobile/etc) visits an origin (a Cloudflare customer), traffic is routed via the Internet to the Cloudflare edge network, and Cloudflare communicates with the origin servers from that point.

As we offer more and more products and enjoy the benefit of horizontal scalability, our edge stack continues to grow in complexity. Originally, we only operated at the application layer with our CDN service and DoS protection. Then we launched transport layer products, such as Spectrum and Argo. Now we have further expanded our footprint into the IP layer and physical link with Magic Transit. They all run on every machine we have. The work of our engineers enables our products to evolve at a fast pace, and to serve our customers better.

However, such software complexity presents a sheer challenge to operation: the more changes you make, the more likely it is that something is going to break. And we don’t tolerate any of our mistakes slipping into the production environment and affecting our customers.

In this article, we will discuss one of the techniques we use to fight such software complexity: simulations. Simulations are basically system tests that run with synthesized customer traffic and applications. We would like to introduce our simulation system, SOAR, i.e. Simulation for Observability, reliAbility, and secuRity.

What is SOAR? Simply put, it’s a data center built specifically for simulations. It runs the same software stack as our production data centers, but without any production traffic. Within SOAR, there are end-user servers, product servers, and origin servers (Figure 2). The product servers behave exactly the same as servers in our production edge network, and they are the targets that we want to test. End-user servers and origin servers run applications that try to simulate customer behaviors. The simplest case is to run network benchmarks through product servers, in order to evaluate how effective the corresponding products are. Instead of sending test traffic over the Internet, everything happens in the LAN environment that Cloudflare tightly controls. This gives us great flexibility in testing network features such as bring-your-own-IP (BYOIP) products.

Soar: Simulation for Observability, reliAbility, and secuRity
Figure 2. SOAR architectural view: by simulating the end users and origin on Cloudflare servers in the same VLAN, we can focus on examining the problems occurring in our edge network.

To demonstrate how this works, let’s go through a running example using Magic Transit.

Magic Transit is a product that provides IP layer protection and acceleration. One of the main functions of Magic Transit is to shield customers from DDoS attacks.

Soar: Simulation for Observability, reliAbility, and secuRity
Figure 3. Magic Transit workflow in a nutshell

Customers bring their IP ranges to advertise from Cloudflare edge. When attackers initiate a DoS attack, Cloudflare absorbs all the customer’s traffic, drops the attack traffic, and encapsulates clean traffic to customers. For this product, operational concerns are multifold, and here are some examples:

  • Have we properly configured our data plane so that traffic can reach customers? Is BGP ready? Are ECMP routes programmed correctly? Are health probes working correctly?
  • Do we have any untested corner cases that only manifest with a large amount of traffic?
  • Is our DoS system dropping malicious traffic as intended? How effective
  • Will any other team’s changes break Magic Transit as our edge keeps growing?

To ease these concerns, we run simulated customers with SOAR. Yes, simulated, not real. For example, assume a customer Alice onboarded an IP range 192.0.2.0/24 to Magic Transit. To simulate this customer, in SOAR we can configure a test application (e.g. iperf server) on one origin server to represent Alice’s service. We also bring up a product server to run Magic Transit. This product server will filter traffic toward a.b.c.0/24, and GRE encapsulated cleansed traffic to Alice’s specified GRE endpoint. To make it work, we also add routing rules to forward packets destined to 192.0.2.0/24 to go through the product server above. Similarly, we add routing rules to deliver GRE packets from the product server to the origin servers. Lastly, we start running test clients as eyeballs to evaluate the functional correctness, performance metrics, and resource usage.

For the rest of this article, we will talk about the design and implementation of this simulation system, as well as several real cases in which it helped us catch problems early or avoid problems altogether.

System Design

From performance simulation to config simulation

Before we created SOAR, we had already built a “performance simulation” for our layer 7 services. It is based on SaltStack, our configuration management software. All the simulation cases are system test cases against Cloudflare-owned HTTP sites. These cases are statically configured and run non-stop. Each simulation case produces multiple Prometheus metrics such as requests per second and latency. We monitor these metrics daily on our Grafana dashboard.

While this simulation system is very useful, it becomes less efficient as we have more and more simulation cases and products to run and analyze.

Isolation and Coordination

As more types of simulations are onboarded, it is critical to ensure each simulation runs in a clean environment, and all tasks of a simulation run together. This challenge is specific to providers like Cloudflare, whose products are not virtualized because we want to maximize our edge performance. As a result, we have to isolate simulations and clean up by ourselves; otherwise, different simulations may cross-affect each other.

For example, for Magic Transit simulations, we need to create a GRE tunnel on an origin server and set up several routes on all three servers, to make sure simulated traffic can flow as real Magic Transit customers would. We cannot leave these routes after the simulation finishes, or there might be a conflict. We once ran into a situation where different simulations required different source IP addresses to the same destination. In our original performance simulation environment, we will have to modify simulation applications to avoid these conflicts. This approach is less desirable as different engineering teams have to pay attention to other teams’ code.

Moreover, the performance simulation addresses only the most basic system test cases: a client sends traffic to a server and measures the performance characteristics like request per second and latency quantile. But the situation we want to simulate and validate in our production environment can be far more complex.

In our previous example of Magic Transit, customers can configure complicated network topology. Figure 4 is one simplified case. Let’s say Alice establishes four GRE tunnels with Cloudflare; two connect to her data center 1, and traffic will be ECMP hashed between these two tunnels. Similarly, she also establishes another two tunnels to her data center 2 with similar ECMP settings. She would like to see traffic hit data center 1 normally and fail over to data center 2 when tunnels 1 and 2 are both down.

Soar: Simulation for Observability, reliAbility, and secuRity
Figure 4. The customer configured Magic Transit to establish four tunnels to her two data centers. Traffic to data center 1 is hashed between tunnel 1 and 2 using ECMP, and traffic to data center 2 is hashed between tunnel 3 and 4. Data center 1 is the primary one, and traffic should failover to data center 2 if tunnels 1 and 2 are both down. Note the number “2” is purely symbolic, as real customers can have more than just 2 data centers, or 2 paths per ECMP route.

In order to examine the effectiveness of route failover, we would need to inject errors on the product servers only after the traffic on the eyeball server has started. But this type of coordination is not achievable with statically defined simulations.

Engineer friendliness and Interactiveness

Our performance simulation is not engineer-friendly. Not just because it is all statically configured in SaltStack (most engineering teams do not possess Salt expertise), but it is also not integrated with an engineer’s daily routine. Can engineers trigger a simulation on every branch build? Can simulation results get back in time to inform that a performance problem occurs? The answer is no, it is not possible with our static configuration. An engineer can submit a Salt PR to config a new simulation, but this simulation may have to wait for several hours because all other unfinished simulations need to complete first (recall it is just a static loop). Nor can an engineer add a test to the team’s repository to run on every build, as it needs to reside in the SRE-managed Salt repository, making it unmanageable as the number of simulations grows.

The Architecture

To address the above limitations, we designed SOAR.

Soar: Simulation for Observability, reliAbility, and secuRity
Figure 5. The Architecture of SOAR

The architecture is a performance simulation structure, extended. We created an internal coordinator service to:

  1. Interface with engineers, so they could now submit one-time simulations from their laptop or within the building pipeline, or view previous execution results.
  2. Dispatch and coordinate simulation tasks to each simulation server, where a simulation agent executes these tasks. The coordinator will isolate simulations properly so none of them contends on system resources. For example, the simplest policy we implemented is to never run two simulations on the same server at the same time.

The coordinator is secured by Cloudflare Access, so that only employees can visit this service. The coordinator will serve two types of simulations: one-time simulation to be run in an ad-hoc way and mainly on a per pull request manner, to ease development testing. It’s also callable from our CI system. Another type is repetitive simulations that are stored in the coordinator’s persistent storage. These simulations serve daily monitoring purposes and will be executed periodically.

Each simulation server runs a simulation agent. This agent will execute two types of tasks received from the coordinator: system tasks and user tasks. System tasks change the system-wide configurations and will be reverted after each simulation terminates. These will include but are not limited to route change, link change, address change, ipset change and iptables change.

User tasks, on the other hand, run benchmarks that we are interested in evaluating, and will be terminated if it exceeds an allocated execution budget. Each user task is isolated in a cgroup, and the agent will ensure all user tasks are executed with dedicated resources. The generic runtime metrics of user tasks is monitored by Cadvisor and sent to Prometheus and Alert Manager. A user task can export its own metrics to Prometheus as well.

For SOAR to run reliably, we provisioned a dedicated environment that enforces the same settings for the production environment and operates it as a production system: hardened security, standard alerts on watch, no engineer access except approved tools. This to a large extent allows us to run simulations as a stable source of anomaly detection.

Simulating with customer-specific configuration

An important ability of SOAR is to simulate for a specific customer. This will provide the customer with more guarantees that both their configurations and our services are battle-tested with traffic before they go live. It can also be used to bisect problems during a customer escalation, helping customer support to rule out unrelated factors more easily.

All of our edge servers know how to dispatch an incoming customer packet. This factor greatly reduces difficulties in simulating a specific customer. What we need to do in simulation is to mock routing and domain translation on simulated eyeballs and origins, so that they will correctly send traffic to designated product servers. And the problem is solved—magic!

The actual implementation is also straightforward: as simulations run in a LAN environment, we have tight control over how to route packets (servers are on the same broadcast domain). Any eyeball, origin, or product server can just use a direct routing rule and a static DNS entry in /etc/hosts to redirect packets to go to the correct destination.

Running a simulation this way allows us to separate customer configuration management from the simulation service: our products will manage it, so any time a customer configuration is changed, they will already reflect in simulations without special care.

Implementation and Integration

All SOAR components are built with Golang from scratch on Linux servers. It took three engineer-months to build the prototype and onboard the first engineering use case. While there are other mature platforms for job scheduling, task isolation, and monitoring, building our own allows us to better absorb new requirements from engineering teams, which is much easier and quicker than an external dependency.

In addition, we are currently integrating the simulation service into our release pipeline. Cloudflare built a release manager internally to schedule product version changes in controlled steps: a new product is first deployed into dogfooding data centers. After the product has been trialed by Cloudflare employees, it moves to several canary data centers with limited customer traffic. If nothing bad happens, in an hour or so, it starts to land in larger data centers spread across three tiers. Tier-3 will receive the changes an hour earlier than tier-2, and the same applies to tier-2 and tier-1. This ensures a product would be battle-tested enough before it can serve the majority of Cloudflare customers.

Now we move this further by adding a simulation step even before dogfooding. In this step, all changes are deployed into the simulation environment, and engineering teams will configure which simulations to run. Dogfooding starts only when there is no performance regression or functional breakage. Here performance regression is based on Prometheus metrics, where each engineering team can define their own Prometheus query to interpret the performance results. All configured simulations will run periodically to detect problems in releases that do not tie to a specific product, e.g. a Linux kernel upgrade. SREs receive notifications asynchronously if any issue is discovered.

Simulations at Cloudflare: Case Studies

Simulations are very useful inside Cloudflare. Let’s see some real experiences we had in the past.

Detecting an anomaly on data center specific releases

In our Magic Transit example, the engineering team was about to release physical network interconnect (PNI) support. With PNI, customer data centers physically peer with Cloudflare routers.

Soar: Simulation for Observability, reliAbility, and secuRity
Figure 6. The Magic Transit service flow for a customer without PNI support. Any Cloudflare data center can receive eyeball traffic. After mitigating a DoS attack traffic, valid traffic is encapsulated to the customer data center from any of the handling Cloudflare data centers.
Soar: Simulation for Observability, reliAbility, and secuRity
Figure 7. Magic Transit with PNI support. Traffic received from any data center will be moved to the PNI data center that the customer connects to. The PNI data center becomes a choke point.‌‌

However, this PNI functionality introduces a problem in our normal release process. However, PNI data centers are typically different from our dogfooding and canary data centers. If we still release with the normal process, then two critical phases are skipped. And what’s worse, the PNI data center could be a choke point in front of that customer’s traffic. If the PNI data center is taken down, no other data center can replace its role.

SOAR in this case is an important utility to help. The basic idea is to configure a server with PNI information. This server will act as if it runs in a PNI data center. Then we run simulated eyeball and origin to examine if there is any functional breakage:

Soar: Simulation for Observability, reliAbility, and secuRity
Figure 8. SOAR configures a server with PNI information and runs simulated eyeball and origin on this server. If a PNI related code release has a problem, then with proper simulation traffic it will be caught before rolling into production.

With such simulation capability, we were able to detect several problems early on and before releasing. For example, we caught a problem that impacts checksum offloading, which could encapsulate TCP packets with the wrong inner checksum and cause the packets to be dropped at the origin side. This problem does not exist in our virtualized testing environment and integration tests; it only happens when production hardware comes into play. We then use this simulation as a success indicator to test various fixes until we get the packet flow running normally again.

Continuously monitor performance on the edge stack

When a team configures a simulation, it runs on the same stack where all other teams run their products as well. This means when a simulation starts to show unhealthy results, it may or may not directly relate to the product associated with that simulation.

But with continuous simulations, we will have more chances to detect issues before things go south, or at least it will serve as a hint to quickly react to emerging problems. In an example early this year, we noticed one of our performance simulation dashboards showed that some HTTP request throughput was dropping by 20%. After digging into the case, we found our bot detection system had made a change that affected related requests. Luckily enough we moved fast thanks to the hint from the simulation (and some other useful tools like Opentracing).

Our recent enhancement from just HTTP performance simulation to SOAR makes it even more useful for customers. This is because we are now able to simulate with customer-specific configurations, so we might expose customer-specific problems. We are still dogfooding this, and hopefully, we can deploy it to our customers soon.

DoS Attacks as Simulations

When we started to develop Magic Transit, a question worth monitoring was how effective our mitigation pipeline is, and how to apply thresholds for different customers. For our new ACK flood mitigation system, flowtrackd, we onboarded its performance simulation cases together with tunable ACK flood. Combined with customer-specific configuration, this allows us to compare the throughput result under different volumes of attacks, and systematically tune our mitigation threshold.

Another important factor that we will be able to achieve with our “attack simulation” system is to mount attacks we have seen in the past, making sure the development of our mitigation pipelines won’t ever pass on these known attacks to our customers.

Conclusion

In this article, we introduced Cloudflare’s simulation system, SOAR. While simulation is not a new tool, we can use it to improve reliability, observability, and security. Our adoption of SOAR is still in its early stages, but we are pretty confident that, by fully leveraging simulations, we will push our quality of service to a new level.

How Netflix Scales its API with GraphQL Federation (Part 2)

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/how-netflix-scales-its-api-with-graphql-federation-part-2-bbe71aaec44a

In our previous post and QConPlus talk, we discussed GraphQL Federation as a solution for distributing our GraphQL schema and implementation. In this post, we shift our attention to what is needed to run a federated GraphQL platform successfully — from our journey implementing it to lessons learned.

Netflix GraphQL Federation

Our Journey so Far

Over the past year, we’ve implemented the core infrastructure pieces necessary for a federated GraphQL architecture as described in our previous post:

Studio Edge Architecture Diagram
Studio Edge Architecture

The first Domain Graph Service (DGS) on the platform was the former GraphQL monolith that we discussed in our first post (Studio API). Next, we worked with a few other application teams to make DGSs that would expose their APIs alongside the former monolith. We had our first Studio applications consuming the federated graph, without any performance degradation, by the end of the 2019. Once we knew that the architecture was feasible, we focused on readying it for broader usage. Our goal was to open up the Studio Edge platform for self-service in April 2020.

April 2020 was a turbulent time with the pandemic and overnight transition to working remotely. Nevertheless, teams started to jump into the graph in droves. Soon we had hundreds of engineers contributing directly to the API on a daily basis. And what about that Studio API monolith that used to be a bottleneck? We migrated the fields exposed by Studio API to individually owned DGSs without breaking the API for consumers. The original monolith is slated to be completely deprecated by the end of 2020.

This journey hasn’t been without its challenges. The biggest challenge was aligning on this strategy across the organization. Initially, there was a lot of skepticism and dissent; the concept was fairly new and would require high alignment across the organization to be successful. Our team spent a lot of time addressing dissenting points and making adjustments to the architecture based on feedback from developers. Through our prototype development and proactive partnership with some key critical voices, we were able to instill confidence and close crucial gaps.

Once we achieved broad alignment on the idea, we needed to ensure that adoption was seamless. This required building robust core infrastructure, ensuring a great developer experience, and solving for key cross-cutting concerns.

Core Infrastructure

Our GraphQL Gateway is based on Apollo’s reference implementation and is written in Kotlin. This gives us access to Netflix’s Java ecosystem, while also giving us the robust language features such as coroutines for efficient parallel fetches, and an expressive type system with null safety.

The schema registry is developed in-house, also in Kotlin. For storing schema changes, we use an internal library that implements the event sourcing pattern on top of the Cassandra database. Using event sourcing allows us to implement new developer experience features such as the Schema History view. The schema registry also integrates with our CI/CD systems like Spinnaker to automatically setup cloud networking for DGSs.

Developer Education & Experience

In the previous architecture, only the monolith Studio API team needed to learn GraphQL. In Studio Edge, every DGS team needs to build expertise in GraphQL. GraphQL has its own learning curve and can get especially tricky for complex cases like batching & lookahead. Also, as discussed in the previous post, understanding GraphQL Federation and implementing entity resolvers is not trivial either.

We partnered with Netflix’s Developer Experience (DevEx) team to build out documentation, training materials, and tutorials for developers. For general GraphQL questions, we lean on the open source community plus cultivate an internal GraphQL community to discuss hot topics like pagination, error handling, nullability, and naming conventions.

DGS Framework & Developer Tools

To make it easy for backend engineers to build a GraphQL DGS, the DevEx team built a “DGS Framework” on top of GraphQL Java and Spring Boot. The framework takes care of all the cross-cutting concerns of running a GraphQL service in production while also making it easier for developers to write GraphQL resolvers. In addition, DevEx built robust tooling for pushing schemas to the Schema Registry and a Self Service UI for browsing the various DGS’s schemas. Check out their conference talk and expect a future blog post from our colleagues. The DGS framework is planned to be open-sourced in early 2021.

Schema Governance

Netflix’s studio data is extremely rich and complex. Early on, we anticipated that active schema management would be crucial for schema evolution and overall health. We had a Studio Data Architect already in the org who was focused on data modeling and alignment across Studio. We engaged with them to determine graph schema best practices to best suit the needs of Studio Engineering.

Our goal was to design a GraphQL schema that was reflective of the domain itself, not the database model. UI developers should not have to build Backends For Frontends (BFF) to massage the data for their needs, rather, they should help shape the schema so that it satisfies their needs. Embracing a collaborative schema design approach was essential to achieving this goal.

Schema Design Workflow Diagram
Schema Design Workflow

The collaborative design process involves feedback and reviews across team boundaries. To streamline schema design and review, we formed a schema working group and a managed technical program for on-boarding to the federated architecture. While reviews add overhead to the product development process, we believe that prioritizing the quality of the graph model will reduce the amount of future changes and reworking needed. The level of review varies based on the entities affected; for the core federated types, more rigor is required (though tooling helps streamline that flow).

We have a deprecation workflow in place for evolving the schema. We’ve leveraged GraphQL’s deprecation feature and also track usage stats for every field in the schema. Once the stats show that a deprecated field is no longer used, we can make a backward incompatible change to remove the field from the schema.

Clients with Deprecated Field Usage
Clients with Deprecated Field Usage

We embraced a schema-first approach instead of generating our schema from existing models such as the Protobuf objects in our gRPC APIs. While Protobufs and gRPC are excellent solutions for building service APIs, we prefer decoupling our GraphQL schema from those layers to enable cleaner graph design and independent evolvability. In some scenarios, we implement generic mapping code from GraphQL resolvers to gRPC calls, but the extra boilerplate is worth the long-term flexibility of the GraphQL API.

Underlying our approach is a foundation of “context over control”, which is a key tenet of Netflix’s culture. Instead of trying to hold tight control of the entire graph, we give guidance and context to product teams so that they can apply their domain knowledge to make a flexible API for their domain. As this architecture matures, we will continue to monitor schema health and develop new tooling, processes, and best practices where needed.

Observability

In our previous architecture, observability was achieved through manual analysis and routing via the API team, which scaled poorly. For our federated architecture, we prioritized solving observability needs in a more scalable manner. We prioritized three areas:

  • Alerting — report when something goes awry
  • Discovery — easily determine what isn’t working
  • Diagnosis — debug why something isn’t working

Our guiding metrics in this space are mean time to resolution (MTTR) and service level objectives and indicators (SLO/SLI).

We teamed up with experts from Netflix’s Telemetry team. We integrated the Gateway and DGS architectural components with Zipkin, the internal distributed tracing tool Edgar, and application monitoring tool TellTale. In GraphQL, almost every response is a 200 with custom errors in the error block. We introspect these custom error codes from the response and emit them to our metrics server, Atlas. These integrations created a great foundation of rich visibility and insights for the consumers and developers of the GraphQL API.

Trace for a Federated Request Lifecycle
Edgar Trace for a Federated Request Lifecycle
Timeline View for a Federated Request lifecycle
Timeline View for a Federated Request

Distributed Log Correlation helps with debugging more complex server issues. By surfacing the application level logging details for all systems involved in processing a request, we gain deeper insights into what happened across the stack. Developers can easily see what was happening around the same time as a given request, to inspect surrounding factors that might have impacted an interaction.

Log correlation across multiple services for a request lifecycle
Logs across multiple services for a Federated Request

To solve the “who do I ask about…” routing problem, we integrated deep linking from GraphQL types and fields to their owning team’s support channels. Finding support is now as simple as clicking a link from a trace, which helps shorten MTTR and reduce the number of times the gateway team needs to get involved.

Securing the Federated Graph

Our goal is to enable robust and consistent security practices across the federated architecture. To achieve this, we partnered with the security experts at Netflix to build security into the graph. Let’s look at two essential parts of our security solution: AuthN and AuthZ.

Authentication

All of our product experiences in the Studio space require an authenticated account, so we restrict the GraphQL Gateway access to only trusted authenticated callers. Additionally, Graph Introspection is restricted to Netflix internal developers.

Authorization

Before Studio Edge, authorization logic was fragmented across teams. Some teams implemented authorization in their BFFs, some in microservices, and others did both for good measure. The result was often a different authorization story for a given piece of data depending on which UI a user was accessing it through. UI teams also found themselves needing to implement (and re-implement) authorization checks with each new frontend.

In Studio Edge, we delegated the authorization responsibility to DGS owners. This resulted in consistent authorization for the same user across different applications. Plus, Product Managers, Engineers and the Security team can easily get a bird’s eye view of who has access to each data type and how.

We have multiple authorization offerings within Netflix: from a simple system that grants access based on user identity to a more granular system that brings in the concept of roles and capabilities. DGS developers can choose a solution based on their needs. Then they simply annotate their resolvers with @Secured annotation and configure that to use one of the available systems. If needed, more complex authorization can be implemented in the resolver or in downstream systems.

Future of Authorization

We are currently prototyping a GraphQL-aware authorization solution. The Schema Registry automatically generates Access Control Groups (ACGs) for each field and its corresponding type when its schema is registered. Product managers & DGS Engineers decide membership and rules for these generated ACGs. Since the ACGs map to a field in GraphQL, the DGS framework then automatically applies the rules associated with the ACG during execution.

Architecting for Failure

The GraphQL Gateway is the single entry point for all requests; a failure on the gateway can cause significant disruptions. Following Netflix engineering best practices, we assume failures will happen and design ways to mitigate the impact of those failures. These are our design principles for ensuring the gateway layer is resilient:

  1. Single purpose
  2. Stateless service
  3. Demand controlled
  4. Multi-region
  5. Sharded by functionality

First, we focus the responsibilities of the gateway layer on a single purpose: parse client queries, then build and execute query plans. By reducing the scope, we limit the range of problems that can occur. We aim to perform any additional resource-intensive operations off-box with the exception of logging and metrics. Taking on additional unrelated logic in the gateway layer could increase surface area for failures in this critical tier.

Second, we run multiple stateless instances of the gateway service. Any gateway instance is able to generate and execute a query plan for any request. When we do code changes to the gateway layer, we rigorously test them before rolling out to production.

Third, we seek to balance the resources each request consumes through applying demand control. We rate-limit callers to avoid overloading the underlying databases that are the source of most of our domain elements. We also run a static query cost calculation on all incoming queries and reject expensive queries to avoid gridlock in gateway and DGS resources. Our partners understand these tradeoffs and work with us to meet these requirements, reworking expensive queries and reducing high volume callers.

Fourth, we deploy our gateway layer to multiple AWS regions around the world. This allows us to limit the blast radius for problems that inevitably arise. When problems happen, we can fail over to another region to ensure our clients are minimally impacted.

Last, we deploy multiple functional shards of our gateway layer. The code is the same in each shard and incoming requests are routed based on category. For example, GraphQL subscriptions generally result in long-lived connections while Queries & Mutations are short-lived. We use a separate fleet of instances for Subscriptions so “running out of connections” does not affect the availability of Queries and Mutations.

There is more we can do to improve resilience. We have plans to do canary deployments and analysis for gateway deployments and, eventually, schema changes. Today, our gateway dynamically updates its schema by polling the schema registry. We are in the process of decoupling these by storing the federation config in a versioned S3 bucket, making the gateway resilient to schema registry failures.

Closing Thoughts

GraphQL and Federation have been a productivity multiplier for Studio applications. Motivated by this, we’ve recently prototyped using GraphQL Federation for the Netflix consumer app search page on iOS & Android. To do this, we created three DGSs to provide the data for a minimal portion of the consumer graph. We are sending a small subset of users to this alternative stack and measuring high-level metrics. We are excited to see the results and explore further applicability in the Netflix consumer space.

Despite our positive experience, GraphQL Federation is early in its maturity lifecycle and may not be the best fit for every team or organization. Learning GraphQL and DGS development, running a federation layer, and doing a migration requires high commitment from partner teams and seamless cross-functional collaboration. If you’re considering going in this direction, we recommend checking out Apollo’s SaaS offering for Federation and the many online resources for learning GraphQL. For ecosystems like ours with a large swath of microservices that need to be aggregated together, the development velocity and improved operability has made the transition worth it.

In closing, we want to hear from you! If you have already implemented federation or tried to solve this problem with another approach, we would love to learn more. Sharing knowledge is one of the ways our industry learns and improves rapidly. Finally, if you’d like to be a part of solving complex and interesting problems like this at Netflix scale, check out our jobs page or reach out to us directly.

By Tejas Shikhare, Edited by Philip Fisher-Ogden

Additional Credits: Stephen Spalding, Jennifer Shin, Robert Reta, Antoine Boyer, Bruce Wang, David Simmer


How Netflix Scales its API with GraphQL Federation (Part 2) was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Building Netflix’s Distributed Tracing Infrastructure

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/building-netflixs-distributed-tracing-infrastructure-bb856c319304

by Maulik Pandey

Our Team — Kevin Lew, Narayanan Arunachalam, Elizabeth Carretto, Dustin Haffner, Andrei Ushakov, Seth Katz, Greg Burrell, Ram Vaithilingam, Mike Smith and Maulik Pandey

@Netflixhelps Why doesn’t Tiger King play on my phone?” — a Netflix member via Twitter

This is an example of a question our on-call engineers need to answer to help resolve a member issue — which is difficult when troubleshooting distributed systems. Investigating a video streaming failure consists of inspecting all aspects of a member account. In our previous blog post we introduced Edgar, our troubleshooting tool for streaming sessions. Now let’s look at how we designed the tracing infrastructure that powers Edgar.

Distributed Tracing: the missing context in troubleshooting services at scale

Prior to Edgar, our engineers had to sift through a mountain of metadata and logs pulled from various Netflix microservices in order to understand a specific streaming failure experienced by any of our members. Reconstructing a streaming session was a tedious and time consuming process that involved tracing all interactions (requests) between the Netflix app, our Content Delivery Network (CDN), and backend microservices. The process started with manual pull of member account information that was part of the session. The next step was to put all puzzle pieces together and hope the resulting picture would help resolve the member issue. We needed to increase engineering productivity via distributed request tracing.

If we had an ID for each streaming session then distributed tracing could easily reconstruct session failure by providing service topology, retry and error tags, and latency measurements for all service calls. We could also get contextual information about the streaming session by joining relevant traces with account metadata and service logs. This insight led us to build Edgar: a distributed tracing infrastructure and user experience.

Figure 1. Troubleshooting a session in Edgar

When we started building Edgar four years ago, there were very few open-source distributed tracing systems that satisfied our needs. Our tactical approach was to use Netflix-specific libraries for collecting traces from Java-based streaming services until open source tracer libraries matured. By 2017, open source projects like Open-Tracing and Open-Zipkin were mature enough for use in polyglot runtime environments at Netflix. We chose Open-Zipkin because it had better integrations with our Spring Boot based Java runtime environment. We use Mantis for processing the stream of collected traces, and we use Cassandra for storing traces. Our distributed tracing infrastructure is grouped into three sections: tracer library instrumentation, stream processing, and storage. Traces collected from various microservices are ingested in a stream processing manner into the data store. The following sections describe our journey in building these components.

Trace Instrumentation: how will it impact our service?

That is the first question our engineering teams asked us when integrating the tracer library. It is an important question because tracer libraries intercept all requests flowing through mission-critical streaming services. Safe integration and deployment of tracer libraries in our polyglot runtime environments was our top priority. We earned the trust of our engineers by developing empathy for their operational burden and by focusing on providing efficient tracer library integrations in runtime environments.

Distributed tracing relies on propagating context for local interprocess calls (IPC) and client calls to remote microservices for any arbitrary request. Passing the request context captures causal relationships between microservices during runtime. We adopted Open-Zipkin’s B3 HTTP header based context propagation mechanism. We ensure that context propagation headers are correctly passed between microservices across a variety of our “paved road” Java and Node runtime environments, which include both older environments with legacy codebases and newer environments such as Spring Boot. We execute the Freedom & Responsibility principle of our culture in supporting tracer libraries for environments like Python, NodeJS, and Ruby on Rails that are not part of the “paved road” developer experience. Our loosely coupled but highly aligned engineering teams have the freedom to choose an appropriate tracer library for their runtime environment and have the responsibility to ensure correct context propagation and integration of network call interceptors.

Our runtime environment integrations inject infrastructure tags like service name, auto-scaling group (ASG), and container instance identifiers. Edgar uses this infrastructure tagging schema to query and join traces with log data for troubleshooting streaming sessions. Additionally, it became easy to provide deep links to different monitoring and deployment systems in Edgar due to consistent tagging. With runtime environment integrations in place, we had to set an appropriate trace data sampling policy for building a troubleshooting experience.

Stream Processing: to sample or not to sample trace data?

This was the most important question we considered when building our infrastructure because data sampling policy dictates the amount of traces that are recorded, transported, and stored. A lenient trace data sampling policy generates a large number of traces in each service container and can lead to degraded performance of streaming services as more CPU, memory, and network resources are consumed by the tracer library. An additional implication of a lenient sampling policy is the need for scalable stream processing and storage infrastructure fleets to handle increased data volume.

We knew that a heavily sampled trace dataset is not reliable for troubleshooting because there is no guarantee that the request you want is in the gathered samples. We needed a thoughtful approach for collecting all traces in the streaming microservices while keeping low operational complexity of running our infrastructure.

Most distributed tracing systems enforce sampling policy at the request ingestion point in a microservice call graph. We took a hybrid head-based sampling approach that allows for recording 100% of traces for a specific and configurable set of requests, while continuing to randomly sample traffic per the policy set at ingestion point. This flexibility allows tracer libraries to record 100% traces in our mission-critical streaming microservices while collecting minimal traces from auxiliary systems like offline batch data processing. Our engineering teams tuned their services for performance after factoring in increased resource utilization due to tracing. The next challenge was to stream large amounts of traces via a scalable data processing platform.

Mantis is our go-to platform for processing operational data at Netflix. We chose Mantis as our backbone to transport and process large volumes of trace data because we needed a backpressure-aware, scalable stream processing system. Our trace data collection agent transports traces to Mantis job cluster via the Mantis Publish library. We buffer spans for a time period in order to collect all spans for a trace in the first job. A second job taps the data feed from the first job, does tail sampling of data and writes traces to the storage system. This setup of chained Mantis jobs allows us to scale each data processing component independently. An additional advantage of using Mantis is the ability to perform real-time ad-hoc data exploration in Raven using the Mantis Query Language (MQL). However, having a scalable stream processing platform doesn’t help much if you can’t store data in a cost efficient manner.

Storage: don’t break the bank!

We started with Elasticsearch as our data store due to its flexible data model and querying capabilities. As we onboarded more streaming services, the trace data volume started increasing exponentially. The increased operational burden of scaling ElasticSearch clusters due to high data write rate became painful for us. The data read queries took an increasingly longer time to finish because ElasticSearch clusters were using heavy compute resources for creating indexes on ingested traces. The high data ingestion rate eventually degraded both read and write operations. We solved this by migrating to Cassandra as our data store for handling high data ingestion rates. Using simple lookup indices in Cassandra gives us the ability to maintain acceptable read latencies while doing heavy writes.

In theory, scaling up horizontally would allow us to handle higher write rates and retain larger amounts of data in Cassandra clusters. This implies that the cost of storing traces grows linearly to the amount of data being stored. We needed to ensure storage cost growth was sub-linear to the amount of data being stored. In pursuit of this goal, we outlined following storage optimization strategies:

  1. Use cheaper Elastic Block Store (EBS) volumes instead of SSD instance stores in EC2.
  2. Employ better compression technique to reduce trace data size.
  3. Store only relevant and interesting traces by using simple rules-based filters.

We were adding new Cassandra nodes whenever the EC2 SSD instance stores of existing nodes reached maximum storage capacity. The use of a cheaper EBS Elastic volume instead of an SSD instance store was an attractive option because AWS allows dynamic increase in EBS volume size without re-provisioning the EC2 node. This allowed us to increase total storage capacity without adding a new Cassandra node to the existing cluster. In 2019 our stunning colleagues in the Cloud Database Engineering (CDE) team benchmarked EBS performance for our use case and migrated existing clusters to use EBS Elastic volumes. By optimizing the Time Window Compaction Strategy (TWCS) parameters, they reduced the disk write and merge operations of Cassandra SSTable files, thereby reducing the EBS I/O rate. This optimization helped us reduce the data replication network traffic amongst the cluster nodes because SSTable files were created less often than in our previous configuration. Additionally, by enabling Zstd block compression on Cassandra data files, the size of our trace data files was reduced by half. With these optimized Cassandra clusters in place, it now costs us 71% less to operate clusters and we could store 35x more data than our previous configuration.

We observed that Edgar users explored less than 1% of collected traces. This insight leads us to believe that we can reduce write pressure and retain more data in the storage system if we drop traces that users will not care about. We currently use a simple rule based filter in our Storage Mantis job that retains interesting traces for very rarely looked service call paths in Edgar. The filter qualifies a trace as an interesting data point by inspecting all buffered spans of a trace for warnings, errors, and retry tags. This tail-based sampling approach reduced the trace data volume by 20% without impacting user experience. There is an opportunity to use machine learning based classification techniques to further reduce trace data volume.

While we have made substantial progress, we are now at another inflection point in building our trace data storage system. Onboarding new user experiences on Edgar could require us to store 10x the amount of current data volume. As a result, we are currently experimenting with a tiered storage approach for a new data gateway. This data gateway provides a querying interface that abstracts the complexity of reading and writing data from tiered data stores. Additionally, the data gateway routes ingested data to the Cassandra cluster and transfers compacted data files from Cassandra cluster to S3. We plan to retain the last few hours worth of data in Cassandra clusters and keep the rest in S3 buckets for long term retention of traces.

Table 1. Timeline of Storage Optimizations

Secondary advantages

In addition to powering Edgar, trace data is used for the following use cases:

Application Health Monitoring

Trace data is a key signal used by Telltale in monitoring macro level application health at Netflix. Telltale uses the causal information from traces to infer microservice topology and correlate traces with time series data from Atlas. This approach paints a richer observability portrait of application health.

Resiliency Engineering

Our chaos engineering team uses traces to verify that failures are correctly injected while our engineers stress test their microservices via Failure Injection Testing (FIT) platform.

Regional Evacuation

The Demand Engineering team leverages tracing to improve the correctness of prescaling during regional evacuations. Traces provide visibility into the types of devices interacting with microservices such that changes in demand for these services can be better accounted for when an AWS region is evacuated.

Estimate infrastructure cost of running an A/B test

The Data Science and Product team factors in the costs of running A/B tests on microservices by analyzing traces that have relevant A/B test names as tags.

What’s next?

The scope and complexity of our software systems continue to increase as Netflix grows. We will focus on following areas for extending Edgar:

  • Provide a great developer experience for collecting traces across all runtime environments. With an easy way to to try out distributed tracing, we hope that more engineers instrument their services with traces and provide additional context for each request by tagging relevant metadata.
  • Enhance our analytics capability for querying trace data to enable power users at Netflix in building their own dashboards and systems for narrowly focused use cases.
  • Build abstractions that correlate data from metrics, logging, and tracing systems to provide additional contextual information for troubleshooting.

As we progress in building distributed tracing infrastructure, our engineers continue to rely on Edgar for troubleshooting streaming issues like “Why doesn’t Tiger King play on my phone?”. Our distributed tracing infrastructure helps in ensuring that Netflix members continue to enjoy a must-watch show like Tiger King!

We are looking for stunning colleagues to join us on this journey of building distributed tracing infrastructure. If you are passionate about Observability then come talk to us.


Building Netflix’s Distributed Tracing Infrastructure was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Why Deployment Requirements are Important When Making Architectural Choices

Post Syndicated from Yusuf Mayet original https://aws.amazon.com/blogs/architecture/why-deployment-requirements-are-important-when-making-architectural-choices/

Introduction

Too often, architects fall into the trap of thinking the architecture of an application is restricted to just the runtime part of the architecture. By doing this we focus on only a single customer (such as the application’s users and how they interact with the system) and we forget about other important customers like developers and DevOps teams. This means that requirements regarding deployment ease, deployment frequency, and observability are delegated to the back burner during design time and tacked on after the runtime architecture is built. This leads to increased costs and reduced ability to innovate.

In this post, I discuss the importance of key non-functional requirements, and how they can and should influence the target architecture at design time.

Architectural patterns

When building and designing new applications, we usually start by looking at the functional requirements, which will define the functionality and objective of the application. These are all the things that the users of the application expect, such as shopping online, searching for products, and ordering. We also consider aspects such as usability to ensure a great user experience (UX).

We then consider the non-functional requirements, the so-called “ilities,” which typically include requirements regarding scalability, availability, latency, etc. These are constraints around the functional requirements, like response times for placing orders or searching for products, which will define the expected latency of the system.

These requirements—both functional and non-functional together—dictate the architectural pattern we choose to build the application. These patterns include Multi-tierevent-driven architecturemicroservices, and others, and each one has benefits and limitations. For example, a microservices architecture allows for a system where services can be deployed and scaled independently, but this also introduces complexity around service discovery.

Aligning the architecture to technical users’ requirements

Amazon is a customer-obsessed organization, so it’s important for us to first identify who the main customers are at each point so that we can meet their needs. The customers of the functional requirements are the application users, so we need to ensure the application meets their needs. For the most part, we will ensure that the desired product features are supported by the architecture.

But who are the users of the architecture? Not the applications’ users—they don’t care if it’s monolithic or microservices based, as long as they can shop and search for products. The main customers of the architecture are the technical teams: the developers, architects, and operations teams that build and support the application. We need to work backwards from the customers’ needs (in this case the technical team), and make sure that the architecture meets their requirements. We have therefore identified three non-functional requirements that are important to consider when designing an architecture that can equally meet the needs of the technical users:

  1. Deployability: Flow and agility to consistently deploy new features
  2. Observability: feedback about the state of the application
  3. Disposability: throwing away resources and provision new ones quickly

Together these form part of the Developer Experience (DX), which is focused on providing developers with APIs, documentation, and other technologies to make it easy to understand and use. This will ensure that we design for Day 2 operations in mind.

Deployability: Flow

There are many reasons that organizations embark on digital transformation journeys, which usually involve moving to the cloud and adopting DevOps. According to Stephen Orban, GM of AWS Data Exchange, in his book Ahead in the Cloud, faster product development is often a key motivator, meaning the most important non-functional requirement is achieving flow, the speed at which you can consistently deploy new applications, respond to competitors, and test and roll out new features. As well, the architecture needs to be designed upfront to support deployability. If the architectural pattern is a monolithic application, this will hamper the developers’ ability to quickly roll out new features to production. So we need to choose and design the architecture to support easy and automated deployments. Results from years of research prove that leaders use DevOps to achieve high levels of throughput:

Graphic - Using DevOps to achieve high levels of throughput

Decisions on the pace and frequency of deployments will dictate whether to use rolling, blue/green, or canary deployment methodologies. This will then inform the architectural pattern chosen for the application.

Using AWS, in order to achieve flow of deployability, we will use services such as AWS CodePipelineAWS CodeBuildAWS CodeDeploy and AWS CodeStar.

Observability: feedback

Once you have achieved a rapid and repeatable flow of features into production, you need a constant feedback loop of logs and metrics in order to detect and avoid problems. Observability is a property of the architecture that will allow us to better understand the application across the delivery pipeline and into production. This requires that we design the architecture to ensure that health reports are generated to analyze and spot trends. This includes error rates and stats from each stage of the development process, how many commits were made, build duration, and frequency of deployments. This not only allows us to measure code characteristics such as test coverage, but also developer productivity.

On AWS, we can leverage Amazon CloudWatch to gather and search through logs and metrics, AWS X-Ray for tracing, and Amazon QuickSight as an analytics tool to measure CI/CD metrics.

Disposability: automation

In his book, Cloud Strategy: A Decision-based Approach to a Successful Cloud Journey, Gregor Hohpe, Enterprise Strategist at AWS, notes that cloud and automation add a new “-ility”: disposability, which is the ability to set up and dispose of new servers in an automated and pain-free manner. Having immutable, disposable infrastructure greatly enhances your ability to achieve high levels of deployability and flow, especially when used in a CI/CD pipeline, which can create new resources and kill off the old ones.

At AWS, we can achieve disposability with serverless using AWS Lambda, or with containers running on Amazon Elastic Container Service (ECS) or Amazon Elastic Kubernetes Service (EKS), or using AWS Auto Scaling with Amazon Elastic Compute Cloud (EC2).

Three different views of the architecture

Once we have designed an architecture that caters for deployability, observability, and disposability, it exposes three lenses across which we can view the architecture:

3 views of the architecture

  1. Build lens: the focus of this part of the architecture is on achieving deployability, with the objective to give the developers an easy-to-use, automated platform that builds, tests, and pushes their code into the different environments, in a repeatable way. Developers can push code changes more reliably and frequently, and the operations team can see greater stability because environments have standard configurations and rollback procedures are automated
  2. Runtime lens: the focus is on the users of the application and on maximizing their experience by making the application responsive and highly available.
  3. Operate lens: the focus is on achieving observability for the DevOps teams, allowing them to have complete visibility into each part of the architecture.

Summary

When building and designing new applications, the functional requirements (such as UX) are usually the primary drivers for choosing and defining the architecture to support those requirements. In this post I have discussed how DX characteristics like deployability, observability, and disposability are not just operational concerns that get tacked on after the architecture is chosen. Rather, they should be as important as the functional requirements when choosing the architectural pattern. This ensures that the architecture can support the needs of both the developers and users, increasing quality and our ability to innovate.

Edgar: Solving Mysteries Faster with Observability

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/edgar-solving-mysteries-faster-with-observability-e1a76302c71f

Edgar helps Netflix teams troubleshoot distributed systems efficiently with the help of a summarized presentation of request tracing, logs, analysis, and metadata.

by Elizabeth Carretto

Everyone loves Unsolved Mysteries. There’s always someone who seems like the surefire culprit. There’s a clear motive, the perfect opportunity, and an incriminating footprint left behind. Yet, this is Unsolved Mysteries! It’s never that simple. Whether it’s a cryptic note behind the TV or a mysterious phone call from an unknown number at a critical moment, the pieces rarely fit together perfectly. As mystery lovers, we want to answer the age-old question of whodunit; we want to understand what really happened.

For engineers, instead of whodunit, the question is often “what failed and why?” When a problem occurs, we put on our detective hats and start our mystery-solving process by gathering evidence. The more complex a system, the more places to look for clues. An engineer can find herself digging through logs, poring over traces, and staring at dozens of dashboards.

All of these sources make it challenging to know where to begin and add to the time spent figuring out what went wrong. While this abundance of dashboards and information is by no means unique to Netflix, it certainly holds true within our microservices architecture. Each microservice may be easy to understand and debug individually, but what about when combined into a request that hits tens or hundreds of microservices? Searching for key evidence becomes like digging for a needle in a group of haystacks.

Example call graph in Edgar

In some cases, the question we’re answering is, “What’s happening right now??” and every second without resolution can carry a heavy cost. We want to resolve the problem as quickly as possible so our members can resume enjoying their favorite movies and shows. For teams building observability tools, the question is: how do we make understanding a system’s behavior fast and digestible? Quick to parse, and easy to pinpoint where something went wrong even if you aren’t deeply familiar with the inner workings and intricacies of that system? At Netflix, we’ve answered that question with a suite of observability tools. In an earlier blog post, we discussed Telltale, our health monitoring system. Telltale tells us when an application is unhealthy, but sometimes we need more fine-grained insight. We need to know why a specific request is failing and where. We built Edgar to ease this burden, by empowering our users to troubleshoot distributed systems efficiently with the help of a summarized presentation of request tracing, logs, analysis, and metadata.

What is Edgar?

Edgar is a self-service tool for troubleshooting distributed systems, built on a foundation of request tracing, with additional context layered on top. With request tracing and additional data from logs, events, metadata, and analysis, Edgar is able to show the flow of a request through our distributed system — what services were hit by a call, what information was passed from one service to the next, what happened inside that service, how long did it take, and what status was emitted — and highlight where an issue may have occurred. If you’re familiar with platforms like Zipkin or OpenTelemetry, this likely sounds familiar. But, there are a few substantial differences in how Edgar approaches its data and its users.

  • While Edgar is built on top of request tracing, it also uses the traces as the thread to tie additional context together. Deriving meaningful value from trace data alone can be challenging, as Cindy Sridharan articulated in this blog post. In addition to trace data, Edgar pulls in additional context from logs, events, and metadata, sifting through them to determine valuable and relevant information, so that Edgar can visually highlight where an error occurred and provide detailed context.
  • Edgar captures 100% of interesting traces, as opposed to sampling a small fixed percentage of traffic. This difference has substantial technological implications, from the classification of what’s interesting to transport to cost-effective storage (keep an eye out for later Netflix Tech Blog posts addressing these topics).
  • Edgar provides a powerful and consumable user experience to both engineers and non-engineers alike. If you embrace the cost and complexity of storing vast amounts of traces, you want to get the most value out of that cost. With Edgar, we’ve found that we can leverage that value by curating an experience for additional teams such as customer service operations, and we have embraced the challenge of building a product that makes trace data easy to access, easy to grok, and easy to gain insight by several user personas.

Tracing as a foundation

Logs, metrics, and traces are the three pillars of observability. Metrics communicate what’s happening on a macro scale, traces illustrate the ecosystem of an isolated request, and the logs provide a detail-rich snapshot into what happened within a service. These pillars have immense value and it is no surprise that the industry has invested heavily in building impressive dashboards and tooling around each. The downside is that we have so many dashboards. In one request hitting just ten services, there might be ten different analytics dashboards and ten different log stores. However, a request has its own unique trace identifier, which is a common thread tying all the pieces of this request together. The trace ID is typically generated at the first service that receives the request and then passed along from service to service as a header value. This makes the trace a great starting point to unify this data in a centralized location.

A trace is a set of segments representing each step of a single request throughout a system. Distributed tracing is the process of generating, transporting, storing, and retrieving traces in a distributed system. As a request flows between services, each distinct unit of work is documented as a span. A trace is made up of many spans, which are grouped together using a trace ID to form a single, end-to-end umbrella. A span:

  • Represents a unit of work, such as a network call from one service to another (a client/server relationship) or a purely internal action (e.g., starting and finishing a method).
  • Relates to other spans through a parent/child relationship.
  • Contains a set of key value pairs called tags, where service owners can attach helpful values such as urls, version numbers, regions, corresponding IDs, and errors. Tags can be associated with errors or warnings, which Edgar can display visually on a graph representation of the request.
  • Has a start time and an end time. Thanks to these timestamps, a user can quickly see how long the operation took.

The trace (along with its underlying spans) allows us to graphically represent the request chronologically.

Sample timeline view of a trace, based on Jaegar UI’s timeline view

Adding context to traces

With distributed tracing alone, Edgar is able to draw the path of a request as it flows through various systems. This centralized view is extremely helpful to determine which services were hit and when, but it lacks nuance. A tag might indicate there was an error but doesn’t fully answer the question of what happened. Adding logs to the picture can help a great deal. With logs, a user can see what the service itself had to say about what went wrong. If a data fetcher fails, the log can tell you what query it was running and what exact IDs or fields led to the failure. That alone might give an engineer the knowledge she needs to reproduce the issue. In Edgar, we parse the logs looking for error or warning values. We add these errors and warnings to our UI, highlighting them in our call graph and clearly associating them with a given service, to make it easy for users to view any errors we uncovered.

Example view of errors associated with a service, including an error parsed from a log

With the trace and additional context from logs illustrating the issue, one of the next questions may be how does this individual trace fit into the overall health and behavior of each service. Is this an anomaly or are we dealing with a pattern? To help answer this question, Edgar pulls in anomaly detection from a partner application, Telltale. Telltale provides Edgar with latency benchmarks that indicate if the individual trace’s latency is abnormal for this given service. A trace alone could tell you that a service took 500ms to respond, but it takes in-depth knowledge of a particular service’s typical behavior to make a determination if this response time is an outlier. Telltale’s anomaly analysis looks at historic behavior and can evaluate whether the latency experienced by this trace is anomalous. With this knowledge, Edgar can then visually warn that something happened in a service that caused its latency to fall outside of normal bounds.

Sample latency analysis

Edgar should reduce burden, not add to it

Presenting all of this data in one interface reduces the footwork of an engineer to uncover each source. However, discovery is only part of the path to resolution. With all the evidence presented and summarized by Edgar, an engineer may know what went wrong and where it went wrong. This is a huge step towards resolution, but not yet cause for celebration. The root cause may have been identified, but who owns the service in question? Many times, finding the right point of contact would require a jump into Slack or a company directory, which costs more time. In Edgar, we have integrated with our services to provide that information in-app alongside the details of a trace. For any service configured with an owner and support channel, Edgar provides a link to a service’s contact email and their Slack channel, smoothing the hand-off from one party to the next. If an engineer does need to pass an issue along to another team or person, Edgar’s request detail page contains all the context — the trace, logs, analysis — and is easily shareable, eliminating the need to write a detailed description or provide a cascade of links to communicate the issue.

Edgar’s request detail page

A key aspect of Edgar’s mission is to minimize the burden on both users and service owners. With all of its data sources, the sheer quantity of data could become overwhelming. It is essential for Edgar to maintain a prioritized interface, built to highlight errors and abnormalities to the user and assist users in taking the next step towards resolution. As our UI grows, it’s important to be discerning and judicious in how we handle new data sources, weaving them into our existing errors and warnings models to minimize disruption and to facilitate speedy understanding. We lean heavily on focus groups and user feedback to ensure a tight feedback loop so that Edgar can continue to meet our users’ needs as their services and use cases evolve.

As services evolve, they might change their log format or use new tags to indicate errors. We built an admin page to give our service owners that configurability and to decouple our product from in-depth service knowledge. Service owners can configure the essential details of their log stores, such as where their logs are located and what fields they use for trace IDs and span IDs. Knowing their trace and span IDs is what enables Edgar to correlate the traces and logs. Beyond that though, what are the idiosyncrasies of their logs? Some fields may be irrelevant or deprecated, and teams would like to hide them by default. Alternatively, some fields contain the most important information, and by promoting them in the Edgar UI, they are able to view these fields more quickly. This self-service configuration helps reduce the burden on service owners.

Initial log configuration in Edgar

Leveraging Edgar

In order for users to turn to Edgar in a situation when time is of the essence, users need to be able to trust Edgar. In particular, they need to be able to count on Edgar having data about their issue. Many approaches to distributed tracing involve setting a sample rate, such as 5%, and then only tracing that percentage of request traffic. Instead of sampling a fixed percentage, Edgar’s mission is to capture 100% of interesting requests. As a result, when an error happens, Edgar’s users can be confident they will be able to find it. That’s key to positioning Edgar as a reliable source. Edgar’s approach makes a commitment to have data about a given issue.

In addition to storing trace data for all requests, Edgar implemented a feature to collect additional details on-demand at a user’s discretion for a given criteria. With this fine-grained level of tracing turned on, Edgar captures request and response payloads as well as headers for requests matching the user’s criteria. This adds clarity to exactly what data is being passed from service to service through a request’s path. While this level of granularity is unsustainable for all request traffic, it is a robust tool in targeted use cases, especially for errors that prove challenging to reproduce.

As you can imagine, this comes with very real storage costs. While the Edgar team has done its best to manage these costs effectively and to optimize our storage, the cost is not insignificant. One way to strengthen our return on investment is by being a key tool throughout the software development lifecycle. Edgar is a crucial tool for operating and maintaining a production service, where reducing the time to recovery has direct customer impact. Engineers also rely on our tool throughout development and testing, and they use the Edgar request page to communicate issues across teams.

By providing our tool to multiple sets of users, we are able to leverage our cost more efficiently. Edgar has become not just a tool for engineers, but rather a tool for anyone who needs to troubleshoot a service at Netflix. In Edgar’s early days, as we strove to build valuable abstractions on top of trace data, the Edgar team first targeted streaming video use cases. We built a curated experience for streaming video, grouping requests into playback sessions, marked by starting and stopping playback for a given asset. We found this experience was powerful for customer service operations as well as engineering teams. Our team listened to customer service operations to understand which common issues caused an undue amount of support pain so that we could summarize these issues in our UI. This empowers customer service operations, as well as engineers, to quickly understand member issues with minimal digging. By logically grouping traces and summarizing the behavior at a higher level, trace data becomes extremely useful in answering questions like why a member didn’t receive 4k video for a certain title or why a member couldn’t watch certain content.

An example error viewing a playback session in Edgar

Extending Edgar for Studio

As the studio side of Netflix grew, we realized that our movie and show production support would benefit from a similar aggregation of user activity. Our movie and show production support might need to answer why someone from the production crew can’t log in or access their materials for a particular project. As we worked to serve this new user group, we sought to understand what issues our production support needed to answer most frequently and then tied together various data sources to answer those questions in Edgar.

The Edgar team built out an experience to meet this need, building another abstraction with trace data; this time, the focus was on troubleshooting production-related use cases and applications, rather than a streaming video session. Edgar provides our production support the ability to search for a given contractor, vendor, or member of production staff by their name or email. After finding the individual, Edgar reaches into numerous log stores for their user ID, and then pulls together their login history, role access change log, and recent traces emitted from production-related applications. Edgar scans through this data for errors and warnings and then presents those errors right at the front. Perhaps a vendor tried to login with the wrong password too many times, or they were assigned an incorrect role on a production. In this new domain, Edgar is solving the same multi-dashboarded problem by tying together information and pointing its users to the next step of resolution.

An example error for a production-related user

What Edgar is and is not

Edgar’s goal is not to be the be-all, end-all of tools or to be the One Tool to Rule Them All. Rather, our goal is to act as a concierge of troubleshooting — Edgar should quickly be able to guide users to an understanding of an issue, as well usher them to the next location, where they can remedy the problem. Let’s say a production vendor is unable to access materials for their production due to an incorrect role/permissions assignment, and this production vendor reaches out to support for assistance troubleshooting. When a support user searches for this vendor, Edgar should be able to indicate that this vendor recently had a role change and summarize what this role change is. Instead of being assigned to Dead To Me Season 2, they were assigned to Season 1! In this case, Edgar’s goal is to help a support user come to this conclusion and direct them quickly to the role management tool where this can be rectified, not to own the full circle of resolution.

Usage at Netflix

While Edgar was created around Netflix’s core streaming video use-case, it has since evolved to cover a wide array of applications. While Netflix streaming video is used by millions of members, some applications using Edgar may measure their volume in requests per minute, rather than requests per second, and may only have tens or hundreds of users rather than millions. While we started with a curated approach to solve a pain point for engineers and support working on streaming video, we found that this pain point is scale agnostic. Getting to the bottom of a problem is costly for all engineers, whether they are building a budget forecasting application used heavily by 30 people or a SVOD application used by millions.

Today, many applications and services at Netflix, covering a wide array of type and scale, publish trace data that is accessible in Edgar, and teams ranging from service owners to customer service operations rely on Edgar’s insights. From streaming to studio, Edgar leverages its wealth of knowledge to speed up troubleshooting across applications with the same fundamental approach of summarizing request tracing, logs, analysis, and metadata.

As you settle into your couch to watch a new episode of Unsolved Mysteries, you may still find yourself with more questions than answers. Why did the victim leave his house so abruptly? How did the suspect disappear into thin air? Hang on, how many people saw that UFO?? Unfortunately, Edgar can’t help you there (trust me, we’re disappointed too). But, if your relaxing evening is interrupted by a production outage, Edgar will be behind the scenes, helping Netflix engineers solve the mystery at hand.

Keeping services up and running allows Netflix to share stories with our members around the globe. Underneath every outage and failure, there is a story to tell, and powerful observability tooling is needed to tell it. If you are passionate about observability then come talk to us.


Edgar: Solving Mysteries Faster with Observability was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Telltale: Netflix Application Monitoring Simplified

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/telltale-netflix-application-monitoring-simplified-5c08bfa780ba

By Andrei U., Seth Katz, Janak Ramachandran, Jeff Butsch, Peter Lau, Ram Vaithilingam, and Greg Burrell

Our Telltale Vision

An alert fires and you get paged in the middle of the night. A metric crossed a threshold. You’re half awake and wondering, “Is there really a problem or is this just an alert that needs tuning? When was the last time somebody adjusted our alert thresholds? Maybe it’s due to an upstream or downstream service?” This is a critical application so you drag yourself out of bed, open your laptop, and start poring through dashboards for more info. You’re not yet convinced there’s a real problem but you’re also aware that the clock is ticking as you dig through a mountain of data looking for clues.

Healthy Netflix services are essential to member joy. When you sit down to watch “Tiger King” you expect it to just play. Over the years we’ve learned from on-call engineers about the pain points of application monitoring: too many alerts, too many dashboards to scroll through, and too much configuration and maintenance. Our streaming teams need a monitoring system that enables them to quickly diagnose and remediate problems; seconds count! Our Node team needs a system that empowers a small group to operate a large fleet.

So we built Telltale.

The Telltale UI shows health over time and across multiple services.
The Telltale timeline.

Telltale combines a variety of data sources to create a holistic view of an application’s health. Telltale learns what constitutes typical health for an application, no alert tuning required. And because we know what’s healthy, we can let application owners know when their services are trending towards unhealthy.

Metrics are a key part of understanding application health. But sometimes you can have too many metrics, too many graphs, and too many dashboards. Telltale shows only the relevant data from the application plus that of upstream and downstream services. We use colors to indicate severity (users can opt to have Telltale display numbers in addition to colors) so users can tell, at a glance, the state of their application’s health. We also highlight interesting broader events such as regional traffic evacuations and nearby deployments, information that is vital to understanding health holistically. Especially during an incident.

That is our Telltale vision. It exists today and monitors the health of over 100 Netflix production-facing applications.

A call graph with four layers of services.
An application lives in an ecosystem

The Application Health Model

A microservice doesn’t live in isolation. It usually has dependencies, talks to other services, and lives in different AWS regions. The call graph above is a relatively simple one, they can be much deeper with dozens of services involved. An application is part of an ecosystem that can be subtly influenced by property changes or radically altered by region-wide events. The launch of a canary can affect an application. As can an upstream or downstream deployments.

Telltale uses a variety of signals from multiple sources to assemble a constantly evolving model of the application’s health:

Different signals have different levels of importance to an application’s health. For example, a latency increase is less critical than error rate increase and some error codes are less critical than others. A canary launch two layers downstream might not be as significant as a deployment immediately upstream. A regional traffic shift means one region ends up with zero traffic while another region has double. You can imagine the impact that has on metrics. A metric’s meaning determines how we should interpret it.

Telltale takes all those factors into consideration when constructing its view of application health.

The application health model is the heart of Telltale.

Intelligent Monitoring

Every service operator knows the difficulty of alert tuning. Set thresholds too low and you get a deluge of spurious alerts. So you overcompensate and relax the tuning to the point of missing important health warnings. The end result is a lack of trust in alerts. Telltale is built on the premise that you shouldn’t have to constantly tune configuration.

We make setup and configuration easy for application owners by providing curated and managed signal packs. These packs are combined into application profiles to address most common service types. Telltale automatically tracks dependencies between services to build the topology used in the application health model. Signal packs and topology detection keep configuration up-to-date with minimal effort. Those who want a more hands-on approach can still do manual configuration and tuning.

No single algorithm can account for the wide variety of signals we use. So, instead, we employ a mix of algorithms including statistical, rule based, and machine learning. We’ll do a future Netflix Tech Blog article focused on our algorithms. Telltale also has analyzers to detect long-term trends or memory leaks. Intelligent monitoring means results our users can trust. It means a faster time to detection and a faster time to resolution during an incident.

Intelligent Alerting

Intelligent monitoring yields intelligent alerting. Telltale creates an issue when it detects a health problem in your application’s ecosystem. Teams can opt in to alerting via Slack, email, or PagerDuty (all powered by our internal alerting system). If the issue is caused by an upstream or downstream system then Telltale’s context-aware routing alerts that team instead. Intelligent alerting also means a team receives a single notification, alert storms are a thing of the past.

An example of a Telltale alert notification in Slack.
An example of a Telltale notification in Slack.

When a problem strikes, it’s essential to have the right information. Our Slack alerts also start a thread containing only the most relevant context about the incident. This includes the signals that Telltale identified as unhealthy and the reasons why. The right context provides a better understanding of the application’s current state so the on-call engineer can return it to health.

Incidents evolve and have their own lifecycle, so updates are essential. Are things getting better or worse? Are there new signals or events to consider? Telltale updates the Slack thread as the current incident unfolds. The thread is marked Resolved upon return to healthy state so users know, at a glance, which incidents are ongoing and which have been successfully remediated.

But these Slack threads aren’t just for Telltale. Teams use them to share additional data, observations, theories, and discussion about the incident. Incident data and discussion all in one thread makes for shared understanding, faster resolution, and easier post-incident analysis.

We strive to improve the quality of Telltale alerts. One way to do that is to learn from our users. So we provide feedback buttons right in the Slack message. Users can tell us to suppress future occurrences of an alert. Or provide a reason for why an alert isn’t actionable. Intelligent alerting means alerts our users can trust.

An example of the details found in a Telltale notification in Slack. Which metrics and which triggers fired.
An example of the details found in a Telltale notification in Slack.

Why Is My Service Unhealthy?

A wide variety of signals, knowledge of the application’s ecosystem, and correlation of signals across multiple services helps Telltale to detect the possible causes of an application’s degraded health. Causes such as an outlier instance, a canary or deployment by a dependent service, an unhealthy database, or just a spike in traffic. Highlighting possible causes saves valuable time during an incident.

Incident Management

An incident summary gathers relevant information for the team.
An example of a Telltale incident summary.

When Telltale sends an alert it also creates a snapshot that has references to the unhealthy signals. As new information arrives, it’s added to this snapshot. This simplifies the post-incident review process for many teams. When it’s time to review past issues, the Application Incident Summary feature shows all aspects of recent issues in a single place including key metrics like total downtime and MTTR (Mean Time To Resolution). We want to help our teams see larger patterns of incidents so they can improve overall service availability.

The cluster view shows groupings of similar incidents so teams can understand the larger patterns.
The cluster view groups similar incidents.

Deployment Monitoring

Telltale’s application health model and intelligent monitoring have proven so powerful that we’re also using it for safer deployments. We start with Spinnaker, our open source delivery platform. As Spinnaker slowly rolls out a new build we use Telltale to continuously monitor the health of the instances running the new build. Continuous monitoring means a deployment stops and rolls back at the first sign of a problem. It means deployment problems have smaller blast radius and a shorter duration.

Continuous Improvement

Operating microservices in a complex ecosystem is challenging. We’re thrilled that Telltale’s intelligent monitoring and alerting helps our service operators improve availability, reduce toil, and sleep better at night. But we’re not done. We’re constantly exploring new algorithms to improve the accuracy of our alerts. We’ll write more about that in a future Netflix Tech Blog post. We’re also evaluating improvements to our application health model. We believe there’s useful information in service log and trace data. And benefits to employing higher resolution metrics. We’re looking forward to collaborating with our platform team on building out those new features. Getting new applications onto Telltale has been a white-glove treatment which doesn’t scale well, we can definitely improve our self-service UI. And we know there’s better heuristics to help pinpoint what’s affecting your service health.

Telltale is application monitoring simplified.

A healthy Netflix service enables us to entertain the world. Correlating disparate signals to model health in realtime is challenging. Add in thousands of streaming device types, an ever-evolving architecture, and a growing content production ecosystem and the problem becomes fascinating. If you’re passionate about observability then come talk to us.


Telltale: Netflix Application Monitoring Simplified was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.