Tag Archives: observability

Cloudflare acquires Baselime to expand serverless application observability capabilities

2024-04-05 Boris Tane

Post Syndicated from Boris Tane original https://blog.cloudflare.com/cloudflare-acquires-baselime-expands-observability-capabilities

Today, we’re thrilled to announce that Cloudflare has acquired Baselime.

The cloud is changing. Just a few years ago, serverless functions were revolutionary. Today, entire applications are built on serverless architectures, from compute to databases, storage, queues, etc. — with Cloudflare leading the way in making it easier than ever for developers to build, without having to think about their architecture. And while the adoption of serverless has made it simple for developers to run fast, it has also made one of the most difficult problems in software even harder: how the heck do you unravel the behavior of distributed systems?

When I started Baselime 2 years ago, our goal was simple: enable every developer to build, ship, and learn from their serverless applications such that they can resolve issues before they become problems.

Since then, we built an observability platform that enables developers to understand the behaviour of their cloud applications. It’s designed for high cardinality and dimensionality data, from logs to distributed tracing with OpenTelemetry. With this data, we automatically surface insights from your applications, and enable you to quickly detect, troubleshoot, and resolve issues in production.

In parallel, Cloudflare has been busy the past few years building the next frontier of cloud computing: the connectivity cloud. The team is building primitives that enable developers to build applications with a completely new set of paradigms, from Workers to D1, R2, Queues, KV, Durable Objects, AI, and all the other services available on the Cloudflare Developers Platform.

This synergy makes Cloudflare the perfect home for Baselime. Our core mission has always been to simplify and innovate around observability for the future of the cloud, and Cloudflare’s ecosystem offers the ideal ground to further this cause. With Cloudflare, we’re positioned to deeply integrate into a platform that tens of thousands of developers trust and use daily, enabling them to quickly build, ship, and troubleshoot applications. We believe that every Worker, Queue, KV, Durable Object, AI call, etc. should have built-in observability by default.

That’s why we’re incredibly excited about the potential of what we can build together and the impact it will have on developers around the world.

To give you a preview into what’s ahead, I wanted to dive deeper into the 3 core concepts we followed while building Baselime.

High Cardinality and Dimensionality

Cardinality and dimensionality are best described using examples. Imagine you’re playing a board game with a deck of cards. High cardinality is like playing a game where every card is a unique character, making it hard to remember or match them. And high dimensionality is like each card has tons of details like strength, speed, magic, aura, etc., making the game’s strategy complex because there’s so much to consider.

This also applies to the data your application emits. For example, when you log an HTTP request that makes database calls.

High cardinality means that your logs can have a unique userId or requestId (which can take millions of distinct values). Those are high cardinality fields.
High dimensionality means that your logs can have thousands of possible fields. You can record each HTTP header of your request and the details of each database call. Any log can be a key-value object with thousands of individual keys.

The ability to query on high cardinality and dimensionality fields is key to modern observability. You can surface all errors or requests for a specific user, compute the duration of each of those requests, and group by location. You can answer all of those questions with a single tool.

OpenTelemetry

OpenTelemetry provides a common set of tools, APIs, SDKs, and standards for instrumenting applications. It is a game-changer for debugging and understanding cloud applications. You get to see the big picture: how fast your HTTP APIs are, which routes are experiencing the most errors, or which database queries are slowest. You can also get into the details by following the path of a single request or user across your entire application.

Baselime is OpenTelemetry native, and it is built from the ground up to leverage OpenTelemetry data. To support this, we built a set of OpenTelemetry SDKs compatible with several serverless runtimes.

Cloudflare is building the cloud of tomorrow and has developed workerd, a modern JavaScript runtime for Workers. With Cloudflare, we are considering embedding OpenTelemetry directly in the Workers’ runtime. That’s one more reason we’re excited to grow further at Cloudflare, enabling more developers to understand their applications, even in the most unconventional scenarios.

Developer Experience

Observability without action is just storage. I have seen too many developers pay for tools to store logs and metrics they never use, and the key reason is how opaque these tools are.

The crux of the issue in modern observability isn’t the technology itself, but rather the developer experience. Many tools are complex, with a significant learning curve. This friction reduces the speed at which developers can identify and resolve issues, ultimately affecting the reliability of their applications. Improving developer experience is key to unlocking the full potential of observability.

We built Baselime to be an exploratory solution that surfaces insights to you rather than requiring you to dig for them. For example, we notify you in real time when errors are discovered in your application, based on your logs and traces. You can quickly search through all of your data with full-text search, or using our powerful query engine, which makes it easy to correlate logs and traces for increased visibility, or ask our AI debugging assistant for insights on the issue you’re investigating.

It is always possible to go from one insight to another, asking questions about the state of your app iteratively until you get to the root cause of the issue you are troubleshooting.

Cloudflare has always prioritised the developer experience of its developer platform, especially with Wrangler, and we are convinced it’s the right place to solve the developer experience problem of observability.

What’s next?

Over the next few months, we’ll work to bring the core of Baselime into the Cloudflare ecosystem, starting with OpenTelemetry, real-time error tracking, and all the developer experience capabilities that make a great observability solution. We will keep building and improving observability for applications deployed outside Cloudflare because we understand that observability should work across providers.

But we don’t want to stop there. We want to push the boundaries of what modern observability looks like. For instance, directly connecting to your codebase and correlating insights from your logs and traces to functions and classes in your codebase. We also want to enable more AI capabilities beyond our debugging assistant. We want to deeply integrate with your repositories such that you can go from an error in your logs and traces to a Pull Request in your codebase within minutes.

We also want to enable everyone building on top of Large Language Models to do all your LLM observability directly within Cloudflare, such that you can optimise your prompts, improve latencies and reduce error rates directly within your cloud provider. These are just a handful of capabilities we can now build with the support of the Cloudflare platform.

Thanks

We are incredibly thankful to our community for its continued support, from day 0 to today. With your continuous feedback, you’ve helped us build something we’re incredibly proud of.

To all the developers currently using Baselime, you’ll be able to keep using the product and will receive ongoing support. Also, we are now making all the paid Baselime features completely free.

Baselime products remain available to sign up for while we work on integrating with the Cloudflare platform. We anticipate sunsetting the Baselime products towards the end of 2024 when you will be able to observe all of your applications within the Cloudflare dashboard. If you’re interested in staying up-to-date on our work with Cloudflare, we will release a signup link in the coming weeks!

We are looking forward to continuing to innovate with you.

New tools for production safety — Gradual deployments, Source maps, Rate Limiting, and new SDKs

2024-04-04 Tanushree Sharma

Post Syndicated from Tanushree Sharma original https://blog.cloudflare.com/workers-production-safety

2024’s Developer Week is all about production readiness. On Monday. April 1, we announced that D1, Queues, Hyperdrive, and Workers Analytics Engine are ready for production scale and generally available. On Tuesday, April 2, we announced the same about our inference platform, Workers AI. And we’re not nearly done yet.

However, production readiness isn’t just about the scale and reliability of the services you build with. You also need tools to make changes safely and reliably. You depend not just on what Cloudflare provides, but on being able to precisely control and tailor how Cloudflare behaves to the needs of your application.

Today we are announcing five updates that put more power in your hands – Gradual Deployments, source mapped stack traces in Tail Workers, a new Rate Limiting API, brand-new API SDKs, and updates to Durable Objects – each built with mission-critical production services in mind. We build our own products using Workers, including Access, R2, KV, Waiting Room, Vectorize, Queues, Stream, and more. We rely on each of these new features ourselves to ensure that we are production ready – and now we’re excited to bring them to everyone.

Gradually deploy changes to Workers and Durable Objects

Deploying a Worker is nearly instantaneous – a few seconds and your change is live everywhere.

When you reach production scale, each change you make carries greater risk, both in terms of volume and expectations. You need to meet your 99.99% availability SLA, or have an ambitious P90 latency SLO. A bad deployment that’s live for 100% of traffic for 45 seconds could mean millions of failed requests. A subtle code change could cause a thundering herd of retries to an overwhelmed backend, if rolled out all at once. These are the kinds of risks we consider and mitigate ourselves for our own services built on Workers.

The way to mitigate these risks is to deploy changes gradually – commonly called rolling deployments:

The current version of your application runs in production.
You deploy the new version of your application to production, but only route a small percentage of traffic to this new version, and wait for it to “soak” in production, monitoring for regressions and bugs. If something bad happens, you’ve caught it early at a small percentage (e.g. 1%) of traffic and can revert quickly.
You gradually increment the percentage of traffic until the new version receives 100%, at which point it is fully rolled out.

Today we’re opening up a first-class way to deploy code changes gradually to Workers and Durable Objects via the Cloudflare API, the Wrangler CLI, or the Workers dashboard. Gradual Deployments is entering open beta – you can use Gradual Deployments with any Cloudflare account that is on the Workers Free plan, and very soon you’ll be able to start using Gradual Deployments with Cloudflare accounts on the Workers Paid and Enterprise plans. You’ll see a banner on the Workers dashboard once your account has access.

When you have two versions of your Worker or Durable Object running concurrently in production, you almost certainly want to be able to filter your metrics, exceptions, and logs by version. This can help you spot production issues early, when the new version is only rolled out to a small percentage of traffic, or compare performance metrics when splitting traffic 50/50. We’ve also added observability at a version level across our platform:

You can filter analytics in the Workers dashboard and via the GraphQL Analytics API by version.
Workers Trace Events and Tail Worker events include the version ID of your Worker, along with optional version message and version tag fields.
When using wrangler tail to view live logs, you can view logs for a specific version.
You can access version ID, message, and tag from within your Worker’s code, by configuring the Version Metadata binding.

You may also want to make sure that each client or user only sees a consistent version of your Worker. We’ve added Version Affinity so that requests associated with a particular identifier (such as user, session, or any unique ID) are always handled by a consistent version of your Worker. Session Affinity, when used with Ruleset Engine, gives you full control over both the mechanism and identifier used to ensure “stickiness”.

Gradual Deployments is entering open beta. As we move towards GA, we’re working to support:

Version Overrides. Invoke a specific version of your Worker in order to test before it serves any production traffic. This will allow you to create Blue-Green Deployments.
Cloudflare Pages. Let the CI/CD system in Pages automatically progress the deployments on your behalf.
Automatic rollbacks. Roll back deployments automatically when the error rate spikes for a new version of your Worker.

We’re looking forward to hearing your feedback! Let us know what you think through this feedback form or reach out in our Developer Discord in the #workers-gradual-deployments-beta channel.

Source mapped stack traces in Tail Workers

Production readiness means tracking errors and exceptions, and trying to drive them down to zero. When an error occurs, the first thing you typically want to look at is the error’s stack trace – the specific functions that were called, in what order, from which line and file, and with what arguments.

Most JavaScript code – not just on Workers, but across platforms – is first bundled, often transpiled, and then minified before being deployed to production. This is done behind the scenes to create smaller bundles to optimize performance and convert from Typescript to JavaScript if needed.

If you’ve ever seen an exception return a stack trace like: /src/index.js:1:342,it means the error occurred on the 342nd character of your function’s minified code. This is clearly not very helpful for debugging.

Source maps solve this – they map compiled and minified code back to the original code that you wrote. Source maps are combined with the stack trace returned by the JavaScript runtime in order to present you with a human-readable stack trace. For example, the following stack trace shows that the Worker received an unexpected null value on line 30 of the down.ts file. This is a useful starting point for debugging, and you can move down the stack trace to understand the functions that were called that were set that resulted in the null value.

Unexpected input value: null
  at parseBytes (src/down.ts:30:8)
  at down_default (src/down.ts:10:19)
  at Object.fetch (src/index.ts:11:12)

Here’s how it works:

When you set upload_source_maps = true in your wrangler.toml, Wrangler will automatically generate and upload any source map files when you run wrangler deploy or wrangler versions upload.
When your Worker throws an uncaught exception, we fetch the source map and use it to map the stack trace of the exception back to lines of your Worker’s original source code.
You can then view this deobfuscated stack trace in real-time logs or in Tail Workers.

Starting today, in open beta, you can upload source maps to Cloudflare when you deploy your Worker – get started by reading the docs. And starting on April 15th , the Workers runtime will start using source maps to deobfuscate stack traces. We’ll post a notification in the Cloudflare dashboard and post on our Cloudflare Developers X account when source mapped stack traces are available.

New Rate Limiting API in Workers

An API is only production ready if it has a sensible rate limit. And as you grow, so does the complexity and diversity of limits that you need to enforce in order to balance the needs of specific customers, protect the health of your service, or enforce and adjust limits in specific scenarios. Cloudflare’s own API has this challenge – each of our dozens of products, each with many API endpoints, may need to enforce different rate limits.

You’ve been able to configure Rate Limiting rules on Cloudflare since 2017. But until today, the only way to control this was in the Cloudflare dashboard or via the Cloudflare API. It hasn’t been possible to define behavior at runtime, or write code in a Worker that interacts directly with rate limits – you could only control whether a request is rate limited or not before it hits your Worker.

Today we’re introducing a new API, in open beta, that gives you direct access to rate limits from your Worker. It’s lightning fast, backed by memcached, and dead simple to add to your Worker. For example, the following configuration defines a rate limit of 100 requests within a 60-second period:

[[unsafe.bindings]]
name = "RATE_LIMITER"
type = "ratelimit"
namespace_id = "1001" # An identifier unique to your Cloudflare account

# Limit: the number of tokens allowed within a given period, in a single Cloudflare location
# Period: the duration of the period, in seconds. Must be either 60 or 10
simple = { limit = 100, period = 60 }

Then, in your Worker, you can call the limit method on the RATE_LIMITER binding, providing a key of your choosing. Given the configuration above, this code will return a HTTP 429 response status code once more than 100 requests to a specific path are made within a 60-second period:

export default {
  async fetch(request, env) {
    const { pathname } = new URL(request.url)

    const { success } = await env.RATE_LIMITER.limit({ key: pathname })
    if (!success) {
      return new Response(`429 Failure – rate limit exceeded for ${pathname}`, { status: 429 })
    }

    return new Response(`Success!`)
  }
}

Now that Workers can connect directly to a data store like memcached, what else could we provide? Counters? Locks? An in-memory cache? Rate limiting is the first of many primitives that we’re exploring providing in Workers that address questions we’ve gotten for years about where a temporary shared state that spans many Worker isolates should live. If you rely on putting state in the global scope of your Worker today, we’re working on better primitives that are purpose-built for specific use cases.

The Rate Limiting API in Workers is in open beta, and you can get started by reading the docs.

New auto-generated SDKs for Cloudflare’s API

Production readiness means going from making changes by clicking buttons in a dashboard to making changes programmatically, using an infrastructure-as-code approach like Terraform or Pulumi, or by making API requests directly, either on your own or via an SDK.

The Cloudflare API is massive, and constantly adding new capabilities – on average we update our API schemas between 20 and 30 times per day. But to date, our API SDKs have been built and maintained manually, so we had a burning need to automate this.

We’ve done that, and today we’re announcing new client SDKs for the Cloudflare API in three languages – Typescript, Python and Go – with more languages on the way.

Each SDK is generated automatically using Stainless API, based on the OpenAPI schemas that define the structure and capabilities of each of our API endpoints. This means that when we add any new functionality to the Cloudflare API, across any Cloudflare product, these API SDKs are automatically regenerated, and new versions are published, ensuring that they are correct and up-to-date.

You can install the SDKs by running one of the following commands:

// Typescript
npm install cloudflare

// Python
pip install --pre cloudflare

// Go
go get -u github.com/cloudflare/cloudflare-go/v2

If you use Terraform or Pulumi, under the hood, Cloudflare’s Terraform Provider currently uses the existing, non-automated Go SDK. When you run terraform apply, the Cloudflare Terraform Provider determines which API requests to make in what order, and executes these using the Go SDK.

The new, auto-generated Go SDK clears a path towards more comprehensive Terraform support for all Cloudflare products, providing a base set of tools that can be relied upon to be both correct and up-to-date with the latest API changes. We’re building towards a future where any time a product team at Cloudflare builds a new feature that is exposed via the Cloudflare API, it is automatically supported by the SDKs. Expect more updates on this throughout 2024.

Durable Object namespace analytics and WebSocket Hibernation GA

Many of our own products, including Waiting Room, R2, and Queues, as well as platforms like PartyKit, are built using Durable Objects. Deployed globally, including newly added support for Oceania, you can think of Durable Objects like singleton Workers that can provide a single point of coordination and persist state. They’re perfect for applications that need real-time user coordination, like interactive chat or collaborative editing. Take Atlassian’s word for it:

One of our new capabilities is Confluence whiteboards, which provides a freeform way to capture unstructured work like brainstorming and early planning before teams document it more formally. The team considered many options for real-time collaboration and ultimately decided to use Cloudflare’s Durable Objects. Durable Objects have proven to be a fantastic fit for this problem space, with a unique combination of functionalities that has allowed us to greatly simplify our infrastructure and easily scale to a large number of users. – Atlassian

We haven’t previously exposed associated analytical trends in the dashboard, making it hard to understand the usage patterns and error rates within a Durable Objects namespace unless you used the GraphQL Analytics API directly. The Durable Objects dashboard has now been revamped, letting you drill down into metrics, and go as deep as you need.

From day one, Durable Objects have supported WebSockets, allowing many clients to directly connect to a Durable Object to send and receive messages.

However, sometimes client applications open a WebSocket connection and then eventually stop doing…anything. Think about that tab you’ve had sitting open in your browser for the last 5 hours, but haven’t touched. If it uses WebSockets to send and receive messages, it effectively has a long-lived TCP connection that isn’t being used for anything. If this connection is to a Durable Object, the Durable Object must stay running, waiting for something to happen, consuming memory, and costing you money.

We first introduced WebSocket Hibernation to solve this problem, and today we’re announcing that this feature is out of beta and is Generally Available. With WebSocket Hibernation, you set an automatic response to be used while hibernating and serialize state such that it survives hibernation. This gives Cloudflare the inputs we need in order to maintain open WebSocket connections from clients while “hibernating” the Durable Object such that it is not actively running, and you are not billed for idle time. The result is that your state is always available in-memory when you actually need it, but isn’t unnecessarily kept around when it’s not. As long as your Durable Object is hibernating, even if there are active clients still connected over a WebSocket, you won’t be billed for duration.

In addition, we’ve heard developer feedback on the costs of incoming WebSocket messages to Durable Objects, which favor smaller, more frequent messages for real-time communication. Starting today incoming WebSocket messages will be billed at the equivalent of 1/20th of a request (as opposed to 1 message being the equivalent of 1 request as it has been up until now). Following a pricing example:

	WebSocket Connection Requests	Incoming WebSocket Messages	Billed Requests	Request Billing
Before	10K	432M	432,010,000	$64.65
After	10K	432M	21,610,000	$3.09

Production ready, without production complexity

Becoming production ready on the last generation of cloud platforms meant slowing down how fast you shipped. It meant stitching together many disconnected tools or standing up whole teams to work on internal platforms. You had to retrofit your own productivity layers onto platforms that put up roadblocks.

The Cloudflare Developer Platform is grown up and production ready, and committed to being an integrated platform where products intuitively work together and where there aren’t 10 ways to do the same thing, with no need for a compatibility matrix to help understand what works together. Each of these updates shows this in action, integrating new functionality across products and parts of Cloudflare’s platform.

To that end, we want to hear from you about not only what you want to see next, but where you think we could be even simpler, or where you think our products could work better together. Tell us where you think we could do more – the Cloudflare Developers Discord is always open.

Minimizing on-call burnout through alerts observability

2024-03-29 Monika Singh

Post Syndicated from Monika Singh original https://blog.cloudflare.com/alerts-observability

Many people have probably come across the ‘this is fine’ meme or the original comic. This is what a typical day for a lot of on-call personnel looks like. On-calls get a lot of alerts, and dealing with too many alerts can result in alert fatigue – a feeling of exhaustion caused by responding to alerts that lack priority or clear actions. Ensuring the alerts are actionable and accurate, not false positives, is crucial because repeated false alarms can desensitize on-call personnel. To this end, within Cloudflare, numerous teams conduct periodic alert analysis, with each team developing its own dashboards for reporting. As members of the Observability team, we’ve encountered situations where teams reported inaccuracies in alerts or instances where alerts failed to trigger, as well as provided assistance in dealing with noisy/flapping alerts.

Observability aims to enhance insight into the technology stack by gathering and analyzing a broader spectrum of data. In this blog post, we delve into alert observability, discussing its importance and Cloudflare’s approach to achieving it. We’ll also explore how we overcome shortcomings in alert reporting within our architecture to simplify troubleshooting using open-source tools and best practices. Join us to understand how we use alerts effectively and use simple tools and practices to enhance our alerts observability, resilience, and on-call personnel health.

Being on-call can disrupt sleep patterns, impact social life, and hinder leisure activities, potentially leading to burnout. While burnout can be caused by several factors, one contributing factor can be excessively noisy alerts or receiving alerts that are neither important nor actionable. Analyzing alerts can help mitigate the risk of such burnout by reducing unnecessary interruptions and improving the overall efficiency of the on-call process. It involves periodic review and feedback to the system for improving alert quality. Unfortunately, only some companies or teams do alert analysis, even though it is essential information that every on-call or manager should have access to.

Alert analysis is useful for on-call personnel, enabling them to easily see which alerts have fired during their shift to help draft handover notes and not miss anything important. In addition, managers can generate reports from these stats to see the improvements over time, as well as helping assess on-call vulnerability to burnout. Alert analysis also helps with writing incident reports, to see if alerts were fired, or to determine when an incident started.

Let’s first understand the alerting stack and how we used open-source tools to gain greater visibility into it, which allowed us to analyze and optimize its effectiveness.

Prometheus architecture at Cloudflare

At Cloudflare, we rely heavily on Prometheus for monitoring. We have data centers in more than 310 cities, and each has several Prometheis. In total, we have over 1100 Prometheus servers. All alerts are sent to a central Alertmanager, where we have various integrations to route them. Additionally, using an alertmanager webhook, we store all alerts in a datastore for analysis.

Lifecycle of an alert

Prometheus collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts when the alerting conditions are met. Once an alert goes into firing state, it will be sent to the alertmanager.

Depending on the configuration, once Alertmanager receives an alert, it can inhibit, group, silence, or route the alerts to the correct receiver integration, such as chat, PagerDuty, or ticketing system. When configured properly, Alertmanager can mitigate a lot of alert noise. Unfortunately, that is not the case all the time, as not all alerts are optimally configured.

In Alertmanager, alerts initially enter the firing state, where they may be inhibited or silenced. They return to the firing state when the silence expires or the inhibiting alert resolves, and eventually transition to the resolved state.

Alertmanager sends notifications for firing and resolved alert events via webhook integration. We were using alertmanager2es, which receives webhook alert notifications from Alertmanager and inserts them into an Elasticsearch index for searching and analysis. Alertmanager2es has been a reliable tool for us over the years, offering ways to monitor alerting volume, noisy alerts and do some kind of alert reporting. However, it had its limitations. The absence of silenced and inhibited alert states made troubleshooting issues challenging. We often found ourselves guessing why an alert didn’t trigger – was it silenced by another alert or perhaps inhibited by one? Without concrete data, we lacked the means to confirm what was truly happening.

Since the Alertmanager doesn’t provide notifications for silenced or inhibited alert events via webhook integration, the alert reporting we were doing was somewhat lacking or incomplete. However, the Alertmanager API provides querying capabilities and by querying the /api/alerts alertmanager endpoint, we can get the silenced and inhibited alert states. Having all four states in a datastore will enhance our ability to improve alert reporting and troubleshoot Alertmanager issues.

*Interfaces for providing information about alert states*

Solution

We opted to aggregate all states of the alerts (firing, silenced, inhibited, and resolved) into a datastore. Given that we’re gathering data from two distinct sources (the webhook and API) each in varying formats and potentially representing different events, we correlate alerts from both sources using the fingerprint field. The fingerprint is a unique hash of the alert’s label set which enables us to match alerts across responses from the Alertmanager webhook and API.

*Alertmanager webhook and API response of same alert event*

The Alertmanager API offers additional fields compared to the webhook (highlighted in pastel red on the right), such as silencedBy and inhibitedBy IDs, which aid in identifying silenced and inhibited alerts. We store both webhook and API responses in the datastore as separate rows. While querying, we match the alerts using the fingerprint field.

We decided to use a vector.dev instance to transform the data as necessary, and store it in a data store. Vector.dev (acquired by Datadog) is an open-source, high-performance, observability data pipeline that supports a vast range of sources to read data from and supports a lot of sinks for writing data to, as well as a variety of data transformation operations.

Here, we use one http_server vector instance to receive Alertmanager webhook notifications, two http_client sources to query alerts and silence API endpoints, and two sinks for writing all of the state logs in ClickHouse into alerts and silences tables

Although we use ClickHouse to store this data, any other database can be used here. ClickHouse was chosen as a data store because it provides various data manipulation options. It allows aggregating data during insertion using Materialized Views, reduces duplicates with the replacingMergeTree table engine, and supports JOIN statements.

If we were to create individual columns for all the alert labels, the number of columns would grow exponentially with the addition of new alerts and unique labels. Instead, we decided to create individual columns for a few common labels like alert priority, instance, dashboard, alert-ref, alertname, etc., which helps us analyze the data in general and keep all other labels in a column of type Map(String, String). This was done because we wanted to keep all the labels in the datastore with minimal resource usage and allow users to query specific labels or filter alerts based on particular labels. For example, we can select all Prometheus alerts using labelsmap[‘service’’] = ‘Prometheus’.

Dashboards

We built multiple dashboards on top of this data:

Alerts overview: To get insights into all the alerts the Alertmanager receives.
Alertname overview: To drill down on a specific alert.
Alerts overview by receiver: This is similar to alerts overview but specific to a team or receiver.
Alerts state timeline: This dashboard shows a snapshot of alert volume at a glance.
Jiralerts overview: To get insights into the alerts the ticket system receives.
Silences overview: To get insights into the Alertmanager silences.

Alerts overview

The image is a screenshot of the collapsed alerts overview dashboard by receiver. This dashboard comprises general stats, components, services, and alertname breakdown. The dashboard also highlights the number of P1 / P2 alerts in the last one day / seven days / thirty days, top alerts for the current quarter, and quarter-to-quarter comparison.

Component breakdown

We route alerts to teams and a team can have multiple services or components. This panel shows firing alerts component counts over time for a receiver. For example, the alerts are sent to the observability team, which owns multiple components like logging, metrics, traces, and errors. This panel gives an alerting component count over time, and provides a good idea about which component is noisy and at what time at a glance.

Timeline of alerts

We created this swimlane view using Grafana’s state timeline panel for the receivers. The panel shows how busy the on-call was and at what point. Red here means the alert started firing, orange represents the alert is active and green means it has resolved. It displays the start time, active duration, and resolution of an alert. This highlighted alert is changing state too frequently from firing to resolved – this looks like a flapping alert. Flapping occurs when an alert changes state too frequently. This can happen when alerts are not configured properly and need tweaking, such as adjusting the alert threshold or increasing the for duration period in the alerting rule. The for duration field in the alerting rules adds time tolerance before an alert starts firing. In other words, the alert won’t fire unless the condition is met for ‘X’ minutes.

Findings

There were a few interesting findings within our analysis. We found a few alerts that were firing and did not have a notify label set, which means the alerts were firing but were not being sent or routed to any team, creating unnecessary load on the Alertmanager. We also found a few components generating a lot of alerts, and when we dug in, we found that they were for a cluster that was decommissioned where the alerts were not removed. These dashboards gave us excellent visibility and cleanup opportunities.

Alertmanager inhibitions

Alertmanager inhibition allows suppressing a set of alerts or notifications based on the presence of another set of alerts. We found that Alertmanager inhibitions were not working sometimes. Since there was no way to know about this, we only learned about it when a user reported getting alerted for inhibited alerts. Imagine a Venn diagram of firing and inhibited alerts to understand failed inhibitions. Ideally, there should be no overlap because the inhibited alerts shouldn’t be firing. But if there is an overlap, that means inhibited alerts are firing, and this overlap is considered a failed inhibition alert.

After storing alert notifications in ClickHouse, we were able to come up with a query to find the fingerprint of the `alertnames` where the inhibitions were failing using the following query:

SELECT $rollup(timestamp) as t, count() as count
FROM
(
    SELECT
        fingerprint, timestamp
    FROM alerts
    WHERE
        $timeFilter
        AND status.state = 'firing'
    GROUP BY
        fingerprint, timestamp
) AS firing
ANY INNER JOIN
(
    SELECT
        fingerprint, timestamp
    FROM alerts
    WHERE
        $timeFilter
        AND status.state = 'suppressed' AND notEmpty(status.inhibitedBy)
    GROUP BY
        fingerprint, timestamp
) AS suppressed USING (fingerprint)
GROUP BY t

The first panel in the image below is the total number of firing alerts, the second panel is the number of failed inhibitions.

We can also create breakdown for each failed inhibited alert

By looking up the fingerprint from the database, we could map the alert inhibitions and found that the failed inhibited alerts have an inhibition loop. For example, alert Service_XYZ_down is inhibited by alert server_OOR, alert server_OOR is inhibited by alert server_down, and server_down is inhibited by alert server_OOR.

Failed inhibitions can be avoided if alert inhibitions are configured carefully.

Silences

Alertmanager provides a mechanism to silence an alert while it is being worked on or during maintenance. Silence can mute the alerts for a given time and it can be configured based on matchers, which can be an exact match, a regex, an alert name, or any other label. The silence matcher doesn’t necessarily translate to the alertname. By doing alert analysis, we could map the alerts and the silence ID by doing a JOIN query on the alerts and silences tables. We also discovered a lot of stale silences, where silence was created for a long duration and is not relevant anymore.

DIY Alert analysis

The directory contains a basic demo for implementing alerts observability. Running `docker-compose up` spawns several containers, including Prometheus, Alertmanager, Vector, ClickHouse, and Grafana. The vector.dev container queries the Alertmanager alerts API and writes the data into ClickHouse after transforming it. The Grafana dashboard showcases a demo of Alerts and Silences overview.

Make sure you have docker installed and run docker compose up to get started.

Visit http://localhost:3000/dashboards to explore the prebuilt demo dashboards.

Conclusion

As part of the observability team, we manage the Alertmanager, which is a multi-tenant system. It’s crucial for us to have visibility to detect and address system misuse, ensuring proper alerting. The use of alert analysis tools has significantly enhanced the experience for on-call personnel and our team, offering swift access to the alert system. Alerts observability has facilitated the troubleshooting of events such as why an alert did not fire, why an inhibited alert fired, or which alert silenced / inhibited another alert, providing valuable insights for improving alert management.

Moreover, alerts overview dashboards facilitate rapid review and adjustment, streamlining operations. Teams use these dashboards in the weekly alert reviews to provide tangible evidence of how an on-call shift went, identify which alerts fire most frequently, becoming candidates for cleanup or aggregation thus curbing system misuse and bolstering overall alert management. Additionally, we can pinpoint services that may require particular attention. Alerts observability has also empowered some teams to make informed decisions about on-call configurations, such as transitioning to longer but less frequent shifts or integrating on-call and unplanned work shifts.

In conclusion, alert observability plays a crucial role in averting burnout by minimizing interruptions and enhancing on-call duties’ efficiency. Offering alerts observability as a service benefits all teams by obviating the need for individual dashboard development and fostering a proactive monitoring culture.
If you found this blog post interesting and want to work on observability, please check out our job openings – we’re hiring for Alerting and Logging!

Announcing bpftop: Streamlining eBPF performance optimization

2024-02-26 Netflix Technology Blog

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/announcing-bpftop-streamlining-ebpf-performance-optimization-6a727c1ae2e5

By Jose Fernandez

Today, we are thrilled to announce the release of bpftop, a command-line tool designed to streamline the performance optimization and monitoring of eBPF applications. As Netflix increasingly adopts eBPF [1, 2], applying the same rigor to these applications as we do to other managed services is imperative. Striking a balance between eBPF’s benefits and system load is crucial, ensuring it enhances rather than hinders our operational efficiency. This tool enables Netflix to embrace eBPF’s potential.

Introducing bpftop

bpftop provides a dynamic real-time view of running eBPF programs. It displays the average execution runtime, events per second, and estimated total CPU % for each program. This tool minimizes overhead by enabling performance statistics only while it is active.

bpftop simplifies the performance optimization process for eBPF programs by enabling an efficient cycle of benchmarking, code refinement, and immediate feedback. Without bpftop, optimization efforts would require manual calculations, adding unnecessary complexity to the process. With bpftop, users can quickly establish a baseline, implement improvements, and verify enhancements, streamlining the process.

A standout feature of this tool is its ability to display the statistics in time series graphs. This approach can uncover patterns and trends that could be missed otherwise.

How it works

bpftop uses the BPF_ENABLE_STATS syscall command to enable global eBPF runtime statistics gathering, which is disabled by default to reduce performance overhead. It collects these statistics every second, calculating the average runtime, events per second, and estimated CPU utilization for each eBPF program within that sample period. This information is displayed in a top-like tabular format or a time series graph over a 10s moving window. Once bpftop terminates, it turns off the statistics-gathering function. The tool is written in Rust, leveraging the libbpf-rs and ratatui crates.

Getting started

Visit the project’s GitHub page to learn more about using the tool. We’ve open-sourced bpftop under the Apache 2 license and look forward to contributions from the community.

Announcing bpftop: Streamlining eBPF performance optimization was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Introducing Foundations – our open source Rust service foundation library

2024-01-24 Ivan Nikulin http://blog.cloudflare.com/author/ivan-nikulin/

Post Syndicated from Ivan Nikulin http://blog.cloudflare.com/author/ivan-nikulin/ original https://blog.cloudflare.com/introducing-foundations-our-open-source-rust-service-foundation-library

In this blog post, we’re excited to present Foundations, our foundational library for Rust services, now released as open source on GitHub. Foundations is a foundational Rust library, designed to help scale programs for distributed, production-grade systems. It enables engineers to concentrate on the core business logic of their services, rather than the intricacies of production operation setups.

Originally developed as part of our Oxy proxy framework, Foundations has evolved to serve a wider range of applications. For those interested in exploring its technical capabilities, we recommend consulting the library’s API documentation. Additionally, this post will cover the motivations behind Foundations’ creation and provide a concise summary of its key features. Stay with us to learn more about how Foundations can support your Rust projects.

What is Foundations?

In software development, seemingly minor tasks can become complex when scaled up. This complexity is particularly evident when comparing the deployment of services on server hardware globally to running a program on a personal laptop.

The key question is: what fundamentally changes when transitioning from a simple laptop-based prototype to a full-fledged service in a production environment? Through our experience in developing numerous services, we’ve identified several critical differences:

Observability: locally, developers have access to various tools for monitoring and debugging. However, these tools are not as accessible or practical when dealing with thousands of software instances running on remote servers.
Configuration: local prototypes often use basic, sometimes hardcoded, configurations. This approach is impractical in production, where changes require a more flexible and dynamic configuration system. Hardcoded settings are cumbersome, and command-line options, while common, don’t always suit complex hierarchical configurations or align with the “Configuration as Code” paradigm.
Security: services in production face a myriad of security challenges, exposed to diverse threats from external sources. Basic security hardening becomes a necessity.

Addressing these distinctions, Foundations emerges as a comprehensive library, offering solutions to these challenges. Derived from our Oxy proxy framework, Foundations brings the tried-and-tested functionality of Oxy to a broader range of Rust-based applications at Cloudflare.

Foundations was developed with these guiding principles:

High modularity: recognizing that many services predate Foundations, we designed it to be modular. Teams can adopt individual components at their own pace, facilitating a smooth transition.
API ergonomics: a top priority for us is user-friendly library interaction. Foundations leverages Rust’s procedural macros to offer an intuitive, well-documented API, aiming for minimal friction in usage.
Simplified setup and configuration: our goal is for engineers to spend minimal time on setup. Foundations is designed to be ‘plug and play’, with essential functions working immediately and adjustable settings for fine-tuning. We understand that this focus on ease of setup over extreme flexibility might be debatable, as it implies a trade-off. Unlike other libraries that cater to a wide range of environments with potentially verbose setup requirements, Foundations is tailored for specific, production-tested environments and workflows. This doesn’t restrict Foundations’ adaptability to other settings, but we approach this with compile-time features to manage setup workflows, rather than a complex setup API.

Next, let’s delve into the components Foundations offers. To better illustrate the functionality that Foundations provides we will refer to the example web server from Foundations’ source code repository.

Telemetry

In any production system, observability, which we refer to as telemetry, plays an essential role. Generally, three primary types of telemetry are adequate for most service needs:

Logging: this involves recording arbitrary textual information, which can be enhanced with tags or structured fields. It’s particularly useful for documenting operational errors that aren’t critical to the service.
Tracing: this method offers a detailed timing breakdown of various service components. It’s invaluable for identifying performance bottlenecks and investigating issues related to timing.
Metrics: these are quantitative data points about the service, crucial for monitoring the overall health and performance of the system.

Foundations integrates an API that encompasses all these telemetry aspects, consolidating them into a unified package for ease of use.

Tracing

Foundations’ tracing API shares similarities with tokio/tracing, employing a comparable approach with implicit context propagation, instrumentation macros, and futures wrapping:

#[tracing::span_fn("respond to request")]
async fn respond(
    endpoint_name: Arc<String>,
    req: Request<Body>,
    routes: Arc<Map<String, ResponseSettings>>,
) -> Result<Response<Body>, Infallible> {
    …
}

Refer to the example web server and documentation for more comprehensive examples.

However, Foundations distinguishes itself in a few key ways:

Simplified API: we’ve streamlined the setup process for tracing, aiming for a more minimalistic approach compared to tokio/tracing.
Enhanced trace sampling flexibility: Foundations allows for selective override of the sampling ratio in specific code branches. This feature is particularly useful for detailed performance bug investigations, enabling a balance between global trace sampling for overall performance monitoring and targeted sampling for specific accounts, connections, or requests.
Distributed trace stitching: our API supports the integration of trace data from multiple services, contributing to a comprehensive view of the entire pipeline. This functionality includes fine-tuned control over sampling ratios, allowing upstream services to dictate the sampling of specific traffic flows in downstream services.
Trace forking capability: addressing the challenge of long-lasting connections with numerous multiplexed requests, Foundations introduces trace forking. This feature enables each request within a connection to have its own trace, linked to the parent connection trace. This method significantly simplifies the analysis and improves performance, particularly for connections handling thousands of requests.

We regard telemetry as a vital component of our software, not merely an optional add-on. As such, we believe in rigorous testing of this feature, considering it our primary tool for monitoring software operations. Consequently, Foundations includes an API and user-friendly macros to facilitate the collection and analysis of tracing data within tests, presenting it in a format conducive to assertions.

Logging

Foundations’ logging API shares its foundation with tokio/tracing and slog, but introduces several notable enhancements.

During our work on various services, we recognized the hierarchical nature of logging contextual information. For instance, in a scenario involving a connection, we might want to tag each log record with the connection ID and HTTP protocol version. Additionally, for requests served over this connection, it would be useful to attach the request URL to each log record, while still including connection-specific information.

Typically, achieving this would involve creating a new logger for each request, copying tags from the connection’s logger, and then manually passing this new logger throughout the relevant code. This method, however, is cumbersome, requiring explicit handling and storage of the logger object.

To streamline this process and prevent telemetry from obstructing business logic, we adopted a technique similar to tokio/tracing’s approach for tracing, applying it to logging. This method relies on future instrumentation machinery (tracing-rs documentation has a good explanation of the concept), allowing for implicit passing of the current logger. This enables us to “fork” logs for each request and use this forked log seamlessly within the current code scope, automatically propagating it down the call stack, including through asynchronous function calls:

 let conn_tele_ctx = TelemetryContext::current();

 let on_request = service_fn({
        let endpoint_name = Arc::clone(&endpoint_name);

        move |req| {
            let routes = Arc::clone(&routes);
            let endpoint_name = Arc::clone(&endpoint_name);

            // Each request gets independent log inherited from the connection log and separate
            // trace linked to the connection trace.
            conn_tele_ctx
                .with_forked_log()
                .with_forked_trace("request")
                .apply(async move { respond(endpoint_name, req, routes).await })
        }
});

Refer to example web server and documentation for more comprehensive examples.

In an effort to simplify the user experience, we merged all APIs related to context management into a single, implicitly available in each code scope, TelemetryContext object. This integration not only simplifies the process but also lays the groundwork for future advanced features. These features could blend tracing and logging information into a cohesive narrative by cross-referencing each other.

Like tracing, Foundations also offers a user-friendly API for testing service’s logging.

Metrics

Foundations incorporates the official Prometheus Rust client library for its metrics functionality, with a few enhancements for ease of use. One key addition is a procedural macro provided by Foundations, which simplifies the definition of new metrics with typed labels, reducing boilerplate code:

use foundations::telemetry::metrics::{metrics, Counter, Gauge};
use std::sync::Arc;

#[metrics]
pub(crate) mod http_server {
    /// Number of active client connections.
    pub fn active_connections(endpoint_name: &Arc<String>) -> Gauge;

    /// Number of failed client connections.
    pub fn failed_connections_total(endpoint_name: &Arc<String>) -> Counter;

    /// Number of HTTP requests.
    pub fn requests_total(endpoint_name: &Arc<String>) -> Counter;

    /// Number of failed requests.
    pub fn requests_failed_total(endpoint_name: &Arc<String>, status_code: u16) -> Counter;
}

Refer to the example web server and documentation for more information of how metrics can be defined and used.

In addition to this, we have refined the approach to metrics collection and structuring. Foundations offers a streamlined, user-friendly API for both these tasks, focusing on simplicity and minimalism.

Memory profiling

Recognizing the efficiency of jemalloc for long-lived services, Foundations includes a feature for enabling jemalloc memory allocation. A notable aspect of jemalloc is its memory profiling capability. Foundations packages this functionality into a straightforward and safe Rust API, making it accessible and easy to integrate.

Telemetry server

Foundations comes equipped with a built-in, customizable telemetry server endpoint. This server automatically handles a range of functions including health checks, metric collection, and memory profiling requests.

Security

A vital component of Foundations is its robust and ergonomic API for seccomp, a Linux kernel feature for syscall sandboxing. This feature enables the setting up of hooks for syscalls used by an application, allowing actions like blocking or logging. Seccomp acts as a formidable line of defense, offering an additional layer of security against threats like arbitrary code execution.

Foundations provides a simple way to define lists of all allowed syscalls, also allowing a composition of multiple lists (in addition, Foundations ships predefined lists for common use cases):

  use foundations::security::common_syscall_allow_lists::{ASYNC, NET_SOCKET_API, SERVICE_BASICS};
    use foundations::security::{allow_list, enable_syscall_sandboxing, ViolationAction};

    allow_list! {
        static ALLOWED = [
            ..SERVICE_BASICS,
            ..ASYNC,
            ..NET_SOCKET_API
        ]
    }

    enable_syscall_sandboxing(ViolationAction::KillProcess, &ALLOWED)

Refer to the web server example and documentation for more comprehensive examples of this functionality.

Settings and CLI

Foundations simplifies the management of service settings and command-line argument parsing. Services built on Foundations typically use YAML files for configuration. We advocate for a design where every service comes with a default configuration that’s functional right off the bat. This philosophy is embedded in Foundations’ settings functionality.

In practice, applications define their settings and defaults using Rust structures and enums. Foundations then transforms Rust documentation comments into configuration annotations. This integration allows the CLI interface to generate a default, fully annotated YAML configuration files. As a result, service users can quickly and easily understand the service settings:

use foundations::settings::collections::Map;
use foundations::settings::net::SocketAddr;
use foundations::settings::settings;
use foundations::telemetry::settings::TelemetrySettings;

#[settings]
pub(crate) struct HttpServerSettings {
    /// Telemetry settings.
    pub(crate) telemetry: TelemetrySettings,
    /// HTTP endpoints configuration.
    #[serde(default = "HttpServerSettings::default_endpoints")]
    pub(crate) endpoints: Map<String, EndpointSettings>,
}

impl HttpServerSettings {
    fn default_endpoints() -> Map<String, EndpointSettings> {
        let mut endpoint = EndpointSettings::default();

        endpoint.routes.insert(
            "/hello".into(),
            ResponseSettings {
                status_code: 200,
                response: "World".into(),
            },
        );

        endpoint.routes.insert(
            "/foo".into(),
            ResponseSettings {
                status_code: 403,
                response: "bar".into(),
            },
        );

        [("Example endpoint".into(), endpoint)]
            .into_iter()
            .collect()
    }
}

#[settings]
pub(crate) struct EndpointSettings {
    /// Address of the endpoint.
    pub(crate) addr: SocketAddr,
    /// Endoint's URL path routes.
    pub(crate) routes: Map<String, ResponseSettings>,
}

#[settings]
pub(crate) struct ResponseSettings {
    /// Status code of the route's response.
    pub(crate) status_code: u16,
    /// Content of the route's response.
    pub(crate) response: String,
}

The settings definition above automatically generates the following default configuration YAML file:

---
# Telemetry settings.
telemetry:
  # Distributed tracing settings
  tracing:
    # Enables tracing.
    enabled: true
    # The address of the Jaeger Thrift (UDP) agent.
    jaeger_tracing_server_addr: "127.0.0.1:6831"
    # Overrides the bind address for the reporter API.
    # By default, the reporter API is only exposed on the loopback
    # interface. This won't work in environments where the
    # Jaeger agent is on another host (for example, Docker).
    # Must have the same address family as `jaeger_tracing_server_addr`.
    jaeger_reporter_bind_addr: ~
    # Sampling ratio.
    #
    # This can be any fractional value between `0.0` and `1.0`.
    # Where `1.0` means "sample everything", and `0.0` means "don't sample anything".
    sampling_ratio: 1.0
  # Logging settings.
  logging:
    # Specifies log output.
    output: terminal
    # The format to use for log messages.
    format: text
    # Set the logging verbosity level.
    verbosity: INFO
    # A list of field keys to redact when emitting logs.
    #
    # This might be useful to hide certain fields in production logs as they may
    # contain sensitive information, but allow them in testing environment.
    redact_keys: []
  # Metrics settings.
  metrics:
    # How the metrics service identifier defined in `ServiceInfo` is used
    # for this service.
    service_name_format: metric_prefix
    # Whether to report optional metrics in the telemetry server.
    report_optional: false
  # Server settings.
  server:
    # Enables telemetry server
    enabled: true
    # Telemetry server address.
    addr: "127.0.0.1:0"
# HTTP endpoints configuration.
endpoints:
  Example endpoint:
    # Address of the endpoint.
    addr: "127.0.0.1:0"
    # Endoint's URL path routes.
    routes:
      /hello:
        # Status code of the route's response.
        status_code: 200
        # Content of the route's response.
        response: World
      /foo:
        # Status code of the route's response.
        status_code: 403
        # Content of the route's response.
        response: bar

Refer to the example web server and documentation for settings and CLI API for more comprehensive examples of how settings can be defined and used with Foundations-provided CLI API.

Wrapping Up

At Cloudflare, we greatly value the contributions of the open source community and are eager to reciprocate by sharing our work. Foundations has been instrumental in reducing our development friction, and we hope it can do the same for others. We welcome external contributions to Foundations, aiming to integrate diverse experiences into the project for the benefit of all.

If you’re interested in working on projects like Foundations, consider joining our team — we’re hiring!

An overview of Cloudflare’s logging pipeline

2024-01-08 Colin Douch http://blog.cloudflare.com/author/colin/

Post Syndicated from Colin Douch http://blog.cloudflare.com/author/colin/ original https://blog.cloudflare.com/an-overview-of-cloudflares-logging-pipeline

One of the roles of Cloudflare’s Observability Platform team is managing the operation, improvement, and maintenance of our internal logging pipelines. These pipelines are used to ship debugging logs from every service across Cloudflare’s infrastructure into a centralised location, allowing our engineers to operate and debug their services in near real time. In this post, we’re going to go over what that looks like, how we achieve high availability, and how we meet our Service Level Objectives (SLOs) while shipping close to a million log lines per second.

Logging itself is a simple concept. Virtually every programmer has written a Hello, World! program at some point. Printing something to the console like that is logging, whether intentional or not.

Logging pipelines have been around since the beginning of computing itself. Starting with putting string lines in a file, or simply in memory, our industry quickly outgrew the concept of each machine in the network having its own logs. To centralise logging, and to provide scaling beyond a single machine, we invented protocols such as the BSD Syslog Protocol to provide a method for individual machines to send logs over the network to a collector, providing a single pane of glass for logs over an entire set of machines.

Our logging infrastructure at Cloudflare is a bit more complicated, but still builds on these foundational principles.

The beginning

Logs at Cloudflare start the same as any other, with a println. Generally systems don’t call println directly however, they outsource that logic to a logging library. Systems at Cloudflare use various logging libraries such as Go’s zerolog, C++’s KJ_LOG, or Rusts log, however anything that is able to print lines to a program’s stdout/stderr streams is compatible with our pipeline. This offers our engineers the greatest flexibility in choosing tools that work for them and their teams.

Because we use systemd for most of our service management, these stdout/stderr streams are generally piped into systemd-journald which handles the local machine logs. With its RateLimitBurst and RateLimitInterval configurations, this gives us a simple knob to control the output of any given service on a machine. This has given our logging pipeline the colloquial name of the “journal pipeline”, however as we will see, our pipeline has expanded far beyond just journald logs.

Syslog-NG

While journald provides us a method to collect logs on every machine, logging onto each machine individually is impractical for debugging large scale services. To this end, the next step of our pipeline is syslog-ng. Syslog-ng is a daemon that implements the aforementioned BSD syslog protocol. In our case, it reads logs from journald, and applies another layer of rate limiting. It then applies rewriting rules to add common fields, such as the name of the machine that emitted the log, the name of the data center the machine is in, and the state of the data center that the machine is in. It then wraps the log in a JSON wrapper and forwards it to our Core data centers.

journald itself has an interesting feature that makes it difficult for some of our use cases – it guarantees a global ordering of every log on a machine. While this is convenient for the single node case, it imposes the limitation that journald is single-threaded. This means that for our heavier workloads, where every millisecond of delay counts, we provide a more direct path into our pipeline. In particular, we offer a Unix Domain Socket that syslog-ng listens on. This socket operates as a separate source of logs into the same pipeline that the journald logs follow, but allows greater throughput by eschewing the need for a global ordering that journald enforces. Logging in this manner is a bit more involved than outputting logs to the stdout streams, as services have to have a pipe created for them and then manually open that socket to write to. As such, this is generally reserved for services that need it, and don’t mind the management overhead it requires.

log-x

Our logging pipeline is a critical service at Cloudflare. Any potential delays or missing data can cause downstream effects that may hinder or even prevent the resolving of customer facing incidents. Because of this strict requirement, we have to offer redundancy in our pipeline. This is where the operation we call “log-x” comes into play.

We operate two main core data centers. One in the United States, and one in Europe. From each machine, we ship logs to both of these data centers. We call these endpoints log-a, and log-b. The log-a and log-b receivers will insert the logs into a Kafka topic for later consumption. By duplicating the data to two different locations, we achieve a level of redundancy that can handle the failure of either data center.

The next problem we encounter is that we have many data centers all around the world, which at any time due to changing Internet conditions may become disconnected from one, or both core data centers. If the data center is disconnected for long enough we may end up in a situation where we drop logs to either the log-a or log-b receivers. This would result in an incomplete view of logs from one data center and is unacceptable; Log-x was designed to alleviate this problem. In the event that syslog-ng fails to send logs to either log-a or log-b, it will actually send the log twice to the available receiver. This second copy will be marked as actually destined for the other log-x receiver. When a log-x receiver receives such a log, it will insert it into a different Kafka queue, known as the Dead Letter Queue (DLQ). We then use Kafka Mirror Maker to sync this DLQ across to the data center that was inaccessible. With this logic log-x allows us to maintain a full copy of all the logs in each core data center, regardless of any transient failures from any of our data centers.

Kafka

When logs arrive in the core data centers, we buffer them in a Kafka queue. This provides a few benefits. Firstly, it means that any consumers of the logs can be added without any changes – they only need to register with Kafka as a consumer group on the logs topic. Secondly, it allows us to tolerate transient failures of the consumers without losing any data. Because the Kafka clusters in the core data centers are much larger than any single machine, Kafka allows us to tolerate up to eight hours of total outage for our consumers without losing any data. This has proven to be enough to recover without data loss from all but the largest of incidents.

When it comes to partitioning our Kafka data, we have an interesting dilemma. Rather arbitrarily, the syslog protocol only supports timestamps up to microseconds. For our faster log emitters, this means that the syslog protocol cannot guarantee ordering with timestamps alone. To work around this limitation, we partition our logs using a key made up of both the host, and the service name. Because Kafka guarantees ordering within a partition, this means that any logs from a service on a machine are guaranteed to be ordered between themselves. Unfortunately, because logs from a service can have vastly different rates between different machines, this can result in unbalanced Kafka partitions. We have an ongoing project to move towards Open Telemetry Logs to combat this.

Onward to storage

With the logs in Kafka, we can proceed to insert them into a more long term storage. For storage, we operate two backends. An ElasticSearch/Logstash/Kibana (ELK) stack, and a Clickhouse cluster.

For ElasticSearch, we split our cluster of 90 nodes into a few types. The first being “master”

nodes. These nodes act as the ElasticSearch masters, and coordinate insertions into the cluster. We then have “data” nodes that handle the actual insertion and storage. Finally, we have the “HTTP” nodes that handle HTTP queries. Traditionally in an ElasticSearch cluster, all the data nodes will also handle HTTP queries, however because of the size of our cluster and shards we have found that designating only a few nodes to handle HTTP requests greatly reduces our query times by allowing us to take advantage of aggressive caching.

On the Clickhouse side, we operate a ten node Clickhouse cluster that stores our service logs. We are in the process of migrating this to be our primary storage, but at the moment it provides an alternative interface into the same logs that ElasticSearch provides, allowing our Engineers to use either Lucene through the ELK stack, or SQL and Bash scripts through the Clickhouse interface.

What’s next?

As Cloudflare continues to grow, our demands on our Observability systems, and our logging pipeline in particular continue to grow with it. This means that we’re always thinking ahead to what will allow us to scale and improve the experience for our engineers. On the horizon, we have a number of projects to further that goal including:

Increasing our multi-tenancy capabilities with better resource insights for our engineers
Migrating our syslog-ng pipeline towards Open Telemetry
Tail sampling rather than our probabilistic head sampling we have at the moment
Better balancing for our Kafka clusters

If you’re interested in working with logging at Cloudflare, then reach out – we’re hiring!

Cloudflare Integrations Marketplace introduces three new partners: Sentry, Momento and Turso

2023-09-28 Tanushree Sharma

Post Syndicated from Tanushree Sharma original http://blog.cloudflare.com/cloudflare-integrations-marketplace-new-partners-sentry-momento-turso/

Cloudflare Integrations Marketplace introduces three new partners: Sentry, Momento and Turso

Building modern full-stack applications requires connecting to many hosted third party services, from observability platforms to databases and more. All too often, this means spending time doing busywork, managing credentials and writing glue code just to get started. This is why we’re building out the Cloudflare Integrations Marketplace to allow developers to easily discover, configure and deploy products to use with Workers.

Earlier this year, we introduced integrations with Supabase, PlanetScale, Neon and Upstash. Today, we are thrilled to introduce our newest additions to Cloudflare’s Integrations Marketplace – Sentry, Turso and Momento.

Let's take a closer look at some of the exciting integration providers that are now part of the Workers Integration Marketplace.

Improve performance and reliability by connecting Workers to Sentry

When your Worker encounters an error you want to know what happened and exactly what line of code triggered it. Sentry is an application monitoring platform that helps developers identify and resolve issues in real-time.

The Workers and Sentry integration automatically sends errors, exceptions and console.log() messages from your Worker to Sentry with no code changes required. Here’s how it works:

You enable the integration from the Cloudflare Dashboard.
The credentials from the Sentry project of your choice are automatically added to your Worker.
You can configure sampling to control the volume of events you want sent to Sentry. This includes selecting the sample rate for different status codes and exceptions.
Cloudflare deploys a Tail Worker behind the scenes that contains all the logic needed to capture and send data to Sentry.
Like magic, errors, exceptions, and log messages are automatically sent to your Sentry project.

In the future, we’ll be improving this integration by adding support for uploading source maps and stack traces so that you can pinpoint exactly which line of your code caused the issue. We’ll also be tying in Workers deployments with Sentry releases to correlate new versions of your Worker with events in Sentry that help pinpoint problematic deployments. Check out our developer documentation for more information.

Develop at the Data Edge with Turso + Workers

Turso is an edge-hosted, distributed database based on libSQL, an open-source fork of SQLite. Turso focuses on providing a global service that minimizes query latency (and thus, application latency!). It’s perfect for use with Cloudflare Workers – both compute and data are served close to users.

Turso follows the model of having one primary database with replicas that are located globally, close to users. Turso automatically routes requests to a replica closest to where the Worker was invoked. This model works very efficiently for read heavy applications since read requests can be served globally. If you’re running an application that has heavy write workloads, or want to cut down on replication costs, you can run Turso with just the primary instance and use Smart Placement to speed up queries.

The Turso and Workers integration automatically pulls in Turso API credentials and adds them as secrets to your Worker, so that you can start using Turso by simply establishing a connection using the libsql SDK. Get started with the Turso and Workers Integration today by heading to our developer documentation.

Cache responses from data stores with Momento

Momento Cache is a low latency serverless caching solution that can be used on top of relational databases, key-value databases or object stores to get faster load times and better performance. Momento abstracts details like scaling, warming and replication so that users can deploy cache in a matter of minutes.

The Momento and Workers integration automatically pulls in your Momento API key using an OAuth2 flow. The Momento API key is added as a secret in Workers and, from there, you can start using the Momento SDK in Workers. Head to our developer documentation to learn more and use the Momento and Workers integration!

Try integrations out today

We want to give you back time, so that you can focus less on configuring and connecting third party tools to Workers and spend more time building. We’re excited to see what you build with integrations. Share your projects with us on Twitter (@CloudflareDev) and stay tuned for more exciting updates as we continue to grow our Integrations Marketplace!

If you would like to build an integration with Cloudflare Workers, fill out the integration request form and we’ll be in touch.

Directing ML-powered Operational Insights from Amazon DevOps Guru to your Datadog event stream

2023-07-13 Bineesh Ravindran

Post Syndicated from Bineesh Ravindran original https://aws.amazon.com/blogs/devops/directing_ml-powered_operational_insights_from_amazon_devops_guru_to_your_datadog_event_stream/

Amazon DevOps Guru is a fully managed AIOps service that uses machine learning (ML) to quickly identify when applications are behaving outside of their normal operating patterns and generates insights from its findings. These insights generated by DevOps Guru can be used to alert on-call teams to react to anomalies for business mission critical workloads. If you are already utilizing Datadog to automate infrastructure monitoring, application performance monitoring, and log management for real-time observability of your entire technology stack, then this blog is for you.

You might already be using Datadog for a consolidated view of your Datadog Events interface to search, analyze and filter events from many different sources in one place. Datadog Events are records of notable changes relevant for managing and troubleshooting IT Operations, such as code, deployments, service health, configuration changes and monitoring alerts.

Wherever DevOps Guru detects operational events in your AWS environment that could lead to outages, it generates insights and recommendations. These insights/recommendations are then pushed to a user specific Datadog endpoint using Datadog events API. Customers can then create dashboards, incidents, alarms or take corrective automated actions based on these insights and recommendations in Datadog.

Datadog collects and unifies all of the data streaming from these complex environments, with a 1-click integration for pulling in metrics and tags from over 90 AWS services. Companies can deploy the Datadog Agent directly on their hosts and compute instances to collect metrics with greater granularity—down to one-second resolution. And with Datadog’s out-of-the-box integration dashboards, companies get not only a high-level view into the health of their infrastructure and applications but also deeper visibility into individual services such as AWS Lambda and Amazon EKS.

This blogpost will show you how to utilize Amazon DevOps guru with Datadog to get real time insights and recommendations on their AWS Infrastructure. We will demonstrate how an insight generated by Amazon DevOps Guru for an anomaly can automatically be pushed to Datadog’s event streams which can then be used to create dashboards, create alarms and alerts to take corrective actions.

Solution Overview

When an Amazon DevOps Guru insight is created, an Amazon EventBridge rule is used to capture the insight as an event and routed to an AWS Lambda Function target. The lambda function interacts with Datadog using a REST API to push corresponding DevOps Guru events captured by Amazon EventBridge

The EventBridge rule can be customized to capture all DevOps Guru insights or narrowed down to specific insights. In this blog, we will be capturing all DevOps Guru insights and will be performing actions on Datadog for the below DevOps Guru events:

DevOps Guru New Insight Open
DevOps Guru New Anomaly Association
DevOps Guru Insight Severity Upgraded
DevOps Guru New Recommendation Created
DevOps Guru Insight Closed

Figure 1: Amazon DevOps Guru Integration with Datadog with Amazon EventBridge and AWS.

Solution Implementation Steps

Pre-requisites

Before you deploy the solution, complete the following steps.

- Datadog Account Setup: We will be connecting your AWS Account with Datadog. If you do not have a Datadog account, you can request a free trial developer instance through Datadog.
- Datadog Credentials: Gather the credentials of Datadog keys that will be used to connect with AWS. Follow the steps below to create an API Key and Application Key
  Add an API key or client token
  1. 1. To add a Datadog API key or client token:
    2. Navigate to Organization settings, then click the API keys or Client Tokens
    3. Click the New Key or New Client Token button, depending on which you’re creating.
    4. Enter a name for your key or token.
    5. Click Create API key or Create Client Token.
    6. Note down the newly generated API Key value. We will need this in later steps
    7. Figure 2: Create new API Key.
  Add application keys
  - To add a Datadog application key, navigate to Organization Settings > Application Keys.If you have the permission to create application keys, click New Key.Note down the newly generated Application Key. We will need this in later steps

Add Application Key and API Key to AWS Secrets Manager : Secrets Manager enables you to replace hardcoded credentials in your code, including passwords, with an API call to Secrets Manager to retrieve the secret programmatically. This helps ensure the secret can’t be compromised by someone examining your code,because the secret no longer exists in the code.
Follow below steps to create a new secret in AWS Secrets Manager.

Open the Secrets Manager console at https://console.aws.amazon.com/secretsmanager/
Choose Store a new secret.
On the Choose secret type page, do the following:
1. For Secret type, choose other type of secret.
2. In Key/value pairs, either enter your secret in Key/value
  pairs

Figure 3: Create new secret in Secret Manager.

Click next and enter “DatadogSecretManager” as the secret name followed by Review and Finish

Figure 4: Configure secret in Secret Manager.

- - Enable DevOps Guru for your applications by following these steps or you can follow this blog to deploy a sample serverless application that can be used to generate DevOps Guru insights for anomalies detected in the application.
  - AWS Cloud9 is recommended to create an environment as AWS Serverless Application Model (SAM) CLI and AWS Command Line Interface (CLI) are pre-installed and can be accessed from a bash terminal.
  - Install and set up SAM CLI – Install the SAM CLI
  - Download and set up Java. The version should be matching to the runtime that you defined in the SAM template. yaml Serverless function configuration – Install the Java SE Development Kit 11
  - Maven – Install Maven

Option 1: Deploy Datadog Connector App from AWS Serverless Repository

The DevOps Guru Datadog Connector application is available on the AWS Serverless Application Repository which is a managed repository for serverless applications. The application is packaged with an AWS Serverless Application Model (SAM) template, definition of the AWS resources used and the link to the source code. Follow the steps below to quickly deploy this serverless application in your AWS account

- - Login to the AWS management console of the account to which you plan to deploy this solution.
  - Go to the DevOps Guru Datadog Connector application in the AWS Serverless Repository and click on “Deploy”.
  - The Lambda application deployment screen will be displayed where you can enter the Datadog Application name
    
    Figure 5: DevOps Guru Datadog connector.
    
    Figure 6: Serverless Application DevOps Guru Datadog connector.
  - After successful deployment the AWS Lambda Application page will display the “Create complete” status for the serverlessrepo-DevOps-Guru-Datadog-Connector application. The CloudFormation template creates four resources,
    1. Lambda function which has the logic to integrate to the Datadog
    2. Event Bridge rule for the DevOps Guru Insights
    3. Lambda permission
    4. IAM role
  - Now skip Option 2 and follow the steps in the “Test the Solution” section to trigger some DevOps Guru insights/recommendations and validate that the events are created and updated in Datadog.

Option 2: Build and Deploy sample Datadog Connector App using AWS SAM Command Line Interface

As you have seen above, you can directly deploy the sample serverless application form the Serverless Repository with one click deployment. Alternatively, you can choose to clone the GitHub source repository and deploy using the SAM CLI from your terminal.

The Serverless Application Model Command Line Interface (SAM CLI) is an extension of the AWS CLI that adds functionality for building and testing serverless applications. The CLI provides commands that enable you to verify that AWS SAM template files are written according to the specification, invoke Lambda functions locally, step-through debug Lambda functions, package and deploy serverless applications to the AWS Cloud, and so on. For details about how to use the AWS SAM CLI, including the full AWS SAM CLI Command Reference, see AWS SAM reference – AWS Serverless Application Model.

Before you proceed, make sure you have completed the pre-requisites section in the beginning which should set up the AWS SAM CLI, Maven and Java on your local terminal. You also need to install and set up Docker to run your functions in an Amazon Linux environment that matches Lambda.

Clone the source code from the github repo

git clone https://github.com/aws-samples/amazon-devops-guru-connector-datadog.git

Build the sample application using SAM CLI

$cd DatadogFunctions

$sam build
Building codeuri: $\amazon-devops-guru-connector-datadog\DatadogFunctions\Functions runtime: java11 metadata: {} architecture: x86_64 functions: Functions
Running JavaMavenWorkflow:CopySource
Running JavaMavenWorkflow:MavenBuild
Running JavaMavenWorkflow:MavenCopyDependency
Running JavaMavenWorkflow:MavenCopyArtifacts

Build Succeeded

Built Artifacts  : .aws-sam\build
Built Template   : .aws-sam\build\template.yaml

Commands you can use next
=========================
[*] Validate SAM template: sam validate
[*] Invoke Function: sam local invoke
[*] Test Function in the Cloud: sam sync --stack-name {{stack-name}} --watch
[*] Deploy: sam deploy --guided

This command will build the source of your application by installing dependencies defined in Functions/pom.xml, create a deployment package and saves it in the. aws-sam/build folder.

Deploy the sample application using SAM CLI

$sam deploy --guided

This command will package and deploy your application to AWS, with a series of prompts that you should respond to as shown below:

- - Stack Name: The name of the stack to deploy to CloudFormation. This should be unique to your account and region, and a good starting point would be something matching your project name.
  - AWS Region: The AWS region you want to deploy your application to.
  - Confirm changes before deploy: If set to yes, any change sets will be shown to you before execution for manual review. If set to no, the AWS SAM CLI will automatically deploy application changes.
  - Allow SAM CLI IAM role creation:Many AWS SAM templates, including this example, create AWS IAM roles required for the AWS Lambda function(s) included to access AWS services. By default, these are scoped down to minimum required permissions. To deploy an AWS CloudFormation stack which creates or modifies IAM roles, the CAPABILITY_IAM value for capabilities must be provided. If permission isn’t provided through this prompt, to deploy this example you must explicitly pass --capabilities CAPABILITY_IAM to the sam deploy command.
  - Disable rollback [y/N]: If set to Y, preserves the state of previously provisioned resources when an operation fails.
  - Save arguments to configuration file (samconfig.toml): If set to yes, your choices will be saved to a configuration file inside the project, so that in the future you can just re-run sam deploy without parameters to deploy changes to your application.

After you enter your parameters, you should see something like this if you have provided Y to view and confirm ChangeSets. Proceed here by providing ‘Y’ for deploying the resources.

Initiating deployment
=====================

        Uploading to sam-app-datadog/0c2b93e71210af97a8c57710d0463c8b.template  1797 / 1797  (100.00%)


Waiting for changeset to be created..

CloudFormation stack changeset
---------------------------------------------------------------------------------------------------------------------
Operation                     LogicalResourceId             ResourceType                  Replacement
---------------------------------------------------------------------------------------------------------------------
+ Add                         FunctionsDevOpsGuruPermissi   AWS::Lambda::Permission       N/A
                              on
+ Add                         FunctionsDevOpsGuru           AWS::Events::Rule             N/A
+ Add                         FunctionsRole                 AWS::IAM::Role                N/A
+ Add                         Functions                     AWS::Lambda::Function         N/A
---------------------------------------------------------------------------------------------------------------------


Changeset created successfully. arn:aws:cloudformation:us-east-1:867001007349:changeSet/samcli-deploy1680640852/bdc3039b-cdb7-4d7a-a3a0-ed9372f3cf9a


Previewing CloudFormation changeset before deployment
======================================================
Deploy this changeset? [y/N]: y

2023-04-04 15:41:06 - Waiting for stack create/update to complete

CloudFormation events from stack operations (refresh every 5.0 seconds)
---------------------------------------------------------------------------------------------------------------------
ResourceStatus                ResourceType                  LogicalResourceId             ResourceStatusReason
---------------------------------------------------------------------------------------------------------------------
CREATE_IN_PROGRESS            AWS::IAM::Role                FunctionsRole                 -
CREATE_IN_PROGRESS            AWS::IAM::Role                FunctionsRole                 Resource creation Initiated
CREATE_COMPLETE               AWS::IAM::Role                FunctionsRole                 -
CREATE_IN_PROGRESS            AWS::Lambda::Function         Functions                     -
CREATE_IN_PROGRESS            AWS::Lambda::Function         Functions                     Resource creation Initiated
CREATE_COMPLETE               AWS::Lambda::Function         Functions                     -
CREATE_IN_PROGRESS            AWS::Events::Rule             FunctionsDevOpsGuru           -
CREATE_IN_PROGRESS            AWS::Events::Rule             FunctionsDevOpsGuru           Resource creation Initiated
CREATE_COMPLETE               AWS::Events::Rule             FunctionsDevOpsGuru           -
CREATE_IN_PROGRESS            AWS::Lambda::Permission       FunctionsDevOpsGuruPermissi   -
                                                            on
CREATE_IN_PROGRESS            AWS::Lambda::Permission       FunctionsDevOpsGuruPermissi   Resource creation Initiated
                                                            on
CREATE_COMPLETE               AWS::Lambda::Permission       FunctionsDevOpsGuruPermissi   -
                                                            on
CREATE_COMPLETE               AWS::CloudFormation::Stack    sam-app-datadog               -
---------------------------------------------------------------------------------------------------------------------


Successfully created/updated stack - sam-app-datadog in us-east-1

Once the deployment succeeds, you should be able to see the successful creation of your resources. Also, you can find your Lambda, IAM Role and EventBridge Rule in the CloudFormation stack output values.

You can also choose to test and debug your function locally with sample events using the SAM CLI local functionality.Test a single function by invoking it directly with a test event. An event is a JSON document that represents the input that the function receives from the event source. Refer the Invoking Lambda functions locally – AWS Serverless Application Model link here for more details.

$ sam local invoke Functions -e ‘event/event.json’

Once you are done with the above steps, move on to “Test the Solution” section below to trigger some DevOps Guru insights and validate that the events are created and pushed to Datadog.

Test the Solution

To test the solution, we will simulate a DevOps Guru Insight. You can also simulate an insight by following the steps in this blog. After an anomaly is detected in the application, DevOps Guru creates an insight as shown below

Figure 7: DevOps Guru insight for DynamoDB

For the DevOps Guru insight shown above, a corresponding event is automatically created and pushed to Datadog as shown below. In addition to the events creation, any new anomalies and recommendations from DevOps Guru is also associated with the events

Figure 8 : DevOps Guru Insight pushed to Datadog event stream.

Cleaning Up

To delete the sample application that you created, In your Cloud 9 environment open a new terminal. Now type in the AWS CLI command below and pass the stack name you provided in the deploy step

aws cloudformation delete-stack --stack-name <Stack Name>

Alternatively ,you could also use the AWS CloudFormation Console to delete the stack

Conclusion

This article highlights how Amazon DevOps Guru monitors resources within a specific region of your AWS account, automatically detecting operational issues, predicting potential resource exhaustion, identifying probable causes, and recommending remediation actions. It describes a bespoke solution enabling integration of DevOps Guru insights with Datadog, enhancing management and oversight of AWS services. This solution aids customers using Datadog to bolster operational efficiencies, delivering customized insights, real-time alerts, and management capabilities directly from DevOps Guru, offering a unified interface to swiftly restore services and systems.

To start gaining operational insights on your AWS Infrastructure with Datadog head over to Amazon DevOps Guru documentation page.

About the authors:

Integrating DevOps Guru Insights with CloudWatch Dashboard

2023-05-04 Suresh Babu

Post Syndicated from Suresh Babu original https://aws.amazon.com/blogs/devops/integrating-devops-guru-insights-with-cloudwatch-dashboard/

Many customers use Amazon CloudWatch dashboards to monitor applications and often ask how they can integrate Amazon DevOps Guru Insights in order to have a unified dashboard for monitoring. This blog post showcases integrating DevOps Guru proactive and reactive insights to a CloudWatch dashboard by using Custom Widgets. It can help you to correlate trends over time and spot issues more efficiently by displaying related data from different sources side by side and to have a single pane of glass visualization in the CloudWatch dashboard.

Amazon DevOps Guru is a machine learning (ML) powered service that helps developers and operators automatically detect anomalies and improve application availability. DevOps Guru’s anomaly detectors can proactively detect anomalous behavior even before it occurs, helping you address issues before they happen; detailed insights provide recommendations to mitigate that behavior.

Amazon CloudWatch dashboard is a customizable home page in the CloudWatch console that monitors multiple resources in a single view. You can use CloudWatch dashboards to create customized views of the metrics and alarms for your AWS resources.

Solution overview

This post will help you to create a Custom Widget for Amazon CloudWatch dashboard that displays DevOps Guru Insights. A custom widget is part of your CloudWatch dashboard that calls an AWS Lambda function containing your custom code. The Lambda function accepts custom parameters, generates your dataset or visualization, and then returns HTML to the CloudWatch dashboard. The CloudWatch dashboard will display this HTML as a widget. In this post, we are providing sample code for the Lambda function that will call DevOps Guru APIs to retrieve the insights information and displays as a widget in the CloudWatch dashboard. The architecture diagram of the solution is below.

Solution Architecture

Figure 1: Reference architecture diagram

Prerequisites and Assumptions

An AWS account. To sign up:
- Create an AWS account. For instructions, see Sign Up For AWS.
DevOps Guru should be enabled in the account. For enabling DevOps guru, see DevOps Guru Setup
Follow this Workshop to deploy a sample application in your AWS Account which can help generate some DevOps Guru insights.

Solution Deployment

We are providing two options to deploy the solution – using the AWS console and AWS CloudFormation. The first section has instructions to deploy using the AWS console followed by instructions for using CloudFormation. The key difference is that we will create one Widget while using the Console, but three Widgets are created when we use AWS CloudFormation.

Using the AWS Console:

We will first create a Lambda function that will retrieve the DevOps Guru insights. We will then modify the default IAM role associated with the Lambda function to add DevOps Guru permissions. Finally we will create a CloudWatch dashboard and add a custom widget to display the DevOps Guru insights.

Navigate to the Lambda Console after logging to your AWS Account and click on Create function.

Figure 2a: Create Lambda Function
Choose Author from Scratch and use the runtime Node.js 16.x. Leave the rest of the settings at default and create the function.

Figure 2b: Create Lambda Function

After a few seconds, the Lambda function will be created and you will see a code source box. Copy the code from the text box below and replace the code present in code source as shown in screen print below.

// SPDX-License-Identifier: MIT-0
// CloudWatch Custom Widget sample: displays count of Amazon DevOps Guru Insights
const aws = require('aws-sdk');

const DOCS = `## DevOps Guru Insights Count
Displays the total counts of Proactive and Reactive Insights in DevOps Guru.
`;

async function getProactiveInsightsCount(DevOpsGuru, StartTime, EndTime) {
    let NextToken = null;
    let proactivecount=0;

    do {
        const args = { StatusFilter: { Any : { StartTimeRange: { FromTime: StartTime, ToTime: EndTime }, Type: 'PROACTIVE'  }}}
        const result = await DevOpsGuru.listInsights(args).promise();
        console.log(result)
        NextToken = result.NextToken;
        result.ProactiveInsights.forEach(res =&gt; {
        console.log(result.ProactiveInsights[0].Status)
        proactivecount++;
        });
        } while (NextToken);
    return proactivecount;
}

async function getReactiveInsightsCount(DevOpsGuru, StartTime, EndTime) {
    let NextToken = null;
    let reactivecount=0;

    do {
        const args = { StatusFilter: { Any : { StartTimeRange: { FromTime: StartTime, ToTime: EndTime }, Type: 'REACTIVE'  }}}
        const result = await DevOpsGuru.listInsights(args).promise();
        NextToken = result.NextToken;
        result.ReactiveInsights.forEach(res =&gt; {
        reactivecount++;
        });
        } while (NextToken);
    return reactivecount;
}

function getHtmlOutput(proactivecount, reactivecount, region, event, context) {

    return `DevOps Guru Proactive Insights&lt;br&gt;&lt;font size="+10" color="#FF9900"&gt;${proactivecount}&lt;/font&gt;
    &lt;p&gt;DevOps Guru Reactive Insights&lt;/p&gt;&lt;font size="+10" color="#FF9900"&gt;${reactivecount}`;
}

exports.handler = async (event, context) =&gt; {
    if (event.describe) {
        return DOCS;
    }
    const widgetContext = event.widgetContext;
    const timeRange = widgetContext.timeRange.zoom || widgetContext.timeRange;
    const StartTime = new Date(timeRange.start);
    const EndTime = new Date(timeRange.end);
    const region = event.region || process.env.AWS_REGION;
    const DevOpsGuru = new aws.DevOpsGuru({ region });

    const proactivecount = await getProactiveInsightsCount(DevOpsGuru, StartTime, EndTime);
    const reactivecount = await getReactiveInsightsCount(DevOpsGuru, StartTime, EndTime);

    return getHtmlOutput(proactivecount, reactivecount, region, event, context);
    
};

Figure 3: Lambda Function Source Code

Click on Deploy to save the function code
Since we used the default settings while creating the function, a default Execution role is created and associated with the function. We will need to modify the IAM role to grant DevOps Guru permissions to retrieve Proactive and Reactive insights.
Click on the Configuration tab and select Permissions from the left side option list. You can see the IAM execution role associated with the function as shown in figure 4.

Figure 4: Lambda function execution role
Click on the IAM role name to open the role in the IAM console. Click on Add Permissions and select Attach policies.

Figure 5: IAM Role Update
Search for DevOps and select the AmazonDevOpsGuruReadOnlyAccess. Click on Add permissions to update the IAM role.

Figure 6: IAM Role Policy Update
Now that we have created the Lambda function for our custom widget and assigned appropriate permissions, we can navigate to CloudWatch to create a Dashboard.
Navigate to CloudWatch and click on dashboards from the left side list. You can choose to create a new dashboard or add the widget in an existing dashboard.
We will choose to create a new dashboard

Figure 7: Create New CloudWatch dashboard
Choose Custom Widget in the Add widget page

Figure 8: Add widget
Click Next in the custom widge page without choosing a sample

Figure 9: Custom Widget Selection
Choose the region where devops guru is enabled. Select the Lambda function that we created earlier. In the preview pane, click on preview to view DevOps Guru metrics. Once the preview is successful, create the Widget.

Figure 10: Create Custom Widget
Congratulations, you have now successfully created a CloudWatch dashboard with a custom widget to get insights from DevOps Guru. The sample code that we provided can be customized to suit your needs.

Using AWS CloudFormation

You may skip this step and move to future scope section if you have already created the resources using AWS Console.

In this step we will show you how to deploy the solution using AWS CloudFormation. AWS CloudFormation lets you model, provision, and manage AWS and third-party resources by treating infrastructure as code. Customers define an initial template and then revise it as their requirements change. For more information on CloudFormation stack creation refer to this blog post.

The following resources are created.

Three Lambda functions that will support CloudWatch Dashboard custom widgets
An AWS Identity and Access Management (IAM) role to that allows the Lambda function to access DevOps Guru Insights and to publish logs to CloudWatch
Three Log Groups under CloudWatch
A CloudWatch dashboard with widgets to pull data from the Lambda Functions

To deploy the solution by using the CloudFormation template

You can use this downloadable template to set up the resources. To launch directly through the console, choose Launch Stack button, which creates the stack in the us-east-1 AWS Region.
Choose Next to go to the Specify stack details page.
(Optional) On the Configure Stack Options page, enter any tags, and then choose Next.
On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create stack.

It takes approximately 2-3 minutes for the provisioning to complete. After the status is “Complete”, proceed to validate the resources as listed below.

Validate the resources

Now that the stack creation has completed successfully, you should validate the resources that were created.

On AWS Console, head to CloudWatch, under Dashboards – there will be a dashboard created with name <StackName-Region>.
On AWS Console, head to CloudWatch, under LogGroups there will be 3 new log-groups created with the name as:
- lambdaProactiveLogGroup
- lambdaReactiveLogGroup
- lambdaSummaryLogGroup
On AWS Console, head to Lambda, there will be lambda function(s) under the name:
- lambdaFunctionDGProactive
- lambdaFunctionDGReactive
- lambdaFunctionDGSummary
On AWS Console, head to IAM, under Roles there will be a new role created with name “lambdaIAMRole”

To View Results/Outcome

With the appropriate time-range setup on CloudWatch Dashboard, you will be able to navigate through the insights that have been generated from DevOps Guru on the CloudWatch Dashboard.

Figure 11: DevOpsGuru Insights in Cloudwatch Dashboard

Cleanup

For cost optimization, after you complete and test this solution, clean up the resources. You can delete them manually if you used the AWS Console or by deleting the AWS CloudFormation stack called devopsguru-cloudwatch-dashboard if you used AWS CloudFormation.

For more information on deleting the stacks, see Deleting a stack on the AWS CloudFormation console.

Conclusion

This blog post outlined how you can integrate DevOps Guru insights into a CloudWatch Dashboard. As a customer, you can start leveraging CloudWatch Custom Widgets to include DevOps Guru Insights in an existing Operational dashboard.

AWS Customers are now using Amazon DevOps Guru to monitor and improve application performance. You can start monitoring your applications by following the instructions in the product documentation. Head over to the Amazon DevOps Guru console to get started today.

To learn more about AIOps for Serverless using Amazon DevOps Guru check out this video.

Improved Alerting with Atlas Streaming Eval

2023-04-27 Netflix Technology Blog

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/improved-alerting-with-atlas-streaming-eval-e691c60dc61e

Ruchir Jha, Brian Harrington, Yingwu Zhao

TL;DR

Streaming alert evaluation scales much better than the traditional approach of polling time-series databases.
It allows us to overcome high dimensionality/cardinality limitations of the time-series database.
It opens doors to support more exciting use-cases.

Engineers want their alerting system to be realtime, reliable, and actionable. While actionability is subjective and may vary by use-case, reliability is non-negotiable. In other words, false positives are bad but false negatives are the absolute worst!

A few years ago, we were paged by our SRE team due to our Metrics Alerting System falling behind — critical application health alerts reached engineers 45 minutes late! As we investigated the alerting delay, we found that the number of configured alerts had recently increased dramatically, by 5 times! The alerting system queried Atlas, our time series database on a cron for each configured alert query, and was seeing an elevated throttle rate and excessive retries with backoffs. This, in turn, increased the time between two consecutive checks for an alert, causing a global slowdown for all alerts. On further investigation, we discovered that one user had programmatically created tens of thousands of new alerts. This user represented a platform team at Netflix, and their goal was to build alerting automation for their users.

While we were able to put out the immediate fire by disabling the newly created alerts, this incident raised some critical concerns around the scalability of our alerting system. We also heard from other platform teams at Netflix who wanted to build similar automation for their users who, given our state at the time, wouldn’t have been able to do so without impacting Mean Time To Detect (MTTD) for all others. Rather, we were looking at an order of magnitude increase in the number of alert queries just over the next 6 months!

Since querying Atlas was the bottleneck, our first instinct was to scale it up to meet the increased alert query demand; however, we soon realized that would increase Atlas cost prohibitively. Atlas is an in-memory time-series database that ingests multiple billions of time-series per day and retains the last two weeks of data. It is already one of the largest services at Netflix both in size and cost. While Atlas is architected around compute & storage separation, and we could theoretically just scale the query layer to meet the increased query demand, every query, regardless of its type, has a data component that needs to be pushed down to the storage layer. To serve the increasing number of push down queries, the in-memory storage layer would need to scale up as well, and it became clear that this would push the already expensive storage costs far higher. Moreover, common database optimizations like caching recently queried data don’t really work for alerting queries because, generally speaking, the last received datapoint is required for correctness. Take for example, this alert query that checks if errors as a % of total RPS exceeds a threshold of 50% for 4 out of the last 5 minutes:

name,errors,:eq,:sum,
name,rps,:eq,:sum,
:div,
100,:mul,
50,:gt,
5,:rolling-count,4,:gt,

Say if the datapoint received for the last time interval leads to a positive evaluation for this query, relying on stale/cached data would either increase MTTD or result in the perception of a false negative, at least until the missing data is fetched and evaluated. It became clear to us that we needed to solve the scalability problem with a fundamentally different approach. Hence, we started down the path of alert evaluation via real-time streaming metrics.

High Level Architecture

The idea, at a high level, was to avoid the need to query the Atlas database almost entirely and transition most alert queries to streaming evaluation.

Alert queries are submitted either via our Alerting UI or by API clients, which are then saved to a custom config database that supports streaming config updates (full snapshot + update notifications). The Alerting Service receives these config updates and hashes every new or updated alert query for evaluation to one of its nodes by leveraging Edda Slots. The node responsible for evaluating a query, starts by breaking it down into a set of “data expressions” and with them subscribes to an upstream “broker” service. Data expressions define what data needs to be sourced in order to evaluate a query. For the example query listed above, the data expressions are name,errors,:eq,:sum and name,rps,:eq,:sum. The broker service acts as a subscription manager that maps a data expression to a set of subscriptions. In addition, it also maintains a Query Index of all active data expressions which is consulted to discern if an incoming datapoint is of interest to an active subscriber. The internals here are outside the scope of this blog post.

Next, the Alerting service (via the atlas-eval library) maps the received data points for a data expression to the alert query that needs them. For alert queries that resolve to more than one data expression, we align the incoming data points for each one of those data expressions on the same time boundary before emitting the accumulated values to the final eval step. For the example above, the final eval step would be responsible for computing the ratio and maintaining the rolling-count, which is keeping track of the number of intervals in which the ratio crossed the threshold as shown below:

The atlas-eval library supports streaming evaluation for most if not all Query, Data, Math and Stateful operators supported by Atlas today. Certain operators such as offset, integral, des are not supported on the streaming path.

OK, Results?

First and foremost, we have successfully alleviated our initial scalability problem with the polling based architecture. Today, we run 20X the number of queries we used to run a few years ago, with ease and at a fraction of what it would have cost to scale up the Atlas storage layer to serve the same volume. Multiple platform teams at Netflix programmatically generate and maintain alerts on behalf of their users without having to worry about impacting other users of the system. We are able to maintain strong SLAs around Mean Time To Detect (MTTD) regardless of the number of alerts being evaluated by the system.

Additionally, streaming evaluation allowed us to relax restrictions around high cardinality that our users were previously running into — alert queries that were rejected by Atlas Backend before due to cardinality constraints are now getting checked correctly on the streaming path. In addition, we are able to use Atlas Streaming to monitor and alert on some very high cardinality use-cases, such as metrics derived from free-form log data.

Finally, we switched Telltale, our holistic application health monitoring system, from polling a metrics cache to using realtime Atlas Streaming. The fundamental idea behind Telltale is to detect anomalies on SLI metrics (for example, latency, error rates, etc). When such anomalies are detected, Telltale is able to compute correlations with similar metrics emitted from either upstream or downstream services. In addition, it also computes correlations between SLI metrics and custom metrics like the log derived metrics mentioned above. This has proven to be valuable towards reducing Mean Time to Recover (MTTR). For example, we are able to now correlate increased error rates with increased rate of specific exceptions occurring in logs and even point to an exemplar stacktrace, as shown below:

Our logs pipeline fingerprints every log message and attaches a (very high cardinality) fingerprint tag to a log events counter that is then emitted to Atlas Streaming. Telltale consumes this metric in a streaming fashion to identify fingerprints that correlate with anomalies seen in SLI metrics. Once an anomaly is found, we query the logs backend with the fingerprint hash to obtain the exemplar stacktrace. What’s more is we are now able to identify correlated anomalies (and exceptions) occurring in services that may be N hops away from the affected service. A system like Telltale becomes more effective as more services are onboarded (and for that matter the full service graph), because otherwise it becomes difficult to root cause the problem, especially in a microservices-based architecture. A few years ago, as noted in this blog, only about a hundred services were using Telltale; thanks to Atlas Streaming we have now managed to onboard thousands of other services at Netflix.

Finally, we realized that once you remove limits on the number of monitored queries, and start supporting much higher metric dimensionality/cardinality without impacting the cost/performance profile of the system, it opens doors to many exciting new possibilities. For example, to make alerts more actionable, we may now be able to compute correlations between SLI anomalies and custom metrics with high cardinality dimensions, for example an alert on elevated HTTP error rates may be able to point to impacted customer cohorts, by linking to precisely correlated exemplars. This would help developers with reproducibility.

Transitioning to the streaming path has been a long journey for us. One of the challenges was difficulty in debugging scenarios where the streaming path didn’t agree with what is returned by querying the Atlas database. This is especially true when either the data is not available in Atlas or the query is not supported because of (say) cardinality constraints. This is one of the reasons it has taken us years to get here. That said, early signs indicate that the streaming paradigm may help with tackling a cardinal problem in observability — effective correlation between the metrics & events verticals (logs, and potentially traces in the future), and we are excited to explore the opportunities that this presents for Observability in general.

Improved Alerting with Atlas Streaming Eval was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Monitoring Amazon DevOps Guru insights using Amazon Managed Grafana

2023-04-19 MJ Kubba

Post Syndicated from MJ Kubba original https://aws.amazon.com/blogs/devops/monitoring-amazon-devops-guru-insights-using-amazon-managed-grafana/

As organizations operate day-to-day, having insights into their cloud infrastructure state can be crucial for the durability and availability of their systems. Industry research estimates[1] that downtime costs small businesses around $427 per minute of downtime, and medium to large businesses an average of $9,000 per minute of downtime. Amazon DevOps Guru customers want to monitor and generate alerts using a single dashboard. This allows them to reduce context switching between applications, providing them an opportunity to respond to operational issues faster.

DevOps Guru can integrate with Amazon Managed Grafana to create and display operational insights. Alerts can be created and communicated for any critical events captured by DevOps Guru and notifications can be sent to operation teams to respond to these events. The key telemetry data types of logs and metrics are parsed and filtered to provide the necessary insights into observability.

Furthermore, it provides plug-ins to popular open-source databases, third-party ISV monitoring tools, and other cloud services. With Amazon Managed Grafana, you can easily visualize information from multiple AWS services, AWS accounts, and Regions in a single Grafana dashboard.

In this post, we will walk you through integrating the insights generated from DevOps Guru with Amazon Managed Grafana.

Solution Overview:

This architecture diagram shows the flow of the logs and metrics that will be utilized by Amazon Managed Grafana. Insights originate from DevOps Guru, each insight generating an event. These events are captured by Amazon EventBridge, and then saved as logs to Amazon CloudWatch Log Group DevOps Guru service metrics, and then parsed by Amazon Managed Grafana to create new dashboards.

This architecture diagram shows the flow of the logs and metrics that will be utilized by Amazon Managed Grafana, starting with DevOps Guru and then using Amazon EventBridge to save the insight event logs to Amazon CloudWatch Log Group DevOps Guru service metrics to be parsed by Amazon Managed Grafana and create new dashboards in Grafana from these logs and Metrics.

Now we will walk you through how to do this and set up notifications to your operations team.

Prerequisites:

The following prerequisites are required for this walkthrough:

An AWS Account
Enabled DevOps Guru on your account with CloudFormation stack, or tagged resources monitored.

Using Amazon CloudWatch Metrics

DevOps Guru sends service metrics to CloudWatch Metrics. We will use these to track metrics for insights and metrics for your DevOps Guru usage; the DevOps Guru service reports the metrics to the AWS/DevOps-Guru namespace in CloudWatch by default.

First, we will provision an Amazon Managed Grafana workspace and then create a Dashboard in the workspace that uses Amazon CloudWatch as a data source.

Setting up Amazon CloudWatch Metrics

Create Grafana Workspace
Navigate to Amazon Managed Grafana from AWS console, then click Create workspace

a. Select the Authentication mechanism

i. AWS IAM Identity Center (AWS SSO) or SAML v2 based Identity Providers

ii. Service Managed Permission or Customer Managed

iii. Choose Next

b. Under “Data sources and notification channels”, choose Amazon CloudWatch

c. Create the Service.

You can use this post for more information on how to create and configure the Grafana workspace with SAML based authentication.

Next, we will show you how to create a dashboard and parse the Logs and Metrics to display the DevOps Guru insights and recommendations.

2. Configure Amazon Managed Grafana

a. Add CloudWatch as a data source:
From the left bar navigation menu, hover over AWS and select Data sources.

b. From the Services dropdown select and configure CloudWatch.

3. Create a Dashboard

a. From the left navigation bar, click on add a new Panel.

b. You will see a demo panel.

c. In the demo panel – Click on Data source and select Amazon CloudWatch.

The Amazon Grafana Workspace dashboard with the Grafana data source dropdown menu open. The drop down has 'Amazon CloudWatch (region name)' highlighted, other options include 'Mixed, 'Dashboard', and 'Grafana'.

d. For this panel we will use CloudWatch metrics to display the number of insights.

e. From Namespace select the AWS/DevOps-Guru name space, Insights as Metric name and Average for Statistics.

In the Amazon Grafana Workspace dashboard the user has entered values in three fields. "Grafana Query with Namespace" has the chosen value: AWS/DevOps-Guru. "Metric name" has the chosen value: Insights. "Statistic" has the chosen value: Average.

click apply

Time series graph contains a single new data point, indicting a recent event.

f. This is our first panel. We can change the panel name from the right-side bar under Title. We will name this panel “Insights”

g. From the top right menu, click save dashboard and give your new dashboard a name

Using Amazon CloudWatch Logs via Amazon EventBridge

For other insights outside of the service metrics, such as a number of insights per specific service or the average for a region or for a specific AWS account, we will need to parse the event logs. These logs first need to be sent to Amazon CloudWatch Logs. We will go over the details on how to set this up and how we can parse these logs in Amazon Managed Grafana using CloudWatch Logs Query Syntax. In this post, we will show a couple of examples. For more details, please check out this User Guide documentation. This is not done by default and we will need to use Amazon EventBridge to pass these logs to CloudWatch.

DevOps Guru logs include other details that can be helpful when building Dashboards, such as region, Insight Severity (High, Medium, or Low), associated resources, and DevOps guru dashboard URL, among other things. For more information, please check out this User Guide documentation.

EventBridge offers a serverless event bus that helps you receive, filter, transform, route, and deliver events. It provides one to many messaging solutions to support decoupled architectures, and it is easy to integrate with AWS Services and 3rd-party tools. Using Amazon EventBridge with DevOps Guru provides a solution that is easy to extend to create a ticketing system through integrations with ServiceNow, Jira, and other tools. It also makes it easy to set up alert systems through integrations with PagerDuty, Slack, and more.

Setting up Amazon CloudWatch Logs

Let’s dive in to creating the EventBridge rule and enhance our Grafana dashboard:

a. First head to Amazon EventBridge in the AWS console.

b. Click Create rule.

Type in rule Name and Description. You can leave the Event bus to default and Rule type to Rule with an event pattern.

c. Select AWS events or EventBridge partner events.

For event Pattern change to Customer patterns (JSON editor) and use:

{"source": ["aws.devops-guru"]}

This filters for all events generated from DevOps Guru. You can use the same mechanism to filter out specific messages such as new insights, or insights closed to a different channel. For this demonstration, let’s consider extracting all events.

As the user configures their EventBridge Rule, for the Creation method they have chosen "Custom pattern (JSON editor) write an event pattern in JSON." For the Event pattern editor just below they have entered {"source":["aws.devops-guru"]}

d. Next, for Target, select AWS service.

Then use CloudWatch log Group.

For the Log Group, give your group a name, such as “devops-guru”.

In the prompt for the new Target's configurations, the user has chosen AWS service as the Target type. For the Select a target drop down, they chose CloudWatch log Group. For the log group, they selected the /aws/events radio option, and then filled in the following input text box with the kebab case group name devops-guru.

e. Click Create rule.

f. Navigate back to Amazon Managed Grafana.
It’s time to add a couple more additional Panels to our dashboard. Click Add panel.
Then Select Amazon CloudWatch, and change from metrics to CloudWatch Logs and select the Log Group we created previously.

In the Grafana Workspace, the user has "Data source" selected as Amazon CloudWatch us-east-1. Underneath that they have chosen to use the default region and CloudWatch Logs. Below that, for the Log Groups they have entered /aws/events/DevOpsGuru

g. For the query use the following to get the number of closed insights:

fields @detail.messageType
| filter detail.messageType="CLOSED_INSIGHT"
| count(detail.messageType)

You’ll see the new dashboard get updated with “Data is missing a time field”.

New panel suggestion with switch to table or open visualization suggestions

You can either open the suggestions and select a gauge that makes sense;

New Suggestions display a dial graph, a bar graph, and a count numerical tracker

Or choose from multiple visualization options.

Now we have 2 panels:

Two panels are shown, one is the new dial graph, and the other is the time series graph that was created earlier.

h. You can repeat the same process. To create 3^rd panel for the new insights using this query:

fields @detail.messageType 
| filter detail.messageType="NEW_INSIGHT" 
| count(detail.messageType)

Now we have 3 panels:

Grafana now shows three 3 panels. Two dial graphs, and the time series graph.

Next, depending on the visualizations, you can work with the Logs and metrics data types to parse and filter the data.

Setting up a 4th panel as table. Under the Query tab, in the query editor, the user has entered the text: fields detail.messageType, detail.insightSeverity, detail.insightUrlfilter | filter detail.messageType="CLOSED_INSIGHT" or detail.messageType="NEW_INSIGHT"

i. For our fourth panel, we will add DevOps Guru dashboard direct link to the AWS Console.

Repeat the same process as demonstrated previously one more time with this query:

fields detail.messageType, detail.insightSeverity, detail.insightUrlfilter 
| filter detail.messageType="CLOSED_INSIGHT" or detail.messageType="NEW_INSIGHT"

Switch to table when prompted on the panel.

Grafana now shows 4 panels. The new panel displays a data table that contains information about the most recent DevOps Guru insights. There are also the two dial graphs, and the time series graph from before.

This will give us a direct link to the DevOps Guru dashboard and help us get to the insight details and Recommendations.

Save your dashboard.

You can extend observability by sending notifications through alerts on dashboards of panels providing metrics. The alerts will be triggered when a condition is met. The Alerts are communicated with Amazon SNS notification mechanism. This is our SNS notification channel setup.

Screenshot: notification settings show Name: DevopsGuruAlertsFromGrafana and Type: SNS

A previously created notification is used next to communicate any alerts when the condition is met across the metrics being observed.

Screenshot: notification setting with condition when count of query is above 5, a notification is sent to DevopsGuruAlertsFromGrafana with message, "More than 5 insights in the past 1 hour"

Cleanup

To avoid incurring future charges, delete the resources.

Navigate to EventBridge in AWS console and delete the rule created in step 4 (a-e) “devops-guru”.
Navigate to CloudWatch logs in AWS console and delete the log group created as results of step 4 (a-e) named “devops-guru”.
Amazon Managed Grafana: Navigate to Amazon Managed Grafana service and delete the Grafana services you created in step 1.

Conclusion

In this post, we have demonstrated how to successfully incorporate Amazon DevOps Guru insights into Amazon Managed Grafana and use Grafana as the observability tool. This will allow Operations team to successfully observe the state of their AWS resources and notify them through Alarms on any preset thresholds on DevOps Guru metrics and logs. You can expand on this to create other panels and dashboards specific to your needs. If you don’t have DevOps Guru, you can start monitoring your AWS applications with AWS DevOps Guru today using this link.

[1] https://www.atlassian.com/incident-management/kpis/cost-of-downtime

About the authors:

Let’s Architect! Monitoring production systems at scale

2023-04-12 Vittorio Denti

Post Syndicated from Vittorio Denti original https://aws.amazon.com/blogs/architecture/lets-architect-monitoring-production-systems-at-scale/

“Everything fails, all the time” is a famous quote from Amazon’s Chief Technology Officer Werner Vogels. This means that software and distributed systems may eventually fail because something can always go wrong. We have to accept this and design our systems accordingly, test our software and services, and think about all the possible edge cases.

With this in mind, we should also set our teams up for success by providing visibility in every environment for a quick turnaround when incidents happen. When a system serves traffic in production, we need to monitor it to make sure it behaves as expected and that all components are healthy. But questions arise such as:

How do we monitor a system?
What is monitoring?
What are some architectural and engineering approaches to implement in order to design a successful monitoring strategy?

All of these questions require complex answers. It’s not possible to cover everything in a blog post, but let’s start exploring the topic and sharing resources to guide you through this domain.

In this edition of Let’s Architect! we share some practices for monitoring used at Amazon and AWS, as well as more resources to discover how to build monitoring solutions for the workloads running on AWS.

Observability best practices at Amazon

Observability and monitoring are engineering tasks that also require putting a suitable cultural mindset in place. At Amazon, if a service doesn’t run as expected, the team writes a CoE (Correction of Errors) document to analyze the issue and answer critical questions to learn from it. There are also weekly operations meetings to analyze operational and performance dashboards for each service.

The session introduced here covers the full range of monitoring at Amazon, from how teams assess system health at a high level to how they understand the details of a single request. Use this resource to learn some best practices for metrics, logs, and tracing, and using these signals to achieve operational excellence.

Take me to this re:Invent video!

Observability is an iterative process which requires us to establish a feedback loop and improve based on the signals coming from the system.

Build an observability solution using managed AWS services and the OpenTelemetry standard

Visibility of what’s happening in a distributed system is key to operationalize workloads at scale. OpenTelemetry is the standard for observability and AWS services are fully integrated with that. The blog post introduced in this section shows you how AWS Distro for OpenTelemetry (ADOT) works under the hood and how to use it with a Kubernetes cluster. But keep in mind, this is just one of the many implementations available for AWS compute services and OpenTelemetry—so even if you’re not using Kubernetes right now, we’ve still got you covered!

Want more? Watch this re:Invent video for an understanding of how to think about logging, tracing, metrics, and monitoring with AWS services, and the possibilities to provide the observability your distributed systems need. This is a great learning resource with many demos and examples.

Take me to this blog post!

Flow of metrics and traces from Application services to the Observability Platform.

Optimizing your AWS Batch architecture for scale with observability dashboards

We’ve explored the mental models and strategies for monitoring in previous resources. Now let’s see how these principles can be applied in a scenario where we run batch and ML computing jobs at scale. In the blog post introduced in this section, you can learn how to use runtime metrics to understand an architecture designed on AWS Batch for running batch computing jobs. AWS Batch is a fully managed service enabling you to run jobs at any scale without needing to manage underlying compute resources. This blog explains how AWS Batch works and guides you through the process used to design a monitoring framework.

Since the solution is open-source, you are free to add other custom metrics you find useful. To get started with the AWS Batch open-source observability solution, visit the project page on GitHub. Several customers have used this monitoring tool to optimize their workload for scale by reshaping their jobs, refining their instance selection, and tuning their AWS Batch architecture.

Take me to this blog!

High-level structure of AWS Batch resources and interactions. This diagram depicts a user submitting jobs based on a job definition template to a job queue, which then communicates to a compute environment that resources are needed.

Observability workshop

This resource provides a hands-on experience for you on the variety of toolsets AWS offers to set up monitoring and observability on your applications. Whether your workload is on-premises or on AWS—or your application is a giant monolith or based on modern microservices-based architecture—the observability tools can provide deeper insights into application performance and health.

The monitoring tools covered in this workshop provide powerful capabilities that enable you to identify bottlenecks, issues, and defects without having to manually sift through various logs, metrics, and trace data.

Take me to this workshop!

The diagram illustrates the various components of the PetAdoptions architecture. In the workshop you will learn how to monitor this application.

See you next time!

Thanks for exploring architecture tools and resources with us!

Next time we’ll talk about containers on AWS.

To find all the posts from this series, check out the Let’s Architect! page of the AWS Architecture Blog.

How Cloudflare runs Prometheus at scale

2023-03-03 Lukasz Mierzwa

Post Syndicated from Lukasz Mierzwa original https://blog.cloudflare.com/how-cloudflare-runs-prometheus-at-scale/

How Cloudflare runs Prometheus at scale

We use Prometheus to gain insight into all the different pieces of hardware and software that make up our global network. Prometheus allows us to measure health & performance over time and, if there’s anything wrong with any service, let our team know before it becomes a problem.

At the moment of writing this post we run 916 Prometheus instances with a total of around 4.9 billion time series. Here’s a screenshot that shows exact numbers:

That’s an average of around 5 million time series per instance, but in reality we have a mixture of very tiny and very large instances, with the biggest instances storing around 30 million time series each.

Operating such a large Prometheus deployment doesn’t come without challenges. In this blog post we’ll cover some of the issues one might encounter when trying to collect many millions of time series per Prometheus instance.

Metrics cardinality

One of the first problems you’re likely to hear about when you start running your own Prometheus instances is cardinality, with the most dramatic cases of this problem being referred to as “cardinality explosion”.

So let’s start by looking at what cardinality means from Prometheus’ perspective, when it can be a problem and some of the ways to deal with it.

Let’s say we have an application which we want to instrument, which means add some observable properties in the form of metrics that Prometheus can read from our application. A metric can be anything that you can express as a number, for example:

The speed at which a vehicle is traveling.
Current temperature.
The number of times some specific event occurred.

To create metrics inside our application we can use one of many Prometheus client libraries. Let’s pick client_python for simplicity, but the same concepts will apply regardless of the language you use.

from prometheus_client import Counter

# Declare our first metric.
# First argument is the name of the metric.
# Second argument is the description of it.
c = Counter(mugs_of_beverage_total, 'The total number of mugs drank.')

# Call inc() to increment our metric every time a mug was drank.
c.inc()
c.inc()

With this simple code Prometheus client library will create a single metric. For Prometheus to collect this metric we need our application to run an HTTP server and expose our metrics there. The simplest way of doing this is by using functionality provided with client_python itself – see documentation here.

When Prometheus sends an HTTP request to our application it will receive this response:

# HELP mugs_of_beverage_total The total number of mugs drank.
# TYPE mugs_of_beverage_total counter
mugs_of_beverage_total 2

This format and underlying data model are both covered extensively in Prometheus’ own documentation.

Please see data model and exposition format pages for more details.

We can add more metrics if we like and they will all appear in the HTTP response to the metrics endpoint.

Prometheus metrics can have extra dimensions in form of labels. We can use these to add more information to our metrics so that we can better understand what’s going on.

With our example metric we know how many mugs were consumed, but what if we also want to know what kind of beverage it was? Or maybe we want to know if it was a cold drink or a hot one? Adding labels is very easy and all we need to do is specify their names. Once we do that we need to pass label values (in the same order as label names were specified) when incrementing our counter to pass this extra information.

Let’s adjust the example code to do this.

from prometheus_client import Counter

c = Counter(mugs_of_beverage_total, 'The total number of mugs drank.', ['content', 'temperature'])

c.labels('coffee', 'hot').inc()
c.labels('coffee', 'hot').inc()
c.labels('coffee', 'cold').inc()
c.labels('tea', 'hot').inc()

Our HTTP response will now show more entries:

# HELP mugs_of_beverage_total The total number of mugs drank.
# TYPE mugs_of_beverage_total counter
mugs_of_beverage_total{content="coffee", temperature="hot"} 2
mugs_of_beverage_total{content="coffee", temperature="cold"} 1
mugs_of_beverage_total{content="tea", temperature="hot"} 1

As we can see we have an entry for each unique combination of labels.

And this brings us to the definition of cardinality in the context of metrics. Cardinality is the number of unique combinations of all labels. The more labels you have and the more values each label can take, the more unique combinations you can create and the higher the cardinality.

Metrics vs samples vs time series

Now we should pause to make an important distinction between metrics and time series.

A metric is an observable property with some defined dimensions (labels). In our example case it’s a Counter class object.

A time series is an instance of that metric, with a unique combination of all the dimensions (labels), plus a series of timestamp & value pairs – hence the name “time series”. Names and labels tell us what is being observed, while timestamp & value pairs tell us how that observable property changed over time, allowing us to plot graphs using this data.

What this means is that a single metric will create one or more time series. The number of time series depends purely on the number of labels and the number of all possible values these labels can take.

Every time we add a new label to our metric we risk multiplying the number of time series that will be exported to Prometheus as the result.

In our example we have two labels, “content” and “temperature”, and both of them can have two different values. So the maximum number of time series we can end up creating is four (2*2). If we add another label that can also have two values then we can now export up to eight time series (2*2*2). The more labels we have or the more distinct values they can have the more time series as a result.

If all the label values are controlled by your application you will be able to count the number of all possible label combinations. But the real risk is when you create metrics with label values coming from the outside world.

If instead of beverages we tracked the number of HTTP requests to a web server, and we used the request path as one of the label values, then anyone making a huge number of random requests could force our application to create a huge number of time series. To avoid this it’s in general best to never accept label values from untrusted sources.

To make things more complicated you may also hear about “samples” when reading Prometheus documentation. A sample is something in between metric and time series – it’s a time series value for a specific timestamp. Timestamps here can be explicit or implicit. If a sample lacks any explicit timestamp then it means that the sample represents the most recent value – it’s the current value of a given time series, and the timestamp is simply the time you make your observation at.

If you look at the HTTP response of our example metric you’ll see that none of the returned entries have timestamps. There’s no timestamp anywhere actually. This is because the Prometheus server itself is responsible for timestamps. When Prometheus collects metrics it records the time it started each collection and then it will use it to write timestamp & value pairs for each time series.

That’s why what our application exports isn’t really metrics or time series – it’s samples.

Confusing? Let’s recap:

We start with a metric – that’s simply a definition of something that we can observe, like the number of mugs drunk.
Our metrics are exposed as a HTTP response. That response will have a list of samples – these are individual instances of our metric (represented by name & labels), plus the current value.
When Prometheus collects all the samples from our HTTP response it adds the timestamp of that collection and with all this information together we have a time series.

Each time series will cost us resources since it needs to be kept in memory, so the more time series we have, the more resources metrics will consume. This is true both for client libraries and Prometheus server, but it’s more of an issue for Prometheus itself, since a single Prometheus server usually collects metrics from many applications, while an application only keeps its own metrics.

Since we know that the more labels we have the more time series we end up with, you can see when this can become a problem. Simply adding a label with two distinct values to all our metrics might double the number of time series we have to deal with. Which in turn will double the memory usage of our Prometheus server. If we let Prometheus consume more memory than it can physically use then it will crash.

This scenario is often described as “cardinality explosion” – some metric suddenly adds a huge number of distinct label values, creates a huge number of time series, causes Prometheus to run out of memory and you lose all observability as a result.

How is Prometheus using memory?

To better handle problems with cardinality it’s best if we first get a better understanding of how Prometheus works and how time series consume memory.

For that let’s follow all the steps in the life of a time series inside Prometheus.

Step one – HTTP scrape

The process of sending HTTP requests from Prometheus to our application is called “scraping”. Inside the Prometheus configuration file we define a “scrape config” that tells Prometheus where to send the HTTP request, how often and, optionally, to apply extra processing to both requests and responses.

It will record the time it sends HTTP requests and use that later as the timestamp for all collected time series.

After sending a request it will parse the response looking for all the samples exposed there.

Step two – new time series or an update?

Once Prometheus has a list of samples collected from our application it will save it into TSDB – Time Series DataBase – the database in which Prometheus keeps all the time series.

But before doing that it needs to first check which of the samples belong to the time series that are already present inside TSDB and which are for completely new time series.

As we mentioned before a time series is generated from metrics. There is a single time series for each unique combination of metrics labels.

This means that Prometheus must check if there’s already a time series with identical name and exact same set of labels present. Internally time series names are just another label called __name__, so there is no practical distinction between name and labels. Both of the representations below are different ways of exporting the same time series:

mugs_of_beverage_total{content="tea", temperature="hot"} 1
{__name__="mugs_of_beverage_total", content="tea", temperature="hot"} 1

Since everything is a label Prometheus can simply hash all labels using sha256 or any other algorithm to come up with a single ID that is unique for each time series.

Knowing that it can quickly check if there are any time series already stored inside TSDB that have the same hashed value. Basically our labels hash is used as a primary key inside TSDB.

Step three – appending to TSDB

Once TSDB knows if it has to insert new time series or update existing ones it can start the real work.

Internally all time series are stored inside a map on a structure called Head. That map uses labels hashes as keys and a structure called memSeries as values. Those memSeries objects are storing all the time series information. The struct definition for memSeries is fairly big, but all we really need to know is that it has a copy of all the time series labels and chunks that hold all the samples (timestamp & value pairs).

Labels are stored once per each memSeries instance.

Samples are stored inside chunks using “varbit” encoding which is a lossless compression scheme optimized for time series data. Each chunk represents a series of samples for a specific time range. This helps Prometheus query data faster since all it needs to do is first locate the memSeries instance with labels matching our query and then find the chunks responsible for time range of the query.

By default Prometheus will create a chunk per each two hours of wall clock. So there would be a chunk for: 00:00 – 01:59, 02:00 – 03:59, 04:00 – 05:59, …, 22:00 – 23:59.

There’s only one chunk that we can append to, it’s called the “Head Chunk”. It’s the chunk responsible for the most recent time range, including the time of our scrape. Any other chunk holds historical samples and therefore is read-only.

There is a maximum of 120 samples each chunk can hold. This is because once we have more than 120 samples on a chunk efficiency of “varbit” encoding drops. TSDB will try to estimate when a given chunk will reach 120 samples and it will set the maximum allowed time for current Head Chunk accordingly.

If we try to append a sample with a timestamp higher than the maximum allowed time for current Head Chunk, then TSDB will create a new Head Chunk and calculate a new maximum time for it based on the rate of appends.

All chunks must be aligned to those two hour slots of wall clock time, so if TSDB was building a chunk for 10:00-11:59 and it was already “full” at 11:30 then it would create an extra chunk for the 11:30-11:59 time range.

Since the default Prometheus scrape interval is one minute it would take two hours to reach 120 samples.

What this means is that using Prometheus defaults each memSeries should have a single chunk with 120 samples on it for every two hours of data.

Going back to our time series – at this point Prometheus either creates a new memSeries instance or uses already existing memSeries. Once it has a memSeries instance to work with it will append our sample to the Head Chunk. This might require Prometheus to create a new chunk if needed.

Step four – memory-mapping old chunks

After a few hours of Prometheus running and scraping metrics we will likely have more than one chunk on our time series:

One “Head Chunk” – containing up to two hours of the last two hour wall clock slot.
One or more for historical ranges – these chunks are only for reading, Prometheus won’t try to append anything here.

Since all these chunks are stored in memory Prometheus will try to reduce memory usage by writing them to disk and memory-mapping. The advantage of doing this is that memory-mapped chunks don’t use memory unless TSDB needs to read them.

The Head Chunk is never memory-mapped, it’s always stored in memory.

Step five – writing blocks to disk

Up until now all time series are stored entirely in memory and the more time series you have, the higher Prometheus memory usage you’ll see. The only exception are memory-mapped chunks which are offloaded to disk, but will be read into memory if needed by queries.

This allows Prometheus to scrape and store thousands of samples per second, our biggest instances are appending 550k samples per second, while also allowing us to query all the metrics simultaneously.

But you can’t keep everything in memory forever, even with memory-mapping parts of data.

Every two hours Prometheus will persist chunks from memory onto the disk. This process is also aligned with the wall clock but shifted by one hour.

When using Prometheus defaults and assuming we have a single chunk for each two hours of wall clock we would see this:

02:00 – create a new chunk for 02:00 – 03:59 time range
03:00 – write a block for 00:00 – 01:59
04:00 – create a new chunk for 04:00 – 05:59 time range
05:00 – write a block for 02:00 – 03:59
…
22:00 – create a new chunk for 22:00 – 23:59 time range
23:00 – write a block for 20:00 – 21:59

Once a chunk is written into a block it is removed from memSeries and thus from memory. Prometheus will keep each block on disk for the configured retention period.

Blocks will eventually be “compacted”, which means that Prometheus will take multiple blocks and merge them together to form a single block that covers a bigger time range. This process helps to reduce disk usage since each block has an index taking a good chunk of disk space. By merging multiple blocks together, big portions of that index can be reused, allowing Prometheus to store more data using the same amount of storage space.

Step six – garbage collection

After a chunk was written into a block and removed from memSeries we might end up with an instance of memSeries that has no chunks. This would happen if any time series was no longer being exposed by any application and therefore there was no scrape that would try to append more samples to it.

A common pattern is to export software versions as a build_info metric, Prometheus itself does this too:

prometheus_build_info{version="2.42.0"} 1

When Prometheus 2.43.0 is released this metric would be exported as:

prometheus_build_info{version="2.43.0"} 1

Which means that a time series with version=”2.42.0” label would no longer receive any new samples.

Once the last chunk for this time series is written into a block and removed from the memSeries instance we have no chunks left. This means that our memSeries still consumes some memory (mostly labels) but doesn’t really do anything.

To get rid of such time series Prometheus will run “head garbage collection” (remember that Head is the structure holding all memSeries) right after writing a block. This garbage collection, among other things, will look for any time series without a single chunk and remove it from memory.

Since this happens after writing a block, and writing a block happens in the middle of the chunk window (two hour slices aligned to the wall clock) the only memSeries this would find are the ones that are “orphaned” – they received samples before, but not anymore.

What does this all mean?

TSDB used in Prometheus is a special kind of database that was highly optimized for a very specific workload:

Time series scraped from applications are kept in memory.
Samples are compressed using encoding that works best if there are continuous updates.
Chunks that are a few hours old are written to disk and removed from memory.
When time series disappear from applications and are no longer scraped they still stay in memory until all chunks are written to disk and garbage collection removes them.

This means that Prometheus is most efficient when continuously scraping the same time series over and over again. It’s least efficient when it scrapes a time series just once and never again – doing so comes with a significant memory usage overhead when compared to the amount of information stored using that memory.

If we try to visualize how the perfect type of data Prometheus was designed for looks like we’ll end up with this:

A few continuous lines describing some observed properties.

If, on the other hand, we want to visualize the type of data that Prometheus is the least efficient when dealing with, we’ll end up with this instead:

Here we have single data points, each for a different property that we measure.

Although you can tweak some of Prometheus’ behavior and tweak it more for use with short lived time series, by passing one of the hidden flags, it’s generally discouraged to do so. These flags are only exposed for testing and might have a negative impact on other parts of Prometheus server.

To get a better understanding of the impact of a short lived time series on memory usage let’s take a look at another example.

Let’s see what happens if we start our application at 00:25, allow Prometheus to scrape it once while it exports:

prometheus_build_info{version="2.42.0"} 1

And then immediately after the first scrape we upgrade our application to a new version:

prometheus_build_info{version="2.43.0"} 1

At 00:25 Prometheus will create our memSeries, but we will have to wait until Prometheus writes a block that contains data for 00:00-01:59 and runs garbage collection before that memSeries is removed from memory, which will happen at 03:00.

This single sample (data point) will create a time series instance that will stay in memory for over two and a half hours using resources, just so that we have a single timestamp & value pair.

If we were to continuously scrape a lot of time series that only exist for a very brief period then we would be slowly accumulating a lot of memSeries in memory until the next garbage collection.

Looking at memory usage of such Prometheus server we would see this pattern repeating over time:

The important information here is that short lived time series are expensive. A time series that was only scraped once is guaranteed to live in Prometheus for one to three hours, depending on the exact time of that scrape.

The cost of cardinality

At this point we should know a few things about Prometheus:

We know what a metric, a sample and a time series is.
We know that the more labels on a metric, the more time series it can create.
We know that each time series will be kept in memory.
We know that time series will stay in memory for a while, even if they were scraped only once.

With all of that in mind we can now see the problem – a metric with high cardinality, especially one with label values that come from the outside world, can easily create a huge number of time series in a very short time, causing cardinality explosion. This would inflate Prometheus memory usage, which can cause Prometheus server to crash, if it uses all available physical memory.

To get a better idea of this problem let’s adjust our example metric to track HTTP requests.

Our metric will have a single label that stores the request path.

from prometheus_client import Counter

c = Counter(http_requests_total, 'The total number of HTTP requests.', ['path'])

# HTTP request handler our web server will call
def handle_request(path):
  c.labels(path).inc()
  ...

If we make a single request using the curl command:

> curl https://app.example.com/index.html

We should see these time series in our application:

# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{path="/index.html"} 1

But what happens if an evil hacker decides to send a bunch of random requests to our application?

> curl https://app.example.com/jdfhd5343
> curl https://app.example.com/3434jf833
> curl https://app.example.com/1333ds5
> curl https://app.example.com/aaaa43321

Extra time series would be created:

# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{path="/index.html"} 1
http_requests_total{path="/jdfhd5343"} 1
http_requests_total{path="/3434jf833"} 1
http_requests_total{path="/1333ds5"} 1
http_requests_total{path="/aaaa43321"} 1

With 1,000 random requests we would end up with 1,000 time series in Prometheus. If our metric had more labels and all of them were set based on the request payload (HTTP method name, IPs, headers, etc) we could easily end up with millions of time series.

Often it doesn’t require any malicious actor to cause cardinality related problems. A common class of mistakes is to have an error label on your metrics and pass raw error objects as values.

from prometheus_client import Counter

c = Counter(errors_total, 'The total number of errors.', [error])

def my_func:
  try:
    ...
  except Exception as err:
    c.labels(err).inc()

This works well if errors that need to be handled are generic, for example “Permission Denied”:

errors_total{error="Permission Denied"} 1

But if the error string contains some task specific information, for example the name of the file that our application didn’t have access to, or a TCP connection error, then we might easily end up with high cardinality metrics this way:

errors_total{error="file not found: /myfile.txt"} 1
errors_total{error="file not found: /other/file.txt"} 1
errors_total{error="read udp 127.0.0.1:12421->127.0.0.2:443: i/o timeout"} 1
errors_total{error="read udp 127.0.0.1:14743->127.0.0.2:443: i/o timeout"} 1

Once scraped all those time series will stay in memory for a minimum of one hour. It’s very easy to keep accumulating time series in Prometheus until you run out of memory.

Even Prometheus’ own client libraries had bugs that could expose you to problems like this.

How much memory does a time series need?

Each time series stored inside Prometheus (as a memSeries instance) consists of:

Copy of all labels.
Chunks containing samples.
Extra fields needed by Prometheus internals.

The amount of memory needed for labels will depend on the number and length of these. The more labels you have, or the longer the names and values are, the more memory it will use.

The way labels are stored internally by Prometheus also matters, but that’s something the user has no control over. There is an open pull request which improves memory usage of labels by storing all labels as a single string.

Chunks will consume more memory as they slowly fill with more samples, after each scrape, and so the memory usage here will follow a cycle – we start with low memory usage when the first sample is appended, then memory usage slowly goes up until a new chunk is created and we start again.

You can calculate how much memory is needed for your time series by running this query on your Prometheus server:

go_memstats_alloc_bytes / prometheus_tsdb_head_series

Note that your Prometheus server must be configured to scrape itself for this to work.

Secondly this calculation is based on all memory used by Prometheus, not only time series data, so it’s just an approximation. Use it to get a rough idea of how much memory is used per time series and don’t assume it’s that exact number.

Thirdly Prometheus is written in Golang which is a language with garbage collection. The actual amount of physical memory needed by Prometheus will usually be higher as a result, since it will include unused (garbage) memory that needs to be freed by Go runtime.

Protecting Prometheus from cardinality explosions

Prometheus does offer some options for dealing with high cardinality problems. There are a number of options you can set in your scrape configuration block. Here is the extract of the relevant options from Prometheus documentation:

# An uncompressed response body larger than this many bytes will cause the
# scrape to fail. 0 means no limit. Example: 100MB.
# This is an experimental feature, this behaviour could
# change or be removed in the future.
[ body_size_limit: <size> | default = 0 ]
# Per-scrape limit on number of scraped samples that will be accepted.
# If more than this number of samples are present after metric relabeling
# the entire scrape will be treated as failed. 0 means no limit.
[ sample_limit: <int> | default = 0 ]

# Per-scrape limit on number of labels that will be accepted for a sample. If
# more than this number of labels are present post metric-relabeling, the
# entire scrape will be treated as failed. 0 means no limit.
[ label_limit: <int> | default = 0 ]

# Per-scrape limit on length of labels name that will be accepted for a sample.
# If a label name is longer than this number post metric-relabeling, the entire
# scrape will be treated as failed. 0 means no limit.
[ label_name_length_limit: <int> | default = 0 ]

# Per-scrape limit on length of labels value that will be accepted for a sample.
# If a label value is longer than this number post metric-relabeling, the
# entire scrape will be treated as failed. 0 means no limit.
[ label_value_length_limit: <int> | default = 0 ]

# Per-scrape config limit on number of unique targets that will be
# accepted. If more than this number of targets are present after target
# relabeling, Prometheus will mark the targets as failed without scraping them.
# 0 means no limit. This is an experimental feature, this behaviour could
# change in the future.
[ target_limit: <int> | default = 0 ]

Setting all the label length related limits allows you to avoid a situation where extremely long label names or values end up taking too much memory.

Going back to our metric with error labels we could imagine a scenario where some operation returns a huge error message, or even stack trace with hundreds of lines. If such a stack trace ended up as a label value it would take a lot more memory than other time series, potentially even megabytes. Since labels are copied around when Prometheus is handling queries this could cause significant memory usage increase.

Setting label_limit provides some cardinality protection, but even with just one label name and huge number of values we can see high cardinality. Passing sample_limit is the ultimate protection from high cardinality. It enables us to enforce a hard limit on the number of time series we can scrape from each application instance.

The downside of all these limits is that breaching any of them will cause an error for the entire scrape.

If we configure a sample_limit of 100 and our metrics response contains 101 samples, then Prometheus won’t scrape anything at all. This is a deliberate design decision made by Prometheus developers.

The main motivation seems to be that dealing with partially scraped metrics is difficult and you’re better off treating failed scrapes as incidents.

How does Cloudflare deal with high cardinality?

We have hundreds of data centers spread across the world, each with dedicated Prometheus servers responsible for scraping all metrics.

Each Prometheus is scraping a few hundred different applications, each running on a few hundred servers.

Combined that’s a lot of different metrics. It’s not difficult to accidentally cause cardinality problems and in the past we’ve dealt with a fair number of issues relating to it.

Basic limits

The most basic layer of protection that we deploy are scrape limits, which we enforce on all configured scrapes. These are the sane defaults that 99% of application exporting metrics would never exceed.

By default we allow up to 64 labels on each time series, which is way more than most metrics would use.

We also limit the length of label names and values to 128 and 512 characters, which again is more than enough for the vast majority of scrapes.

Finally we do, by default, set sample_limit to 200 – so each application can export up to 200 time series without any action.

What happens when somebody wants to export more time series or use longer labels? All they have to do is set it explicitly in their scrape configuration.

Those limits are there to catch accidents and also to make sure that if any application is exporting a high number of time series (more than 200) the team responsible for it knows about it. This helps us avoid a situation where applications are exporting thousands of times series that aren’t really needed. Once you cross the 200 time series mark, you should start thinking about your metrics more.

CI validation

The next layer of protection is checks that run in CI (Continuous Integration) when someone makes a pull request to add new or modify existing scrape configuration for their application.

These checks are designed to ensure that we have enough capacity on all Prometheus servers to accommodate extra time series, if that change would result in extra time series being collected.

For example, if someone wants to modify sample_limit, let’s say by changing existing limit of 500 to 2,000, for a scrape with 10 targets, that’s an increase of 1,500 per target, with 10 targets that’s 10*1,500=15,000 extra time series that might be scraped. Our CI would check that all Prometheus servers have spare capacity for at least 15,000 time series before the pull request is allowed to be merged.

This gives us confidence that we won’t overload any Prometheus server after applying changes.

Our custom patches

One of the most important layers of protection is a set of patches we maintain on top of Prometheus. There is an open pull request on the Prometheus repository. This patchset consists of two main elements.

First is the patch that allows us to enforce a limit on the total number of time series TSDB can store at any time. There is no equivalent functionality in a standard build of Prometheus, if any scrape produces some samples they will be appended to time series inside TSDB, creating new time series if needed.

This is the standard flow with a scrape that doesn’t set any sample_limit:

With our patch we tell TSDB that it’s allowed to store up to N time series in total, from all scrapes, at any time. So when TSDB is asked to append a new sample by any scrape, it will first check how many time series are already present.

If the total number of stored time series is below the configured limit then we append the sample as usual.

The difference with standard Prometheus starts when a new sample is about to be appended, but TSDB already stores the maximum number of time series it’s allowed to have. Our patched logic will then check if the sample we’re about to append belongs to a time series that’s already stored inside TSDB or is it a new time series that needs to be created.

If the time series already exists inside TSDB then we allow the append to continue. If the time series doesn’t exist yet and our append would create it (a new memSeries instance would be created) then we skip this sample. We will also signal back to the scrape logic that some samples were skipped.

This is the modified flow with our patch:

By running “go_memstats_alloc_bytes / prometheus_tsdb_head_series” query we know how much memory we need per single time series (on average), we also know how much physical memory we have available for Prometheus on each server, which means that we can easily calculate the rough number of time series we can store inside Prometheus, taking into account the fact the there’s garbage collection overhead since Prometheus is written in Go:

memory available to Prometheus / bytes per time series = our capacity

This doesn’t capture all complexities of Prometheus but gives us a rough estimate of how many time series we can expect to have capacity for.

By setting this limit on all our Prometheus servers we know that it will never scrape more time series than we have memory for. This is the last line of defense for us that avoids the risk of the Prometheus server crashing due to lack of memory.

The second patch modifies how Prometheus handles sample_limit – with our patch instead of failing the entire scrape it simply ignores excess time series. If we have a scrape with sample_limit set to 200 and the application exposes 201 time series, then all except one final time series will be accepted.

This is the standard Prometheus flow for a scrape that has the sample_limit option set:

The entire scrape either succeeds or fails. Prometheus simply counts how many samples are there in a scrape and if that’s more than sample_limit allows it will fail the scrape.

With our custom patch we don’t care how many samples are in a scrape. Instead we count time series as we append them to TSDB. Once we appended sample_limit number of samples we start to be selective.

Any excess samples (after reaching sample_limit) will only be appended if they belong to time series that are already stored inside TSDB.

The reason why we still allow appends for some samples even after we’re above sample_limit is that appending samples to existing time series is cheap, it’s just adding an extra timestamp & value pair.

Creating new time series on the other hand is a lot more expensive – we need to allocate new memSeries instances with a copy of all labels and keep it in memory for at least an hour.

This is how our modified flow looks:

Both patches give us two levels of protection.

The TSDB limit patch protects the entire Prometheus from being overloaded by too many time series.

This is because the only way to stop time series from eating memory is to prevent them from being appended to TSDB. Once they’re in TSDB it’s already too late.

While the sample_limit patch stops individual scrapes from using too much Prometheus capacity, which could lead to creating too many time series in total and exhausting total Prometheus capacity (enforced by the first patch), which would in turn affect all other scrapes since some new time series would have to be ignored. At the same time our patch gives us graceful degradation by capping time series from each scrape to a certain level, rather than failing hard and dropping all time series from affected scrape, which would mean losing all observability of affected applications.

It’s also worth mentioning that without our TSDB total limit patch we could keep adding new scrapes to Prometheus and that alone could lead to exhausting all available capacity, even if each scrape had sample_limit set and scraped fewer time series than this limit allows.

Extra metrics exported by Prometheus itself tell us if any scrape is exceeding the limit and if that happens we alert the team responsible for it.

This also has the benefit of allowing us to self-serve capacity management – there’s no need for a team that signs off on your allocations, if CI checks are passing then we have the capacity you need for your applications.

The main reason why we prefer graceful degradation is that we want our engineers to be able to deploy applications and their metrics with confidence without being subject matter experts in Prometheus. That way even the most inexperienced engineers can start exporting metrics without constantly wondering “Will this cause an incident?”.

Another reason is that trying to stay on top of your usage can be a challenging task. It might seem simple on the surface, after all you just need to stop yourself from creating too many metrics, adding too many labels or setting label values from untrusted sources.

In reality though this is as simple as trying to ensure your application doesn’t use too many resources, like CPU or memory – you can achieve this by simply allocating less memory and doing fewer computations. It doesn’t get easier than that, until you actually try to do it. The more any application does for you, the more useful it is, the more resources it might need. Your needs or your customers’ needs will evolve over time and so you can’t just draw a line on how many bytes or cpu cycles it can consume. If you do that, the line will eventually be redrawn, many times over.

In general, having more labels on your metrics allows you to gain more insight, and so the more complicated the application you’re trying to monitor, the more need for extra labels.

In addition to that in most cases we don’t see all possible label values at the same time, it’s usually a small subset of all possible combinations. For example our errors_total metric, which we used in example before, might not be present at all until we start seeing some errors, and even then it might be just one or two errors that will be recorded. This holds true for a lot of labels that we see are being used by engineers.

This means that looking at how many time series an application could potentially export, and how many it actually exports, gives us two completely different numbers, which makes capacity planning a lot harder.

Especially when dealing with big applications maintained in part by multiple different teams, each exporting some metrics from their part of the stack.

For that reason we do tolerate some percentage of short lived time series even if they are not a perfect fit for Prometheus and cost us more memory.

Documentation

Finally we maintain a set of internal documentation pages that try to guide engineers through the process of scraping and working with metrics, with a lot of information that’s specific to our environment.

Prometheus and PromQL (Prometheus Query Language) are conceptually very simple, but this means that all the complexity is hidden in the interactions between different elements of the whole metrics pipeline.

Managing the entire lifecycle of a metric from an engineering perspective is a complex process.

You must define your metrics in your application, with names and labels that will allow you to work with resulting time series easily. Then you must configure Prometheus scrapes in the correct way and deploy that to the right Prometheus server. Next you will likely need to create recording and/or alerting rules to make use of your time series. Finally you will want to create a dashboard to visualize all your metrics and be able to spot trends.

There will be traps and room for mistakes at all stages of this process. We covered some of the most basic pitfalls in our previous blog post on Prometheus – Monitoring our monitoring. In the same blog post we also mention one of the tools we use to help our engineers write valid Prometheus alerting rules.

Having good internal documentation that covers all of the basics specific for our environment and most common tasks is very important. Being able to answer “How do I X?” yourself without having to wait for a subject matter expert allows everyone to be more productive and move faster, while also avoiding Prometheus experts from answering the same questions over and over again.

Closing thoughts

Prometheus is a great and reliable tool, but dealing with high cardinality issues, especially in an environment where a lot of different applications are scraped by the same Prometheus server, can be challenging.

We had a fair share of problems with overloaded Prometheus instances in the past and developed a number of tools that help us deal with them, including custom patches.

But the key to tackling high cardinality was better understanding how Prometheus works and what kind of usage patterns will be problematic.

Having better insight into Prometheus internals allows us to maintain a fast and reliable observability platform without too much red tape, and the tooling we’ve developed around it, some of which is open sourced, helps our engineers avoid most common pitfalls and deploy with confidence.

Intelligent, automatic restarts for unhealthy Kafka consumers

2023-01-24 Chris Shepherd

Post Syndicated from Chris Shepherd original https://blog.cloudflare.com/intelligent-automatic-restarts-for-unhealthy-kafka-consumers/

Intelligent, automatic restarts for unhealthy Kafka consumers

At Cloudflare, we take steps to ensure we are resilient against failure at all levels of our infrastructure. This includes Kafka, which we use for critical workflows such as sending time-sensitive emails and alerts.

We learned a lot about keeping our applications that leverage Kafka healthy, so they can always be operational. Application health checks are notoriously hard to implement: What determines an application as healthy? How can we keep services operational at all times?

These can be implemented in many ways. We’ll talk about an approach that allows us to considerably reduce incidents with unhealthy applications while requiring less manual intervention.

Kafka at Cloudflare

Cloudflare is a big adopter of Kafka. We use Kafka as a way to decouple services due to its asynchronous nature and reliability. It allows different teams to work effectively without creating dependencies on one another. You can also read more about how other teams at Cloudflare use Kafka in this post.

Kafka is used to send and receive messages. Messages represent some kind of event like a credit card payment or details of a new user created in your platform. These messages can be represented in multiple ways: JSON, Protobuf, Avro and so on.

Kafka organises messages in topics. A topic is an ordered log of events in which each message is marked with a progressive offset. When an event is written by an external system, that is appended to the end of that topic. These events are not deleted from the topic by default (retention can be applied).

Topics are stored as log files on disk, which are finite in size. Partitions are a systematic way of breaking the one topic log file into many logs, each of which can be hosted on separate servers–enabling to scale topics.

Topics are managed by brokers–nodes in a Kafka cluster. These are responsible for writing new events to partitions, serving reads and replicating partitions among themselves.

Messages can be consumed by individual consumers or co-ordinated groups of consumers, known as consumer groups.

Consumers use a unique id (consumer id) that allows them to be identified by the broker as an application which is consuming from a specific topic.

Each topic can be read by an infinite number of different consumers, as long as they use a different id. Each consumer can replay the same messages as many times as they want.

When a consumer starts consuming from a topic, it will process all messages, starting from a selected offset, from each partition. With a consumer group, the partitions are divided amongst each consumer in the group. This division is determined by the consumer group leader. This leader will receive information about the other consumers in the group and will decide which consumers will receive messages from which partitions (partition strategy).

The offset of a consumer’s commit can demonstrate whether the consumer is working as expected. Committing a processed offset is the way a consumer and its consumer group report to the broker that they have processed a particular message.

A standard measurement of whether a consumer is processing fast enough is lag. We use this to measure how far behind the newest message we are. This tracks time elapsed between messages being written to and read from a topic. When a service is lagging behind, it means that the consumption is at a slower rate than new messages being produced.

Due to Cloudflare’s scale, message rates typically end up being very large and a lot of requests are time-sensitive so monitoring this is vital.

At Cloudflare, our applications using Kafka are deployed as microservices on Kubernetes.

Health checks for Kubernetes apps

Kubernetes uses probes to understand if a service is healthy and is ready to receive traffic or to run. When a liveness probe fails and the bounds for retrying are exceeded, Kubernetes restarts the services.

When a readiness probe fails and the bounds for retrying are exceeded, it stops sending HTTP traffic to the targeted pods. In the case of Kafka applications this is not relevant as they don’t run an http server. For this reason, we’ll cover only liveness checks.

A classic Kafka liveness check done on a consumer checks the status of the connection with the broker. It’s often best practice to keep these checks simple and perform some basic operations – in this case, something like listing topics. If, for any reason, this check fails consistently, for instance the broker returns a TLS error, Kubernetes terminates the service and starts a new pod of the same service, therefore forcing a new connection. Simple Kafka liveness checks do a good job of understanding when the connection with the broker is unhealthy.

Problems with Kafka health checks

Due to Cloudflare’s scale, a lot of our Kafka topics are divided into multiple partitions (in some cases this can be hundreds!) and in many cases the replica count of our consuming service doesn’t necessarily match the number of partitions on the Kafka topic. This can mean that in a lot of scenarios this simple approach to health checking is not quite enough!

Microservices that consume from Kafka topics are healthy if they are consuming and committing offsets at regular intervals when messages are being published to a topic. When such services are not committing offsets as expected, it means that the consumer is in a bad state, and it will start accumulating lag. An approach we often take is to manually terminate and restart the service in Kubernetes, this will cause a reconnection and rebalance.

When a consumer joins or leaves a consumer group, a rebalance is triggered and the consumer group leader must re-assign which consumers will read from which partitions.

When a rebalance happens, each consumer is notified to stop consuming. Some consumers might get their assigned partitions taken away and re-assigned to another consumer. We noticed when this happened within our library implementation; if the consumer doesn’t acknowledge this command, it will wait indefinitely for new messages to be consumed from a partition that it’s no longer assigned to, ultimately leading to a deadlock. Usually a manual restart of the faulty client-side app is needed to resume processing.

Intelligent health checks

As we were seeing consumers reporting as “healthy” but sitting idle, it occurred to us that maybe we were focusing on the wrong thing in our health checks. Just because the service is connected to the Kafka broker and can read from the topic, it does not mean the consumer is actively processing messages.

Therefore, we realised we should be focused on message ingestion, using the offset values to ensure that forward progress was being made.

The PagerDuty approach

PagerDuty wrote an excellent blog on this topic which we used as inspiration when coming up with our approach.

Their approach used the current (latest) offset and the committed offset values. The current offset signifies the last message that was sent to the topic, while the committed offset is the last message that was processed by the consumer.

Checking the consumer is moving forwards, by ensuring that the latest offset was changing (receiving new messages) and the committed offsets were changing as well (processing the new messages).

Therefore, the solution we came up with:

If we cannot read the current offset, fail liveness probe.
If we cannot read the committed offset, fail liveness probe.
If the committed offset == the current offset, pass liveness probe.
If the value for the committed offset has not changed since the last run of the health check, fail liveness probe.

To measure if the committed offset is changing, we need to store the value of the previous run, we do this using an in-memory map where partition number is the key. This means each instance of our service only has a view of the partitions it is currently consuming from and will run the health check for each.

Problems

When we first rolled out our smart health checks we started to notice cascading failures some time after release. After initial investigations we realised this was happening when a rebalance happens. It would initially affect one replica then quickly result in the others reporting as unhealthy.

What we observed was due to us storing the previous value of the committed offset in-memory, when a rebalance happens the service may get re-assigned a different partition. When this happened it meant our service was incorrectly assuming that the committed offset for that partition had not changed (as this specific replica was no longer updating the latest value), therefore it would start to report the service as unhealthy. The failing liveness probe would then cause it to restart which would in-turn trigger another rebalancing in Kafka causing other replicas to face the same issue.

Solution

To fix this issue we needed to ensure that each replica only kept track of the offsets for the partitions it was consuming from at that moment. Luckily, the Shopify Sarama library, which we use internally, has functionality to observe when a rebalancing happens. This meant we could use it to rebuild the in-memory map of offsets so that it would only include the relevant partition values.

This is handled by receiving the signal from the session context channel:

for {
  select {
  case message, ok := <-claim.Messages(): // <-- Message received

     // Store latest received offset in-memory
     offsetMap[message.Partition] = message.Offset


     // Handle message
     handleMessage(ctx, message)


     // Commit message offset
     session.MarkMessage(message, "")


  case <-session.Context().Done(): // <-- Rebalance happened

     // Remove rebalanced partition from in-memory map
     delete(offsetMap, claim.Partition())
  }
}

Verifying this solution was straightforward, we just needed to trigger a rebalance. To test this worked in all possible scenarios we spun up a single replica of a service consuming from multiple partitions, then proceeded to scale up the number of replicas until it matched the partition count, then scaled back down to a single replica. By doing this we verified that the health checks could safely handle new partitions being assigned as well as partitions being taken away.

Takeaways

Probes in Kubernetes are very easy to set up and can be a powerful tool to ensure your application is running as expected. Well implemented probes can often be the difference between engineers being called out to fix trivial issues (sometimes outside of working hours) and a service which is self-healing.

However, without proper thought, “dumb” health checks can also lead to a false sense of security that a service is running as expected even when it’s not. One thing we have learnt from this was to think more about the specific behaviour of the service and decide what being unhealthy means in each instance, instead of just ensuring that dependent services are connected.

Monitor your own network with free network flow analytics from Cloudflare

2022-09-28 Chris Draper

Post Syndicated from Chris Draper original https://blog.cloudflare.com/free-magic-network-monitoring/

Monitor your own network with free network flow analytics from Cloudflare

As a network engineer or manager, answering questions about the traffic flowing across your infrastructure is a key part of your job. Cloudflare built Magic Network Monitoring (previously called Flow Based Monitoring) to give you better visibility into your network and to answer questions like, “What is my network’s peak traffic volume? What are the sources of that traffic? When does my network see that traffic?” Today, Cloudflare is excited to announce early access to a free version of Magic Network Monitoring that will be available to everyone. You can request early access by filling out this form.

Magic Network Monitoring now features a powerful analytics dashboard, self-serve configuration, and a step-by-step onboarding wizard. You’ll have access to a tool that helps you visualize your traffic and filter by packet characteristics including protocols, source IPs, destination IPs, ports, TCP flags, and router IP. Magic Network Monitoring also includes network traffic volume alerts for specific IP addresses or IP prefixes on your network.

Making Network Monitoring easy

Magic Networking Monitoring allows customers to collect network analytics without installing a physical device like a network TAP (Test Access Point) or setting up overly complex remote monitoring systems. Our product works with any hardware that exports network flow data, and customers can quickly configure any router to send flow data to Cloudflare’s network. From there, our network flow analyzer will aggregate your traffic data and display it in Magic Network Monitoring analytics.

Analytics dashboard

In Magic Network Monitoring analytics, customers can take a deep dive into their network traffic data. You can filter traffic data by protocol, source IP, destination IP, TCP flags, and router IP. Customers can combine these filters together to answer questions like, “How much ICMP data was requested from my speed test server over the past 24 hours?” Visibility into traffic analytics is a key part of understanding your network’s operations and proactively improving your security. Let’s walk through some cases where Magic Network Monitoring analytics can answer your network visibility and security questions.

Create network volume alert thresholds per IP address or IP prefix

Magic Network Monitoring is incredibly flexible, and it can be customized to meet the needs of any network hobbyist or business. You can monitor your traffic volume trends over time via the analytics dashboard and build an understanding of your network’s traffic profile. After gathering historical network data, you can set custom volumetric threshold alerts for one IP prefix or a group of IP prefixes. As your network traffic changes over time, or their network expands, they can easily update their Magic Network Monitoring configuration to receive data from new routers or destinations within their network.

Monitoring a speed test server in a home lab

Let’s run through an example where you’re running a network home lab. You decide to use Magic Network Monitoring to track the volume of requests a speed test server you’re hosting receives and check for potential bad actors. Your goal is to identify when your speed test server experiences peak traffic, and the volume of that traffic. You set up Magic Network Monitoring and create a rule that analyzes all traffic destined for your speed test server’s IP address. After collecting data for seven days, the analytics dashboard shows that peak traffic occurs on weekdays in the morning, and that during this time, your traffic volume ranges from 450 – 550 Mbps.

As you’re checking over the analytics data, you also notice strange traffic spikes of 300 – 350 Mbps in the middle of the night that occur at the same time. As you investigate further, the analytics dashboard shows the source of this traffic spike is from the same IP prefix. You research some source IPs, and find they’re associated with malicious activity. As a result, you update your firewall to block traffic from this problematic source.

Identifying a network layer DDoS attack

Magic Network Monitoring can also be leveraged to identify a variety of L3, L4, and L7 DDoS attacks. Let’s run through an example of how ACME Corp, a small business using Magic Network Monitoring, can identify a Ping (ICMP) Flood attack on their network. Ping Flood attacks aim to overwhelm the targeted network’s ability to respond to a high number of requests or overload the network connection with bogus traffic.

At the start of a Ping Flood attack, your server’s traffic volume will begin to ramp up. Magic Network Monitoring will analyze traffic across your network, and send an email, webhook, or PagerDuty alert once an unusual volume of traffic is identified. Your network and security team can respond to the volumetric alert by checking the data in Magic Network Monitoring analytics and identifying the attack type. In this case, they’ll notice the following traffic characteristics:

Network traffic volume above your historical traffic averages
An unusually large amount of ICMP traffic
ICMP traffic coming from a specific set of source IPs

Now, your network security team has confirmed the traffic is malicious by identifying the attack type, and can begin taking steps to mitigate the attack.

Magic Network Monitoring and Magic Transit

If your business is impacted by DDoS attacks, Magic Network Monitoring will identify attacks, and Magic Transit can be used to mitigate those DDoS attacks. Magic Transit protects customers’ entire network from DDoS attacks by placing our network in front of theirs. You can use Magic Transit Always On to reduce latency and mitigate attacks all the time, or Magic Transit On Demand to protect your network during active attacks. With Magic Transit, you get DDoS protection, traffic acceleration, and other network functions delivered as a service from every Cloudflare data center. Magic Transit works by allowing Cloudflare to advertise customers’ IP prefixes to the Internet with BGP to route the customer’s traffic through our network for DDoS protection. If you’re interested in protecting your network with Magic Transit, you can visit the Magic Transit product page and request a demo today.

The free version of Magic Network Monitoring (MNM) will be released in the next few weeks. You can request early access by filling out this form.

This is just the beginning for Magic Network Monitoring. In the future, you can look forward to features like advanced DDoS attack identification, network incident history and trends, and volumetric alert threshold recommendations.

Monitoring our monitoring: how we validate our Prometheus alert rules

2022-05-19 Lukasz Mierzwa

Post Syndicated from Lukasz Mierzwa original https://blog.cloudflare.com/monitoring-our-monitoring/

Background

Monitoring our monitoring: how we validate our Prometheus alert rules

We use Prometheus as our core monitoring system. We’ve been heavy Prometheus users since 2017 when we migrated off our previous monitoring system which used a customized Nagios setup. Despite growing our infrastructure a lot, adding tons of new products and learning some hard lessons about operating Prometheus at scale, our original architecture of Prometheus (see Monitoring Cloudflare’s Planet-Scale Edge Network with Prometheus for an in depth walk through) remains virtually unchanged, proving that Prometheus is a solid foundation for building observability into your services.

One of the key responsibilities of Prometheus is to alert us when something goes wrong and in this blog post we’ll talk about how we make those alerts more reliable – and we’ll introduce an open source tool we’ve developed to help us with that, and share how you can use it too. If you’re not familiar with Prometheus you might want to start by watching this video to better understand the topic we’ll be covering here.

Prometheus works by collecting metrics from our services and storing those metrics inside its database, called TSDB. We can then query these metrics using Prometheus query language called PromQL using ad-hoc queries (for example to power Grafana dashboards) or via alerting or recording rules. A rule is basically a query that Prometheus will run for us in a loop, and when that query returns any results it will either be recorded as new metrics (with recording rules) or trigger alerts (with alerting rules).

Prometheus alerts

Since we’re talking about improving our alerting we’ll be focusing on alerting rules.

To create alerts we first need to have some metrics collected. For the purposes of this blog post let’s assume we’re working with http_requests_total metric, which is used on the examples page. Here are some examples of how our metrics will look:

http_requests_total{job="myserver", handler="/", method=”get”, status=”200”}
http_requests_total{job="myserver", handler="/", method=”get”, status=”500”}
http_requests_total{job="myserver", handler="/posts", method=”get”, status=”200”}
http_requests_total{job="myserver", handler="/posts", method=”get”, status=”500”}
http_requests_total{job="myserver", handler="/posts/new", method=”post”, status=”201”}
http_requests_total{job="myserver", handler="/posts/new", method=”post”, status=”401”}

Let’s say we want to alert if our HTTP server is returning errors to customers.

Since, all we need to do is check our metric that tracks how many responses with HTTP status code 500 there were, a simple alerting rule could like this:

- alert: Serving HTTP 500 errors
  expr: http_requests_total{status=”500”} > 0

This will alert us if we have any 500 errors served to our customers. Prometheus will run our query looking for a time series named http_requests_total that also has a status label with value “500”. Then it will filter all those matched time series and only return ones with value greater than zero.

If our alert rule returns any results a fire will be triggered, one for each returned result.

If our rule doesn’t return anything, meaning there are no matched time series, then alert will not trigger.

The whole flow from metric to alert is pretty simple here as we can see on the diagram below.

If we want to provide more information in the alert we can by setting additional labels and annotations, but alert and expr fields are all we need to get a working rule.

But the problem with the above rule is that our alert starts when we have our first error, and then it will never go away.

After all, our http_requests_total is a counter, so it gets incremented every time there’s a new request, which means that it will keep growing as we receive more requests. What this means for us is that our alert is really telling us “was there ever a 500 error?” and even if we fix the problem causing 500 errors we’ll keep getting this alert.

A better alert would be one that tells us if we’re serving errors right now.

For that we can use the rate() function to calculate the per second rate of errors.

Our modified alert would be:

- alert: Serving HTTP 500 errors
  expr: rate(http_requests_total{status=”500”}[2m]) > 0

The query above will calculate the rate of 500 errors in the last two minutes. If we start responding with errors to customers our alert will fire, but once errors stop so will this alert.

This is great because if the underlying issue is resolved the alert will resolve too.

We can improve our alert further by, for example, alerting on the percentage of errors, rather than absolute numbers, or even calculate error budget, but let’s stop here for now.

It’s all very simple, so what do we mean when we talk about improving the reliability of alerting? What could go wrong here?

Maybe a spot for a subheading here as you move on from the intro?

What could go wrong?

We can craft a valid YAML file with a rule definition that has a perfectly valid query that will simply not work how we expect it to work. Which, when it comes to alerting rules, might mean that the alert we rely upon to tell us when something is not working correctly will fail to alert us when it should. To better understand why that might happen let’s first explain how querying works in Prometheus.

Prometheus querying basics

There are two basic types of queries we can run against Prometheus. The first one is an instant query. It allows us to ask Prometheus for a point in time value of some time series. If we write our query as http_requests_total we’ll get all time series named http_requests_total along with the most recent value for each of them. We can further customize the query and filter results by adding label matchers, like http_requests_total{status=”500”}.

Let’s consider we have two instances of our server, green and red, each one is scraped (Prometheus collects metrics from it) every one minute (independently of each other).

This is what happens when we issue an instant query:

There’s obviously more to it as we can use functions and build complex queries that utilize multiple metrics in one expression. But for the purposes of this blog post we’ll stop here.

The important thing to know about instant queries is that they return the most recent value of a matched time series, and they will look back for up to five minutes (by default) into the past to find it. If the last value is older than five minutes then it’s considered stale and Prometheus won’t return it anymore.

The second type of query is a range query – it works similarly to instant queries, the difference is that instead of returning us the most recent value it gives us a list of values from the selected time range. That time range is always relative so instead of providing two timestamps we provide a range, like “20 minutes”. When we ask for a range query with a 20 minutes range it will return us all values collected for matching time series from 20 minutes ago until now.

An important distinction between those two types of queries is that range queries don’t have the same “look back for up to five minutes” behavior as instant queries. If Prometheus cannot find any values collected in the provided time range then it doesn’t return anything.

If we modify our example to request [3m] range query we should expect Prometheus to return three data points for each time series:

When queries don’t return anything

Knowing a bit more about how queries work in Prometheus we can go back to our alerting rules and spot a potential problem: queries that don’t return anything.

If our query doesn’t match any time series or if they’re considered stale then Prometheus will return an empty result. This might be because we’ve made a typo in the metric name or label filter, the metric we ask for is no longer being exported, or it was never there in the first place, or we’ve added some condition that wasn’t satisfied, like value of being non-zero in our http_requests_total{status=”500”} > 0 example.

Prometheus will not return any error in any of the scenarios above because none of them are really problems, it’s just how querying works. If you ask for something that doesn’t match your query then you get empty results. This means that there’s no distinction between “all systems are operational” and “you’ve made a typo in your query”. So if you’re not receiving any alerts from your service it’s either a sign that everything is working fine, or that you’ve made a typo, and you have no working monitoring at all, and it’s up to you to verify which one it is.

For example, we could be trying to query for http_requests_totals instead of http_requests_total (an extra “s” at the end) and although our query will look fine it won’t ever produce any alert.

Range queries can add another twist – they’re mostly used in Prometheus functions like rate(), which we used in our example. This function will only work correctly if it receives a range query expression that returns at least two data points for each time series, after all it’s impossible to calculate rate from a single number.

Since the number of data points depends on the time range we passed to the range query, which we then pass to our rate() function, if we provide a time range that only contains a single value then rate won’t be able to calculate anything and once again we’ll return empty results.

The number of values collected in a given time range depends on the interval at which Prometheus collects all metrics, so to use rate() correctly you need to know how your Prometheus server is configured. You can read more about this here and here if you want to better understand how rate() works in Prometheus.

For example if we collect our metrics every one minute then a range query http_requests_total[1m] will be able to find only one data point. Here’s a reminder of how this looks:

Since, as we mentioned before, we can only calculate rate() if we have at least two data points, calling rate(http_requests_total[1m]) will never return anything and so our alerts will never work.

There are more potential problems we can run into when writing Prometheus queries, for example any operations between two metrics will only work if both have the same set of labels, you can read about this here. But for now we’ll stop here, listing all the gotchas could take a while. The point to remember is simple: if your alerting query doesn’t return anything then it might be that everything is ok and there’s no need to alert, but it might also be that you’ve mistyped your metrics name, your label filter cannot match anything, your metric disappeared from Prometheus, you are using too small time range for your range queries etc.

Renaming metrics can be dangerous

We’ve been running Prometheus for a few years now and during that time we’ve grown our collection of alerting rules a lot. Plus we keep adding new products or modifying existing ones, which often includes adding and removing metrics, or modifying existing metrics, which may include renaming them or changing what labels are present on these metrics.

A lot of metrics come from metrics exporters maintained by the Prometheus community, like node_exporter, which we use to gather some operating system metrics from all of our servers. Those exporters also undergo changes which might mean that some metrics are deprecated and removed, or simply renamed.

A problem we’ve run into a few times is that sometimes our alerting rules wouldn’t be updated after such a change, for example when we upgraded node_exporter across our fleet. Or the addition of a new label on some metrics would suddenly cause Prometheus to no longer return anything for some of the alerting queries we have, making such an alerting rule no longer useful.

It’s worth noting that Prometheus does have a way of unit testing rules, but since it works on mocked data it’s mostly useful to validate the logic of a query. Unit testing won’t tell us if, for example, a metric we rely on suddenly disappeared from Prometheus.

Chaining rules

When writing alerting rules we try to limit alert fatigue by ensuring that, among many things, alerts are only generated when there’s an action needed, they clearly describe the problem that needs addressing, they have a link to a runbook and a dashboard, and finally that we aggregate them as much as possible. This means that a lot of the alerts we have won’t trigger for each individual instance of a service that’s affected, but rather once per data center or even globally.

For example, we might alert if the rate of HTTP errors in a datacenter is above 1% of all requests. To do that we first need to calculate the overall rate of errors across all instances of our server. For that we would use a recording rule:

- record: job:http_requests_total:rate2m
  expr: sum(rate(http_requests_total[2m])) without(method, status, instance)

- record: job:http_requests_status500:rate2m
  expr: sum(rate(http_requests_total{status=”500”}[2m])) without(method, status, instance)

First rule will tell Prometheus to calculate per second rate of all requests and sum it across all instances of our server. Second rule does the same but only sums time series with status labels equal to “500”. Both rules will produce new metrics named after the value of the record field.

Now we can modify our alert rule to use those new metrics we’re generating with our recording rules:

- alert: Serving HTTP 500 errors
  expr: job:http_requests_status500:rate2m / job:http_requests_total:rate2m > 0.01

If we have a data center wide problem then we will raise just one alert, rather than one per instance of our server, which can be a great quality of life improvement for our on-call engineers.

But at the same time we’ve added two new rules that we need to maintain and ensure they produce results. To make things more complicated we could have recording rules producing metrics based on other recording rules, and then we have even more rules that we need to ensure are working correctly.

What if all those rules in our chain are maintained by different teams? What if the rule in the middle of the chain suddenly gets renamed because that’s needed by one of the teams? Problems like that can easily crop up now and then if your environment is sufficiently complex, and when they do, they’re not always obvious, after all the only sign that something stopped working is, well, silence – your alerts no longer trigger. If you’re lucky you’re plotting your metrics on a dashboard somewhere and hopefully someone will notice if they become empty, but it’s risky to rely on this.

We definitely felt that we needed something better than hope.

Introducing pint: a Prometheus rule linter

To avoid running into such problems in the future we’ve decided to write a tool that would help us do a better job of testing our alerting rules against live Prometheus servers, so we can spot missing metrics or typos easier. We also wanted to allow new engineers, who might not necessarily have all the in-depth knowledge of how Prometheus works, to be able to write rules with confidence without having to get feedback from more experienced team members.

Since we believe that such a tool will have value for the entire Prometheus community we’ve open-sourced it, and it’s available for anyone to use – say hello to pint!

You can find sources on github, there’s also online documentation that should help you get started.

Pint works in 3 different ways:

You can run it against a file(s) with Prometheus rules
It can run as a part of your CI pipeline
Or you can deploy it as a side-car to all your Prometheus servers

It doesn’t require any configuration to run, but in most cases it will provide the most value if you create a configuration file for it and define some Prometheus servers it should use to validate all rules against. Running without any configured Prometheus servers will limit it to static analysis of all the rules, which can identify a range of problems, but won’t tell you if your rules are trying to query non-existent metrics.

First mode is where pint reads a file (or a directory containing multiple files), parses it, does all the basic syntax checks and then runs a series of checks for all Prometheus rules in those files.

Second mode is optimized for validating git based pull requests. Instead of testing all rules from all files pint will only test rules that were modified and report only problems affecting modified lines.

Third mode is where pint runs as a daemon and tests all rules on a regular basis. If it detects any problem it will expose those problems as metrics. You can then collect those metrics using Prometheus and alert on them as you would for any other problems. This way you can basically use Prometheus to monitor itself.

What kind of checks can it run for us and what kind of problems can it detect?

All the checks are documented here, along with some tips on how to deal with any detected problems. Let’s cover the most important ones briefly.

As mentioned above the main motivation was to catch rules that try to query metrics that are missing or when the query was simply mistyped. To do that pint will run each query from every alerting and recording rule to see if it returns any result, if it doesn’t then it will break down this query to identify all individual metrics and check for the existence of each of them. If any of them is missing or if the query tries to filter using labels that aren’t present on any time series for a given metric then it will report that back to us.

So if someone tries to add a new alerting rule with http_requests_totals typo in it, pint will detect that when running CI checks on the pull request and stop it from being merged. Which takes care of validating rules as they are being added to our configuration management system.

Another useful check will try to estimate the number of times a given alerting rule would trigger an alert. Which is useful when raising a pull request that’s adding new alerting rules – nobody wants to be flooded with alerts from a rule that’s too sensitive so having this information on a pull request allows us to spot rules that could lead to alert fatigue.

Similarly, another check will provide information on how many new time series a recording rule adds to Prometheus. In our setup a single unique time series uses, on average, 4KiB of memory. So if a recording rule generates 10 thousand new time series it will increase Prometheus server memory usage by 10000*4KiB=40MiB. 40 megabytes might not sound like but our peak time series usage in the last year was around 30 million time series in a single Prometheus server, so we pay attention to anything that’s might add a substantial amount of new time series, which pint helps us to notice before such rule gets added to Prometheus.

On top of all the Prometheus query checks, pint allows us also to ensure that all the alerting rules comply with some policies we’ve set for ourselves. For example, we require everyone to write a runbook for their alerts and link to it in the alerting rule using annotations.

We also require all alerts to have priority labels, so that high priority alerts are generating pages for responsible teams, while low priority ones are only routed to karma dashboard or create tickets using jiralert. It’s easy to forget about one of these required fields and that’s not something which can be enforced using unit testing, but pint allows us to do that with a few configuration lines.

With pint running on all stages of our Prometheus rule life cycle, from initial pull request to monitoring rules deployed in our many data centers, we can rely on our Prometheus alerting rules to always work and notify us of any incident, large or small.

GitHub: https://github.com/cloudflare/pint

Putting it all together

Let’s see how we can use pint to validate our rules as we work on them.

We can begin by creating a file called “rules.yml” and adding both recording rules there.

The goal is to write new rules that we want to add to Prometheus, but before we actually add those, we want pint to validate it all for us.

groups:
- name: Demo recording rules
  rules:
  - record: job:http_requests_total:rate2m
    expr: sum(rate(http_requests_total[2m])) without(method, status, instance)

  - record: job:http_requests_status500:rate2m
    expr: sum(rate(http_requests_total{status="500"}[2m]) without(method, status, instance)

Next we’ll download the latest version of pint from GitHub and run check our rules.

$ pint lint rules.yml 
level=info msg="File parsed" path=rules.yml rules=2
rules.yml:8: syntax error: unclosed left parenthesis (promql/syntax)
    expr: sum(rate(http_requests_total{status="500"}[2m]) without(method, status, instance)

level=info msg="Problems found" Fatal=1
level=fatal msg="Execution completed with error(s)" error="problems found"

Whoops, we have “sum(rate(…)” and so we’re missing one of the closing brackets. Let’s fix that and try again.

groups:
- name: Demo recording rules
  rules:
  - record: job:http_requests_total:rate2m
    expr: sum(rate(http_requests_total[2m])) without(method, status, instance)

  - record: job:http_requests_status500:rate2m
    expr: sum(rate(http_requests_total{status="500"}[2m])) without(method, status, instance)

$ pint lint rules.yml 
level=info msg="File parsed" path=rules.yml rules=2

Our rule now passes the most basic checks, so we know it’s valid. But to know if it works with a real Prometheus server we need to tell pint how to talk to Prometheus. For that we’ll need a config file that defines a Prometheus server we test our rule against, it should be the same server we’re planning to deploy our rule to. Here we’ll be using a test instance running on localhost. Let’s create a “pint.hcl” file and define our Prometheus server there:

prometheus "prom1" {
  uri     = "http://localhost:9090"
  timeout = "1m"
}

Now we can re-run our check using this configuration file:

$ pint -c pint.hcl lint rules.yml 
level=info msg="Loading configuration file" path=pint.hcl
level=info msg="File parsed" path=rules.yml rules=2
rules.yml:5: prometheus "prom1" at http://localhost:9090 didn't have any series for "http_requests_total" metric in the last 1w (promql/series)
    expr: sum(rate(http_requests_total[2m])) without(method, status, instance)

rules.yml:8: prometheus "prom1" at http://localhost:9090 didn't have any series for "http_requests_total" metric in the last 1w (promql/series)
    expr: sum(rate(http_requests_total{status="500"}[2m])) without(method, status, instance)

level=info msg="Problems found" Bug=2
level=fatal msg="Execution completed with error(s)" error="problems found"

Yikes! It’s a test Prometheus instance, and we forgot to collect any metrics from it.

Let’s fix that by starting our server locally on port 8080 and configuring Prometheus to collect metrics from it:

scrape_configs:
  - job_name: webserver
    static_configs:
      - targets: ['localhost:8080’]

Let’ re-run our checks once more:

$ pint -c pint.hcl lint rules.yml 
level=info msg="Loading configuration file" path=pint.hcl
level=info msg="File parsed" path=rules.yml rules=2

This time everything works!

Now let’s add our alerting rule to our file, so it now looks like this:

groups:
- name: Demo recording rules
  rules:
  - record: job:http_requests_total:rate2m
    expr: sum(rate(http_requests_total[2m])) without(method, status, instance)

  - record: job:http_requests_status500:rate2m
    expr: sum(rate(http_requests_total{status="500"}[2m])) without(method, status, instance)

- name: Demo alerting rules
  rules:
  - alert: Serving HTTP 500 errors
    expr: job:http_requests_status500:rate2m / job:http_requests_total:rate2m > 0.01

And let’s re-run pint once again:

$ pint -c pint.hcl lint rules.yml 
level=info msg="Loading configuration file" path=pint.hcl
level=info msg="File parsed" path=rules.yml rules=3
rules.yml:13: prometheus "prom1" at http://localhost:9090 didn't have any series for "job:http_requests_status500:rate2m" metric in the last 1w but found recording rule that generates it, skipping further checks (promql/series)
    expr: job:http_requests_status500:rate2m / job:http_requests_total:rate2m > 0.01

rules.yml:13: prometheus "prom1" at http://localhost:9090 didn't have any series for "job:http_requests_total:rate2m" metric in the last 1w but found recording rule that generates it, skipping further checks (promql/series)
    expr: job:http_requests_status500:rate2m / job:http_requests_total:rate2m > 0.01

level=info msg="Problems found" Information=2

It all works according to pint, and so we now can safely deploy our new rules file to Prometheus.

Notice that pint recognised that both metrics used in our alert come from recording rules, which aren’t yet added to Prometheus, so there’s no point querying Prometheus to verify if they exist there.

Now what happens if we deploy a new version of our server that renames the “status” label to something else, like “code”?

$ pint -c pint.hcl lint rules.yml 
level=info msg="Loading configuration file" path=pint.hcl
level=info msg="File parsed" path=rules.yml rules=3
rules.yml:8: prometheus "prom1" at http://localhost:9090 has "http_requests_total" metric but there are no series with "status" label in the last 1w (promql/series)
    expr: sum(rate(http_requests_total{status="500"}[2m])) without(method, status, instance)

rules.yml:13: prometheus "prom1" at http://localhost:9090 didn't have any series for "job:http_requests_status500:rate2m" metric in the last 1w but found recording rule that generates it, skipping further checks (promql/series)
    expr: job:http_requests_status500:rate2m / job:http_requests_total:rate2m > 0.01

level=info msg="Problems found" Bug=1 Information=1
level=fatal msg="Execution completed with error(s)" error="problems found"

Luckily pint will notice this and report it, so we can adopt our rule to match the new name.

But what if that happens after we deploy our rule? For that we can use the “pint watch” command that runs pint as a daemon periodically checking all rules.

Please note that validating all metrics used in a query will eventually produce some false positives. In our example metrics with status=”500” label might not be exported by our server until there’s at least one request ending in HTTP 500 error.

The promql/series check responsible for validating presence of all metrics has some documentation on how to deal with this problem. In most cases you’ll want to add a comment that instructs pint to ignore some missing metrics entirely or stop checking label values (only check if there’s “status” label present, without checking if there are time series with status=”500”).

Summary

Prometheus metrics don’t follow any strict schema, whatever services expose will be collected. At the same time a lot of problems with queries hide behind empty results, which makes noticing these problems non-trivial.

We use pint to find such problems and report them to engineers, so that our global network is always monitored correctly, and we have confidence that lack of alerts proves how reliable our infrastructure is.

How US federal agencies can use AWS to improve logging and log retention

2021-08-13 Derek Doerr

Post Syndicated from Derek Doerr original https://aws.amazon.com/blogs/security/how-us-federal-agencies-can-use-aws-to-improve-logging-and-log-retention/

This post is part of a series about how Amazon Web Services (AWS) can help your US federal agency meet the requirements of the President’s Executive Order on Improving the Nation’s Cybersecurity. You will learn how you can use AWS information security practices to help meet the requirement to improve logging and log retention practices in your AWS environment.

Improving the security and operational readiness of applications relies on improving the observability of the applications and the infrastructure on which they operate. For our customers, this translates to questions of how to gather the right telemetry data, how to securely store it over its lifecycle, and how to analyze the data in order to make it actionable. These questions take on more importance as our federal customers seek to improve their collection and management of log data in all their IT environments, including their AWS environments, as mandated by the executive order.

Given the interest in the technologies used to support logging and log retention, we’d like to share our perspective. This starts with an overview of logging concepts in AWS, including log storage and management, and then proceeds to how to gain actionable insights from that logging data. This post will address how to improve logging and log retention practices consistent with the Security and Operational Excellence pillars of the AWS Well-Architected Framework.

Log actions and activity within your AWS account

AWS provides you with extensive logging capabilities to provide visibility into actions and activity within your AWS account. A security best practice is to establish a wide range of detection mechanisms across all of your AWS accounts. Starting with services such as AWS CloudTrail, AWS Config, Amazon CloudWatch, Amazon GuardDuty, and AWS Security Hub provides a foundation upon which you can base detective controls, remediation actions, and forensics data to support incident response. Here is more detail on how these services can help you gain more security insights into your AWS workloads:

AWS CloudTrail provides event history for all of your AWS account activity, including API-level actions taken through the AWS Management Console, AWS SDKs, command line tools, and other AWS services. You can use CloudTrail to identify who or what took which action, what resources were acted upon, when the event occurred, and other details. If your agency uses AWS Organizations, you can automate this process for all of the accounts in the organization.
CloudTrail logs can be delivered from all of your accounts into a centralized account. This places all logs in a tightly controlled, central location, making it easier to both protect them as well as to store and analyze them. As with AWS CloudTrail, you can automate this process for all of the accounts in the organization using AWS Organizations. CloudTrail can also be configured to emit metrical data into the CloudWatch monitoring service, giving near real-time insights into the usage of various services.
CloudTrail log file integrity validation produces and cyptographically signs a digest file that contains references and hashes for every CloudTrail file that was delivered in that hour. This makes it computationally infeasible to modify, delete or forge CloudTrail log files without detection. Validated log files are invaluable in security and forensic investigations. For example, a validated log file enables you to assert positively that the log file itself has not changed, or that particular user credentials performed specific API activity.
AWS Config monitors and records your AWS resource configurations and allows you to automate the evaluation of recorded configurations against desired configurations. For example, you can use AWS Config to verify that resources are encrypted, multi-factor authentication (MFA) is enabled, and logging is turned on, and you can use AWS Config rules to identify noncompliant resources. Additionally, you can review changes in configurations and relationships between AWS resources and dive into detailed resource configuration histories, helping you to determine when compliance status changed and the reason for the change.
Amazon GuardDuty is a threat detection service that continuously monitors for malicious activity and unauthorized behavior to protect your AWS accounts and workloads. Amazon GuardDuty analyzes and processes the following data sources: VPC Flow Logs, AWS CloudTrail management event logs, CloudTrail Amazon Simple Storage Service (Amazon S3) data event logs, and DNS logs. It uses threat intelligence feeds, such as lists of malicious IP addresses and domains, and machine learning to identify potential threats within your AWS environment.
AWS Security Hub provides a single place that aggregates, organizes, and prioritizes your security alerts, or findings, from multiple AWS services and optional third-party products to give you a comprehensive view of security alerts and compliance status.

You should be aware that most AWS services do not charge you for enabling logging (for example, AWS WAF) but the storage of logs will incur ongoing costs. Always consult the AWS service’s pricing page to understand cost impacts. Related services such as Amazon Kinesis Data Firehose (used to stream data to storage services), and Amazon Simple Storage Service (Amazon S3), used to store log data, will incur charges.

Turn on service-specific logging as desired

After you have the foundational logging services enabled and configured, next turn your attention to service-specific logging. Many AWS services produce service-specific logs that include additional information. These services can be configured to record and send out information that is necessary to understand their internal state, including application, workload, user activity, dependency, and transaction telemetry. Here’s a sampling of key services with service-specific logging features:

Amazon CloudWatch provides you with data and actionable insights to monitor your applications, respond to system-wide performance changes, optimize resource utilization, and get a unified view of operational health. CloudWatch collects monitoring and operational data in the form of logs, metrics, and events, providing you with a unified view of AWS resources, applications, and services that run on AWS and on-premises servers. You can gain additional operational insights from your AWS compute instances (Amazon Elastic Compute Cloud, or EC2) as well as on-premises servers using the CloudWatch agent. Additionally, you can use CloudWatch to detect anomalous behavior in your environments, set alarms, visualize logs and metrics side by side, take automated actions, troubleshoot issues, and discover insights to keep your applications running smoothly.
Amazon CloudWatch Logs is a component of Amazon CloudWatch which you can use to monitor, store, and access your log files from Amazon Elastic Compute Cloud (Amazon EC2) instances, AWS CloudTrail, Route 53, and other sources. CloudWatch Logs enables you to centralize the logs from all of your systems, applications, and AWS services that you use, in a single, highly scalable service. You can then easily view them, search them for specific error codes or patterns, filter them based on specific fields, or archive them securely for future analysis. CloudWatch Logs enables you to see all of your logs, regardless of their source, as a single and consistent flow of events ordered by time, and you can query them and sort them based on other dimensions, group them by specific fields, create custom computations with a powerful query language, and visualize log data in dashboards.
Traffic Mirroring allows you to achieve full packet capture (as well as filtered subsets) of network traffic from an elastic network interface of EC2 instances inside your VPC. You can then send the captured traffic to out-of-band security and monitoring appliances for content inspection, threat monitoring, and troubleshooting.
The Elastic Load Balancing service provides access logs that capture detailed information about requests that are sent to your load balancer. Each log contains information such as the time the request was received, the client’s IP address, latencies, request paths, and server responses. The specific information logged varies by load balancer type:
Amazon S3 access logs record the S3 bucket and account that are being accessed, the API action, and requester information.
AWS Web Application Firewall (WAF) logs record web requests that are processed by AWS WAF, and indicate whether the requests matched AWS WAF rules and what actions, if any, were taken. These logs are delivered to Amazon S3 by using Amazon Kinesis Data Firehose.
Amazon Relational Database Service (Amazon RDS) log files can be downloaded or published to Amazon CloudWatch Logs. Log settings are specific to each database engine. Agencies use these settings to apply their desired logging configurations and chose which events are logged. Amazon Aurora and Amazon RDS for Oracle also support a real-time logging feature called “database activity streams” which provides even more detail, and cannot be accessed or modified by database administrators.
Amazon Route 53 provides options for logging for both public DNS query requests as well as Route53 Resolver DNS queries:
- Route 53 Resolver DNS query logs record DNS queries and responses that originate from your VPC, that use an inbound Resolver endpoint, that use an outbound Resolver endpoint, or that use a Route 53 Resolver DNS Firewall.
- Route 53 DNS public query logs record queries to Route 53 for domains you have hosted with AWS, including the domain or subdomain that was requested; the date and time of the request; the DNS record type; the Route 53 edge location that responded to the DNS query; and the DNS response code.
Amazon Elastic Compute Cloud (Amazon EC2) instances can use the unified CloudWatch agent to collect logs and metrics from Linux, macOS, and Windows EC2 instances and publish them to the Amazon CloudWatch service.
Elastic Beanstalk logs can be streamed to CloudWatch Logs. You can also use the AWS Management Console to request the last 100 log entries from the web and application servers, or request a bundle of all log files that is uploaded to Amazon S3 as a ZIP file.
Amazon CloudFront logs record user requests for content that is cached by CloudFront.

Store and analyze log data

Now that you’ve enabled foundational and service-specific logging in your AWS accounts, that data needs to be persisted and managed throughout its lifecycle. AWS offers a variety of solutions and services to consolidate your log data and store it, secure access to it, and perform analytics.

Store log data

The primary service for storing all of this logging data is Amazon S3. Amazon S3 is ideal for this role, because it’s a highly scalable, highly resilient object storage service. AWS provides a rich set of multi-layered capabilities to secure log data that is stored in Amazon S3, including encrypting objects (log records), preventing deletion (the S3 Object Lock feature), and using lifecycle policies to transition data to lower-cost storage over time (for example, to S3 Glacier). Access to data in Amazon S3 can also be restricted through AWS Identity and Access Management (IAM) policies, AWS Organizations service control policies (SCPs), S3 bucket policies, Amazon S3 Access Points, and AWS PrivateLink interfaces. While S3 is particularly easy to use with other AWS services given its various integrations, many customers also centralize their storage and analysis of their on-premises log data, or log data from other cloud environments, on AWS using S3 and the analytic features described below.

If your AWS accounts are organized in a multi-account architecture, you can make use of the AWS Centralized Logging solution. This solution enables organizations to collect, analyze, and display CloudWatch Logs data in a single dashboard. AWS services generate log data, such as audit logs for access, configuration changes, and billing events. In addition, web servers, applications, and operating systems all generate log files in various formats. This solution uses the Amazon Elasticsearch Service (Amazon ES) and Kibana to deploy a centralized logging solution that provides a unified view of all the log events. In combination with other AWS-managed services, this solution provides you with a turnkey environment to begin logging and analyzing your AWS environment and applications.

You can also make use of services such as Amazon Kinesis Data Firehose, which you can use to transport log information to S3, Amazon ES, or any third-party service that is provided with an HTTP endpoint, such as Datadog, New Relic, or Splunk.

Finally, you can use Amazon EventBridge to route and integrate event data between AWS services and to third-party solutions such as software as a service (SaaS) providers or help desk ticketing systems. EventBridge is a serverless event bus service that allows you to connect your applications with data from a variety of sources. EventBridge delivers a stream of real-time data from your own applications, SaaS applications, and AWS services, and then routes that data to targets such as AWS Lambda.

Analyze log data and respond to incidents

As the final step in managing your log data, you can use AWS services such as Amazon Detective, Amazon ES, CloudWatch Logs Insights, and Amazon Athena to analyze your log data and gain operational insights.

Amazon Detective makes it easy to analyze, investigate, and quickly identify the root cause of security findings or suspicious activities. Detective automatically collects log data from your AWS resources. It then uses machine learning, statistical analysis, and graph theory to help you visualize and conduct faster and more efficient security investigations.
Incident Manager is a component of AWS Systems Manger which enables you to automatically take action when a critical issue is detected by an Amazon CloudWatch alarm or Amazon Eventbridge event. Incident Manager executes pre-configured response plans to engage responders via SMS and phone calls, enable chat commands and notifications using AWS Chatbot, and execute AWS Systems Manager Automation runbooks. The Incident Manager console integrates with AWS Systems Manager OpsCenter to help you track incidents and post-incident action items from a central place that also synchronizes with popular third-party incident management tools such as Jira Service Desk and ServiceNow.
Amazon Elasticsearch Service (Amazon ES) is a fully managed service that collects, indexes, and unifies logs and metrics across your environment to give you unprecedented visibility into your applications and infrastructure. With Amazon ES, you get the scalability, flexibility, and security you need for the most demanding log analytics workloads. You can configure a CloudWatch Logs log group to stream data it receives to your Amazon ES cluster in near real time through a CloudWatch Logs subscription.
CloudWatch Logs Insights enables you to interactively search and analyze your log data in CloudWatch Logs.
Amazon Athena is an interactive query service that you can use to analyze data in Amazon S3 by using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Conclusion

As called out in the executive order, information from network and systems logs is invaluable for both investigation and remediation services. AWS provides a broad set of services to collect an unprecedented amount of data at very low cost, optionally store it for long periods of time in tiered storage, and analyze that telemetry information from your cloud-based workloads. These insights will help you improve your organization’s security posture and operational readiness and, as a result, improve your organization’s ability to deliver on its mission.

Next steps

To learn more about how AWS can help you meet the requirements of the executive order, see the other post in this series:

How AWS can help your US federal agency meet the executive order on improving the nation’s cybersecurity

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Gaining operational insights with AIOps using Amazon DevOps Guru

2021-02-26 Nikunj Vaidya

Post Syndicated from Nikunj Vaidya original https://aws.amazon.com/blogs/devops/gaining-operational-insights-with-aiops-using-amazon-devops-guru/

Amazon DevOps Guru offers a fully managed AIOps platform service that enables developers and operators to improve application availability and resolve operational issues faster. It minimizes manual effort by leveraging machine learning (ML) powered recommendations. Its ML models take advantage of AWS’s expertise in operating highly available applications for the world’s largest e-commerce business for over 20 years. DevOps Guru automatically detects operational issues, predicts impending resource exhaustion, details likely causes, and recommends remediation actions.

This post walks you through how to enable DevOps Guru for your account in a typical serverless environment and observe the insights and recommendations generated for various activities. These insights are generated for operational events that could pose a risk to your application availability. DevOps Guru uses AWS CloudFormation stacks as the application boundary to detect resource relationships and co-relate with deployment events.

Solution overview

The goal of this post is to demonstrate the insights generated for anomalies detected by DevOps Guru from DevOps operations. If you don’t have a test environment and want to build out infrastructure to test the generation of insights, then you can follow along through this post. However, if you have a test or production environment (preferably spawned from CloudFormation stacks), you can skip the first section and jump directly to section, Enabling DevOps Guru and injecting traffic.

The solution includes the following steps:

1. Deploy a serverless infrastructure.

2. Enable DevOps Guru and inject traffic.

3. Review the generated DevOps Guru insights.

4. Inject another failure to generate a new insight.

As depicted in the following diagram, we use a CloudFormation stack to create a serverless infrastructure, comprising of Amazon API Gateway, AWS Lambda, and Amazon DynamoDB, and inject HTTP requests at a high rate towards the API published to list records.

When DevOps Guru is enabled to monitor your resources within the account, it uses a combination of vended Amazon CloudWatch metrics and specific patterns from its ML models to detect anomalies. When an anomaly is detected, it generates an insight with the recommendations.

The generation of insights is dependent upon several factors. Although this post provides a canned environment to reproduce insights, the results may vary depending upon traffic pattern, timings of traffic injection, and so on.

Prerequisites

To complete this tutorial, you should have access to an AWS Cloud9 environment and the AWS Command Line Interface (AWS CLI).

Deploying a serverless infrastructure

To deploy your serverless infrastructure, you complete the following high-level steps:

1. Create your IDE environment.

2. Launch the CloudFormation template to deploy the serverless infrastructure.

3. Populate the DynamoDB table.

1. Creating your IDE environment

We recommend using AWS Cloud9 to create an environment to get access to the AWS CLI from a bash terminal. AWS Cloud9 is a browser-based IDE that provides a development environment in the cloud. While creating the new environment, ensure you choose Linux2 as the operating system. Alternatively, you can use your bash terminal in your favorite IDE and configure your AWS credentials in your terminal.

When access is available, run the following command to confirm that you can see the Amazon Simple Storage Service (Amazon S3) buckets in your account:

aws s3 ls

Install the following prerequisite packages and ensure you have Python3 installed:

sudo yum install jq -y

export AWS_REGION=$(curl -s \
169.254.169.254/latest/dynamic/instance-identity/document | jq -r '.region')

sudo pip3 install requests

Clone the git repository to download the required CloudFormation templates:

git clone https://github.com/aws-samples/amazon-devopsguru-samples
cd amazon-devopsguru-samples/generate-devopsguru-insights/

2. Launching the CloudFormation template to deploy your serverless infrastructure

To deploy your infrastructure, complete the following steps:

Run the CloudFormation template using the following command:

aws cloudformation create-stack --stack-name myServerless-Stack \
--template-body file:///$PWD/cfn-shops-monitoroper-code.yaml \
--capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM

The AWS CloudFormation deployment creates an API Gateway, a DynamoDB table, and a Lambda function with sample code.

When it’s complete, go to the Outputs tab of the stack on the AWS CloudFormation console.
Record the links for the two APIs: one of them to list the table contents and other one to populate the contents.

3. Populating the DynamoDB table

Run the following commands (simply copy-paste) to populate the DynamoDB table. The below three commands will identify the name of the DynamoDB table from the CloudFormation stack and populate the name in the populate-shops-dynamodb-table.json file.

dynamoDBTableName=$(aws cloudformation list-stack-resources \
--stack-name myServerless-Stack | \
jq '.StackResourceSummaries[]|select(.ResourceType == "AWS::DynamoDB::Table").PhysicalResourceId' | tr -d '"')

sudo sed -i s/"<YOUR-DYNAMODB-TABLE-NAME>"/$dynamoDBTableName/g \
populate-shops-dynamodb-table.json

aws dynamodb batch-write-item \
--request-items file://populate-shops-dynamodb-table.json

The command gives the following output:

{
"UnprocessedItems": {}
}

This populates the DynamoDB table with a few sample records, which you can verify by accessing the ListRestApiEndpointMonitorOper API URL published on the Outputs tab of the CloudFormation stack. The following screenshot shows the output.

Enabling DevOps Guru and injecting traffic

In this section, you complete the following high-level steps:

1. Enable DevOps Guru for the CloudFormation stack.

2. Wait for the serverless stack to complete.

3. Update the stack.

4. Inject HTTP requests into your API.

1. Enabling DevOps Guru for the CloudFormation stack

To enable DevOps Guru for CloudFormation, complete the following steps:

Run the CloudFormation template to enable DevOps Guru for this CloudFormation stack:

aws cloudformation create-stack \
--stack-name EnableDevOpsGuruForServerlessCfnStack \
--template-body file:///$PWD/EnableDevOpsGuruForServerlessCfnStack.yaml \
--parameters ParameterKey=CfnStackNames,ParameterValue=myServerless-Stack \
--region ${AWS_REGION}

When the stack is created, navigate to the Amazon DevOps Guru console.
Choose Settings.
Under CloudFormation stacks, locate myServerless-Stack.

If you don’t see it, your CloudFormation stack has not been successfully deployed. You may remove and redeploy the EnableDevOpsGuruForServerlessCfnStack stack.

Optionally, you can configure Amazon Simple Notification Service (Amazon SNS) topics or enable AWS Systems Manager integration to create OpsItem entries for every insight created by DevOps Guru. If you need to deploy as a stack set across multiple accounts and Regions, see Easily configure Amazon DevOps Guru across multiple accounts and Regions using AWS CloudFormation StackSets.

2. Waiting for baselining of resources

This is a necessary step to allow DevOps Guru to complete baselining the resources and benchmark the normal behavior. For our serverless stack with 3 resources, we recommend waiting for 2 hours before carrying out next steps. When enabled in a production environment, depending upon the number of resources selected to monitor, it can take up to 24 hours for it to complete baselining.

Note: Unlike many monitoring tools, DevOps Guru does not expect the dashboard to be continuously monitored and thus under normal system health, the dashboard would simply show zero’ed counters for the ongoing insights. It is only when an anomaly is detected, it will raise an alert and display an insight on the dashboard.

3. Updating the CloudFormation stack

When enough time has elapsed, we will make a configuration change to simulate a typical operational event. As shown below, update the CloudFormation template to change the read capacity for the DynamoDB table from 5 to 1.

Run the following command to deploy the updated CloudFormation template:

aws cloudformation update-stack --stack-name myServerless-Stack \
--template-body file:///$PWD/cfn-shops-monitoroper-code.yaml \
--capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM

4. Injecting HTTP requests into your API

We now inject ingress traffic in the form of HTTP requests towards the ListRestApiEndpointMonitorOper API, either manually or using the Python script provided in the current directory (sendAPIRequest.py). Due to reduced read capacity, the traffic will oversubscribe the dynamodb tables, thus inducing an anomaly. However, before triggering the script, populate the url parameter with the correct API link for the ListRestApiEndpointMonitorOper API, listed in the CloudFormation stack’s output tab.

After the script is saved, trigger the script using the following command:

python sendAPIRequest.py

Make sure you’re getting an output of status 200 with the records that we fed into the DynamoDB table (see the following screenshot). You may have to launch multiple tabs (preferably 4) of the terminal to run the script to inject a high rate of traffic.

After approximately 10 minutes of the script running in a loop, an operational insight is generated in DevOps Guru.

Reviewing DevOps Guru insights

While these operations are being run, DevOps Guru monitors for anomalies, logs insights that provide details about the metrics indicating an anomaly, and prints actionable recommendations to mitigate the anomaly. In this section, we review some of these insights.

Under normal conditions, DevOps Guru dashboard will show the ongoing insights counter to be zero. It monitors a high number of metrics behind the scenes and offloads the operator from manually monitoring any counters or graphs. It raises an alert in the form of an insight, only when anomaly occurs.

The following screenshot shows an ongoing reactive insight for the specific CloudFormation stack. When you choose the insight, you see further details. The number under the Total resources analyzed last hour may vary, so for this post, you can ignore this number.

The insight is divided into four sections: Insight overview, Aggregated metrics, Relevant events, and Recommendations. Let’s take a closer look into these sections.

The following screenshot shows the Aggregated metrics section, where it shows metrics for all the resources that it detected anomalies in (DynamoDB, Lambda, and API Gateway). Note that depending upon your traffic pattern, lambda settings, baseline traffic, the list of metrics may vary. In the example below, the timeline shows that an anomaly for DynamoDB started first and was followed by API Gateway and Lambda. This helps us understand the cause and symptoms, and prioritize the right anomaly investigation.

Initially, you may see only two metrics listed, however, over time, it populates more metrics that showed anomalies. You can see the anomaly for DynamoDB started earlier than the anomalies for API Gateway and Lambda, thus indicating them as after effects. In addition to the information in the preceding screenshot, you may see Duration p90 and IntegrationLatency p90 (for Lambda and API Gateway respectively, due to increased backend latency) metrics also listed.

Now we check the Relevant events section, which lists potential triggers for the issue. The events listed here depend on the set of operations carried out on this CloudFormation stack’s resources in this Region. This makes it easy for the operator to be reminded of a change that may have caused this issue. The dots (representing events) that are near the Insight start portion of timeline are of particular interest.

If you need to delve into any of these events, just click of any of these points, and it provides more details as shown below screenshot.

You can choose the link for an event to view specific details about the operational event (configuration change via CloudFormation update stack operation).

Now we move to the Recommendations section, which provides prescribed guidance for mitigating this anomaly. As seen in the following screenshot, it recommends to roll back the configuration change related to the read capacity for the DynamoDB table. It also lists specific metrics and the event as part of the recommendation for reference.

Going back to the metric section of the insight, when we select Graphed anomalies, it shows you the graphs of all related resources. Below screenshot shows a snippet showing anomaly for DynamoDB ReadThrottleEvents metrics. As seen in the below screenshot of the graph pattern, the read operations on the table are exceeding the provisioned throughput of read capacity. This clearly indicates an anomaly.

Let’s navigate to the DynamoDB table and check our table configuration. Checking the table properties, we notice that our read capacity is reduced to 1. This is our root cause that led to this anomaly.

If we change it to 5, we fix this anomaly. Alternatively, if the traffic is stopped, the anomaly moves to a Resolved state.

The ongoing reactive insight takes a few minutes after resolution to move to a Closed state.

Note: When the insight is active, you may see more metrics get populated over time as we detect further anomalies. When carrying out the preceding tests, if you don’t see all the metrics as listed in the screenshots, you may have to wait longer.

Injecting another failure to generate a new DevOps Guru insight

Let’s create a new failure and generate an insight for that.

1. After you close the insight from the previous section, trigger the HTTP traffic generation loop from the AWS Cloud9 terminal.

We modify the Lambda functions’s resource-based policy by removing the permissions for API Gateway to access this function.

2. On the Lambda console, open the function ScanFunctionMonitorOper.

3. On the Permissions tab, access the policy.

4. Save a copy of the policy offline as a backup before making any changes.

5. Note down the “Sid” values for the “AWS:SourceArn” that ends with prod/*/ and prod/*/*.

6. Run the following command to remove the “Sid” JSON statements in your Cloud9 terminal:

aws lambda remove-permission --function-name ScanFunctionMonitorOper \
--statement-id <Sid-value-ending-with-prod/*/>

7. Run the same command for the second Sid value:

aws lambda remove-permission --function-name ScanFunctionMonitorOper \
--statement-id <Sid-value-ending-with-prod/*/*>

You should see several 5XX errors, as in the following screenshot.

After less than 8 minutes, you should see a new ongoing reactive insight on the DevOps Guru dashboard.

Let’s take a closer look at the insight. The following screenshot shows the anomalous metric 5XXError Average of API Gateway and its duration. (This insight shows as closed because I had already restored permissions.)

If you have configured to enable creating OpsItem in Systems Manager, you would see the link to OpsItem ID created in the insight, as shown above. This is an optional configuration, which will enable you to track the insights in the form of open tickets (OpsItems) in Systems Manager OpsCenter.

The recommendations provide guidance based upon the related events and anomalous metrics.

After the insight has been generated, reviewed, and verified, restore the permissions by running the following command:

aws lambda add-permission --function-name ScanFunctionMonitorOper  \
--statement-id APIGatewayProdPerm --action lambda:InvokeFunction \
--principal apigateway.amazonaws.com

If needed, you can insert the condition to point to the API Gateway ARN to allow only specific API Gateways to access the Lambda function.

Cleaning up

After you walk through this post, you should clean up and un-provision the resources to avoid incurring any further charges.

1. To un-provision the CloudFormation stacks, on the AWS CloudFormation console, choose Stacks.

2. Select each stack (EnableDevOpsGuruForServerlessCfnStack and myServerless-Stack) and choose Delete.

3. Check to confirm that the DynamoDB table created with the stacks is cleaned up. If not, delete the table manually.

4. Un-provision the AWS Cloud9 environment.

Conclusion

This post reviewed how DevOps Guru can continuously monitor the resources in your AWS account in a typical production environment. When it detects an anomaly, it generates an insight, which includes the vended CloudWatch metrics that breached the threshold, the CloudFormation stack in which the resource existed, relevant events that could be potential triggers, and actionable recommendations to mitigate the condition.

DevOps Guru generates insights that are relevant to you based upon the pre-trained machine-learning models, removing the undifferentiated heavy lifting of manually monitoring several events, metrics, and trends.

I hope this post was useful to you and that you would consider DevOps Guru for your production needs.

Soar: Simulation for Observability, reliAbility, and secuRity

2021-01-14 Yan Zhai

Post Syndicated from Yan Zhai original https://blog.cloudflare.com/soar-simulation-for-observability-reliability-and-security/

Soar: Simulation for Observability, reliAbility, and secuRity

Serving more than approximately 25 million Internet properties is not an easy thing, and neither is serving 20 million requests per second on average. At Cloudflare, we achieve this by running a homogeneous edge environment: almost every Cloudflare server runs all Cloudflare products.

As we offer more and more products and enjoy the benefit of horizontal scalability, our edge stack continues to grow in complexity. Originally, we only operated at the application layer with our CDN service and DoS protection. Then we launched transport layer products, such as Spectrum and Argo. Now we have further expanded our footprint into the IP layer and physical link with Magic Transit. They all run on every machine we have. The work of our engineers enables our products to evolve at a fast pace, and to serve our customers better.

However, such software complexity presents a sheer challenge to operation: the more changes you make, the more likely it is that something is going to break. And we don’t tolerate any of our mistakes slipping into the production environment and affecting our customers.

In this article, we will discuss one of the techniques we use to fight such software complexity: simulations. Simulations are basically system tests that run with synthesized customer traffic and applications. We would like to introduce our simulation system, SOAR, i.e. Simulation for Observability, reliAbility, and secuRity.

What is SOAR? Simply put, it’s a data center built specifically for simulations. It runs the same software stack as our production data centers, but without any production traffic. Within SOAR, there are end-user servers, product servers, and origin servers (Figure 2). The product servers behave exactly the same as servers in our production edge network, and they are the targets that we want to test. End-user servers and origin servers run applications that try to simulate customer behaviors. The simplest case is to run network benchmarks through product servers, in order to evaluate how effective the corresponding products are. Instead of sending test traffic over the Internet, everything happens in the LAN environment that Cloudflare tightly controls. This gives us great flexibility in testing network features such as bring-your-own-IP (BYOIP) products.

To demonstrate how this works, let’s go through a running example using Magic Transit.

Magic Transit is a product that provides IP layer protection and acceleration. One of the main functions of Magic Transit is to shield customers from DDoS attacks.

Customers bring their IP ranges to advertise from Cloudflare edge. When attackers initiate a DoS attack, Cloudflare absorbs all the customer’s traffic, drops the attack traffic, and encapsulates clean traffic to customers. For this product, operational concerns are multifold, and here are some examples:

Have we properly configured our data plane so that traffic can reach customers? Is BGP ready? Are ECMP routes programmed correctly? Are health probes working correctly?
Do we have any untested corner cases that only manifest with a large amount of traffic?
Is our DoS system dropping malicious traffic as intended? How effective
Will any other team’s changes break Magic Transit as our edge keeps growing?

To ease these concerns, we run simulated customers with SOAR. Yes, simulated, not real. For example, assume a customer Alice onboarded an IP range 192.0.2.0/24 to Magic Transit. To simulate this customer, in SOAR we can configure a test application (e.g. iperf server) on one origin server to represent Alice’s service. We also bring up a product server to run Magic Transit. This product server will filter traffic toward a.b.c.0/24, and GRE encapsulated cleansed traffic to Alice’s specified GRE endpoint. To make it work, we also add routing rules to forward packets destined to 192.0.2.0/24 to go through the product server above. Similarly, we add routing rules to deliver GRE packets from the product server to the origin servers. Lastly, we start running test clients as eyeballs to evaluate the functional correctness, performance metrics, and resource usage.

For the rest of this article, we will talk about the design and implementation of this simulation system, as well as several real cases in which it helped us catch problems early or avoid problems altogether.

System Design

From performance simulation to config simulation

Before we created SOAR, we had already built a “performance simulation” for our layer 7 services. It is based on SaltStack, our configuration management software. All the simulation cases are system test cases against Cloudflare-owned HTTP sites. These cases are statically configured and run non-stop. Each simulation case produces multiple Prometheus metrics such as requests per second and latency. We monitor these metrics daily on our Grafana dashboard.

While this simulation system is very useful, it becomes less efficient as we have more and more simulation cases and products to run and analyze.

Isolation and Coordination

As more types of simulations are onboarded, it is critical to ensure each simulation runs in a clean environment, and all tasks of a simulation run together. This challenge is specific to providers like Cloudflare, whose products are not virtualized because we want to maximize our edge performance. As a result, we have to isolate simulations and clean up by ourselves; otherwise, different simulations may cross-affect each other.

For example, for Magic Transit simulations, we need to create a GRE tunnel on an origin server and set up several routes on all three servers, to make sure simulated traffic can flow as real Magic Transit customers would. We cannot leave these routes after the simulation finishes, or there might be a conflict. We once ran into a situation where different simulations required different source IP addresses to the same destination. In our original performance simulation environment, we will have to modify simulation applications to avoid these conflicts. This approach is less desirable as different engineering teams have to pay attention to other teams’ code.

Moreover, the performance simulation addresses only the most basic system test cases: a client sends traffic to a server and measures the performance characteristics like request per second and latency quantile. But the situation we want to simulate and validate in our production environment can be far more complex.

In our previous example of Magic Transit, customers can configure complicated network topology. Figure 4 is one simplified case. Let’s say Alice establishes four GRE tunnels with Cloudflare; two connect to her data center 1, and traffic will be ECMP hashed between these two tunnels. Similarly, she also establishes another two tunnels to her data center 2 with similar ECMP settings. She would like to see traffic hit data center 1 normally and fail over to data center 2 when tunnels 1 and 2 are both down.

In order to examine the effectiveness of route failover, we would need to inject errors on the product servers only after the traffic on the eyeball server has started. But this type of coordination is not achievable with statically defined simulations.

Engineer friendliness and Interactiveness

Our performance simulation is not engineer-friendly. Not just because it is all statically configured in SaltStack (most engineering teams do not possess Salt expertise), but it is also not integrated with an engineer’s daily routine. Can engineers trigger a simulation on every branch build? Can simulation results get back in time to inform that a performance problem occurs? The answer is no, it is not possible with our static configuration. An engineer can submit a Salt PR to config a new simulation, but this simulation may have to wait for several hours because all other unfinished simulations need to complete first (recall it is just a static loop). Nor can an engineer add a test to the team’s repository to run on every build, as it needs to reside in the SRE-managed Salt repository, making it unmanageable as the number of simulations grows.

The Architecture

To address the above limitations, we designed SOAR.

The architecture is a performance simulation structure, extended. We created an internal coordinator service to:

Interface with engineers, so they could now submit one-time simulations from their laptop or within the building pipeline, or view previous execution results.
Dispatch and coordinate simulation tasks to each simulation server, where a simulation agent executes these tasks. The coordinator will isolate simulations properly so none of them contends on system resources. For example, the simplest policy we implemented is to never run two simulations on the same server at the same time.

The coordinator is secured by Cloudflare Access, so that only employees can visit this service. The coordinator will serve two types of simulations: one-time simulation to be run in an ad-hoc way and mainly on a per pull request manner, to ease development testing. It’s also callable from our CI system. Another type is repetitive simulations that are stored in the coordinator’s persistent storage. These simulations serve daily monitoring purposes and will be executed periodically.

Each simulation server runs a simulation agent. This agent will execute two types of tasks received from the coordinator: system tasks and user tasks. System tasks change the system-wide configurations and will be reverted after each simulation terminates. These will include but are not limited to route change, link change, address change, ipset change and iptables change.

User tasks, on the other hand, run benchmarks that we are interested in evaluating, and will be terminated if it exceeds an allocated execution budget. Each user task is isolated in a cgroup, and the agent will ensure all user tasks are executed with dedicated resources. The generic runtime metrics of user tasks is monitored by Cadvisor and sent to Prometheus and Alert Manager. A user task can export its own metrics to Prometheus as well.

For SOAR to run reliably, we provisioned a dedicated environment that enforces the same settings for the production environment and operates it as a production system: hardened security, standard alerts on watch, no engineer access except approved tools. This to a large extent allows us to run simulations as a stable source of anomaly detection.

Simulating with customer-specific configuration

An important ability of SOAR is to simulate for a specific customer. This will provide the customer with more guarantees that both their configurations and our services are battle-tested with traffic before they go live. It can also be used to bisect problems during a customer escalation, helping customer support to rule out unrelated factors more easily.

All of our edge servers know how to dispatch an incoming customer packet. This factor greatly reduces difficulties in simulating a specific customer. What we need to do in simulation is to mock routing and domain translation on simulated eyeballs and origins, so that they will correctly send traffic to designated product servers. And the problem is solved—magic!

The actual implementation is also straightforward: as simulations run in a LAN environment, we have tight control over how to route packets (servers are on the same broadcast domain). Any eyeball, origin, or product server can just use a direct routing rule and a static DNS entry in /etc/hosts to redirect packets to go to the correct destination.

Running a simulation this way allows us to separate customer configuration management from the simulation service: our products will manage it, so any time a customer configuration is changed, they will already reflect in simulations without special care.

Implementation and Integration

All SOAR components are built with Golang from scratch on Linux servers. It took three engineer-months to build the prototype and onboard the first engineering use case. While there are other mature platforms for job scheduling, task isolation, and monitoring, building our own allows us to better absorb new requirements from engineering teams, which is much easier and quicker than an external dependency.

In addition, we are currently integrating the simulation service into our release pipeline. Cloudflare built a release manager internally to schedule product version changes in controlled steps: a new product is first deployed into dogfooding data centers. After the product has been trialed by Cloudflare employees, it moves to several canary data centers with limited customer traffic. If nothing bad happens, in an hour or so, it starts to land in larger data centers spread across three tiers. Tier-3 will receive the changes an hour earlier than tier-2, and the same applies to tier-2 and tier-1. This ensures a product would be battle-tested enough before it can serve the majority of Cloudflare customers.

Now we move this further by adding a simulation step even before dogfooding. In this step, all changes are deployed into the simulation environment, and engineering teams will configure which simulations to run. Dogfooding starts only when there is no performance regression or functional breakage. Here performance regression is based on Prometheus metrics, where each engineering team can define their own Prometheus query to interpret the performance results. All configured simulations will run periodically to detect problems in releases that do not tie to a specific product, e.g. a Linux kernel upgrade. SREs receive notifications asynchronously if any issue is discovered.

Simulations at Cloudflare: Case Studies

Simulations are very useful inside Cloudflare. Let’s see some real experiences we had in the past.

Detecting an anomaly on data center specific releases

In our Magic Transit example, the engineering team was about to release physical network interconnect (PNI) support. With PNI, customer data centers physically peer with Cloudflare routers.

However, this PNI functionality introduces a problem in our normal release process. However, PNI data centers are typically different from our dogfooding and canary data centers. If we still release with the normal process, then two critical phases are skipped. And what’s worse, the PNI data center could be a choke point in front of that customer’s traffic. If the PNI data center is taken down, no other data center can replace its role.

SOAR in this case is an important utility to help. The basic idea is to configure a server with PNI information. This server will act as if it runs in a PNI data center. Then we run simulated eyeball and origin to examine if there is any functional breakage:

With such simulation capability, we were able to detect several problems early on and before releasing. For example, we caught a problem that impacts checksum offloading, which could encapsulate TCP packets with the wrong inner checksum and cause the packets to be dropped at the origin side. This problem does not exist in our virtualized testing environment and integration tests; it only happens when production hardware comes into play. We then use this simulation as a success indicator to test various fixes until we get the packet flow running normally again.

Continuously monitor performance on the edge stack

When a team configures a simulation, it runs on the same stack where all other teams run their products as well. This means when a simulation starts to show unhealthy results, it may or may not directly relate to the product associated with that simulation.

But with continuous simulations, we will have more chances to detect issues before things go south, or at least it will serve as a hint to quickly react to emerging problems. In an example early this year, we noticed one of our performance simulation dashboards showed that some HTTP request throughput was dropping by 20%. After digging into the case, we found our bot detection system had made a change that affected related requests. Luckily enough we moved fast thanks to the hint from the simulation (and some other useful tools like Opentracing).

Our recent enhancement from just HTTP performance simulation to SOAR makes it even more useful for customers. This is because we are now able to simulate with customer-specific configurations, so we might expose customer-specific problems. We are still dogfooding this, and hopefully, we can deploy it to our customers soon.

DoS Attacks as Simulations

When we started to develop Magic Transit, a question worth monitoring was how effective our mitigation pipeline is, and how to apply thresholds for different customers. For our new ACK flood mitigation system, flowtrackd, we onboarded its performance simulation cases together with tunable ACK flood. Combined with customer-specific configuration, this allows us to compare the throughput result under different volumes of attacks, and systematically tune our mitigation threshold.

Another important factor that we will be able to achieve with our “attack simulation” system is to mount attacks we have seen in the past, making sure the development of our mitigation pipelines won’t ever pass on these known attacks to our customers.

Conclusion

In this article, we introduced Cloudflare’s simulation system, SOAR. While simulation is not a new tool, we can use it to improve reliability, observability, and security. Our adoption of SOAR is still in its early stages, but we are pretty confident that, by fully leveraging simulations, we will push our quality of service to a new level.

How Netflix Scales its API with GraphQL Federation (Part 2)

2020-12-11 Netflix Technology Blog

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/how-netflix-scales-its-api-with-graphql-federation-part-2-bbe71aaec44a

In our previous post and QConPlus talk, we discussed GraphQL Federation as a solution for distributing our GraphQL schema and implementation. In this post, we shift our attention to what is needed to run a federated GraphQL platform successfully — from our journey implementing it to lessons learned.

Our Journey so Far

Over the past year, we’ve implemented the core infrastructure pieces necessary for a federated GraphQL architecture as described in our previous post:

Studio Edge Architecture Diagram — **Studio Edge Architecture**

The first Domain Graph Service (DGS) on the platform was the former GraphQL monolith that we discussed in our first post (Studio API). Next, we worked with a few other application teams to make DGSs that would expose their APIs alongside the former monolith. We had our first Studio applications consuming the federated graph, without any performance degradation, by the end of the 2019. Once we knew that the architecture was feasible, we focused on readying it for broader usage. Our goal was to open up the Studio Edge platform for self-service in April 2020.

April 2020 was a turbulent time with the pandemic and overnight transition to working remotely. Nevertheless, teams started to jump into the graph in droves. Soon we had hundreds of engineers contributing directly to the API on a daily basis. And what about that Studio API monolith that used to be a bottleneck? We migrated the fields exposed by Studio API to individually owned DGSs without breaking the API for consumers. The original monolith is slated to be completely deprecated by the end of 2020.

This journey hasn’t been without its challenges. The biggest challenge was aligning on this strategy across the organization. Initially, there was a lot of skepticism and dissent; the concept was fairly new and would require high alignment across the organization to be successful. Our team spent a lot of time addressing dissenting points and making adjustments to the architecture based on feedback from developers. Through our prototype development and proactive partnership with some key critical voices, we were able to instill confidence and close crucial gaps.

Once we achieved broad alignment on the idea, we needed to ensure that adoption was seamless. This required building robust core infrastructure, ensuring a great developer experience, and solving for key cross-cutting concerns.

Core Infrastructure

Our GraphQL Gateway is based on Apollo’s reference implementation and is written in Kotlin. This gives us access to Netflix’s Java ecosystem, while also giving us the robust language features such as coroutines for efficient parallel fetches, and an expressive type system with null safety.

The schema registry is developed in-house, also in Kotlin. For storing schema changes, we use an internal library that implements the event sourcing pattern on top of the Cassandra database. Using event sourcing allows us to implement new developer experience features such as the Schema History view. The schema registry also integrates with our CI/CD systems like Spinnaker to automatically setup cloud networking for DGSs.

Developer Education & Experience

In the previous architecture, only the monolith Studio API team needed to learn GraphQL. In Studio Edge, every DGS team needs to build expertise in GraphQL. GraphQL has its own learning curve and can get especially tricky for complex cases like batching & lookahead. Also, as discussed in the previous post, understanding GraphQL Federation and implementing entity resolvers is not trivial either.

We partnered with Netflix’s Developer Experience (DevEx) team to build out documentation, training materials, and tutorials for developers. For general GraphQL questions, we lean on the open source community plus cultivate an internal GraphQL community to discuss hot topics like pagination, error handling, nullability, and naming conventions.

DGS Framework & Developer Tools

To make it easy for backend engineers to build a GraphQL DGS, the DevEx team built a “DGS Framework” on top of GraphQL Java and Spring Boot. The framework takes care of all the cross-cutting concerns of running a GraphQL service in production while also making it easier for developers to write GraphQL resolvers. In addition, DevEx built robust tooling for pushing schemas to the Schema Registry and a Self Service UI for browsing the various DGS’s schemas. Check out their conference talk and expect a future blog post from our colleagues. The DGS framework is planned to be open-sourced in early 2021.

Schema Governance

Netflix’s studio data is extremely rich and complex. Early on, we anticipated that active schema management would be crucial for schema evolution and overall health. We had a Studio Data Architect already in the org who was focused on data modeling and alignment across Studio. We engaged with them to determine graph schema best practices to best suit the needs of Studio Engineering.

Our goal was to design a GraphQL schema that was reflective of the domain itself, not the database model. UI developers should not have to build Backends For Frontends (BFF) to massage the data for their needs, rather, they should help shape the schema so that it satisfies their needs. Embracing a collaborative schema design approach was essential to achieving this goal.

Schema Design Workflow Diagram — **Schema Design Workflow**

The collaborative design process involves feedback and reviews across team boundaries. To streamline schema design and review, we formed a schema working group and a managed technical program for on-boarding to the federated architecture. While reviews add overhead to the product development process, we believe that prioritizing the quality of the graph model will reduce the amount of future changes and reworking needed. The level of review varies based on the entities affected; for the core federated types, more rigor is required (though tooling helps streamline that flow).

We have a deprecation workflow in place for evolving the schema. We’ve leveraged GraphQL’s deprecation feature and also track usage stats for every field in the schema. Once the stats show that a deprecated field is no longer used, we can make a backward incompatible change to remove the field from the schema.

We embraced a schema-first approach instead of generating our schema from existing models such as the Protobuf objects in our gRPC APIs. While Protobufs and gRPC are excellent solutions for building service APIs, we prefer decoupling our GraphQL schema from those layers to enable cleaner graph design and independent evolvability. In some scenarios, we implement generic mapping code from GraphQL resolvers to gRPC calls, but the extra boilerplate is worth the long-term flexibility of the GraphQL API.

Underlying our approach is a foundation of “context over control”, which is a key tenet of Netflix’s culture. Instead of trying to hold tight control of the entire graph, we give guidance and context to product teams so that they can apply their domain knowledge to make a flexible API for their domain. As this architecture matures, we will continue to monitor schema health and develop new tooling, processes, and best practices where needed.

Observability

In our previous architecture, observability was achieved through manual analysis and routing via the API team, which scaled poorly. For our federated architecture, we prioritized solving observability needs in a more scalable manner. We prioritized three areas:

Alerting — report when something goes awry
Discovery — easily determine what isn’t working
Diagnosis — debug why something isn’t working

Our guiding metrics in this space are mean time to resolution (MTTR) and service level objectives and indicators (SLO/SLI).

We teamed up with experts from Netflix’s Telemetry team. We integrated the Gateway and DGS architectural components with Zipkin, the internal distributed tracing tool Edgar, and application monitoring tool TellTale. In GraphQL, almost every response is a 200 with custom errors in the error block. We introspect these custom error codes from the response and emit them to our metrics server, Atlas. These integrations created a great foundation of rich visibility and insights for the consumers and developers of the GraphQL API.

**Edgar Trace for a Federated Request Lifecycle**

Timeline View for a Federated Request lifecycle — **Timeline View for a Federated Request**

Distributed Log Correlation helps with debugging more complex server issues. By surfacing the application level logging details for all systems involved in processing a request, we gain deeper insights into what happened across the stack. Developers can easily see what was happening around the same time as a given request, to inspect surrounding factors that might have impacted an interaction.

Log correlation across multiple services for a request lifecycle — **Logs across multiple services for a Federated Request**

To solve the “who do I ask about…” routing problem, we integrated deep linking from GraphQL types and fields to their owning team’s support channels. Finding support is now as simple as clicking a link from a trace, which helps shorten MTTR and reduce the number of times the gateway team needs to get involved.

Securing the Federated Graph

Our goal is to enable robust and consistent security practices across the federated architecture. To achieve this, we partnered with the security experts at Netflix to build security into the graph. Let’s look at two essential parts of our security solution: AuthN and AuthZ.

Authentication

All of our product experiences in the Studio space require an authenticated account, so we restrict the GraphQL Gateway access to only trusted authenticated callers. Additionally, Graph Introspection is restricted to Netflix internal developers.

Authorization

Before Studio Edge, authorization logic was fragmented across teams. Some teams implemented authorization in their BFFs, some in microservices, and others did both for good measure. The result was often a different authorization story for a given piece of data depending on which UI a user was accessing it through. UI teams also found themselves needing to implement (and re-implement) authorization checks with each new frontend.

In Studio Edge, we delegated the authorization responsibility to DGS owners. This resulted in consistent authorization for the same user across different applications. Plus, Product Managers, Engineers and the Security team can easily get a bird’s eye view of who has access to each data type and how.

We have multiple authorization offerings within Netflix: from a simple system that grants access based on user identity to a more granular system that brings in the concept of roles and capabilities. DGS developers can choose a solution based on their needs. Then they simply annotate their resolvers with @Secured annotation and configure that to use one of the available systems. If needed, more complex authorization can be implemented in the resolver or in downstream systems.

Future of Authorization

We are currently prototyping a GraphQL-aware authorization solution. The Schema Registry automatically generates Access Control Groups (ACGs) for each field and its corresponding type when its schema is registered. Product managers & DGS Engineers decide membership and rules for these generated ACGs. Since the ACGs map to a field in GraphQL, the DGS framework then automatically applies the rules associated with the ACG during execution.

Architecting for Failure

The GraphQL Gateway is the single entry point for all requests; a failure on the gateway can cause significant disruptions. Following Netflix engineering best practices, we assume failures will happen and design ways to mitigate the impact of those failures. These are our design principles for ensuring the gateway layer is resilient:

Single purpose
Stateless service
Demand controlled
Multi-region
Sharded by functionality

First, we focus the responsibilities of the gateway layer on a single purpose: parse client queries, then build and execute query plans. By reducing the scope, we limit the range of problems that can occur. We aim to perform any additional resource-intensive operations off-box with the exception of logging and metrics. Taking on additional unrelated logic in the gateway layer could increase surface area for failures in this critical tier.

Second, we run multiple stateless instances of the gateway service. Any gateway instance is able to generate and execute a query plan for any request. When we do code changes to the gateway layer, we rigorously test them before rolling out to production.

Third, we seek to balance the resources each request consumes through applying demand control. We rate-limit callers to avoid overloading the underlying databases that are the source of most of our domain elements. We also run a static query cost calculation on all incoming queries and reject expensive queries to avoid gridlock in gateway and DGS resources. Our partners understand these tradeoffs and work with us to meet these requirements, reworking expensive queries and reducing high volume callers.

Fourth, we deploy our gateway layer to multiple AWS regions around the world. This allows us to limit the blast radius for problems that inevitably arise. When problems happen, we can fail over to another region to ensure our clients are minimally impacted.

Last, we deploy multiple functional shards of our gateway layer. The code is the same in each shard and incoming requests are routed based on category. For example, GraphQL subscriptions generally result in long-lived connections while Queries & Mutations are short-lived. We use a separate fleet of instances for Subscriptions so “running out of connections” does not affect the availability of Queries and Mutations.

There is more we can do to improve resilience. We have plans to do canary deployments and analysis for gateway deployments and, eventually, schema changes. Today, our gateway dynamically updates its schema by polling the schema registry. We are in the process of decoupling these by storing the federation config in a versioned S3 bucket, making the gateway resilient to schema registry failures.

Closing Thoughts

GraphQL and Federation have been a productivity multiplier for Studio applications. Motivated by this, we’ve recently prototyped using GraphQL Federation for the Netflix consumer app search page on iOS & Android. To do this, we created three DGSs to provide the data for a minimal portion of the consumer graph. We are sending a small subset of users to this alternative stack and measuring high-level metrics. We are excited to see the results and explore further applicability in the Netflix consumer space.

Despite our positive experience, GraphQL Federation is early in its maturity lifecycle and may not be the best fit for every team or organization. Learning GraphQL and DGS development, running a federation layer, and doing a migration requires high commitment from partner teams and seamless cross-functional collaboration. If you’re considering going in this direction, we recommend checking out Apollo’s SaaS offering for Federation and the many online resources for learning GraphQL. For ecosystems like ours with a large swath of microservices that need to be aggregated together, the development velocity and improved operability has made the transition worth it.

In closing, we want to hear from you! If you have already implemented federation or tried to solve this problem with another approach, we would love to learn more. Sharing knowledge is one of the ways our industry learns and improves rapidly. Finally, if you’d like to be a part of solving complex and interesting problems like this at Netflix scale, check out our jobs page or reach out to us directly.

By Tejas Shikhare, Edited by Philip Fisher-Ogden

Additional Credits: Stephen Spalding, Jennifer Shin, Robert Reta, Antoine Boyer, Bruce Wang, David Simmer

How Netflix Scales its API with GraphQL Federation (Part 2) was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

High Cardinality and Dimensionality

OpenTelemetry

Developer Experience

What’s next?

Thanks

Gradually deploy changes to Workers and Durable Objects

Source mapped stack traces in Tail Workers

New Rate Limiting API in Workers

New auto-generated SDKs for Cloudflare’s API

Durable Object namespace analytics and WebSocket Hibernation GA

Production ready, without production complexity

Prometheus architecture at Cloudflare

Lifecycle of an alert

Solution

Dashboards

Alerts overview

Component breakdown

Timeline of alerts

Findings

Alertmanager inhibitions

Silences

DIY Alert analysis

Conclusion

Introducing bpftop

How it works

Getting started

What is Foundations?

Telemetry

Tracing

Logging

Metrics

Memory profiling

Telemetry server

Security

Settings and CLI

Wrapping Up

The beginning

Syslog-NG

log-x

Kafka

Onward to storage

What’s next?

Improve performance and reliability by connecting Workers to Sentry

Develop at the Data Edge with Turso + Workers

Cache responses from data stores with Momento

Try integrations out today

Solution Implementation Steps

Option 1: Deploy Datadog Connector App from AWS Serverless Repository

Option 2: Build and Deploy sample Datadog Connector App using AWS SAM Command Line Interface

Solution overview

Prerequisites and Assumptions

Solution Deployment

Using the AWS Console:

Using AWS CloudFormation

Validate the resources

To View Results/Outcome

Cleanup

Conclusion

Prerequisites:

Using Amazon CloudWatch Metrics

Setting up Amazon CloudWatch Metrics

Using Amazon CloudWatch Logs via Amazon EventBridge

Setting up Amazon CloudWatch Logs

See you next time!

Metrics cardinality

Metrics vs samples vs time series

Cardinality related problems

How is Prometheus using memory?

Step one – HTTP scrape

Step two – new time series or an update?

Step three – appending to TSDB

Step four – memory-mapping old chunks

Step five – writing blocks to disk

Step six – garbage collection

What does this all mean?

The cost of cardinality

How much memory does a time series need?

Protecting Prometheus from cardinality explosions

How does Cloudflare deal with high cardinality?

Basic limits