All posts by Blogs on Grafana Labs Blog

Meet the Grafana Labs Team: Hugo Häggmark

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/04/19/meet-the-grafana-labs-team-hugo-h%C3%A4ggmark/

As Grafana Labs continues to grow, we’d like you to get to know the team members who are building the cool stuff you’re using.
Check out the first of our Friday team profiles.

Meet Hugo!

Name: Hugo Häggmark

Hugo Häggmark

Grafana Labs Fullstack Developer Hugo Häggmark

Current location/time zone:
I’m in Vallentuna, 30 min. from Stockholm, CET.

What do you do at Grafana Labs?
I’m a fullstack developer on the Grafana team focusing on frontend.

What open source projects do you contribute to?
Grafana. I’m currently working on Explore and making that an even better and more performant part of Grafana.

What are your GitHub and Twitter handles?
Left Twitter last year and so should everyone 🙂 and my GitHub handle is hugohaggmark.

What do you like to do in your free time?
I’m married since 2003 but we’ve been together since 1998, we’ve 3 kids, 19, 17 and 3, so most of my spare time is about the family and all the activities that come with that. I enjoy cooking, playing long RPG games like Fallout, snowboarding and free diving/snorkeling.

What’s your favorite new gadget or tech toy?
The IoT kit from the latest GrafanaCon.

What’s your ideal environment for coding?
Depends, when I want a 100% focus I’ll listen to instrumental playlists like this. Otherwise I enjoy hearing people around me.

What do you do to get “in the zone” when you code?
Turn off all notifications and noise canceling headphones on.

What Game of Thrones character are you?
Daenerys Targaryen.

Spaces or tabs?
Spaces of course. 😉

Everything You Need to Know About the OSS Licensing War, Part 3.

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/04/18/everything-you-need-to-know-about-the-oss-licensing-war-part-3./

In Parts One and Two of this blog, we looked back at the ongoing open source licensing wars, focusing on the evolving situation between Elastic N.V. and AWS. In this final installment, we’ll offer some opinions on the situation, as well as share our own views on how we’re reacting at Grafana Labs.

So Who’s Right?

As faithful readers of the previous two installments hopefully realize by now: It’s complicated. Both parties have blood on their hands.

The knock against commercial open source companies is that they are trying to have their cake and eat it too. They want to stop the public cloud vendors from offering their software as a service and making money off their work. But preventing someone from using your software for any purpose violates a core freedom of open source.

Without those freedoms, many successful open source companies would not have been able to garner the adoption, mindshare, and support they enjoy today. With these new restrictions, they’re turning their back on open source, kicking the ladder out from under them and onto the community.

The knock against the public cloud companies is that they have been “unfairly” and “unsustainably” monetizing open source projects and companies; they’re making billions largely off the code others are creating. Plus, due to scale and channel advantages, the public cloud can stifle companies trying to offer their own software as a service.

Reversion to a Monoculture

But without contributing back in any notable way (either through people or dollars), the public clouds definitely haven’t lived up to the open source ideal. What’s happening isn’t a sustainable cycle.

I’ve mentioned how Linux and Red Hat both drove massive value and innovation for the whole infrastructure ecosystem. We’re talking about trillions of dollars of value, across many companies. It was truly a rising tide that lifted all boats.

I don’t feel the same way currently about the rise of AWS or Amazon. To me it feels more like a tax on our infrastructure, and the reversion to the kind of monoculture that the open source community fought so hard against.

Or the Beginning of a Change?

In Part Two, I pondered whether the new breed of open source companies could monetize fast enough to command their eye-popping valuations, and whether they’d be able to capture enough of the value that their open source projects create.

Arguably being the “center of mass” for their communities is the most valuable asset that companies like Elastic N.V. and MongoDB Inc. have.

In releasing the Open Distro for Elasticsearch, AWS forced another, perhaps more interesting question to commercial open source companies: What happens if you get the balance of value creation and value capture wrong? Does it leave your community vulnerable to getting stolen right out from under you?

What Does This Mean for Elastic N.V.?

I’ve written about my admiration for Elastic N.V. before. I think they’ve built an incredible business.

Overall, AWS’s move could be a positive thing for the Elasticsearch community, but it could really hurt Elastic N.V. the company. All of a sudden, much of its commercial differentiators are freely available in a permissive Apache2-licensed fork. That’s a huge potential hit to their monetization plans.

But running an open source project isn’t as simple as throwing some code over the wall and putting up a web page, as AWS has done. It’s about things like shepherding and supporting the community, encouraging committers, and having momentum on a compelling vision. AWS does not have a good track record at any of this. Yet.

What Does This Mean for Grafana Labs?

We’ve watched these developments with great interest and had a lot of internal discussions and debates. We are still fine-tuning our strategy but are ready to double down on a few important details.

Firstly, we will continue to partner with cloud vendors. Grafana Labs is all about meeting our customers where they are; we have an inclusive “big tent” philosophy which is about more than connecting different data sources – it’s about connecting different communities and being a trusted advisor to our customers. That’s why we’ve entered into commercial agreements with Microsoft and Google – to provide a first-class experience for our common users who use Grafana with Azure Monitor or Stackdriver. We hope to replicate this arrangement with AWS.

In addition, while the overwhelming majority of the code we write is open source, we will continue to offer an Enterprise version that is commercially licensed and clearly not open source. But, we will not play games with licensing. Our open source software will be licensed under a standard OSI license, and our commercial software will not be open source. We will be very deliberate about what we monetize and whom we are targeting with our commercial products.

Finally, we will be watching the situation very carefully, and we will be very transparent with our plans if and when they change. We don’t take the community we’ve fostered for granted. Far from it, we consider it to be the foundation that everything we do depends on.

Maybe this is all wishful thinking, and I’m in denial because I run one of these new commercial open source companies. I hope not, because otherwise I’m just poking the 800-pound gorilla in the room.

How Bloomberg Tracks Hundreds of Billions of Data Points Daily with MetricTank and Grafana

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/04/17/how-bloomberg-tracks-hundreds-of-billions-of-data-points-daily-with-metrictank-and-grafana/

Bloomberg is best known as a media company with its news destination site, its award-winning magazine Bloomberg Businessweek, and its daily 24-7 social media program, Tic Toc, on Twitter.

But the main product for the 38-year-old company is actually Bloomberg Terminal, a software system that aggregates real-time market data and delivers financial news to more than 325,000 subscribers around the world. The enterprise premium service handles about 120 billion (that’s billion) pieces of data from financial markets daily, 2 million stories from its news division and affiliates, and a messaging network (think “Instant Bloomberg”) that fields 1 billion messages.

“With all this, people seem to notice when it doesn’t work,” said Stig Sorenson, Head of Production Visibility Group at Bloomberg. “We have had a few outages that were high profile, so about three years ago we decided to embark on a journey where we took telemetry a bit more seriously.”

Since 2015, Bloomberg’s central telemetry team has been growing steadily – and the same could be said for its influence within the company. Today, “we’re storing 5 million data points a second and running over 2,500 rules on our metrics stream,” said Software Developer Sean Hanson during his talk at GrafanaCon 2019 in Los Angeles. “We also do a bit more on the log side with about 100 terabytes of raw log data a day and a lot of legacy log rules.”

Their most impressive feat, however, was rallying 5,500 engineers around streamlining their monitoring systems. “It’s a hard problem when you have a lot of independent users and a lot of teams,” Hanson admitted. “For a lot of users, monitoring is not a priority … So we try to give them as much as we can without them actually having to do something.”

Here’s how the telemetry team embarked on their “safari of stability” by solving three key problems within their infrastructure.

1. Centralizing Data

Historically Bloomberg isolated software so that outages only affected a small subset of customers. Teams would get an alert about a sub-failure and evaluate their individual products, utilizing their own data sources and their own telemetry stack.

However, working in silos led to issues when there was either an unforeseen single point of failure or when the alerts would snowball into multiple failures.

With teams working in isolation, “the outage would linger until it got bad enough that someone from the outside, either on our environment support team or a high-level manager, would be like, ‘Hey, I think you might all be working on the same problem independently,’” said Hanson. “Then the teams would piece together all of the individual data to track down a root cause.”

The telemetry team’s first step to stem this problem was to deploy agents to as many machines as possible to collect system metrics – file system, operating system, etc. – as well as each machine’s process tables. The telemetry team also worked with key infrastructure teams to gain insight into system frameworks within individual services, queues, or databases.

The goal was to centralize the data and provide a broader picture of the operating infrastructure at any given time. The Head of Engineering now has high-level system health dashboards in place to monitor outages. “Once we provided all of these displays, we were able to narrow down the pieces of data that could help triage outages as they happened or prevent them if we could alert on them,” said Hanson.

The Grafana dashboards also became valued assets throughout the organization, from high-level execs such as Sorenson who want a monitoring overview, to developers who want drill-down links on all the panels, to programmers who extract insights through a query API for more complex analysis.

“We have one place that users go to, to configure everything for their metrics, logs, alerts, Grafana folders, and distributed traces” said Hanson.

More importantly, the team automated processes to implement SRE best practices moving forward. Firm-wide rules around CPU, memory, file system storage, and service frameworks “take effect as soon as users create a new service or spin up a new machine,” said Hanson. Plus, because they are hooked into the machine-building process now, “even the machine creation process can publish its own metrics and report failures.”

2. Unifying Alerts

Prior to the formation of the telemetry team, Bloomberg had various systems that created different notifications.

Now, not only are alerts centralized. A link is served with every alert that shows correlation – what happened around the same time on the same set of machines or for the same basic rule – so teams can quickly detect whether it is an isolated incident or a problem with their software.

“All of the alerts we generate have a similar look and feel and a base set of information that we require which include our remediation plan,” said Hanson. “Every alert that comes out should have an action or a call to action available to you right at the top.”

In the case of tags, the team enforces tag key registration, not tag values, to ensure that when users try to register PIDs or timestamps as tag keys, they are alerted that they are off-base.

“We really wanted it to be easy to do the right thing, hard to do the wrong thing, but still possible to do something non-standard that we decide is sane,” added Hanson. “We designed our APIs to try to facilitate this.”

Recently the team has taken the initiative to meet with cross-functional teams “to talk about use cases and guide people to pre-built solutions where they exist or teach them how to use existing tools to build on top of,” said Hanson. “If there’s a really good use case or if we see it a lot, we just build it into our system so that people don’t have to even think about it. They just get it.”

3. Simplifying Queries

Bloomberg’s dashboards live in one “massive” Grafana instance that provides templates and uses the same query language and API as Graphite.

As users adapted to using the metrics, “we had growing dimensionality where users really want to drill down a lot into their data,” said Hanson. “So they want to keep adding more tags or labels, and some of the frameworks like Kubernetes cause a lot of transient time series to come around.”

In other words, more time series means more RAM. So this is where MetricTank came into play.

With MetricTank’s pattern-based pruning rules in the index and pattern-based retention rules, “we came up with a Goldilocks approach where users could pick their favorite flavor from three,” said Hanson. “If they want aggregate data for 10 years of trend analysis, they can pick longer lived. Or if they want advanced drilldowns, they can do that, but they don’t get the data for as long … We let the users pick, we apply it as a tag, and MetricTank does the rest.”

One snag the team hit with MetricTank came after releasing their query API to users for programmatic access. Problems came up for “[users] making a lot of queries sequentially or for users using the tag-based auto-complete in Grafana,” said Hanson. “When you get two-, three-, five-second lag on auto-complete, it’s pretty frustrating and noticeable. The more beta users that came onto the query API, the slower it got, until we had a daily report that took two days to run.”

Working closely with the Grafana team, the Bloomberg team implemented speculative querying which issues redundant queries to other replicas when slow peers were detected. This reduced the run time for daily reports down to four hours. “We also implemented native functions in MetricTank which prevented proxying through Graphite Web,” explained Hanson. “After that, we are now down to an hour and a half for our daily report. So there’s really not much more we can optimize there without actually trying to optimize the report itself.”

So What’s Next?

Improving dashboard discoverability is the next item on the Bloomberg Telemetry punch list.

Bloomberg currently has more than 3,500 dashboards in more than 500 folders, and there are many generic dashboards that prove to be popular and copied internally. But while imitation is the best form of flattery, it’s a poor form of organization.

Dashboards get copied repeatedly with names that are barely distinguishable from the next to the point that it’s hard to organically surface the original dashboards within the system. “People were only finding them by being linked through tickets,” said Hanson.

While folders and permission settings help limit access and editing rights to key dashboards, it didn’t solve the issue of rampant dashboard copies appearing in the system. So Bloomberg again reached out to Grafana Labs for a solution.

Together, the teams are working to enhance the auto-complete function. “This would allow us to search for keywords, descriptions, or even maybe metric names or tag labels inside the queries, which would be great,” said Hanson.

The goal is for Bloomberg’s telemetry team to “score” dashboards based on popularity with a custom weighting system. They are hoping to develop functions such as tagging dashboards “official” vs. “experimental” so users know which ones are more reliable compared to others.

Another big project: Meta tags, a seamless and cost-effective way to add metadata, is also in the works.

In creating a sustainable monitoring infrastructure, “starting with some good open source technology gets you a big step up,” said Hanson. “But since you didn’t build it, it might not work for you right off the bat. So we can’t be afraid to jump in and improve the product for yourself and the wider community.”

After all, “investing in telemetry pays dividends,” said Hanson, “which is my obligatory financial joke I save for the very end.”

For more from GrafanaCon 2019, check out all the talks on YouTube.

Grafana Plugin Tutorial: Polystat Panel (Part 2)

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/04/16/grafana-plugin-tutorial-polystat-panel-part-2/

Polystat grafana-polystat-panel plugin

Continuing from Grafana Plugin Tutorial: Polystat Panel (Part 1)

At the end of Part 1, the end result was a set of polygons that represented each Cassandra node in a Kubernetes statefulset. The cAdvisor-based metrics CPU/Memory/Disk utilization are scraped by Prometheus.

This second tutorial will focus on a rollup of multiple Cassandra clusters running inside Kubernetes.

We will end up with three dashboards tied together to provide an overview of our Cassandra clusters.

The dashboard will start with the overview:
dashboard1 goal

Hovering over a cluster will show the metrics included in the Tooltip:
dashboard1 goal tooltip

Clicking on one of the clusters, in this case Ops, will take you to a per-node view:

dashboard2 goal

dashboard2 tooltip

Clicking on a node will take you to a detailed metric view:

dashboard3 goal

High-Level Rollups

The previous panel showed each node in a cluster and displayed the metrics associated with each node. To indicate a higher-level view, the composite just needs to be modified to match all nodes.

Here’s one of the queries being used:

irate(container_cpu_usage_seconds_total{namespace="$namespace", pod_name=~"cassandra-sfs-.*", container_name="cassandra", cluster=~"$Cluster"}[1m])

A single composite is used like this:

composite example setting

panel goal

Hovering over one of the clusters will show the metrics:

panel tooltips

The polygon displayed now represents the “worst” state of the cluster, considering all metrics for each node.

Seeing one polygon is not as useful as seeing all Cassandra clusters.

We’ll use template variables next to make this easier to maintain.

Rollup for Multiple Clusters

Add the following template variable:

label_values(kube_pod_container_info{namespace=~"metrictank"}, cluster)

Under Selection Options enable Multi-Value.

Next adjust the queries so the cluster is used as a parameter:

irate(container_cpu_usage_seconds_total{namespace="$namespace", pod_name=~"cassandra-sfs-.*", container_name="cassandra", cluster=~"$Cluster"}[1m])

Also set the Legend to:

{{cluster}} {{pod_name}} CPU

Repeat the same changes for each query.

Lastly, add a composite for each cluster. (This will be easier to do soon!)

Also update the default clickthrough to be:

dashboard/db/polystat-part-2-drilldown1?var-namespace=metrictank&var-Cluster=${Cluster}

composite 1

Now when you select multiple clusters at the top, you will get a rollup for each cluster:

selected clusters

panel tooltips

Bringing it All Together

The overall idea is to provide a top-level view of all Cassandra clusters with the ability to drill down to a dashboard with more details (in this case another polystat-based dashboard).

To get to the next dashboard, update the default clickthrough:

dashboard/db/polystat-part-2-drilldown1?var-namespace=metrictank&var-Cluster=${Cluster}

Create a new dashboard. Note: A copy of Part 1’s final dashboard can be used or download the example files.

Update the default clickthrough in this new dashboard to point to:

dashboard/db/polystat-part-2-drilldown2?var-namespace=metrictank&var-Cluster=${Cluster}

Modify the metrics to include the cluster name.

Modify the composites to include the cluster name.

Lastly, we’ll update the composites again in the drilldown to go to another dashboard with more detailed metrics.

dashboard3 goal

As a bonus, one of the nodes will go to a different detail dashboard by adding a clickthrough in COMPOSITE4

dashboard/db/cassandra?var-environment=ops-us-east

dashboard3 alt drilldown

dashboard3 goal alt

Panels!

These panels have been published to grafana.com and can be downloaded here:

  1. Basic Rollup
  2. Templated Rollup
  3. Drilldown to Polystat (templated)
  4. Drilldown to Metrics (templated)

What’s Next?

Tooltip Width

The tooltip may be too narrow to show all metrics in a single line. The ability to customize the width would be very useful. Automatic sizing would also be a good addition.

Template Variables in Composites

The example above demonstrates the need for template variable interpolation in a few places.

If composites can use a template variable as part of the name (or the name itself), the multi-selector will function correctly and each rollup will be labeled appropriately.

Automatic Composites

A popular request is to implement automatic composites. While building a composite from multiple metrics is easy for a basic panel, being able to dynamically build composites will make using polystat even easier (and less tedious).

Multi-Line Labels

Labels can be very long once you add tags. The ability to wrap them inside the polygon would be a great feature.

Multi-Line Metrics

Similar to labels, metric names can be very long. Both wrapping the metric name and splitting the value would be useful.

Sorting

This PR with more sorting options is ready to merge and provides better sorting, similar to other core Grafana panels.

Shapes

D3 comes with other shapes for polygons. Polystat only exposes two of them due to layout calculations.

Wrapping Up

Polystat is a very flexible “multi-stat” type panel that can be used for overviews and drilldowns. More features are being implemented and any ideas to enhance it further are welcome.

Be sure to share your Polystat dashboards on grafana.com!

How We Designed Loki to Work Easily Both as Microservices and as Monoliths

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/04/15/how-we-designed-loki-to-work-easily-both-as-microservices-and-as-monoliths/

In recent years, monoliths have lost favor as microservices increased in popularity. Conventional wisdom says that microservices, and distributed systems in general, are hard to operate: There are often too many dependencies and tunables, multiple moving parts, and operational complexity.

So when Grafana Labs built Loki – a service that optimizes search, aggregation, and exploration of logs natively in Grafana – the team was determined to incorporate the best of both worlds, making it much simpler to use, whether you want to run a single process on bare metal or microservices at hyperscale.

“Loki is single binary, has no dependencies, and can be run on your laptop without being connected to the Internet,” says Grafana Labs VP Product Tom Wilkie. “So it’s easy to develop on, deploy, and try.”

And if you want to run it with microservices at scale, adds Loki Engineer Edward Welch, “Loki lets you go from 1 node to 100 and 1 service to 10, to scale in a pretty straightforward fashion.”

Here’s a breakdown of how the Grafana Labs team developed the architecture of Loki to allow users “to have your cake and eat it too,” says Wilkie.

1. Easy to Deploy

With Loki, easy deployment became a priority feature after the team looked at the other offerings.

On the microservices side, “Kubernetes is well-known to be hard to deploy,” says Wilkie. “It is made of multiple components, they all need to be configured separately, they all do different jobs, they all need to be deployed in different ways. Hadoop would be exactly the same. There’s a big, whole ecosystem developed around just deploying Hadoop.”

The same criticisms even hold true for Wilkie’s other project, Cortex, with its multiple services and dependencies on Consul, Bigtable/DynamoDB/Cassandra, Memcached, and S3/GCS – although this is something Wilkie is actively working to improve.

The single-process, scale-out models such as Cassandra and Nomad have been gaining more traction recently because users can get started much more easily. “It just runs a binary on each node, and you’re done,” says Software Engineer Goutham Veeramachaneni.

So in this way, the team built Loki as a monolith: “One thing to deploy, one thing to manage, one thing to configure,” says Wilkie.

“That low barrier to entry is a huge advantage because it gets people using the project,” says Welch. “When you’re trying out an open source project and not sure if it’s the right thing for you, you don’t want to put all this time, effort, and investment into configuring and deploying the service while learning the best practices up front. You just want something that you could get started with immediately and quickly.”

2. Simple Architecture

With microservice architectures such as Kubernetes, “you don’t get any value running a scheduler or an API server on its own. Kubernetes only has a benefit when you run all the components in combination,” says Wilkie.

On the other end of the spectrum, single binary systems like Cassandra have no dependencies, and every process is the same within the system.

A lot of the inspiration for Loki was actually derived from Thanos, the open source project for running Prometheus at scale. While Thanos operates with a microservices approach, in which users have to deploy all services to run the system, it aligns each service around a given value proposition. If you want to globally aggregate your queries, you deploy Thanos queriers. If you want to do long-term storage, you deploy the store and sidecars. If you want to start doing down sampling, you deploy the Compactor.

“Every service you add incrementally adds benefit, and Thanos doesn’t introduce too many dependencies, so you can still run it locally,” says Wilkie. “And it doesn’t do all the jobs in the one Cassandra-style homogeneous single process.”

With Loki, Welch explains, “every instance can run the same services and has the same code. We don’t deploy different components – it’s a single binary. We deploy more of the same component and then specify what each component does at runtime. You can ask each process to do a single job, or all the jobs in one.”

So in the end, Loki users have flexibility in both dimensions. “You can deploy more of the same component – that’s closer to a Cassandra-style architecture where every process in the system is the same – or run it as a set of microservices.” explains Wilkie. “You can split those out and have every single function done in a separate process and a separate service. You’ve got that flexibility which you don’t get with Cassandra.”

3. Easy to Scale

The final consideration is how does the service grow as the user’s system grows?

Microservices have become the most popular option because by breaking up different functions into different services, there is the ability to isolate each service, use custom load balancing and a more specialized way of scaling and configuring.

“That’s why people went down this microservices architecture – it makes it very easy to isolate concerns from the development process,” says Wilkie. “You might have a separate team working on one service and a different team working on another. So if one service crashes, runs out of memory, pegs the CPU, or experiences trouble, it’s isolated and won’t necessarily affect the other services.”

The challenge of this approach, however, is when multiple problems arise at once. “Deploying lots of microservices makes config management hard,” says Wilkie. “If you’ve got 10 different components, diagnosing outages become trickier – you might not know which component is causing the problem.”

Which is why some engineers prefer a simpler approach. “I really like single binary because the biggest problem with deploying a distributed system is not deploying the distributed system, but gaining that expertise as to what to fix, what to look at,” says Veeramachaneni. “Having it run locally, having it on a single node, and experiencing the issues help users gain familiarity with the system. It also gives you that confidence that you can deploy it into production.”

The compromise: Loki has a single-process, no dependencies scale-out “mode” for ease of deployment and getting started, while allowing users to scale into a microservices architecture when ready.

“I would run a single-node Loki first, look at what breaks, and then scale out what doesn’t work,” says Veeramachaneni. And then, “slowly add that expertise.”

The Best of Both Worlds

“The nice thing about Loki is you can independently scale the right parts by splitting out the microservices,” Wilkie says. “When you want to run Loki at massive scale with microservices, you can. You can just run them all as different services, and you can introduce dependencies on Big Table, DynamoDB, S3, Memcached, or Consul when you want to.”

By design, Loki “gives you the ability to start and learn the system easily,” he adds, “but grows to the same kind of enterprise-ready architecture that microservices offer.”

For more about Grafana’s user interface for Loki, check out this blog post.

Sneak Preview of New Visualizations Coming to Grafana

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/04/11/sneak-preview-of-new-visualizations-coming-to-grafana/

We have been working on a new panel and component architecture for the last half year (or more), and it’s finally starting to bear fruit in terms of new visualizations and capabilities.

Meet the Bar Gauge

We don’t introduce totally new ways to visualize data very often, so we’re excited to share with you this new addition to the family of single value type of visualizations (Singlestat, Gauge): the Bar Gauge.

With a traditional circular Gauge, it’s not always easy to see the levels from a distance. It works when you have a small square area, but when you want something that can stretch or stack efficiently, it doesn’t utilize that space very well. This led me to start thinking about a straight bar gauge.

Bar Gauge Basic

This visualization started simple: It looks very similar to a bar chart. There is a minimum and a maximum, and each bar color depends on the thresholds defined and the colors assigned to them.

bargauge_basic_v

And unlike the graph panel, you can stack these horizontally.

bargauge_basic_h

Bar Gauge Retro LED

But I wanted a different mode that was more visually interesting. I started thinking about old stereos that had these physical spectrum displays with discrete LED cells.

bargauge_basic_h

I started trying to mimic that in the visualization, creating these cells that light up if you reach a certain threshold. The nice thing about this way of visualizing a gauge is that you can see the threshold boundaries in the unlit cells. You can easily see if it’s close to warning or to being red. Also, it just looks really cool! For a few examples of this display mode, check out the image below.

bargauge_basic_h

Bar Gauge Gradient

Next, I turned the thresholds into a gradient. Instead of a single color, you can start with a pre-defined “OK” color and add different colors to different thresholds. You can go from green to yellow to orange to red, or the reverse, or any other combination. It’s super flexible, and you can have any number of thresholds. I think this turned out pretty well.

bargauge_basic_h

Threshold Editor

All the thresholds for these visualizations are defined using the new threshold editor we introduced
for the new Gauge panel in version 6.0.

Animations

Adding css transition animations was a simple one-line change, but it made this dashboard look quite nice. So if you have an auto-updating dashboard, you can get animated transitions between different states.

Other Visualizations in the Pipeline

I shared a preview of the Bar Gauge on Twitter recently, and the response was… positive.

We’re excited to be working on new core visualizations again in Grafana. It feels like it has been years since there was a substantial update on this front. (Because it has been!)

  • We plan to update the Singlestat panel as well, to align its options UI with that of the new Bar Gauge and Gauge panel.
  • The new Gauge, Bar Gauge, and Singlestat panel will be able to repeat vertically or horizontally for every series, table, column, or row.
  • The Table panel will be rewritten with virtualized rendering (faster without paging) and new features.
  • Some form of multi graph visualization (many graphs stacked) is coming.
  • We’re working on improved support for non-timeseries data.

Try It Out and Give Us Feedback

This new Bar Gauge panel will ship in beta form in the next Grafana Release (v6.2), but you can try it right now by downloading the latest nightly build. Then in your config / env settings, you can enable alpha panels.

[panels]
enable_alpha = true

Via ENV variable set GF_PANELS_ENABLE_ALPHA=true.

Please open bugs or feedback issues on GitHub.

Until next time, happy dashboarding!
Torkel Ödegaard

Automating Building the Grafana Image on DigitalOcean with Packer

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/04/10/automating-building-the-grafana-image-on-digitalocean-with-packer/

I recently gave a talk at DigitalOcean Tide in Bangalore on “Grafana and the DigitalOcean Marketplace.” The DO Marketplace lets you launch a range of open source software, including Grafana, with just a few clicks. This post is not about the marketplace – I’m going to talk about how we automated the building of the images.

Prologue

When I was making the demo for the talk, I noticed that the image was launching Grafana v5.4.3 – but the 6.x series launched last month. I then looked into how the images are made. It was quite simple: The 1-click apps are just snapshots of droplets.

Detailed instructions are given here. The application should be installed and configured to start on boot, and you should clean the droplet of logs and keys. For Grafana, that meant following the install guide and then running the cleaning script. But this is a manual process that will still take up nearly an hour of a person’s time. Luckily, the whole thing can be automated away.

Automating the Image Building

I noticed that DO itself documented how you can automate things using either Fabric.py or Packer. I used Packer several years ago to set up AMI building and deployment when I was an intern. I quite liked the tool back then and realized this could be a perfect opportunity to use it again!

The concepts behind Packer are super simple. It has two main things, builders and provisioners. Builders let you create images on various platforms like AWS EC2 and DigitalOcean, while provisioners let you run commands and manipulate the VM before creating the image. The config file for the Grafana image is super simple and can be seen here.

You can see we’re using the DigitalOcean builder and configuring the various parameters for it, like the snapshot name, the droplet size, location, etc. Next up are the provisioners. First we execute all the commands to install Grafana detailed here. Then we copy over the MOTD so that when users log in, they’ll know where to find the documentation.

Finally we run the commands to clean the VM of any history and security keys and also verify that the image will pass all of DO’s approval checks. I took the cleanup script from here and made some modifications to fix the issues below.

This turned out to be hardest part of the job because the validation script continued to fail with:

digitalocean: Updating apt package database to check for security updates, this may take a minute...
digitalocean:
digitalocean: [FAIL] There are 8 security updates available for this image that have not been installed.
digitalocean: Here is a list of the security updates that are not installed:
digitalocean:
digitalocean: Checking for log files in /var/log
digitalocean:
digitalocean: [WARN] un-cleared log file, /var/log/auth.log found

------------------------------------------------------------------------------------------------
Scan Complete.
One or more tests failed.  Please review these items and re-test.
------------------------------------------------------------------------------------------------
7 Tests PASSED
1 WARNINGS
1 Tests FAILED
------------------------------------------------------------------------------------------------
Some critical tests failed.  These items must be resolved and this scan re-run before you submit your image to the marketplace.

It drove me nuts that running apt-get -y update; apt-get -y upgrade in the clean_image step didn’t fix it. Some Googling made me realize this was because apt-get wasn’t updating the kernel, and required me to add the --with-new-pkgs: apt-get -y --with-new-pkgs upgrade. Now when I added that, Packer was stuck, as updating grub throws up an interactive prompt. The flag combination of -yq didn’t help either. :/ But finally, the Google and Stackoverflow gods showed me some mercy, and it finally worked after I set the DEBIAN_FRONTEND=noninteractive environment variable.

The commands finally became:

apt -y update
DEBIAN_FRONTEND=noninteractive apt -y upgrade

And Packer could successfully build and store the snapshot!

Epilogue

Setting up and using Packer again was fun, and now having the latest version of Grafana on the marketplace is easier than ever. But it stills requires me to run the commands on my laptop after every release. I’m hoping to automate it, make it part of Grafana’s official release process, and run it through CircleCI.

Metrictank meta tags

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/04/09/metrictank-meta-tags/

Coming Soon: Seamless and Cost-Effective Meta Tags for Metrictank

One of the major projects we’re working on for Metrictank – our large scale Graphite solution – is the meta tags feature, which we started last year and are targeting to release in a few months.

A lot of people don’t realize this, but Graphite has had tag support for more than a year.
Our mission with Metrictank is to provide a more scalable version of Graphite, so introducing meta tags was a logical next step.

The meta tags feature is sponsored by Bloomberg, which is one of the largest scale users of Metrictank, with tens of thousands of hosts being monitored and 4 million series in the index per Metrictank instance.

The Problem

The Bloomberg team wanted to be able to add a lot of tags to their metrics – say, data center, host operating system or unit – so they can query by them.

As Stig Sorensen, Bloomberg’s Head of Telemetry, put it: “The goal is to provide better filtering and group by capabilities in Grafana/Metrictank, by being able to augment the core tags with additional tags/metadata that should work like any first-class tags in Grafana.”

But there are many, many metrics that would all share the same tag value. If you have a thousand metrics per host, and a thousand hosts in one data center, that would mean that you would have a million metrics all originating out of the same data center.

And if you wanted to add these tags to all your individual metrics – say, adding one particular data center value to those million metrics – that would be a lot of overhead.

These tags are so redundant, and if we have to store them for all the individual metrics, it would blow up the size of our index. We would use way too much RAM and disk space. And ultimately, it would slow things down.

The Solution

We came up with the idea to exploit the redundancy and store these tags separately, with some kind of smarter association: Essentially all of these hosts automatically pretend that they have this additional tag that says, for instance, what data center they’re in. That way, you can add a whole bunch of different tags to your thousands or millions of metrics without having to store them all individually.

It should also be a seamless experience: As an end user, you don’t even have to realize that all this background juggling is going on when you query for all your metrics by operating system, unit, or data center. And when your metrics get returned to you, you would see all the tags associated with it, whether those tags are stored with the actual metric itself or as a meta tag. The user experience is just the same.

Extrinsic vs. Intrinsic

Another way of thinking about meta tags is by using some ideas found in Metrics 2.0. There are some tags intrinsic to the identity of the metric: If you change them, you’re referring to a different metric. But other tags are not part of the metric’s identity. If you change them, you can still be looking at the same metric. They’re called extrinsic.

Metrictank’s regular tags are intrinsic, whereas meta tags are extrinsic, as they are not tightly coupled to the identity of the metric. You can change them, but you’re still working with the same metrics.

Compared to Prometheus Series Joining

The function of meta tags is comparable to what can be achieved in Prometheus with series joining. To add tags in Prometheus, you have to have separate series that declare these additional tags. So for example, you would have a series that has a host value, and then it would have an additional series tagged with a certain data center and operating system for that particular host.

As you query your data, you would have to write, “I want to have a join between the metrics that only have the host tag and this other metadata series that has the same host tag,” and then pull in these additional tags by doing a series join.

But our feature offers a more transparent solution, allowing for what feels like native tags for filtering, grouping, auto-complete, etc. Unlike Prometheus’s series joining, users don’t have to do anything explicit. Additionally, with Prometheus, you can’t backfill the external tags; with Metrictank meta tags, you will be able to add new tags to old series.

Implementation

Our current tag index is fairly standard, using postings lists to link tag key/values to metric metadata. Query evaluation happens by ordering the individual query components by cost (cardinality) and executing them as a pipeline with some degree of parallelization.

With the meta tags project, we’re adding postings lists to link the meta tag key/values to “metarecords,” where a meta record defines which tags (meta tags) should be added and for which query expression. This way, query patterns hitting meta tags can use the same query execution system, but using an additional step. (Execute the query on the meta tags postings lists, then execute its corresponding tag queries – along with remaining query components on regular tags – on the regular postings lists to resolve the metric metadata.)

As far as returning the data in a seamless way, we have an “enrichment” step wherein we add the meta tags as regular tags to the returned series, to make it transparant for the user. This part will most likely come with a cache to speed up enrichment of frequently queried series.

All meta tag operations are API-based, so you’ll be able to add, remove, manage, and update all associations via an API call to any of the cluster nodes (which in turn propagates to other nodes), and the rules will be safely persisted as well.

The design document covers various interesting edge cases and design constraints. One particular aspect that I find interesting is how to effect an update in meta tag rules across a Metrictank cluster. We weighed various consistency trade-offs but ultimately decided that at least for v1, a relaxed, eventually-consistent model will suffice. Here’s an example: If you have a thousand hosts with a thousand series each, and you add a rule that says “For all of these thousand hosts, it should be known that they are a part of this data center,” the implication is that as that rule is being applied when you query for that data center, you’re not going to see all of these million series at once. You’ll gradually get more and more results as the tag associaton is deployed throughout the cluster. We can revisit this model later if needed.

For more details you can check out the conversation on Github or the design document.

The Upshot

Meta tags will be a seamless, powerful feature for Metrictank. They’re the next step in providing a scalable way to enrich large masses of series with redundant tags, but at a fraction of the cost compared to traditional tags, and using a convenient API to manage the associations rather than having to update the source of the metrics. They’ll work seamlessly alongside regular tags. We’ve had several customers ask for a feature along these lines and are excited to be able to bring it to Metrictank, with the help of our friends at Bloomberg.

How eBay Moved from Custom UIs to Grafana Plugins

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/04/08/how-ebay-moved-from-custom-uis-to-grafana-plugins/

In the beginning, the mission of the logging and monitoring team at eBay was simple: “to give out APIs that the developers in the company could use to instrument their applications [in order] to send logs,” Vijay Samuel said during his talk at GrafanaCon about eBay’s journey to using Grafana plugins. “We had our own developers who built out UIs for being able to search view and debug their issues. And metrics were no different from logs. We gave out a bunch of APIs to instrument the code.”

The problem, Samuel said, was that “the quality of the UI was entirely dependent on the person who’s building the UI.” The job of building some of those UIs fell on Samuel’s shoulders, and about four years ago, he found adding new graphs so painful that he decided to do a proof of concept based on Grafana.

“The first attempt was a literal hack,” said Samuel, a member of the monitoring team. “I took the master branch of Grafana, and I modified the open TSDB data source to be able to understand our internal APIs. And we built out some dashboards, primarily scripted dashboards, but they didn’t have all the complex features like templating or annotations.”

Grafana was then still in v3.x, and “it was a dirty-dirty hack,” Samuel said. The PoC was used by some on-call teams, but languished until some people from the Database Ops team came and asked for Grafana support for eBay’s internal TSDB.

Building a Data Source Plugin

Samuel’s old PoC was revived, and the Database Ops team members, Steven West and Auston McReynolds, “took the dirty hack and converted it into a dedicated data source plugin, but it was still grunt-generated code,” said Samuel. “They also added Docker support to the plugin.”

Samuel took that plugin and ran with it, adding some Kubernetes deployment scripts. “Every time someone asked for Grafana support, I would point them to these Kube specs and tell them, ‘Go run it,’” he recounted. “And every time they asked for features, I would use my spare time and build out some features for that.”

The big breakthrough came when some eBay SREs, led by Satish Sambasivan, decided to scrap their work building their own custom UIs and use Grafana instead. “They took it to the next level,” said Samuel. “They started to overlay a lot of data on their graphs. For example, any change that was happening that impacted the site, they dropped them as annotations on the graph. So they were able to catch interesting issues like when a DNS flip caused errors to spike, and that was right there on the dashboard. They started providing hosted solutions.”

Later, the SRE team turned to the monitoring team to support all of this for them. “They have four golden signals, which they basically use to triage all the issues that were happening on the site, and there were many dashboards that they built,” Samuel said. “The monitoring team decided to take up Grafana as a first-class citizen in our offering. And this came with a whole new makeover.”

With seasoned UI developers working on the project, many changes were made: First off, grunt-generated files would be a thing of the past. Widgets were added to view logs and events. It would become a more robust hosted solution. A lot more features were added into Grafana, such as being able to authenticate with internal APIs, and annotations support for the data source plugin.

A Cloud Native Approach

And on the backend, custom APIs for shipping logs, metrics, and events into the platform were replaced by “more cloud native mechanisms,” to make logging and metrics simpler. For logging to log files, users could let the monitoring team know what the log files were, and they’d ship the logs. For metrics, Samuel said, “Instrument your code with Prometheus, and if you’re running on Kubernetes, provide a few annotations saying that this is the port that we’re exposing the metrics on. And we’ll be able to collect and ship it into the platform.”

Along the way, the eBay monitoring team began investing more in open source. “If you found a product to be worth investing in, and if you found gaps, we started contributing to them,” Samuel said. (One project they’ve contributed a lot to: Elastic Beats.)

At this point, Samuel said, “we’re at a place where we can say that we’re slowly changing the dynamic of monitoring inside of eBay, and Grafana is playing a big picture in all of that.”

The biggest lesson they’ve learned: “It’s always good to be a part of the community,” he said. “Anytime we saw that a feature was missing, we tried really hard to build it out in a generic way and tried to give it back to the community.”

Compared to his first painful experiences building graphs, he remarked, “now creating dashboards is easy.” In fact, eBay’s custom data source plugin was built in one day. “That’s a big testament to Grafana,” he added. “If a non-seasoned UI person like me could build it out in a day, then imagine how much power the product is giving to every developer…. Moving away from custom APIs and moving into more cloud native constructs has helped us to onboard more use cases than we could ever imagine.”

Want to watch more GrafanaCon talks? Check them out here.

A Look at the Latest Cloud Data Source Plugins in Grafana

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/04/04/a-look-at-the-latest-cloud-data-source-plugins-in-grafana/

The engineers at Grafana Labs have their heads in the clouds.

“This is a new world: We have hybrid clouds and multiclouds,” Daniel Lee told the crowd gathered at GrafanaCon 2019 in Los Angeles. And the advantage clients have when using Grafana’s hosted services is that “they can deploy them on any cloud,” said Lee.

“With new data sources and new forms of data that you can get into Grafana, you can visualize, analyze, and understand all your systems,” said Lee. “We keep moving with the times.”

In addition to AWS Cloudwatch, which was added to Grafana in 2015, Grafana Labs recently released Azure Monitor as well as Google Stackdriver plugins in Grafana v6.0.

“Grafana’s all about avoiding vendor lock-in,” said Lee. “It fits with our core vision of democratizing metrics.”

Here is a look at the latest cloud data source plugins:

Azure

Microsoft’s Azure Monitor is no stranger to the Grafana community, having had a plugin for nearly two years.

The plugin supports four different Azure services: infrastructure metrics, application metrics and insights, and log-based metrics, in addition to monitoring. “You have support for three different types of metrics in one data source,” said Lee. “As a bonus, you will get Grafana Alerting as well.”

But if it’s already a plugin, why move it into Grafana’s core offerings? “We’re basically committing to a higher level of support,” explained Lee. “With so many plugins, often support for them is sporadic. But now with Azure Monitor, we are committing to keeping it in tip-top shape, improving it, and fixing any bugs that come up.”

So if any of the growing number of clients who use Azure Monitor reports an issue, “it will be resolved in one day,” said Lee.

Like Grafana Labs, Microsoft wanted Azure to be flexible across different services. “One of the nice things is the ability to bring together a visualization that joins data from multiple sources, from something like CPU percentages to application insights,” said Brendan Burns, Distinguished Engineer at Microsoft.

“With metrics, people want consistency,” said Burns. While Azure supports government clouds as well as public clouds, “it’s great to be across all of these clouds, but when you come to something like monitoring, you want it to look and feel the same no matter where you are.”

In addition to being able to “mix and match” different data sources, “it’s great to be able to provide Grafana to people who are using a SaaS-based monitoring inside of Azure,” said Burns.

Other features of the Azure Monitor plugin that Burns highlighted included the ability to edit queries, even within the in-line Grafana experience.

Also, with its log analytics service, Azure can help manage a Kubernetes cluster, log straight to the standard output stream, get the data into log analytics, and visualize the information in Grafana. “That’s a really great ability for users,” said Burns.

And capabilities such as dimension filtering, templating and alerting are “a nice value-add for a lot of people,” said Burns.

In other words, said Lee, the Azure Monitor plugin “is a bit of a beast.”

Stackdriver

Google provides a broad suite of products through Stackdriver, which helps improve the development experience on GCP as well as other cloud environments.

Services included on Stackdriver are monitoring (for platform system applications, your metrics, and custom metrics), logging (for log-based metrics and alerting), APM (Trace and Profiler are provided; Debug is in production), and IRM (for command and control systems for instances and instance repositories). Instance response management was also introduced last year.

But there’s one area where GCP couldn’t get ahead of the game. “We’ve got lots of user feedback saying we, as a GCP user, like how Stackdriver improved over the years, but in terms of UI visualization, we still like the way that Grafana does things,” said Joy Wang, Product Manager on Google Stackdriver.

So in early 2018, the GCP team reached out to Grafana Labs to collaborate on building a plugin together. Or as Wang put it: “I made some requirements, Daniel did all the work.”

“It’s been great to work with Google, helping us with all our questions,” said Lee of the development process. “They’re sort of guinea pigs in a way, because it’s the first data source written in React in Grafana.”

By October 2018, the beta version of the Stackdriver plugin was announced at Google Next, and a few months later, the feature advanced to GA on Grafana v6.0.

“At Google, lots of projects are open source, and we believe that as a GCP user, you benefit from open source solutions like Grafana [because] you get to choose what monitoring solutions you want for all the infrastructures that you have,” said Wang.

Wang also gave GrafanaCon attendees a preview of the notable updates to Stackdriver coming this year. “Basically, we’re doing a lot of things on the UI to allow users to put SRE best practices in place and also share those best practices with your coworkers,” she said.

On the monitoring front end, Stackdriver will introduce more widgets as well as allowing users to apply group-bys and filters on a dashboard level and save views of one dashboard. “You can slice and dice your metrics in a different way,” said Wang. “That will help users do in-context, ad-hoc analysis much faster.”

Stackdriver is also introducing Kubernetes monitoring, and SLO monitoring is another key concept for Google SRE. “We’re working on getting the service level monitoring and SLO monitoring out-of-box for users,” said Wang.

Finally, there will be “APIs for everything,” said Wang. “Later this year, we’re also announcing account API dashboards that allow users to set up their monitoring from end to end automatically. We’re also bringing up metric granularity and retention.”

“That’s only a fraction of the features that we’re working on,” said Wang. “There’s a lot more to come.”

Oracle

The Oracle Cloud Infrastructure (OCI) allows users to launch any type of instance from a VM to bare metal servers to GPU shapes, all for the same API and console. But recently “we’ve been focusing a lot on supporting container-native workloads,” said Mies Hernandez van Leuffen, VP of Solution Development at Oracle Cloud Native Labs.

So in December 2018, Oracle launched the Oracle Cloud Native Framework, which consists of Oracle’s managed Kubernetes solution (OKE) as well as a container registry that is completely Docker v2 compatible.

“The managed Kubernetes solution runs plain vanilla Kubernetes, nothing forked,” said Van Leuffen.

“At the same time, we’ve been working a lot with other companies rooted in open source to build up these ecosystem partnerships,” said Van Leuffen. “To that end, we’re pretty excited to launch our Grafana plugin.”

The Oracle plugin can be installed either via local environment, through Kubernetes, or on a VM. It allows users to surface and graph metrics from the OCI monitoring service. “The current metrics that we support are the Compute Agent, Block Store, Load Balancer, and Virtual Cloud Networks,” said Van Leuffen.

From the beginning, it hasn’t been a hard sell for companies to see the value in the Oracle plugin. “We have a pretty cool launching customer that is already using this in production,” said Van Leuffen. “It’s called Booster Fuels, and they effectively bring the gas station to your car, as opposed to the other way around.”

Added Van Leuffen: “Our mission is effectively to build customer deployable cloud native and container-centric solutions that bridge the gap between where OCI is today and what customers would like to use in terms of open source projects that we all know and love.”

Check out more sessions from GrafanaCon2019 on YouTube.

Grafana v6.1 Released

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/04/03/grafana-v6.1-released/

v6.1 Stable released!

A few weeks have passed since the excitement of the major Grafana 6.0 release during GrafanaCon, which means it’s time for a new Grafana release. Grafana 6.1 iterates on the permissions system to allow for teams to be more self-organizing. It also includes a feature for Prometheus that enables a more exploratory workflow for dashboards.

What’s New in Grafana v6.1

Download Grafana 6.1 Now

Ad hoc filtering for Prometheus

The ad hoc filter feature allows you to create new key/value filters on the fly with autocomplete for both keys and values. The filter condition is then automatically applied to all queries on the dashboard. This makes it easier to explore your data in a dashboard without changing queries and without having to add new template variables.

Other timeseries databases with label-based query languages have had this feature for a while. Prometheus recently added support for fetching label names from its API, and thanks to Mitsuhiro Tanda’s work implementing it in Grafana, the Prometheus datasource finally supports ad hoc filtering.

Support for fetching label names was released in Prometheus v2.6.0, so that is a requirement for this feature to work in Grafana.

Editors can own dashboards, folders, and teams they create

When the dashboard folders feature and permissions system were released in Grafana 5.0, users with the editor role were not allowed to administer dashboards, folders, or teams. In the 6.1 release, we have added a config option that can change the default permissions so that editors are admins for any dashboard, folder, or team they create.

This feature also adds a new team permission that can be assigned to any user with the editor or viewer role and enables that user to add other users to the team.

We believe that this is more in line with the Grafana philosophy, as it will allow teams to be more self-organizing. This option will be made permanent if it gets positive feedback from the community, so let us know what you think in the issue on GitHub.

To turn this feature on, add the following config option to your Grafana ini file in the users section, and then restart the Grafana server:

[users]
editors_can_admin = true

List and revoke user auth tokens in the API

As the first step toward a feature that would enable you to list a user’s signed-in devices/sessions and to log out those devices from the Grafana UI, support has been added to the API to list and revoke user authentication tokens.

Minor Features and Fixes

This release contains a lot of small features and fixes:

  • A new keyboard shortcut d l toggles all graph legends in a dashboard.
  • A small bug fix for Elasticsearch – template variables in the alias field now work properly.
  • Some new capabilities have been added for datasource plugins that will be of interest to plugin authors:
    • There’s a new oauth pass-through option.
    • It’s now possible to add user details to requests sent to the dataproxy.
  • Heatmap and Explore fixes.
  • The Prometheus range query alignment was moved down by one interval. If you have added an offset to your queries to compensate for alignment issues, you can now safely remove it.

Changelog

Check out the CHANGELOG.md file for a complete list of new features, changes, and bug fixes.

Download

Head to the download page for download links & instructions.

Thanks

A big thanks to all the Grafana users who contribute by submitting PRs, bug reports, and feedback!

Grafana Plugin Tutorial: Polystat Panel (Part 1)

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/04/02/grafana-plugin-tutorial-polystat-panel-part-1/

Polystat

The grafana-polystat-panel plugin was created to provide a way to roll up multiple metrics and implement flexible drilldowns to other dashboards.

This example will focus on creating a panel for Cassandra using real data from Prometheus collected from our Kubernetes clusters. We’ll focus on the basic metrics for CPU/Memory/Disk coming from cAdvisor, but a well-instrumented service will have many metrics that indicate overall health, such as requests per second, error rates, and more.

This panel allows you to group these metrics together into an overall health status, which can be used to drill down to more detailed dashboards. For this Cassandra example, the end result will look like this:

panel goal

The Basics

Getting CPU, memory, and disk utilization will give enough metrics to demonstrate the idea behind compositing metrics and displaying them in Grafana. The PromQL queries below are simple and can be adapted with template variables to make the panel more “general purpose.” To get started, some simple queries will be used, then later modified.

CPU

container_cpu_usage_seconds_total{namespace="metrictank", pod_name=~"cassandra-sfs-.*", container_name="cassandra"}

The above query with polystat will show a large number of polygons (one per metric):

all cassandra pods

There are quite a number of pods displayed (we have multiple Cassandra clusters), so we will narrow this down to just a single cluster:

container_cpu_usage_seconds_total{namespace="metrictank", pod_name=~"cassandra-sfs-.*", container_name="cassandra", cluster="ops-tools1"}

The names still don’t show up since they are very long (hint: tooltips will show them). Adding {{pod_name}} to the Legend field will result in a better display:

all cassandra pods with legend

Result:

all cassandra pods with legend result

The query needs a little more work – the metric is a counter – so we’ll use irate to get instantaneous per-second values.

irate(container_cpu_usage_seconds_total{namespace="metrictank", pod_name=~"cassandra-sfs-.*", container_name="cassandra", cluster="ops-tools1"}[1m])

all cassandra pods cpu rate

Disk

While CPU is interesting, disk space in cassandra is usually what tends to run out, so we’ll add this query to show disk usage:

container_fs_usage_bytes{namespace="metrictank", pod_name=~"cassandra-sfs-.*", container_name="cassandra", cluster="ops-tools1"}
container_fs_limit_bytes{namespace="metrictank", pod_name=~"cassandra-sfs-.*", container_name="cassandra", cluster="ops-tools1"}
container_fs_limit_bytes{namespace="metrictank", pod_name=~"cassandra-sfs-.*", container_name="cassandra", cluster="ops-tools1"} - container_fs_usage_bytes{namespace="metrictank", pod_name=~"cassandra-sfs-.*", container_name="cassandra", cluster="ops-tools1"}

Memory

To complete our stats, add this memory query:

container_memory_usage_bytes{namespace="metrictank", pod_name=~"cassandra-sfs-.*", container_name="cassandra", cluster="ops-tools1"}

We can now see the result:

all cassandra pods all stats

Formatting

The stats themselves have “short” as the value type in Grafana. Switching to the options of polystat, we can adjust them to something more meaningful:

cpu overrides

disk overrides

memory overrides

Thresholding

The next step is to create thresholds for each of the metrics. In the thesholds section, add a new threshold, set the name to match a metric, and configure as needed. This example sets a 60% warning and 80% critical for CPU utilization.

cpu threshold setting

The panel will now look like this:

cpu threshold result

Composites

Now that we have basic metrics and thresholds, we can create composites. (NOTE: The composite being created here is for a single node to keep everything simple, but the final result will have all nodes displayed.)

Composites allow you to group multiple metrics together and display a single item with the threshold state reflected.
The polygon is given the color of the “worst” state. The tooltip will show individual states, sorted by worst to best.

To create a composite, click Add in the Composites section:

composites

This will create a new composite named “Cassandra” and will include all metrics that match CPU/Memory/Disk.

composite cassandra

The result of the composite will change the polystat to show a single polygon that represents three different metrics, and will animate to show the value for each metric.

animated

Clickthroughs

There are three levels of clickthroughs provided by this panel.

  1. Default clickthrough
  2. Override clickthrough
  3. Composite clickthrough

The order of precedence is most-specific to least-specific (3, 2, 1).

Default Clickthrough

You can set a clickthrough to be used globally when there are no override or composite clickthroughs defined for a polygon.

In this example, the clickthrough is set to:

dashboard/db/cassandra

clickthrough default

Clicking on the polygon will take you to the Cassandra dashboard, in the same Grafana server. The clickthrough can be any valid url.

The plugin also includes parameters that can be passed to other dashboards.

dashboard/db/cassandra?var-environment=$Cluster&var-instance=All

clickthrough default

Additional variables can be passed; see this for details: https://github.com/grafana/grafana-polystat-panel#single-metric-variables.

Override Clickthrough

In the overrides section, you can specify a clickthrough that applies for that specific override. This is mainly used when not using composites.

Setting the clickthrough for CPU to be…

dashboard/cpu?var-node=${__cell_name}

…will take you to a dashboard named “CPU” and pass the value of the clicked polygon.

Composite Clickthrough

The third type of clickthrough is used to specify where to go when a composited polygon is clicked. The implementation is the same as above.

Composites have another set of variables that can be passed to clickthroughs. See: https://github.com/grafana/grafana-polystat-panel#composite-metric-variables.

Templating

To keep the example above simple, the names are hardcoded. Leveraging Grafana template variables will make the dashboard more flexible.

The queries use “namespace” and “cluster,” so let’s create those.

Add a template variable to allow selection of different clusters:

templated variables

Almost there

The dashboard will look like this, showing a single node with three different metrics displayed.

dashboard completed

dashboard completed with animation

Wrapping Up

To complete the panel, just modify the composites to match regex per-node.

composite1
composite2

After changing the composites, the end result will look like this:

panel goal

About Part 2

Part 2 will detail more composite options and advanced features to make them even easier to create.

If you have created some dashboards already with polystat, we’d love to see them!

How We’re Using Prometheus Subqueries at Grafana Labs.

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/04/01/how-were-using-prometheus-subqueries-at-grafana-labs./

In the Prometheus 2.7 release, Ganesh Vernekar added a feature called “Subqueries”. Ganesh published an explanation of how to use subqueries over on the Prometheus blog. In this post we’ll share a couple of real-life examples of how we use them at Grafana Labs.

Subqueries make it possible to do a certain class of queries against Prometheus in an ad hoc fashion. Previously Prometheus wouldn’t allow you to take a range vector of the output of a function; you could only take a range vector of a timeseries selector. We mainly justified this for two reasons: (1) We argued performance of this would be poor, and (2) it conveniently stopped you from taking the rate of a sum, something you should never do in Prometheus.

But a lot has changed in the past few years. Prometheus 2.0 introduced a new storage engine, and many improvements have been made to query performance. So justification (1) is no longer valid – and at the post-PromCon dev summit back in August, we decided the time was right to add subqueries.

Billing

The first example of how we use subqueries at Grafana Labs is for billing. We do usage-based billing for Grafana Cloud, partly based on your P95 datapoints per minute. Before Prometheus 2.7 and subqueries, we needed to use a recording rule to calculate the ingestion rate:

record: id_namespace:cortex_distributor_received_samples_total:sum_rate
expr: sum by (id, namespace) (rate(cortex_distributor_received_samples_total[5m]))

…and then a separate query to calculate the P95 of that using the quantile_over_time function:

quantile_over_time(0.95, id_namespace:cortex_distributor_received_samples_total:sum_rate[30d])

With subqueries, we can roll this all into one:

quantile_over_time(0.95, sum by (id, namespace) (rate(cortex_distributor_received_samples_total[5m]))[30d:])

The biggest benefit here is that we no longer need to use recording rules. They have to be declared up front, and changes made to them are not retroactively applied, making it very hard to experiment with ad hoc queries that require a range vector of a function’s result.

Capacity Planning

The second example of how we use subqueries at Grafana Labs is for capacity planning. We run Grafana Cloud on a set of Kubernetes clusters around the world. We need to ensure our jobs have the resources they need to serve traffic within our latency targets, but at the same time we do not want to be throwing money away on underutilized machines. As such we keep a close eye on our pods’ CPU utilization, and ensure we size the containers’ CPU requests appropriately.

To achieve good utilization, we size the containers’ CPU requests at P95 of their CPU usage. Per-container CPU usage is exported by cAdvisor as a counter. To calculate the per-second CPU usage, you have to apply the rate function:

record: container_name_pod_name_namespace:container_cpu_usage_seconds_total:sum_rate_5m
expr: sum by (container_name, pod_name, namespace) (rate(container_cpu_usage_seconds_total{namespace="cortex-ops", container_name!="POD"}[5m]))

And to see the P95 CPU usage over the last 7 days:

quantile_over_time(0.95, container_name_pod_name_namespace:container_cpu_usage_seconds_total:sum_rate_5m[7d])

With subqueries, we can now combine this into a single query:

quantile_over_time(0.95,
  sum by (container_name, pod_name, namespace) (
    rate(container_cpu_usage_seconds_total{namespace="cortex-ops", container_name!="POD"}[5m])
  )[7d:]
)

And as a bonus, you can use kube-state-metrics to export the configured CPU requests as a metric (some jiggery-pokery is need to make the labels consistent with cAdvisor):

sum by (container_name, pod_name, namespace) (
  label_join(
    label_join(
      kube_pod_container_resource_requests_cpu_cores,
      "pod_name", "", "pod"
    ),
    "container_name", "", "container"
  )
)

And combine the two queries to find underprovisioned pods:

  quantile_over_time(0.95,
    sum by (container_name, pod_name, namespace) (
      rate(container_cpu_usage_seconds_total{namespace="cortex-ops", container_name!="POD"}[5m])
    )[7d:]
  )
>
  sum by (container_name, pod_name, namespace) (
    label_join(
      label_join(
        kube_pod_container_resource_requests_cpu_cores,
        "pod_name", "", "pod"
      ),
      "container_name", "", "container"
    )
  )

Gotchas

As with “normal” queries, with subqueries you must never do a sum before a rate. For example, instead of rate((success+failure)[1m]) you must do rate(success[1m]) + rate(failure[1m]). This is because the rate function requires special handling for counter resets, which it cannot do if you have aggregated the counter with another. Avoid the temptation to simplify!

timeShift(GrafanaBuzz, 1w) Issue 83

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/03/29/timeshiftgrafanabuzz-1w-issue-83/

Welcome to TimeShift

This week we have updates and articles from the Grafana Labs team, some initial impressions on our Prometheus-inspired log aggregation project Loki, and lots more. Plus learn how to make your own air quality monitor.

See an article we missed? Contact us.


Latest Beta Release: Grafana v6.1.0-beta1

Major New Features
  • Prometheus: Ad-hoc Filtering makes it easier to explore data in a Grafana dashboard.
  • Permissions: A new option so that Editors can own dashboards, folders and teams they create. This makes it easier for teams to self-organize when using Grafana.

Check out all the features and bug fixes in the latest beta release.

Download Grafana v6.1.0-beta1 Now


From the Blogosphere

Everything You Need to Know About the OSS Licensing War, Part 2.: Part 2 of our OSS licensing war series picks up where we left off: In 2015, AWS had taken the Elasticsearch software and launched their own cloud offering, and Elastic N.V. doubled down on an ‘open core strategy.’

Grafana Logging using Loki: Julien dives into Loki, its architecture, configuration, and some various use-cases.

Tinder & Grafana: A Love Story in Metrics and Monitoring: Two years ago, when it was time for the L.A.-based company to find and implement a perfect metrics monitoring partner, the process proved to be more slow-burn love affair versus whirlwind romance.

Scaling Graphite to Millions of Metrics: The folks at Klaviyo discuss the challenges they overcame to scale their Graphite stack to reliably handle over a million active metric keys at any given time across 17 million total metric keys.

Dynamic Configuration Discovery in Grafana: John provides an introduction to how dynamic configuration discovery works in Grafana for data sources and dashboards.

Build an air quality monitor with InfluxDB, Grafana, and Docker on a Raspberry Pi: Learn how to build your own air quality monitor. Collect metrics for temperature, humidity, barometric pressure, and air quality, and visualize all your data in Grafana.

Timeseries and Timeseries Again (pt. 2): Leon describes his initial impressions of using our Prometheus-inspired log aggregation project Loki in part 2 of his series focused on the metrics stack he uses at work. Check out part 1 to get up to speed.

Writing React Plugins for Grafana: In Grafana 6.0 we started the migration to using React in Grafana. This post will walk you through how to create plugins for Grafana using ReactJS.

How to deploy Telegraf, InfluxDB, and Grafana with Puppet Bolt: Check out a demo to configure and deploy monitoring stack via Puppet Modules.

What’s New in Prometheus 2.8: WAL-Based Remote Write: Learn about Prometheus’ new Write-Ahead Logging (WAL) for the remote_write API, which was included in the Prometheus 2.8 release. It’s a change intended to safeguard client metrics in the face of any network issues.


Grafana Plugin Update

Three plugin updates to share this week. Update or install any plugin on your on-prem Grafana via the grafana-cli tool, or update with one-click on Hosted Grafana.

UPDATED PLUGIN

WorldPing App – WorldPing v1.2.5 has been released which adds support for Grafana v6.0 and includes a few minor bug fixes.

Install

UPDATED PLUGIN

Polystat Panel – Polystat panel v1.0.16 has been released and includes a fix for variable encoding in clickthrough urls.

Install

UPDATED PLUGIN

Instana Data Source – The latest update of the Instana data source has lots of bug fixes as well as some new features:

  • Application metrics have been added
  • The datasource connection check has been improved
  • Support added for beacon.meta grouping

Install


Upcoming Events

In between code pushes we like to speak at, sponsor and attend all kinds of conferences and meetups. We also like to make sure we mention other Grafana-related events happening all over the world. If you’re putting on just such an event, let us know and we’ll list it here.

DevOps Days Vancouver 2019 | Vancouver BC, Canada – 03.29.19-03.30.19:

Callum Styan: Grafana Loki – Log Aggregation for Incident Investigations – Get an inside look at Grafana Labs’ latest open source log aggregation project Loki, and learn how to better investigate issues using Grafana’s new Explore UI.

Register Now

KubeCon + CloudNativeCon Europe 2019 | Barcelona, Spain – 05.20.19-05.23.19:

May 21 – Tom Wilkie, Intro: Cortex
May 22 – Tom Wilkie, Deep Dive: Cortex

Cortex provides horizontally scalable, highly available, multi-tenant, long term storage for Prometheus metrics, and a horizontally scalable, Prometheus-compatible query API. Cortex allows users to deploy a centralized, globally aggregated view of all their Prometheus instances, storing data indefinitely. In this talk we will discuss the benefits of, and how to deploy, a fully disaggregated, microservice oriented Cortex architecture. We’ll also discuss some of the challenges operating Cortex at scale, and what the future holds for Cortex. Cortex is a CNCF sandbox project.

May 23 – Tom Wilkie, Grafana Loki: Like Prometheus, But for logs.
Loki is a horizontally-scalable, highly-available log aggregation system inspired by Prometheus. It is designed to be cost effective and easy to operate, as it does not index the contents of the logs, but rather labels for each log stream.

Loki initially targets Kubernetes logging, using Prometheus service discovery to gather labels for log streams. As such, Loki enables you to easily switch between metrics and logs, streamlining the incident response process – a workflow we have built into the latest version of Grafana.

In this talk we will discuss the motivation behind Loki, its design and architecture, and what the future holds. Its early days after the launch at KubeCon Seattle, but so far the response to the project has been overwhelming, with more the 4.5k GitHub stars and over 12hrs at the top spot on Hacker News.

May 23 – David Kaltschmidt, Fool-Proof Kubernetes Dashboards for Sleep-Deprived Oncalls
Software running on Kubernetes can fail in various, but surprisingly well-defined ways. In this intermediate-level talk David Kaltschmidt shows how structuring dashboards in a particular way can be a helpful guide when you get paged in the middle of the night. Reducing cognitive load makes oncall more effective. When dashboards are organized hierarchically on both the service and the resource level, troubleshooting becomes an exercise of divide and conquer. The oncall person can quickly eliminate whole areas of problems and zone in on the real issue. At that point a single service or instance should have been identified, for which more detailed debugging can take place.

Register Now

Percona Live 2019 | Austin, TX – 05.28.19-05.30.19:

Tom Wilkie: Grafana Loki – Grafana Loki: Like Prometheus, But for logs. – Loki is a horizontally-scalable, highly-available log aggregation system inspired by Prometheus. It is designed to be cost effective and easy to operate, as it does not index the contents of the logs, but rather labels for each log stream.

Learn More

Monitorama PDX 2019 | Portland, OR – 06.03.19-06.05.19:

Tom Wilkie: Grafana Loki – Prometheus-inspired open source logging – Imagine if you had Prometheus for log files. In this talk we’ll discuss Grafana Loki, our attempt at creating just that.

Learn More

InfluxDays London 2019 | London, United Kingdom – 06.13.19-06.14.19:

David Kaltschmidt – Mixing metrics and logs with Grafana + Influx – Imagine if you had Prometheus for log files. In this talk we’ll discuss Grafana Loki, our attempt at creating just that.

Learn More


We’re Hiring

Have fun solving real world problems building the next generation of open source tools from anywhere in the world. Check out all of our current opportunities on our careers page.

View All our Open Positions


How are we doing?

We’re always looking to make TimeShift better. If you have feedback, please let us know! Email or send us a tweet, or post something at our community forum.

Follow us on Twitter, like us on Facebook, and join the Grafana Labs community.

Everything You Need to Know About the OSS Licensing War, Part 2.

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/03/28/everything-you-need-to-know-about-the-oss-licensing-war-part-2./

Where we left off: AWS had taken the Elasticsearch software and launched their own cloud offering in 2015, and Elastic N.V. had doubled down on an “open core strategy.”

Once AWS decides to offer a project like Elasticsearch, it immediately becomes a truly formidable competitor to anyone trying to do the same, even the company behind the software itself. AWS has huge scale, operational expertise, and various network effects that really compound.

Over time, however, AWS struggled to keep up with all the new versions of Elasticsearch, and all the innovation coming out of Elastic N.V. The version they had originally taken had to be heavily customized, and it fell behind the latest and greatest from Elastic N.V. and the open source community. It became “vintage” – not a good look for software.

To me, this is almost unconscionable on the part of AWS, given the huge revenues brought in by the service. The engineering effort required by AWS would have been minimal. It should have been an easy decision for them to invest in some upkeep and maintenance and at least rebase their code.

Despite these problems, it’s rumored that the revenue of the AWS Elasticsearch service has grown to eclipse the entire revenue of Elastic N.V. AWS had captured more value selling Elasticsearch than the company that had created it. It was a testament to the power of AWS.

An Escalation of Hostilities

Last year, a whole slew of open source companies – including Elastic N.V., MongoDB Inc., Confluent (the company behind Apache Kafka), and Redis Labs (the company behind Redis) – made pretty drastic and sudden changes to their licenses.

Elastic N.V. evolved their “open core” strategy, further blurring the lines between open source and commercial. They started to make some of its commercial software available for “free” to its users, and even allowed them to see the source code of that software. But the company carefully added restrictions to its license so that public cloud providers couldn’t do the same.

Some of this code was “open” but not open source. Elasticsearch was walking a very fine line deliberately designed to protect itself from the likes of AWS. Some thought they weren’t being clear enough about what was open and what was not. Had they gone too far?

MongoDB Inc. took a more direct approach, releasing MongoDB in its entirety under a new license, the SSPL. Its main purpose was to prevent public clouds like AWS from using the software to offer a MongoDB service. Was MongoDB even open source anymore? Did the company care? Did the community? The Internet was abuzz.

MongoDB Inc. had previously disclosed to investors that it soon expected 50 percent of its revenue to come from delivering MongoDB as a service. The world was trending toward consuming software as a service, and the license change would prevent anyone else from competing with the company’s ability to offer that. It seemed like a winning strategy.

These commercial open source companies were fighting back – and fighting for their valuations.

They’re faced with an existential question, particularly when they’re VC-funded and burning money: Can they monetize fast enough to command their eye-popping valuations? Can they capture enough of the value that their open source projects create?

The Nuclear Option

The war took on a more sinister and existential tone in early March, when AWS announced its “open distribution for Elasticsearch” (ODE) project, which strives to provide a “truly open source” (Apache2 licensed) distribution of Elasticsearch.

AWS had gone from taking open source to forking it.

The new fork of Elasticsearch will not only power Amazon’s hosted offerings, but it also has the potential to split or shift the center of mass of the open source community away from Elastic N.V.’s own offering.

AWS downplays the fact that this is a fork in their blog:

“Our intention is not to fork Elasticsearch, and we will be making contributions back to the Apache 2.0-licensed Elasticsearch upstream project as we develop add-on enhancements to the base open source software.”

I think that it is a fork, and that AWS is being disingenuous. It’s a fork because of the intentions and the messaging, not because they say their “intention” isn’t to fork. It’s a fork because attempts to merge changes back will be half-hearted at best on both sides.

So the codebases will diverge, and the community has the potential to split. But that’s obviously the implicit threat being made by AWS with this shiny new “distribution.”

The ability to fork is what gives open source its power. It’s an indictment against the leadership and governance of the open source project, a call to arms to the community to choose sides. Indeed, after waxing on how awesome Elasticsearch is, and how it has “played a key role in democratizing analytics of machine-generated data,” AWS spoke out pretty aggressively against Elastic N.V. in the blog:

“Unfortunately, since June 2018, we have witnessed significant intermingling of proprietary code into the code base. While an Apache 2.0 licensed download is still available, there is an extreme lack of clarity as to what customers who care about open source are getting and what they can depend on.”

Elastic N.V. did play a bit fast and loose with how it reframed “open” and how it made the default distribution of Elasticsearch veer more into shareware territory than open source. But the company also went out of its way to make pure open source versions of its software available and to communicate what it was doing. Fundamentally, Elasticsearch is still very much a liberally licensed open source project, and Elastic N.V. has invested significantly to improve it over the years.

And AWS adds insult to injury:

“The maintainers of open source projects have the responsibility of keeping the source distribution open to everyone and not changing the rules midstream.”

So a company that makes most of its money selling a closed source cloud that has lock-in as a goal – and capturing a huge amount of the value created by open source – is preaching to open source companies that they have to stay purely open source? This makes me laugh.

AWS doesn’t “support OSS” as it claims. It just wants to commoditize popular open source software so it can rent out its high-margin computers. AWS simply doesn’t have the political capital in the open source world to make that kind of moral judgement.

But could this be a new dawn at AWS? Are things changing? The company currently has a poor reputation, even compared to other public clouds, when it comes to participating in open source communities. Can AWS get good at running open source projects and actually build a real community of users, not just customers? It’s certainly within the realm of possibility.

What about Elastic? Where do they go from here?

And what about Grafana Labs? How has watching this unfold changed our perspective? Will we start playing the same licensing games used by companies like Elastic and MongoDB? Are we worried about our business?

Stay tuned for the third and final part of the blog, for my own opinions on these topics.

Writing React Plugins

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/03/26/writing-react-plugins/

In this blog post we will go through how you can create plugins for Grafana using ReactJS. This
presumes you have some basic knowledge about writing components in React.

(complete code for the example used in this post can be found here).

In Grafana 6.0 we started the migration to using React in Grafana. This allows you to write plugins
using React instead of AngularJS. We are making it easier to write React plugins by releasing a Grafana component library – the new @grafana/ui npm package.

Let’s take a look at how you can build your own plugin, using React and TypeScript.

Setup

There are a few things to consider when writing a new plugin. With Grafana 6.0, we need to move our plugins directory
outside of the Grafana project directory. Feel free to put your plugins directory where you usually store code on your computer.
Next, we need to tell Grafana where it should look for plugins. Grafana comes with a defaults.ini file in grafana/conf/, and we can overwrite this by
creating and modifying a custom.ini. So put yourself in the grafana/conf directory and cp defaults.ini custom.ini.

Open custom.ini with your file editor of choice and search for this phrase:

Directory where grafana will automatically scan and look for plugins

Modify the line under that to:

plugins = <path to your plugins directory>

Restart your grafana-server after this.

Now we’re ready to move on!

The Structure

Grafana needs some basic project structure in your plugin. Grafana will look for a plugin.json located in a src
directory. The plugin.json should contain some information about your plugin; you can read more about it
here.

Also within the src directory we need a module.tsx file. In this file, we will introduce the first magic from our
newly-released @grafana/ui package.

import { ReactPanelPlugin } from '@grafana/ui';

import { RssPanel } from './components/RssPanel';
import { RssPanelEditor } from './components/RssPanelEditor';

import { defaults, RssOptions } from './types';

export const reactPanel = new ReactPanelPlugin<RssOptions>(RssPanel);

reactPanel.setEditor(RssPanelEditor);
reactPanel.setDefaults(defaults);

Let’s go through this and figure out what this file does:

  • First off, we’re creating a new instance of a ReactPanelPlugin, which is a class imported from @grafana/ui. We’re
    sending in our option type (in this case RssOptions, which we’ll get to later).

  • Next up we’re setting the editor component for our plugin with the setEditor() function.

  • Lastly we’re setting any default options that we might have.

That’s it!

The Panel

Now we’re at the fun part. This is where you can let your creativity flow. In this example we’re building an Rss-panel,
and what we’re going to need is some kind of table to display our result. We’re going to use an interface exported by
@grafana/ui called PanelProps. This will provide us with the props we need, such as height and width. I won’t go into
any specifics about writing React components, but I will highlight some things that we do to make our panels written in
React work.

Basic setup of a panel class:

interface Props extends PanelProps<RssOptions> {}
interface State {}

export class RssPanel extends PureComponent<Props, State> {}

It’s important to use React’s life cycle methods to make sure your component updates when the props change. We
do this by invoking componentDidUpdate in our Rss-panel example. So when our user updates the url to the rss feed, we will
update the panel to fetch an rss feed from the new url. In this example we’re using a library called rss-to-json to
fetch and transform the rss feed to javascript objects.

The Panel editor

For adding options to Plugins, we’re using a concept called Editors. In this example we’ll create a component called
<RssPanelEditor />. We have an interface for Editors in @grafana/ui as well, called PanelEditorProps. If we
provide our options type to this interface, we will have the onChange method available for updating our panel when
we change the options.

export class RssPanelEditor extends PureComponent<PanelEditorProps<RssOptions> {
  onUpdatePanel = () => this.props.onChange({
    ...this.props.options,
    feedUrl: 'this new rss feed url'
  });
}

Types

We strongly encourage you to use types in your panel. This makes it easier for you and others to spot potential bugs. In
this example we’ve added some types for RssFeed, RssFeedItem, and RssOptions. These are located in src/types.ts.

Building

To be able to load the plugin, Grafana expects the code to be in plain JavaScript. We’re using webpack for the build step to transpile TypeScript to JavaScript in our RSS-plugin example.

Testing

Start your grafana-server again, and make sure that your plugin is registered.

Registering plugin logger=plugins name="Rss Panel"

Add a new panel to a dashboard, and locate your new panel in the visualization picker.

Tinder & Grafana: A Love Story in Metrics and Monitoring

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/03/26/tinder--grafana-a-love-story-in-metrics-and-monitoring/

Tinder is the world’s most popular dating app, with more than 26 million matches made each day. But two years ago, when it was time for the L.A.-based company to find and implement a perfect metrics monitoring partner, the process proved to be more slow-burn love affair versus whirlwind romance.

“They say Rome wasn’t built in a day. Same could be said for Tinder’s Grafana infrastructure,” Tinder Observability Software Engineer Wenting Gong told the audience at Grafanacon 2019 in L.A.

Since it was founded in 2012, Tinder has grown to more than 320 employees and utilizes Grafana to monitor hundreds of microservices as well as the containers inside its Kubernetes environment.

Here’s how Tinder’s relationship with Grafana evolved.

First Impressions: Swiping Right on Grafana

In 2017, Tinder had about 50 microservices running on Amazon EC2 instances, supported with CloudWatch monitoring and Elastic load balancers. The Cloud Infrastructure Team came together with the Observability Team to start building an internal infrastructure that monitored the overall health of all of Tinder’s services.

“Because everything was running on AWS, AWS CloudWatch could provide those helpful metrics,” said Gong. “Grafana is a very popular open source tool for the data analytics and visualization for applications and infrastructure, so they decided to enable our CloudWatch on those instances and on the Elastic load balancers and the auto-scaling groups as well.”

By pulling metrics from the CloudWatch data source (which is native to Grafana), Grafana allowed engineers to view and check their service metrics directly and created a centralized location to access real-time metrics. In short, Tinder liked Grafana’s profile, swiped right, and enjoyed its early interactions with the monitoring system.

Meeting the Family: Introducing Prometheus

It didn’t take long for Tinder’s backend engineers to decide to get to know Grafana better too. They wanted to use it for monitoring the efficacy of their services, creating dashboards to track P95 latencies or request status updates.

When looking at the scope of time series database options, the Cloud Infrastructure Team investigated OpenTSDB (“It requires Hadoop, which didn’t fit our scenario,” said Gong), Graphite (“It’s hard to scale for an open-source solution”), InfluxDB (“Its clustering feature is available as a paid version”), and Nagios (“It’s one generation old”).

“Prometheus comes without any of those outstanding drawbacks, and it provides great client libraries so developers could use this to expose their application metrics from their code directly,” Gong said. “So Prometheus was the best option for us.”

A single Prometheus server was launched to pull metrics for all of Tinder’s services, and Grafana was used to monitor service-detailed market metrics as well as CloudWatch house metrics. But the honeymoon phase didn’t last long.

Within a few months, backend engineers were finding that data was being dropped from time to time as Tinder’s business continued to grow at a rapid rate. One Prometheus server was just not enough to carry the load of the dating app’s booming services.

The solution was to create a more scalable infrastructure that involved assigning individual Prometheus servers to each service within Tinder’s suite and creating a separate dashboard on Grafana for individual services. Not only did this solution make it easy to scale an individual Prometheus server based on a service load, but “with different Prometheus servers and recording rules enabled, this infrastructure also enhanced the current performance for those frequently used expressions and those computationally expensive expressions,” said Gong.

Getting Serious: Committing to Long-Term Data

As monitoring proved to be of value to Tinder’s business, the Backend Team wanted access to metrics for longer periods of time.

“We were only keeping metrics for several days for each individual service,” explained Gong. “But [engineers] still wanted to check some historical metrics for important data.”

There were two options to solve this problem. “One was to increase the retention period for all of Tinder’s Prometheus servers, which is super easy,” said Gong. But because not all data was needed for long-term retention, “the money and resources would be wasted on unnecessary metrics.”

The Observability Team then suggested launching a separate Prometheus server strictly for archiving key metrics, which engineers could readily access as needed.

The newly created server would pull metrics from all the individual Prometheus servers and expose itself as a separate data source on Grafana. “Also we have the data auto-discovery service running inside this archive server so that whenever a new service came up or one got removed, the data source, targets and endpoints will always be updated on the archive server side as well as the Grafana side,” said Gong.

“Even though this way requires the extra setup and the module owners need to understand and update the existing configuration,” Gong added, “this helps us use the resources and the budget in a more efficient way.”

Moving in Together: Migrating to Kubernetes

As Tinder’s microservices started to add up to the hundreds, the Cloud Infrastructure Team decided it was time to move to a Kubernetes environment, which would help engineers deploy and manage applications at scale. “This way, it will help give our developers more velocity, efficiency, and agility, and it will also help us save some money,” said Gong.

As a result, the Observability Team had to figure out how to execute and support the current monitoring infrastructure within Kubernetes, which involved different services running in different namespaces for different clusters.

Because developers are already familiar with using Prometheus to expose application metrics, it didn’t make sense to implement a new system. Instead, Tinder launched the Prometheus Operator inside its monitoring namespace which manages, creates and configures Prometheus instances within the Kubernetes environment. “The service monitor will help select the targets we’re trying to monitor, and the Prometheus service itself will pull the metrics from those service monitors,” said Gong. “This Prometheus service endpoint is also exposed as a separate Prometheus data source to Grafana.”

Within the monitoring namespace, there is also a separate sidecar service set up specifically for the archived metrics to collect the Prometheus data and put them into a separate MySQL database. “The archive server, which is in a traditional EC2 instance, will go and check those database results, pull those target metrics in, and expose itself as a separate Prometheus archive data source,” said Gong.

And what if there are any emergency issues in the Kubernetes environment? The Observability Team set up a monitoring system for the Kube cluster itself. “We launched some node exporters, a kube-state-metric exporter in that monitoring namespace and used a separate Prometheus cadence to pull those Kube-related metrics in,” said Gong. “This Kube-related cadence is also a separate data source for Grafana to plot the graphs related to the Kube states.”

Tinder also uses Prometheus exporters to monitor other infrastructure components such as Elasticsearch and Kafka.

With Tinder’s infrastructure settled into its Kubernetes environment, engineers can now easily monitor test environments. “But it’s a headache for our engineers to duplicate those dashboards for different environments for the similar modules,” Gong said.

To reduce the manual creation work, Tinder uses many of the APIs supported by Grafana. Said Gong: “We went ahead and developed customized dashboard automation utilities, with the help of GrafanaLib, in Python using an HTTP client-wrapper that will help when duplicating similar dashboards for different clusters and for different environments.”

In short, it’s a relationship built to last.

Check out more sessions from GrafanaCon2019 on YouTube.

Writing React Plugins

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/03/26/writing-react-plugins/

In this blog post we will go through how you can create plugins for Grafana using ReactJS. This
presumes you have some basic knowledge about writing components in React.

(complete code for the example used in this post can be found here).

In Grafana 6.0 we started the migration to using React in Grafana. This allows you to write plugins
using React instead of AngularJS. We are making it easier to write React plugins by releasing a
Grafana component library – the new @grafana/ui npm package.
The new npm package is still in Alpha and we are making breaking changes to the react plugin framework. But we want
to encourage people to test it and give us early feedback.

Let’s take a look at how you can build your own plugin, using React and TypeScript.

Setup

There are a few things to consider when writing a new plugin. With Grafana 6.0, we need to move our plugins directory
outside of the Grafana project directory. Feel free to put your plugins directory where you usually store code on your computer.
Next, we need to tell Grafana where it should look for plugins. Grafana comes with a defaults.ini file in grafana/conf/, and we can overwrite this by
creating and modifying a custom.ini. So put yourself in the grafana/conf directory and cp defaults.ini custom.ini.

Open custom.ini with your file editor of choice and search for this phrase:

Directory where grafana will automatically scan and look for plugins

Modify the line under that to:

plugins = <path to your plugins directory>

Restart your grafana-server after this.

Now we’re ready to move on!

The Structure

Grafana needs some basic project structure in your plugin. Grafana will look for a plugin.json located in a src
directory. The plugin.json should contain some information about your plugin; you can read more about it
here.

Also within the src directory we need a module.tsx file. In this file, we will introduce the first magic from our
newly-released @grafana/ui package.

import { ReactPanelPlugin } from '@grafana/ui';

import { RssPanel } from './components/RssPanel';
import { RssPanelEditor } from './components/RssPanelEditor';

import { defaults, RssOptions } from './types';

export const reactPanel = new ReactPanelPlugin<RssOptions>(RssPanel);

reactPanel.setEditor(RssPanelEditor);
reactPanel.setDefaults(defaults);

Let’s go through this and figure out what this file does:

  • First off, we’re creating a new instance of a ReactPanelPlugin, which is a class imported from @grafana/ui. We’re
    sending in our option type (in this case RssOptions, which we’ll get to later).

  • Next up we’re setting the editor component for our plugin with the setEditor() function.

  • Lastly we’re setting any default options that we might have.

That’s it!

The Panel

Now we’re at the fun part. This is where you can let your creativity flow. In this example we’re building an Rss-panel,
and what we’re going to need is some kind of table to display our result. We’re going to use an interface exported by
@grafana/ui called PanelProps. This will provide us with the props we need, such as height and width. I won’t go into
any specifics about writing React components, but I will highlight some things that we do to make our panels written in
React work.

Basic setup of a panel class:

interface Props extends PanelProps<RssOptions> {}
interface State {}

export class RssPanel extends PureComponent<Props, State> {}

It’s important to use React’s life cycle methods to make sure your component updates when the props change. We
do this by invoking componentDidUpdate in our Rss-panel example. So when our user updates the url to the rss feed, we will
update the panel to fetch an rss feed from the new url. In this example we’re using a library called rss-to-json to
fetch and transform the rss feed to javascript objects.

The Panel editor

For adding options to Plugins, we’re using a concept called Editors. In this example we’ll create a component called
<RssPanelEditor />. We have an interface for Editors in @grafana/ui as well, called PanelEditorProps. If we
provide our options type to this interface, we will have the onChange method available for updating our panel when
we change the options.

export class RssPanelEditor extends PureComponent<PanelEditorProps<RssOptions> {
  onUpdatePanel = () => this.props.onChange({
    ...this.props.options,
    feedUrl: 'this new rss feed url'
  });
}

Types

We strongly encourage you to use types in your panel. This makes it easier for you and others to spot potential bugs. In
this example we’ve added some types for RssFeed, RssFeedItem, and RssOptions. These are located in src/types.ts.

Building

To be able to load the plugin, Grafana expects the code to be in plain JavaScript. We’re using webpack for the build step to transpile TypeScript to JavaScript in our RSS-plugin example.

Testing

Start your grafana-server again, and make sure that your plugin is registered.

Registering plugin logger=plugins name="Rss Panel"

Add a new panel to a dashboard, and locate your new panel in the visualization picker.

What’s New in Prometheus 2.8: WAL-Based Remote Write

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/03/25/whats-new-in-prometheus-2.8-wal-based-remote-write/

Six months in the making, Write-Ahead Logging (WAL) for the remote_write API was one of the enhancements we included in the Prometheus 2.8 release on March 12. It’s a change intended to safeguard client metrics in the face of any network issues.

The remote_write API allows you to send data from Prometheus to other monitoring systems, including Grafana Cloud. The previous implementation hooked into Prometheus’s metric scraping and was given copies of all the samples Prometheus scraped, sending them out to configured remote write endpoints.

If the remote endpoint was down or Prometheus was unable to reach it for any reason, there was only a small in-memory buffer in place – which proved problematic for two reasons: the data could back up, end up using too much memory, and causing Prometheus to OOM. Or if we got to the maximum size of the buffer defined by the existing configuration, we would start dropping data.

WAL-based remote_write

In the latest 2.8 Prometheus release, instead of buffering the data in memory, the remote_write now reads the write-ahead log. Before committing data to long-term storage, Prometheus writes all of the transactions that are occurring – samples that have been scraped and metadata about new time series – to a write-ahead log.

So if the endpoint is having an issue, we simply stop where we are in the write-ahead log and attempt to resend the failed batch of samples. It won’t drop data or cause memory issues because it won’t continue reading the write-ahead log until it successfully sends the data. The 2.8 update effectively uses a constant amount of memory, and the buffer is virtually indefinite, depending only on the size of your disk.

Another key enhancement we made involves cases with a single batch of data: If that send fails, the system no longer encodes that data every time we attempt to resend it. We simply encode it once and then keep sending it until it succeeds.

Corner cases

Initially, this was pitched as a two- to four-week project. But what happened along the way was that there were a lot of little edge cases that came up. The new approach broke assumptions in the existing code around locking, parallelization and concurrency that needed addressing.

For example, the existing code within Prometheus assumed that all write-ahead log files were fully written. So one of the major things we had to do was write code for the new remote write to read the write-ahead log files as they were being written.

Also, the existing reader silently accepted files with certain errors that the code would eventually repair. We developed the new reader on the assumption that there weren’t any corruptions in the files. In reality the WAL can become corrupted in many ways, so we had to add techniques to deal with this.

But the work was worth it. Thus far, the update has gotten a positive response – CPU usage is down and memory usage is much more predictable. This being a pretty big change, I expected there could be some issues with it. So far there have only been a few reported issues related to WAL corruptions, which we have ideas on how to address. If you have feedback, please feel free to open issues on GitHub.

What’s New in Prometheus 2.8: WAL-Based Remote Write

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/03/25/whats-new-in-prometheus-2.8-wal-based-remote-write/

Six months in the making, Write-Ahead Logging (WAL) for the remote_write API was one of the enhancements we included in the Prometheus 2.8 release on March 12. It’s a change intended to safeguard client metrics in the face of any network issues.

The remote_write API allows you to send data from Prometheus to other monitoring systems, including Grafana Cloud. The previous implementation hooked into Prometheus’s metric scraping and was given copies of all the samples Prometheus scraped, sending them out to configured remote write endpoints.

If the remote endpoint was down or Prometheus was unable to reach it for any reason, there was only a small in-memory buffer in place – which proved problematic for two reasons: the data could back up, end up using too much memory, and causing Prometheus to OOM. Or if we got to the maximum size of the buffer defined by the existing configuration, we would start dropping data.

WAL-based remote_write

In the latest 2.8 Prometheus release, instead of buffering the data in memory, the remote_write now reads the write-ahead log. Before committing data to long-term storage, Prometheus writes all of the transactions that are occurring – samples that have been scraped and metadata about new time series – to a write-ahead log.

So if the endpoint is having an issue, we simply stop where we are in the write-ahead log and attempt to resend the failed batch of samples. It won’t drop data or cause memory issues because it won’t continue reading the write-ahead log until it successfully sends the data. The 2.8 update effectively uses a constant amount of memory, and the buffer is virtually indefinite, depending only on the size of your disk.

Another key enhancement we made involves cases with a single batch of data: If that send fails, the system no longer encodes that data every time we attempt to resend it. We simply encode it once and then keep sending it until it succeeds.

Corner cases

Initially, this was pitched as a two- to four-week project. But what happened along the way was that there were a lot of little edge cases that came up. The new approach broke assumptions in the existing code around locking, parallelization and concurrency that needed addressing.

For example, the existing code within Prometheus assumed that all write-ahead log files were fully written. So one of the major things we had to do was write code for the new remote write to read the write-ahead log files as they were being written.

Also, the existing reader silently accepted files with certain errors that the code would eventually repair. We developed the new reader on the assumption that there weren’t any corruptions in the files. In reality the WAL can become corrupted in many ways, so we had to add techniques to deal with this.

But the work was worth it. Thus far, the update has gotten a positive response – CPU usage is down and memory usage is much more predictable. This being a pretty big change, I expected there could be some issues with it. So far there have only been a few reported issues related to WAL corruptions, which we have ideas on how to address. If you have feedback, please feel free to open issues on GitHub.