All posts by Blogs on Grafana Labs Blog

Meet the Grafana Labs Team: Peter Holmberg

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/06/14/meet-the-grafana-labs-team-peter-holmberg/

As Grafana Labs continues to grow, we’d like you to get to know the team members who are building the cool stuff you’re using. Check out the latest of our Friday team profiles.

Meet Peter!

Name: Peter Holmberg

Peter Holmberg

Grafana Labs Frontend Developer Peter Holmberg

Current location/time zone:
I live in Stockholm, Sweden, GMT+2.

What do you do at Grafana Labs?
I mainly work as a frontend developer, but been lurking in some backend code lately.

What open source projects do you contribute to?
I’ve done small contributions to a few repos in the past but not so much anymore. Right now I’m working with the grafana.com website. It hasn’t gotten much love lately, and we want to bring it up to speed so our users can find the information they’re looking for as well as find the best dashboards and plugins.

What are your GitHub and Twitter handles?
peterholmberg on GitHub and @peteholmberg on Twitter.

What do you like to do in your free time?
I’ve been active in sports ever since I was a small kid on an ice rink. Fell into cycling a few years ago and that interest has spiraled out of control since then. Not really been into cardio sports before but it’s a nice mix between lots and lots of cardio training and analyzing data from the power meter and heart rate monitor (Hello time series!). The almost mandatory coffee with cinnamon bun after a ride is a plus as well. During the winter I play ice hockey with some friends when weather allows.

What’s your favorite new gadget or tech toy?
I bought parts and built a gaming PC this winter. After a few years away from gaming, I realized I missed it. I’m currently playing Fallout76 when time allows.

What’s the last thing you binge-watched?
Kind of a guilty pleasure, but I’ve been stuck watching a Swedish ‘90s soap opera called “Rederiet” for the last weeks. Seen 6 of 20 seasons now!

An Open Technology Stack for Industrial IoT

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/06/12/an-open-technology-stack-for-industrial-iot/

AMMP Technologies runs monitoring for energy systems, usually off mini-grids in Africa. The company uses Grafana to monitor interface with physical objects that are not servers or containers. “It’s interesting how a toolkit for visualizing essentially internet/computer/server metrics is so well-suited to working with real-life streaming data,” AMMP Cofounder Svet Bajlekov said during his talk at GrafanaCon L.A.

In fact, AMMP built its whole stack, not just the visualization part, to be open source, and Bajlekov shared his vision of how other industrial IoT organizations can do the same.

AMMP works with electrical substations, and the SCADA systems that are used for running and monitoring them were built a couple of decades ago. “They were not initially built with the Internet in mind,” Bajlekov said. “Today, we live in a place where most things are getting connected, and with that, it’s no longer simply safe to assume that a sensitive piece of infrastructure is going to be safe anymore.”

Bajlekov gave an example from a couple of years ago, when Russian hackers allegedly got into the control center of the utility in Kiev, tripped the switches on 60 substations, and took the power out for 250,000 people. “Everything is getting connected, and it’s pretty clear that the attack vectors, the attack surface, and the vulnerabilities are growing with that,” he said. And according to one survey, 51% of companies don’t feel prepared to deal with those vulnerabilities.

A Vision for the IIoT Stack

“Now we’ve got the picture of critical equipment that’s being operated by pretty ancient software that was built behind closed doors and for operations that happen behind closed doors, which is neither secure nor particularly flexible,” Bajlekov said. Everything is essentially airgapped, by necessity.

His goal is a secure, open, and extensible technology stack – and it’s within reach.

“We do have more devices coming online, which is challenging, but they also nowadays have a lot more processing power than they once did, and there’s a lot more bandwidth than we once had,” he said. “We can be a little less kind of frugal with the resources that we use for this communication. So why don’t we embrace the internet and the standards and the protocols and best practices on which it has been successfully built over the past decades?”

Adapting those best practices means assuming that everything that you work with is online and figuring out how to encapsulate everything the right way. Bajlekov offered this diagram of a possible architecture for industrial IoT:

Open IIoT Architecture

The diagram shows:
1. A real-world system that you want to talk to some industrial product.
2. An edge gateway device.
3. Following internet best practices, an MQTT or HTTPS API is used to connect the gateway to some endpoint.
4. A data store of time series for metrics.
5. Something for managing devices.
6. Analytics and visualization.

“That endpoint is going to be working with those data stores to provision the edge devices,” he said, “and then from there you’re going to probably want to do some analytics and some visualization on all of that.”

To accomplish all of this, he said, “it obviously makes sense to go to open source building blocks, just the way that the rest of the plumbing on the internet has been built up out of systems like that, with the open interfaces and fully interoperable components.”

That’s all possible, he adds, because over the past few years, enterprises have increasingly embraced open source. “It’s actually a way for large companies to de-risk their vendor dependencies and to ensure the continuity of their operations if something goes wrong with a particular vendor,” he said. “Mature open source projects actually have a pretty high level ability for security requirements because basically it’s all out in the open. People tend to look at the code base a lot more than they would for a closed-source system… And if you have a good project, you probably have interfaces that are well-designed, well-documented, and have flexibility and robustness in mind.”

The Open Source IIoT Stack

Bajlekov then offered his suggestions of what the stack could look like:

Open IIoT Architecture

“I think that Grafana really stands out in terms of its best practices for extensibility, interoperability,” he said. “You know, it can do everything via an API, and the way that Grafana approaches the ecosystem is what I feel that ought to be replicated across this chain.”

LoudML, which is next to Kapacitor for analytics, applies machine learning on time series data for anomaly detection.

“You’ve got all this data streaming in, and you want to make sure that you’re making sense of it,” he said. “We’re getting tools that are fully open source that just literally plug into everything else that’s going on here and can get you these great insights off the shelf.

EdgeX Foundry, a Linux Foundation project, is designed to be a fully-interoperable, vendor-neutral microservices framework for IoT edge computing. “The ethos here basically is API first,” he said. “Everything interconnects with everything else via very well-documented interfaces, and you can plug and play different microservices that get you the functionality that you need and allows you to interface with your actual physical objects.”

Bajlekov mentioned that EdgeX is looking into having an implementation that runs on PLCs so that data can be pushed to the cloud directly rather than going through the edge devices.

Where Do We Go From Here?

The barriers to adopting this model are real; utility companies tend to be fairly conservative. Plus, not all of the functionality has been fully worked out. “While a lot of this stack is great for pulling metrics out of a system, processing and visualizing them, there’s not really been an established industry standard for doing end-to-end configuration and control of these remote devices,” he said.

Still, he concluded, “The ecosystem shaping up gives me a lot of hope that things are going to move in a direction where we have a more open, more interoperable, kind of value chain of different components that work together.”

For more from GrafanaCon 2019, check out all the talks on YouTube.

How Not to Fail at Visualization

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/06/10/how-not-to-fail-at-visualization/

As a longtime systems engineer, Blerim Sheqa knows all about using tools like Grafana to debug issues in infrastructures. Currently the CPO of Icinga, an open source monitoring software, he gave a talk at GrafanaCon LA about how not to fail at visualization.

There are common pitfalls that can lead you to spend hours “hunting ghosts,” said Sheqa. “Wrong data visualization leads us to misinterpretation. We think we found something in a graph that looks strange, but actually just the graph was strange.” Here are the things to look out for:

Conventions

“I searched the web for the worst graph I could find,” Sheqa said. He came up with this one from a 2005 newspaper about a gun law in Florida.

Gun Deaths

The graph seems to show that the rate of murders by guns dropped after the law was passed. “But if you look closely, you can see on the left side that actually they just flipped the graph,” he pointed out. “It’s technically correct, but it doesn’t follow conventions.”

Lesson #1: “Following conventions is one important part when visualizing data, especially when using this data to debug any issue in your infrastructure.”

Another example is this load graph:

Load Graph 1

“We can see that the load drops and sometimes increases and looks normal at first glance,” Sheqa said. “What we cannot see here is that on the left side, the Y axis starts at about 60. So the load is actually pretty high, but because we didn’t set the correct minimum, we cannot see it at first glance. We believe that the graph shows us a pretty normal load on the server, where it actually is pretty high, depending on the hardware obviously.”

The graph should look like this:

Load Graph 2

Lesson #2: “Use proper labeling, setting minimum and maximum where it is necessary and setting the correct values and describing the data that we see.” You may know the data, but you have to make sure that your coworkers can understand what the graph shows, too.

Comparability

This graph of memory usage on a server is stacked, which causes problems. You can’t see how much memory is free on the server.

Memory Usage

“That is because the stacking like it is done here is completely useless,” said Sheqa. “You have to figure out what are the labels on the bottom, what is actually the free memory? Is it increasing or is it decreasing?”

The data is clear in this visualization:

Memory Usage 2

“It’s still stacked, but I just flipped the values, and you can see at first glance that the memory is actually increasing,” he explained. “I put the total free memory to the background so you can still see it, but it’s not the first thing that comes into your eye.”

Lesson #3: Highlighting the important parts of a graph is pretty important for data visualization.

The following graph is about requests.

Requests

“It goes up and down, and it looks fancy and shiny, good on our dashboard. We can put it on a TV screen,” he said. “But actually we don’t know anything about the data that is shown here. Is it many requests? Are they going up or down? What is happening here?”

Here’s a better version of it:

Requests 2

Lesson #4: “Using the grid is the most important thing I figured out in the past,” said Sheqa. “It changes so much if you actually use the most common things that are out there, like grids and the Y axis, because now we can see how much the difference between the most bottom and the most up value actually is.”

The following CPU graph “tells us there is a peak somewhere,” said Sheqa, “and for comparing things, this is actually not the best thing that you can do because in this graph there are many CPUs merged into one graph and averaged. So just by looking at averages, we cannot tell any details about the behavior.”

CPU 1

“Split the graph like you should do, because you have more than one CPU,” he said. “In this case it’s four, and you can see that each chip CPU behaves completely differently.”

CPU 2

Lesson #5: “Grafana has this feature where we can just repeat panels or repeat entire rows. You should make use of that when you have something like CPUs or network statistics.”

Readability

Graphs should always be readable for everyone in your organization, not just the person who created it.

“You can add more features from Grafana like adding annotations, which add even more context to your graph so you can better understand what the graph is actually showing to you,” said Sheqa.

The following load graph has annotations about when monitoring sent an alert because of a failing service.

Load Graph Alert

“You can not only see what is happening to the load, but you can also see, when did the monitoring alert us. At which point and at which time frame do I have to look at the graphs?” he said.

On the other hand, there is such a thing as too many annotations and contexts. For example, this single dashboard had to be split into three columns for the slide:

Load Graph Alert

“It’s pretty nice looking; you can see many colors and many shapes and many things, but for someone that is not exactly into the data, it’s pretty useless,” said Sheqa.

Lesson #6: “You always should care about how much information you add to one single dashboard or to one single graph.”

Know Your Data

Last but not least, Sheqa pointed to the most common pitfall in visualization: creating graphs when you don’t understand the data.

“In my opinion, it’s absolutely necessary that you know what data you are collecting and understand each and every metric that you are collecting, so you can actually use it for debugging,” he said. “The best dashboards cannot help you to find an issue if you do not understand the underlying data that you collected before. When we think we know the data, but we don’t actually know them, this leads us usually to building wrong graphs and wrong dashboards. And this again leads us to misinterpretation.”

For more from GrafanaCon 2019, check out all the talks on YouTube.

Meet the Grafana Labs Team: Dominik Prokop

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/06/07/meet-the-grafana-labs-team-dominik-prokop/

As Grafana Labs continues to grow, we’d like you to get to know the team members who are building the cool stuff you’re using.
Check out our latest Friday team profile.

Meet Dominik!

Name: Dominik Prokop

Dominik Prokop

Grafana Labs Developer Dominik Prokop

Current location/time zone:
I’m based in the capital of Poland, Warsaw (CEST).

What do you do at Grafana Labs?
I work as a frontend engineer, focused primarily on React migration and @grafana/ui, but I’m also involved in Explore.

What open source projects do you contribute to?
Grafana is my first experience being deeply engaged in the open source community. And since it’s such a complex product, it’s hard to make yourself an active contibutor in another one. 🙂 But I’m always happy to contribute by reporting issues and bug fixes.

What are your GitHub and Twitter handles?
It’s dprokop on Github and @prokopd on Twitter (but I’m far from being a legitimate Twitter user ;)).

What do you like to do in your free time?
I love photography, trying to keep my Fuji camera & 35mm lens always with me. I’m also interested in architecture and interior design. And, during weekends, I like to put an apron on and bake some cake!

What’s your favourite new gadget or tech toy?
Being minimalist (or essentialist), I’m very far from chasing all the new tech toys that pop up everyday. But, after spending some time arranging my new apartment recently, I fell in love with the Sonoff basic smart switch, which is an insanely cheap way of making your home, well, smarter. 🙂

What’s the last thing you binge-watched?
Street Food documentary series by David Gelb on Netflix. Every foodie’s must-see!

Do you like to code in silence, with music or ambient noise on your headphones, or with people talking around you?
It really depends on the day, but working remotely, mostly from home, silence is my friend. 🙂 And when it gets too silent I turn some techno, ambient, or piano music on (Nils Frahm, oh boy, he’s so good) and keep on crunching!

Monitorama Preview: Observability Talks

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/06/04/monitorama-preview-observability-talks/

The Monitorama Conference kicked off this week in Portland, Oregon, and Grafana Labs’ Tom Wilkie will be presenting his talk, “Grafana Loki: Like Prometheus, But for Logs,” tomorrow.

If you’re not at Monitorama, check out our recap blog post of Tom’s presentation on Loki at KubeCon, and watch the video of it here:

Yesterday at Monitorama, Dave Cadwallader, Site Reliability Architect at DNAnexus, reprised the popular talk on observability that he gave earlier this year at GrafanaCon, and added a new element: a beginner’s guide to Prometheus.

“This is about teaching Grafana to kids and what I learned in the process,” he said at GrafanaCon. “But it’s also about how we can be more empathetic in how we teach fellow grownups and how we can have more fun and hack our own brains to learn as fast as children can.”

In case you missed it, here’s a recap of that talk.

Cadwallader recounted that sometimes his young children would pass by when he was looking at a Grafana dashboard on his computer, and they’d be riveted by what was on his screen. He would usually be triaging an incident, and wouldn’t have time to answer their questions immediately. But he started wondering how he could explain monitoring to his 4- and 6-year-old kids.

He started by buying some inexpensive wireless temperature and humidity sensors, which communicate using a common radio frequency, 433 megahertz, that’s used by a lot of things in a typical home. “There’s this cool little $20 receiver that you can plug into your Raspberry Pi that will actually receive radio frequencies and turn them into digital signals that then you can pipe to a Linux process and do just about whatever you want with them,” he said. In this case, he used it to get the signals from the temperature sensors.

An Introduction to Grafana

Each sensor would broadcast its own unique sensor ID, which the family would record on the sensor or battery. “And every 30 seconds, these send out a little ping with their sensor ID, the temperature, and the humidity levels,” he said. “From there, all we had to do was send that data to InfluxDB and graph it with Grafana – and the kids had their first dashboard to play with.”

The first experiment he did with the kids was monitoring which rooms in the house were warmer or more humid. But “we live in Colorado, so pretty much everything is zero humidity, dry as a bone,” Cadwallader said. “So that wasn’t too exciting. But we started to learn about concepts like zooming in and zooming out and understanding the X and Y axis and plotting lines from those points.”

One room that they assumed would have a pretty constant temperature ended up having a sawtooth pattern. “So we used our eyes and our ears and we noticed that whenever the furnace would kick on, we’d have this little increase, and the furnace would turn off, and we’d have this little decrease,” he said. “So that started to make sense. And this was another example for the kids about making a connection between the environment around them and the data that they observed on the graph.”

Sawtooth

Looking at Historical Data

The next exercise was monitoring the family’s two pet cats, Hobbes and Rooibos. “We were worried that, you know, what if they were cold at night, and did these cat beds really work to keep them warm?” Cadwallader said. “So we thought, ‘Hey, we could actually use Grafana to answer this question.‘”

They put sensors in the cat beds and graphed the two cats.

Cats

The next morning, Cadwallader went over the graph with his kids and asked them what they thought was happening that caused the temperature to go up or down. The verdict: “Rooibos, who’s the much fluffier cat, had a much steeper slope towards her peak bed temperature, and Hobbes, who’s the short hair, was a little bit slower to get there, but they eventually converged,” he said. “And we can see that Hobbes actually got up to have a late night snack and then got back in bed.”

And the lesson for the kids was how to look at historical data to piece together a story of what may have happened – “just like analyzing a production incident by looking at historical data,” he said. “The kids were pretty amazed to learn that this kind of thing is a big part of what I do at work all day. But now it was approachable. So after we had a week’s worth of data, we could zoom in and start to learn about trends, and we talked about how the two cat beds were correlated and followed the same shape a day after day.”

Sampling and Thresholds

The last experiment involved measuring the movement of a small trampoline that the kids bounced on. Cadwallader bought an ultrasonic range finder, which “measures the round trip of how long it takes a signal to bounce back off of something, and then you can send that to your Raspberry Pi,” he said. “The Raspberry Pi has these nice little pins called GPIO where you can plug in just about any sort of electronics. And so the kids helped me wire this up, and it was really fun.”

The first graph they created looked like a sawtooth (which his son Boden had predicted based on their first experiment). Playing with sampling and thresholds, “we figured out that if we took a sample every 50 milliseconds that it would filter out all of those noisy, tiny little peaks and valleys,” he said. “And it wound up with something that looked like this where we could count just peaks and valleys that were in between jumps.”

Trampoline

“I, as the teacher, got to have an experience of learning from my student,” he said, “and being like, ‘Oh yeah, that idea you had before, that’s totally something we could use now.’”

Lessons Learned

Cadwallader said that teaching his kids about monitoring gave him greater insight into how to work with his colleagues more effectively.

From watching his son play with Grafana, he said, he realized that “there’s not always a right or wrong way to do it. It’s more of an art than a science.” And sometimes it’s best to just let someone work through his or her own art of graphing.

Especially when they’re enjoying the process. “Sometimes in education or even at work, there’s extrinsic versus intrinsic motivation,” he explained. “Extrinsic is where you do something to get a reward. Intrinsic is where you do something just for the joy of doing it. Boden was so enamored of Grafana that he just wanted to use it for experiments of his own invention. So I thought, ‘Wouldn’t it be cool if we could bring this excitement back to work as well?’”

Cadwallader organized a game day at his office, during which “you get everybody in a room and you try to break stuff, and we elected a member of our team to be the chaos monkey and try to shut things down and kill processes,” he said.

For some people, it was their first real exposure to Grafana. “The joy on their faces when they made a prediction about what was going to happen when we messed something up, and then seeing it in Grafana was actually really, really cool,” he said. “And I got a lot of feedback from people like, ‘Hey, I learned so much more from doing this than I did just from reading the docs.’”

The takeaway? “When we have fun while we’re learning, we learn better.”

His overall advice: “Explain like they’re five, and learn like you’re five,” he said. “So you’re all the subject matter experts on Grafana and observability, and many of you are the go-to person in your org. But when you’re helping somebody do something with Grafana or observability, try to resist the temptation to do all the work for them just because it’s easier. Let them tinker, let them struggle. And if you can be there next to them, when they have that ‘a ha’ moment, when it clicks, you can share in that emotional learning moment.”

How PostgreSQL and Grafana Can Improve Monitoring Together

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/06/03/how-postgresql-and-grafana-can-improve-monitoring-together/

TimescaleDB is an open source database packaged as a Postgres extension that supports time series, but “it looks as if it were just Postgres,” said Timescale’s Head of Product, Diana Hsieh. “So you can actually use the entire ecosystem. You can use all of the functions that are enabled in Postgres – like JSON indexes, relational tables, post JSON – and they all work with Timescale.”

Last year, Timescale contributed the PostgreSQL query builder to Grafana v5.3. The new visual query editor for the PostgreSQL datasource makes it easier for users to explore time series data by improving the discoverability of data stored in PostgreSQL. Users can use drop-down menus to formulate their queries with valid selections and macros to express time series specific functionalities.

“The reason why we decided to build that was because it really makes it easier for people with varying levels of SQL ability to actually write SQL queries for advanced users,” Hsieh told the crowd at GrafanaCon.

The query builder allows users to auto-fill different columns so that “people who work more in devops and maybe only use PromQL can actually use a query editor that looks more familiar,” Hsieh explained.

The tool also reflects the permissions given to each Grafana instance. “Users will be able to know which table they can query within a database without having to actually know what the underlying schema is in the actual database,” said Hsieh. (For more about the query builder’s features, check out this blog post.)

But is SQL even a feasible solution for time series monitoring systems? At GrafanaCon, Hsieh debunked all the doubts around SQL during her talk.

Not a lot of people are familiar with SQL

One of the main reasons why Hsieh says SQL is “so awesome” is because it’s actually a really well-known querying language.

According to a recent Stack Overflow survey, SQL is ranked fourth among the most popular technologies for programming, scripting, and markup languages. “I also did a quick search on LinkedIn about how many people know SQL versus do not list SQL in their profiles, and it’s nine million versus a couple hundred thousand,” said Hsieh. “I think this is really telling.”

Hsieh also points out that business analysts know SQL. “I knew SQL before I started coding,” said Hsieh. “It’s actually a really well-known language that spans across an organization.”

Finally, two other distributed non-SQL databases are actually introducing SQL. “Elastic has an SQL-like language, and Kafka even has a little logo for KSQL,” said Hsieh. “This just tells you that even some of the non-SQL databases also really like SQL.”

There are new databases languages to invest in

The value of implementing SQL goes beyond technical benefits.

“It’s a cost consideration,” said Hsieh. “At the end of the day, what we’ve come to realize is that data isn’t just about being able to store it. You also have to be able to access it, and this happens at all of the layers of a given architecture.”

On a micro level, for example, KDB is a specific database that financial companies tend to use and because KDB developers are harder to find, they earn considerably more money. “If you think about the database that you’re choosing to be a foundation for your organization, consider how expensive it is to get an SQL developer compared to a KDB developer.”

And if you take a look at the bigger picture, “if you choose a language that people don’t know, you’re going to have to train your entire organization,” said Hsieh. Or else, you risk creating different silos where some people know how to access sources of data that others don’t.

“It’s not just the developers and business analysts who are touching SQL,” said Hsieh. “It’s also the people who know how to manage SQL, people who know how to build relational schemas, people who know how to operate Postgres. It’s a larger ecosystem. So it’s really an investment in choosing something that expands across an organization and lasts a little longer.”

SQL doesn’t scale

Hsieh admitted that this was “the elephant in the room.”

“I would say that I kind of agree with that statement,” said Hsieh.

Traditional SQL databases are optimized for transactional semantics, which time series databases are not.

With time series data, there’s continuously new data being ingested over time, which requires more storage. But in a Postgres server built for a transactional use case, the ratio of CPU, RAM, and storage will be dramatically different. As a data set grows, the ingestion in Postgres will decrease dramatically because the index gets so big it can’t fit in RAM anymore.

Still, Hsieh remains dedicated to the 20-year-old database language because “we think boring is awesome with databases,” said Hsieh.

So Timescale implements Postgres at the storage layer. “Postgres is so interestingly extensible that we’ve actually ripped out how schemas are managed, and we’ve provided our own abstraction layer,” explained Hsieh. “What the abstraction layer does is take a large table that has a lot of times series in it, and it will break everything up into chunks based on time intervals. So it automatically partitions data.”

The index is, therefore, not built on the entire table. “So now when you’re inserting data, all of those indexes built in a chunk can actually live within memory,” said Hsieh. “It speeds everything up, and you can write things a lot faster at the query planner level.”

With Timescale, if users write into a table, underneath the partitioning will automatically happen. And if there is a time series query, it will route it through an optimized query planner for the time series. But if you have a standard SQL, it’ll go through standard SQL queries.

Ultimately, “you can actually have relational databases and time series databases all in the same database,” said Hsieh. “You can actually get a lot better performance using Timescale, and you don’t see that performance degradation in terms of inserts … With an 8-core machine, we’re able to get 1.11 million metrics per second.”

In the end, it’s not about whether or not SQL can scale, said Hsieh. “One of the biggest challenges people have is changing the time series framework into thinking about how they would actually model it in SQL because it is a different way of thinking about SQL.”

For more from GrafanaCon 2019, check out all the talks on YouTube.

Grafana Labs at KubeCon: All the Highlights

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/05/31/grafana-labs-at-kubecon-all-the-highlights/

At KubeCon + CloudNativeCon EU in Barcelona, the Grafana Labs team delivered four well-received talks as well as a keynote in front of thousands of attendees. In case you missed the conference (or you want to watch the sessions again!), check out the links to our recap blog posts or watch the talks below.

  • Intro to Cortex Tom Wilkie and Weaveworks’ Bryan Boreham showed how easy it is to get started with your own Cortex cluster – the same technology behind Grafana Cloud’s hosted Prometheus.
  • Deep Dive: Cortex Tom and Bryan delved deeper into Cortex, discussing how to run it as a set of microservices, how to choose between the various cloud storage options, and how to tune Cortex for the best performance.

Grafana Labs at KubeCon: Awesome Query Performance with Cortex

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/05/30/grafana-labs-at-kubecon-awesome-query-performance-with-cortex/

At KubeCon + CloudNativeCon in Barcelona last week, Weaveworks’ Bryan Boreham and I did a deep-dive session on Cortex, an OSS Apache-licensed CNCF Sandbox project. A horizontally scalable, highly available, long term storage for Prometheus, Cortex powers Grafana Cloud’s hosted Prometheus.

During our talk, we focused on the steps that we’ve taken to make Cortex’s query performance awesome.

Cortex embeds the Prometheus PromQL engine, and mates it to Cortex’s own scale-out storage engine. This allows us to stay feature compatible with queries users can run against their Prometheus instance. Queries are handled by a set of stateless, scale-out query jobs. You could add more to “increase” performance. But this only improved the handling of concurrent queries. Single queries were still handled by a single process.

It may sound obvious, but we have to be able to answer queries in Cortex using only the information in the user’s query – the label matchers and the time range. First we find the matching series in the query using the matchers. Then we find all the chunks for those series, for the time range we’re interested in, using a secondary index. We then fetch those chunks into memory, and merge and deduplicate them in a format that the Prometheus PromQL engine can understand. Finally, we pass this to the PromQL engine to execute the query and return the result to the user.

Query Path 2018

In the past year, we’ve made several optimizations.

Optimization 1: Batch Iterators for Merging Results

Cortex stores multiple copies of the data you send it in heavily-compressed chunks. To run a query, you have to fetch this data, merge it, and deduplicate it.

The initial technique used to do this was very naive. We would decompress the data in memory and merge the decompressed data. This was very fast, but it used a lot of memory and caused the query processes to OOM (out of memory) when large queries were sent to them.

So we started using iterators to dedupe the compressed chunks in a streaming fashion, without decompressing all the chunks. This was very efficient and used very little memory – but the performance was terrible. We used a heap to store the iterators, and operations on the heap (finding the next iterator) were terribly expensive.

We then moved to a technique that used batching-iterators. Instead of fetching a single sample on every iteration, we fetched a batch. We still used the heap, but we had to use it significantly less. This was almost as fast as the original method, and used almost the same amount of memory as the pure iterator-based approach.

Optimization 2: Caching… Everything

As explained, Cortex first consults the index to work out what chunks to fetch, then fetches the chunks, merges them, and executes the query on the result.

We added a series of memcached clusters everywhere possible – in front of the index, in front of the chunks, etc. These were very effective at reducing peak loads on the underlying data and massively improved the average query latency.

Caching

In Cortex, the index is always changing. We had to tweak the write path to ensure that the index could be cached. We had to make sure the ingesters held on to data that they had already written to the index for 15 minutes, so that entries in the chunk index could be considered valid for up to 15 minutes.

More Caching

Optimization 3: Query Parallelization and Results Caching

We added a new job in the query pipeline: the query frontend.

Query Frontend

This job is responsible for aligning, splitting, caching, and queuing queries.

Aligning: We align the start and end time of the incoming queries with their step. This helps make the results more cacheable. Grafana 6.0 does this by default now.

Splitting: We split queries that have a large time range into multiple smaller queries, so we can execute them in parallel.

Caching: If the exact same query is asked twice, we can return the previous result. We can also detect partial overlaps between queries sent and results cached, stitching together cached results with results from the query service.

Queuing: We put queries into per-tenant queues to ensure a single big query from one customer doesn’t denial of service (DoS) smaller queries from other customers. We then dispatch queries in order, and in parallel.

Other Optimizations

We’ve made numerous other tweaks to improve performance.

We optimized the JSON marshalling and unmarshalling, which has a big effect on queries that return a very high number of series.

We added HTTP response compression, so users on the ends of slower links can still get fast query responses.

We have hashed and sharded index rows to guarantee a better load distribution on our index. Plus that means we can look up smaller rows in parallel.

In short, every layer of the stack has been optimized!

And the results are clear: Cortex, when run in Grafana Cloud, achieves <50ms average response time and <500ms 99th percentile response time across all our production clusters. We’re pretty proud of these results, and we hope you’ll notice the improvements too.

But we’re not done yet. We have plans, in collaboration with the Thanos team, to further parallelize big queries at the PromQL layer. This should make Cortex even better for high cardinality workloads.

Most large-scale, clustered TSDBs talk about ingestion performance, and in fact, ingesting millions of samples per second is hard. We spent the first three years of Cortex’s life talking about ingestion challenges. But we have now moved on to talking about query performance. I think this is a good indication of the maturity of the Cortex project, and makes me feel good.

Read more about Cortex on our blog and get involved on GitHub.

Grafana Labs at KubeCon: Foolproof Kubernetes Dashboards for Sleep-Deprived On Calls

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/05/29/grafana-labs-at-kubecon-foolproof-kubernetes-dashboards-for-sleep-deprived-on-calls/

We’ve all been in the situation where suddenly you are the lone developer on call while everyone is out of pocket.

Or in the case of Grafana Labs Director of UX David Kaltschmidt, his then business partner, Grafana Labs VP of Product Tom Wilkie, was checking out for a weekend music fest.

“Tom and I founded a company a couple of years ago, and I’m more of a frontend person. Tom did all the backend and devops stuff,” explained Kaltschimdt. “Then one weekend Tom said, ‘David, I’m going to a heavy metal festival, and you need to watch the servers.’”

“I was freaking out,” he admitted. “An easy on call shift allows for debugging and follow-up such as working on features or reducing toil. But a difficult on call experience puts a developer on the defensive, mostly responding to incidents where every minute counts.”

“I was just hoping that never happened when Tom was in a tent somewhere sleeping,” said Kaltschimdt.

So that no one loses sleep over on calls ever again, Kaltschmidt shared his tips and tricks for creating foolproof Kubernetes dashboards at KubeCon 2019.

Being on call in the context of Kubernetes is, at first, “mind-boggling,” said Kaltschmidt. “The process is so difficult because you have to do the troubleshooting across a couple of dimensions and all the concepts in Kubernetes.”

And troubleshooting is as much about finding the issue as it is about eliminating potential situations where the issue is not, said Kaltschimdt.

To help, Kaltschmidt introduced the Dashboarding Maturity Model (DMM), a three-tiered approach that will help align teams and make dashboards more consistent within organizations. Also DMM will give individual engineers some guidance along their dashboarding journey – and show them how to take it to the next level.

maturity levels

Low Maturity

Here are three signs of low maturity dashboarding, which indicate there is no effective strategy in place.

Sprawl:
This is when duplicating dashboards goes unmanaged and gets unwieldy. “Grafana makes it really easy to modify dashboards,” said Kaltschmidt. But it’s easy to leave the “copy tags” function on.

“You probably had good intentions using it. But then someone else came along, cloned the dashboard, left copy tags on, but modified something that semantically represented one of your tags,” said Kaltschmidt. “Those dashboards diverge, and if you later use those tags to find your things, you end up with a long set of dashboards that are no longer representing what the tag originally meant.”

No Version Control:
What happens when you modify a dashboard and you hit save without version control? “If you have a standard Grafana instance and you don’t back up your data or you don’t have your dashboard JSON in version control and your Grafana goes away, then you’re going to have a bad time,” he said.

Browsing for Dashboards:
A similar behavior that’s symptomatic of low maturity is browsing for dashboards. “If you find yourself browsing a lot – going through folders and going back and forth to find the right thing – that’s the sort of behavior a mature user wants to get away from,” said Kaltschimdt.

Medium Maturity

Here are some Grafana-approved methods to managing your monitoring dashboards.

Templated Variables:
A dashboard for each node in Kubernetes isn’t necessary because factors that are tracked, such as CPU usage or usage per core, appear in the same panel layout for all the nodes.

Within Grafana, the nodes can be registered as template variables, and users can access a dropdown to look through all their instances. “If you’re really clever,” said Kaltschimdt, “you can do this for various data sources as a higher level template variable. Then you can basically access a lot of different clusters.”

Methodical Dashboards:
There are a couple of dashboarding methods that help make sense of what could go wrong in Kubernetes along various dimensions.

For services, the RED method measures request and error rate as well as duration for each service. Check out Tom Wilkie’s overview of this method.

For resources, the USE method measures utilization, saturation, and errors. In the example below, these dashboards are part of the Kubernetes monitoring mixin.

kube mixin1

Above is a view of what a node represents and also what problems a node can have.

kube mixin2

Above is a set of dashboards in this repo about persistent volumes that also shows various other dashboards that are available.

Hierarchical Dashboards:
This type of dashboarding method does a great job at providing summary views with aggregate queries by using the power of trees or the logarithmic drilldown to pinpoint where a problem is.

“They really help in the elimination process. You can quickly see in the higher level tree dashboards that things are okay so you can move on to the next one,” explained Kaltschimdt.

hierarchical1

One hierarchy example would be cluster, namespace, and pod. All of these queries will have to be structured in a way that whatever is below or whatever is above, aggregates those metrics in a meaningful way. But then the question becomes how do you navigate between them?

hierarchical2

At Grafana, there would be a cluster view broken down by namespaces. The breakdown in these queries will always be the next level down so you can move into the next level by using one of these drilldown links in the table. This is also part of the Kubernetes monitoring mixin.

Service Hierarchies:
Using the same dashboard, you can see how data flows through an application.

service

Here is the RED method using one row per service, with the request rate and error rate on the left and the duration of the latency on the right.

“The really powerful thing here is that you can see at the top, which is the local answer that tracks the responses to the user, that the user is not going to see any errors because there’s nothing red. But there’s obviously something wrong because the lower dashboards have red,” explained Kaltschimdt.

“If we’re looking at the app, there’s some red there. We know that the app relies on data from the database, and by virtue of the database also being red, we get a hunch that the error may be in the database,” he said. “So this vertical hierarchy inherently leads me towards a hunch about where the system is not working.”

Expressive Dashboards:
“One thing to keep in mind is that sometimes it’s worth splitting up a service or an app into two different dashboards, mainly because the magnitude can differ,” advised Kaltschmidt.

expressive split

For example, Grafana is on Cortex, which is a Prometheus-based service, and a lot of data is being consumed. “If it’s always at a magnitude like 1000x, any errors in the read path would be drowned visually if they were in the same dashboard and those metrics were aggregated,” said Kaltschmidt.

Expressive Charts:
One tip for making your dashboards “really expressive” is to use color to give you a quick hint about what’s going on, said Kaltschmidt.

expressive color

In the top diagram, with only 200-s in green and 500-s in red, “this helps quickly draw a conclusion on the state of this app,” said Kaltschmidt.

Kaltschmidt also advised to normalize graphs by the Y axis “so you instantly get a feel for how busy something is.” This is especially useful for dashboards that track a situation where the resource is bound, like CPU.

normalized

Taking another example from the Kubernetes monitoring mixin, above is a cluster and each of these lines represents a node. But it’s unclear how many CPUs the nodes have because there could be provisions of different sizes.

“If we normalize this by the CPU count or across the cluster, we can definitely say a lot of resources across the cluster are being used leading up to 100 percent,” pointed out Kaltschmidt. “This is really powerful because you reduce the cognitive load of having to draw conclusions on how much space is left.”

Directed Browsing:
While template variables make it hard to navigate or “just browse” through the dashboards, that’s a good thing.

“You actually shouldn’t just navigate through them, especially if you have three-level hierarchies,” said Kaltschmidt. “That will actually encourage you to use alerts.”

Managing Dashboards:
Lastly, where should we store the dashboard code themselves? “There’s a couple of initiatives inside Grafana going on, and the most important revolves around improving the provisioning workflow,” said Kaltschmidt.

Any developer interested in this, check out the Grafana design doc or go to Issue #13823 on Github and comment.

High Maturity

Sometimes in devops organizations, people know good practices, but they still deviate from them. Here are ways to achieve consistency by design.

Scripting Libraries:
Scripting libraries such as grafonnet (Jsonnet) and grafanalib (Python) give you higher order functions to generate certain types of dashboards, said Kaltschmidt.

“The important thing is that those functions can encode, for example, a query panel,” he said. “If you use this function, you can ensure that all the rows and all the dashboards that have been created will share the same style. There’s no longer a fight of should we use line fills or not … and you guarantee across the organization that your dashboard panels are similar enough so that people can find their answers quickly.”

One of the biggest benefits of scripting libraries are the smaller change sets. “If you use higher-order functions, you don’t have to deal with this massive JSON anymore because you only need to compare the query change.”

Mixins or Other Peer-Reviewed Templates:
Mixins are a set of dashboards and alerts that are peer-reviewed and a great resource for any organization.

“The mixins I’ve been showing have been written in Jsonnet, but you can still extract the queries and use them in your own dashboarding journey. It’s a really good resource to look at how people monitor Kubernetes using Prometheus,” said Kaltschmidt.

For more information on Kubernetes mixins, check out this blog post.

Future Tools to Help

Grafana is looking to improve its workflow so that in the browser, there will be an editor to live edit JSON. “But that’s a bit in the future,” said Kaltschmidt.

Until then, “it’s good to have a strategy for dashboarding. Start with the goal of managing the use of methodical dashboards. Then the next step can be consistency by design,” said Kaltshcmidt.

But always remember: “Your dashboarding practices should reduce cognitive load – not add to it.”

Foolproof Kubernetes Dashboards for Sleep-Deprived On Calls

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/05/29/foolproof-kubernetes-dashboards-for-sleep-deprived-on-calls/

We’ve all been in the situation where suddenly you are the lone developer on call while everyone is out of pocket.

Or in the case of Grafana Labs Director of UX David Kaltschmidt, his then business partner, Grafana Labs VP of Product Tom Wilkie, was checking out for a weekend music fest.

“Tom and I founded a company a couple of years ago, and I’m more of a frontend person. Tom did all the backend and devops stuff,” explained Kaltschimdt. “Then one weekend Tom said, ‘David, I’m going to a heavy metal festival, and you need to watch the servers.’”

“I was freaking out,” he admitted. “An easy on call shift allows for debugging and follow-up such as working on features or reducing toil. But a difficult on call experience puts a developer on the defensive, mostly responding to incidents where every minute counts.”

“I was just hoping that never happened when Tom was in a tent somewhere sleeping,” said Kaltschimdt.

So that no one loses sleep over on calls ever again, Kaltschmidt shared his tips and tricks for creating foolproof Kubernetes dashboards at KubeCon 2019.

Being on call in the context of Kubernetes is, at first, “mind-boggling,” said Kaltschmidt. “The process is so difficult because you have to do the troubleshooting across a couple of dimensions and all the concepts in Kubernetes.”

And troubleshooting is as much about finding the issue as it is about eliminating potential situations where the issue is not, said Kaltschimdt.

To help, Kaltschmidt introduced the Dashboarding Maturity Model (DMM), a three-tiered approach that will help align teams and make dashboards more consistent within organizations. Also DMM will give individual engineers some guidance along their dashboarding journey – and show them how to take it to the next level.

maturity levels

Low Maturity

Here are three signs of low maturity dashboarding, which indicate there is no effective strategy in place.

Sprawl:
This is when duplicating dashboards goes unmanaged and gets unwieldy. “Grafana makes it really easy to modify dashboards,” said Kaltschmidt. But it’s easy to leave the “copy tags” function on.

“You probably had good intentions using it. But then someone else came along, cloned the dashboard, left copy tags on, but modified something that semantically represented one of your tags,” said Kaltschmidt. “Those dashboards diverge, and if you later use those tags to find your things, you end up with a long set of dashboards that are no longer representing what the tag originally meant.”

No Version Control:
What happens when you modify a dashboard and you hit save without version control? “If you have a standard Grafana instance and you don’t back up your data or you don’t have your dashboard JSON in version control and your Grafana goes away, then you’re going to have a bad time,” he said.

Browsing for Dashboards:
A similar behavior that’s symptomatic of low maturity is browsing for dashboards. “If you find yourself browsing a lot – going through folders and going back and forth to find the right thing – that’s the sort of behavior a mature user wants to get away from,” said Kaltschimdt.

Medium Maturity

Here are some Grafana-approved methods to managing your monitoring dashboards.

Templated Variables:
A dashboard for each node in Kubernetes isn’t necessary because factors that are tracked, such as CPU usage or usage per core, appear in the same panel layout for all the nodes.

Within Grafana, the nodes can be registered as template variables, and users can access a dropdown to look through all their instances. “If you’re really clever,” said Kaltschimdt, “you can do this for various data sources as a higher level template variable. Then you can basically access a lot of different clusters.”

Methodical Dashboards:
There are a couple of dashboarding methods that help make sense of what could go wrong in Kubernetes along various dimensions.

For services, the RED method measures request and error rate as well as duration for each service. Check out Tom Wilkie’s overview of this method.

For resources, the USE method measures utilization, saturation, and errors. In the example below, these dashboards are part of the Kubernetes monitoring mixin.

kube mixin1

Above is a view of what a node represents and also what problems a node can have.

kube mixin2

Above is a set of dashboards in this repo about persistent volumes that also shows various other dashboards that are available.

Hierarchical Dashboards:
This type of dashboarding method does a great job at providing summary views with aggregate queries by using the power of trees or the logarithmic drilldown to pinpoint where a problem is.

“They really help in the elimination process. You can quickly see in the higher level tree dashboards that things are okay so you can move on to the next one,” explained Kaltschimdt.

hierarchical1

One hierarchy example would be cluster, namespace, and pod. All of these queries will have to be structured in a way that whatever is below or whatever is above, aggregates those metrics in a meaningful way. But then the question becomes how do you navigate between them?

hierarchical2

At Grafana, there would be a cluster view broken down by namespaces. The breakdown in these queries will always be the next level down so you can move into the next level by using one of these drilldown links in the table. This is also part of the Kubernetes monitoring mixin.

Service Hierarchies:
Using the same dashboard, you can see how data flows through an application.

service

Here is the RED method using one row per service, with the request rate and error rate on the left and the duration of the latency on the right.

“The really powerful thing here is that you can see at the top, which is the local answer that tracks the responses to the user, that the user is not going to see any errors because there’s nothing red. But there’s obviously something wrong because the lower dashboards have red,” explained Kaltschimdt.

“If we’re looking at the app, there’s some red there. We know that the app relies on data from the database, and by virtue of the database also being red, we get a hunch that the error may be in the database,” he said. “So this vertical hierarchy inherently leads me towards a hunch about where the system is not working.”

Expressive Dashboards:
“One thing to keep in mind is that sometimes it’s worth splitting up a service or an app into two different dashboards, mainly because the magnitude can differ,” advised Kaltschmidt.

expressive split

For example, Grafana is on Cortex, which is a Prometheus-based service, and a lot of data is being consumed. “If it’s always at a magnitude like 1000x, any errors in the read path would be drowned visually if they were in the same dashboard and those metrics were aggregated,” said Kaltschmidt.

Expressive Charts:
One tip for making your dashboards “really expressive” is to use color to give you a quick hint about what’s going on, said Kaltschmidt.

expressive color

In the top diagram, with only 200-s in green and 500-s in red, “this helps quickly draw a conclusion on the state of this app,” said Kaltschmidt.

Kaltschmidt also advised to normalize graphs by the Y axis “so you instantly get a feel for how busy something is.” This is especially useful for dashboards that track a situation where the resource is bound, like CPU.

normalized

Taking another example from the Kubernetes monitoring mixin, above is a cluster and each of these lines represents a node. But it’s unclear how many CPUs the nodes have because there could be provisions of different sizes.

“If we normalize this by the CPU count or across the cluster, we can definitely say a lot of resources across the cluster are being used leading up to 100 percent,” pointed out Kaltschmidt. “This is really powerful because you reduce the cognitive load of having to draw conclusions on how much space is left.”

Directed Browsing:
While template variables make it hard to navigate or “just browse” through the dashboards, that’s a good thing.

“You actually shouldn’t just navigate through them, especially if you have three-level hierarchies,” said Kaltschmidt. “That will actually encourage you to use alerts.”

Managing Dashboards:
Lastly, where should we store the dashboard code themselves? “There’s a couple of initiatives inside Grafana going on, and the most important revolves around improving the provisioning workflow,” said Kaltschmidt.

Any developer interested in this, check out the Grafana design doc or go to Issue #13823 on Github and comment.

High Maturity

Sometimes in devops organizations, people know good practices, but they still deviate from them. Here are ways to achieve consistency by design.

Scripting Libraries:
Scripting libraries such as grafonnet (Jsonnet) and grafanalib (Python) give you higher order functions to generate certain types of dashboards, said Kaltschmidt.

“The important thing is that those functions can encode, for example, a query panel,” he said. “If you use this function, you can ensure that all the rows and all the dashboards that have been created will share the same style. There’s no longer a fight of should we use line fills or not … and you guarantee across the organization that your dashboard panels are similar enough so that people can find their answers quickly.”

One of the biggest benefits of scripting libraries are the smaller change sets. “If you use higher-order functions, you don’t have to deal with this massive JSON anymore because you only need to compare the query change.”

Mixins or Other Peer-Reviewed Templates:
Mixins are a set of dashboards and alerts that are peer-reviewed and a great resource for any organization.

“The mixins I’ve been showing have been written in Jsonnet, but you can still extract the queries and use them in your own dashboarding journey. It’s a really good resource to look at how people monitor Kubernetes using Prometheus,” said Kaltschmidt.

For more information on Kubernetes mixins, check out this blog post.

Future Tools to Help

Grafana is looking to improve its workflow so that in the browser, there will be an editor to live edit JSON. “But that’s a bit in the future,” said Kaltschmidt.

Until then, “it’s good to have a strategy for dashboarding. Start with the goal of managing the use of methodical dashboards. Then the next step can be consistency by design,” said Kaltshcmidt.

But, first and foremost, remember: “Your dashboarding practices should reduce cognitive load – not add to it.”

Grafana Labs at KubeCon: What is the Future of Observability?

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/05/27/grafana-labs-at-kubecon-what-is-the-future-of-observability/

The three pillars of observability – monitoring, logging and tracing – are so 2018.

At KubeCon + CloudNativeCon EU last week, Grafana Labs VP Product Tom Wilkie and Red Hat Software Engineer Frederic Branczyk, gave a keynote presentation about the future of observability and how this trifecta will evolve in 2019 and the years to come.

“The three pillars were really meant as a framework for people who have just gotten started on their journey in observability. In 2018 there’s been a conversation – and some critique – about this,” said Branczyk.

“We have all of this data, and we’re telling people if you have metrics, if you have logs, and if you have tracing you’ve solved observability. You have an observable system!” Branczyk explained. “But we don’t think that’s the case. There is so much more for observability to come.”

“It’s always a bit of a risky business doing predictions,” said Wilkie. “But we’re going to give it a go anyway.”

Below is a recap of their Kubecon keynote.

The Three Pillars

To start, here’s a quick overview of each pillar.

Pillars Slide

1. Metrics
Normally this is time series data that is used for trends in memory usage and latency.

“The CNCF has some great projects in this space,” said Wilkie. “OpenMetrics is an exposition format for exporting metrics from your application, and Prometheus is probably now the defacto monitoring system for Kubernetes and apps on Kubernetes.”

2. Logs
Logs, or events, are what come out of your containers on “standard out” in Kubernetes. Think error messages, exceptions, and request logs. The CNCF has the Fluentd project, which is a log shipping agent.

3. Traces
“This is potentially the hardest one to sum up in a single sentence,” said Wilkie. “I think of distributed traces as a way of recording and visualizing a request as it traverses through the many services in your application.”

In this space, there is OpenTelemetry as well as Jaeger, a CNCF project which Grafana Labs utilizes, according to Wilkie.

Prediction #1: More Correlation Between Pillars

“The first prediction is that there will be more correlation between the different pillars,” said Wilkie. “We think this is the year when we’re going to start breaking down the walls, and we’re going to start seeing joined up workflows.”

Here are three examples of workflows and projects that provide automated correlation that you can do today:

Correlation Slide

1. Grafana Loki
“The first system is actually a project that I work on myself called Loki,” said Wilkie, who also delivered a separate KubeCon talk about the open source log aggregation system that Grafana Labs launched six months ago. Since then “we have had an absolutely great response. Loads of people have given us really good feedback.”

Loki uses Prometheus’ service discovery to automatically find jobs within your cluster. It then takes the labels that the service discovery gives you and associates them with the log stream, preserving context and saving you money.

“It’s this kind of systematic, consistent metadata that’s the same between your logs and your metrics that enables switching between the two seamlessly,” said Wilkie.

2. Elasticsearch & Zipkin
“Elasticsearch is probably the most popular log aggregation system, even I can admit that,” said Wilkie. “And Zipkin is the original open source distributed tracing system.”

Within Kibana, the Elasticsearch UI, there is a function called field formatters which allows someone to hyperlink between tracing and logs.

“Some chap on Twitter set up his Kibana to insert a link using a field formatter so that he could instantly link to his Zipkin traces,” said Wilkie. “I think this is really cool, and I’m really looking forward to adding this kind of feature to Grafana.”

3. OpenTelemetry
Recently Google’s OpenCensus and CNCF’s OpenTracing merged into one open source project known as OpenTelemetry now operated by CNCF.

Within OpenCensus, the use of exemplars illustrates another example of correlation. “Exemplars are trace IDs associated with every bucket in a histogram,” said Wilkie. “So you can see what’s causing high latency and link straight to the trace.”

“I like that OpenTelemetry is open source,” added Wilkie. “I’m not actually aware of an open source system on the server side that has implemented this workflow. If I’m wrong, come and find me.” (He’s @tom_wilkie on Twitter!)

Prediction #2: New Signals & New Analysis

Your observability toolkit doesn’t have to include just three pillars.

Which is important because “it’s signals and analysis that are going to bring us forward,” said Branczyk, who shared a concrete example of what could be the fourth pillar of observability.

Signals Slide1

Above is a graph of memory created using Prometheus. It shows memory usage over time, and there is a sudden drop in memory. Then there’s a new line that has a different color. In Prometheus this means it’s a distinct time series, but we’re actually looking at the same workload here.

“What we’re seeing in this graph is actually what we call an OOM kill, when our application has allocated so much memory that the current has said, ‘Stop here; I’m going to kill this process and go on,’” explained Branczyk. “Our existing systems can show us all this data usage over time, and our logs can tell us that the OOM has happened.”

But for app developers, they need more information to know how to fix the problem. “What we want is a memory profile of a point in time during which memory usage was at its highest so that we know which part of our code we need to fix,” said Branczyk.

Imagine if there was a Prometheus-like system that periodically took memory profiles of an application, essentially creating a time series of profiles. “Then if we had taken every 10 seconds or every 15 seconds a memory profile of our OOM-ing application maybe we would actually be able to figure out what has caused this particular incident,” said Branczyk.

Google has published a number of white papers on this topic and there are some proprietary systems that do this work, but there hasn’t been a solution in the open source space – until now. Branczyk has started a new project in GitHub called Conprof. (“It stands for continuous profiling because I’m not a very imaginative person,” he joked.)

But how can time series profiling be more useful than just looking at the normal memory profile?

Signals Slide2

Above is a Pprof profile which is what Go runtime provides developers to analyze running systems.

“As it turns out as I was putting together these slides, I found a memory leak in Con Prof,” admitted Branczyk. “But what if Conprof could have told me which part of Conprof actually has a memory leak? So if we have all of this data over time modeled as a time series, and if we look at two memory profiles in a consecutive way, Conprof could identify which systems have allocated more memory over time and haven’t freed it. That potentially could be what we have to fix.”

“I think we’re going to be seeing a lot more signals and analysis,” concluded Branczyk. “I’ve only shown you one example but I think there’s going to be a lot more out there to explore.”

Prediction #3: Rise of Index-Free Log Aggregation

Over the past six months to a year, the feelings around log aggregation have been mutual. “A lot of people have been saying things like, ‘Just give me log files and grep,’” said Wilkie.

“The systems like Splunk and Elasticsearch give us a tremendous amount of power to search and analyze our logs but all of this power comes with a lot of … I’m not going to say responsibility. It comes with a lot of complexity – and expense,” said Wilkie.

Before Splunk and Elasticsearch, logs were just stored as files on disk, possibly in a centralized server, and it was easy to go on and grep it. “I think we’re starting to see the desire for simpler index-free log aggregation systems,” said Wilkie. “Effectively everything old is new again.”

Here Wilkie gives three examples (“Again, three is a very aesthetically pleasing number,” he joked) of how this works:

Logs Slide

1. OK Log
Peter Bourgon started OK Log over a year ago but “unfortunately it’s been discontinued,” said Wilkie. “But it had some really great ideas about distributing grep over a cluster of machines and being able to basically just brute force your way through your logs. It made it a lot easier to operate and a lot cheaper to run.”

2. kubectl logs
“I think if we squint, we can think of this as a log aggregation system,” said Wilkie. In kubectl logs, there’s a central place to query logs, and it stores them in a distributed way.

“For me, the thing that was really missing from kube logs is being able to get logs for pods that were missing – i.e. pods that disappeared or pods that OOM-ed or failed, especially during rolling upgrades,” said Wilkie.

3. Grafana Loki
The above problem is what led Wilkie to develop Loki, the index-free log aggregation system designed to be easy to run and easy to scale by Grafana Labs.

“It doesn’t come with the power of something like Elasticsearch,” said Wilkie. “You wouldn’t use Loki for business analytics. But Loki is really there for your developer troubleshooting use case.”

“I’m really hoping that in 2019 and 2020, we see the rise of these index-free, developer-focused log aggregation systems,” concluded Wilkie. “And I’m hoping this means, as a developer, I’ll never be told to stop logging so much data again.”

How You Can Help

“The overarching theme of all of this is don’t leave it up to Tom and me. Don’t leave it up to the existing practitioners,” said Branczyk. “This is a community project. Observability was not created by a few people. It was created by people who had lack of tooling in their troubleshooting.”

So the next time you find yourself troubleshooting, think “what data are you looking at, what are you doing to troubleshoot your problem, and can we do that in a systematic way?” said Branczyk. “Hopefully we’ll have more reliable systems as a result.”

With the help of the entire community, said Wilkie, “if we’re lucky, we can watch this talk in a year or two and maybe get one out of three.”

Grafana Labs at KubeCon: Loki’s March Toward GA

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/05/23/grafana-labs-at-kubecon-lokis-march-toward-ga/

At KubeCon + CloudNativeCon EU this week, Grafana Labs VP Product Tom Wilkie gave a talk about Loki, the Prometheus-inspired service that optimizes search, aggregation, and exploration of logs natively in Grafana. In case you missed it, here’s a recap.

Wilkie’s talk is an overview of how and why Grafana Labs built Loki and the features and architecture the team built in. Our policy is to develop projects in the open, so the design doc has been publicly accessible since development started. Loki was launched at KubeCon North America last December; the project now has more than 6,000 stars on GitHub.

“Loki is our horizontally scalable, highly available, multi-tenant log aggregation system,” Wilkie says. “We’ve tried to make Loki as close to the experience you get for metrics with Prometheus, but for logs.”

Here are the main features:

Simple and Cost-Effective to Operate

Existing systems that do log aggregation use inverted indexes. They take every line of logs, split them up into tokens, and then insert a tuple for every single token. One great benefit of indexing all of the fields within a message or a log line is that it makes searches really fast.

“This is how Google does searches on the internet, so it’s a really proven technology,” says Wilkie.

But it’s really hard to scale inverted indexes. The index is the size of the data ingested. “The other problem with inverted indexes is you end up either sending all of your writes to all of your nodes, or you end up sending all of your reads to all of your nodes in your cluster, and it makes them hard to scale for both interactive read and writes,” says Wilkie.

So the team adopted a different data model. “We have a series of tags, like a bag of key value pairs for each stream within the database,” says Wilkie. “The streams themselves are not indexed. This means our index, which we use for the tags, is much smaller. On our dev environment, we’re ingesting >10TB a week, but the index is like ~500MB. It’s more than four orders of magnitude smaller.”

Data is grouped together into chunks that are compressed up to 10x. This model makes it much easier to scale out the system and operate it. Plus, “by using these tags, and by getting these tags from the same place we do for our Prometheus metrics, it makes it really easy to switch between the two,” says Wilkie.

Integrated with the Existing Observability Stack

Imagine what happens when you’re on call and you get an alert. There are multiple steps, and multiple systems, to work through to pinpoint a problem.

Alert

“As you can see, we’ve had to use five different systems. If you include Slack or PagerDuty and every step, you’ve had to manually translate between them,” says Wilkie. “Preserving that context throughout this flow is really hard, and that was the problem we wanted to solve.”

Prometheus has a relatively simple data model. “If you want to do a query, you generally specify a set of matches against labels and that will select some set of time series that you’ll then rate, aggregate, apply percent or maths to, do whatever you need to do to turn that raw data into the information you display on your dashboard,” says Wilkie.

To get these labels, the Prometheus server talks to the Kubernetes API server and asks it what services, what pods, what deployments, what objects do you have in your object model? Then, you apply a set of rules to those objects you return. These are called relabeling rules.

“If you just include in the job name the name of the service, it’s very common to accidentally aggregate together your dev and your prod environment,” says Wilkie. “Part of the relabeling we do is to put the namespace name into the job name, and this stops us from making that mistake.”

The same thing is done in Loki. “The data model for Loki is identical to the data model for Prometheus, except we’ve changed the values, which were float64s, to bytes,” says Wilkie. “And then, to gather these values, we use the Prometheus code base, embed it into a job we called Promtail. It will go and talk to the Kubernetes API server, gather a list of targets, relabel them so they’re consistent. You can really easily and quickly switch between metrics and logs for a given service.”

For more about how Loki correlates metrics and logs – saving you money – check out this blog post.

Cloud Native and Airplane Mode

For developers who travel a lot, working on a technology that doesn’t require an internet connection is a big bonus. “I really wanted Loki to be the kind of thing I could develop on my laptop when I’m disconnected,” says Wilkie.

So the team made Loki a single binary that can run as a single process.

“You can store your logs on disk,” he says. “Of course, my laptop’s got limited storage. Your single server and your data center’s got limited storage, so we need to make it scale out. So the same binary, with the same technologies, can scale out.”

The solution: Install it on multiple servers. Install Promtail on all the clients you want to scrape logs from, and “it’ll collect those logs and send them to your Loki cluster. It uses the technology you should all be familiar with – distributed hashing, dynamo style replication – to achieve this, and then it can use the local disks.”

Grafana Labs also offers a hosted version of Loki, and that uses a microservices architecture. By design, Loki can work as both microservices and monoliths.

“We have the microservices architecture where we’ve broken up every single one of the services into individual functions almost, where we’ve got a separate read path and a separate write path,” says Wilkie. “We’ve got caching at every single layer, and then we don’t like local storage when we run Grafana Cloud. We like to use cloud storage. So, in Grafana Cloud, we use Bigtable and GCS to store the chunks. The Bigtable stores the index, and the chunks go in GCS. But again, this is all pluggable, so you can run this in Amazon, Azure, MinIO. We support all of that.”

In building Loki, “we’ve adopted all of the best practices for cloud native,” says Wilkie. “We’ve made it containerized, we’ve made it Kubernetes native, we’ve used cloud storage, and we’ve made it so it can run at massive scale in the cloud.”

What’s Next

The team is working on LogQL, a query language like PromQL. (Check out the design doc.)

“We’ve heard the feedback from everyone who’s used the early versions of Loki, and we’re going to add alerting to Loki so you can alert off your raw log streams,” says Wilkie. “We’re probably going to do that using LogQL. We want to make LogQL so you can combine Prometheus queries and log queries into a single metric and extract metrics out of your logs. That’s a great way of building alerts off of logs.”

Grafana v6.2 Released

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/05/22/grafana-v6.2-released/

v6.2 Stable Release!

It’s finally time for a new Grafana release again. Grafana 6.2 includes improved security, enhanced provisioning workflow, a new Bar Gauge panel, Elasticsearch 7 support, and lazy loading of panels, among other things.

What’s New in Grafana v6.2

Download Grafana 6.2 Now

Check out the demo dashboard of some of the new features in v6.2.

Improved Security

Datasources now store passwords and basic auth passwords in secureJsonData encrypted by default. Existing datasource with unencrypted passwords will keep working. Read the upgrade notes on how to migrate existing datasources to use encrypted storage.

To mitigate the risk of Clickjacking, embedding Grafana is no longer allowed per default. Read the upgrade notes for further details of how this may affect you.

To mitigate the risk of sensitive information being cached in browser after a user has logged out, browser caching is now disabled for full page requests.

Provisioning

  • Environment variables support: See Using environment variables for more information.
  • Reload provisioning configs: See Admin HTTP API for more information.
  • Does not allow deletion of provisioned dashboards.
  • When trying to delete or save provisioned dashboard, relative file path to the file is shown in the dialog.

Official Support for Elasticsearch 7

Grafana v6.2 ships with official support for Elasticsearch v7. See Using Elasticsearch in Grafana for more information.

Bar Gauge Panel

Grafana v6.2 ships with a new exciting panel! This new panel, named Bar Gauge, is very similar to the current
Gauge panel and shares almost all its options. The main difference is that the Bar Gauge uses both horizontal and
vertical space much better and can be more efficiently stacked both vertically and horizontally. The Bar Gauge also
comes with three unique display modes: Basic, Gradient, and Retro LED. Read the
preview article to learn
more about the design and features of this new panel.

Retro LED Display Mode

Gradient Mode

Check out the Bar Gauge Demo Dashboard.

Improved Table Data Support

We have been working on improving table support in our new react panels (Gauge & Bar Gauge), and this is ongoing work
that will eventually come to the new Graph, Singlestat, and Table panels we are working on. But you can see it already in
the Gauge and Bar Gauge panels. Without any config, you can visualize any number of columns or choose to visualize each
row as its own gauge.

Lazy Loading of Panels Out of View

This has been one of the most requested features for many years and is now finally here! Lazy loading of panels means
Grafana will not issue any data queries for panels that are not visible. This will greatly reduce the load
on your data source backends when loading dashboards with many panels.

Have a look at the demo dashboard, and try the new lazy loading feature.

Panels Without Title

Sometimes your panels do not need a title, and having that panel header still takes up space, making singlestats and
other panels look strange and have bad vertical centering. In v6.2, Grafana will allow panel content (visualizations)
to use the full panel height in case there is no panel title.

Have a look at the demo dashboard to get a feel for how panels without title works.

Minor Features and Fixes

This release contains a lot of small features and fixes:

  • Explore: Adds user time zone support, reconnect for failing datasources, and a fix that prevents killing Prometheus instances when Histogram metrics are loaded.
  • Alerting: Adds support for configuring timeout durations and retries. See configuration for more information.
  • Azure Monitor: Adds support for multiple subscriptions per datasource.
  • Elasticsearch: A small bug fix to properly display percentile metrics in table panel.
  • InfluxDB: Support for POST HTTP verb.
  • CloudWatch: Important fix for default alias disappearing in v6.1.
  • Search: Works in a scope of dashboard’s folder by default when viewing dashboard.

Removal of Old Deprecated Package Repository

Five months ago, we deprecated our old package cloud repository and replaced it with our own. We will remove the old depreciated repo on July 1. Make sure you have switched to the new repo by then. The new repository has all our old releases, so you are not required to upgrade just to switch package repository.

Changelog

Check out the CHANGELOG.md file for a complete list of new features, changes, and bug fixes.

Download

Head to the download page for download links & instructions.

Upgrading

Read important upgrade notes.

Thanks

A big thanks to all the Grafana users who contribute by submitting PRs, bug reports, and feedback!

Grafana Labs at KubeCon: The Latest on Cortex

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/05/21/grafana-labs-at-kubecon-the-latest-on-cortex/

Grafana Labs has been running Cortex for more than a year to power Hosted Prometheus in Grafana Cloud. We’re super happy: It’s been incredibly stable and has recently gotten insanely fast. Here’s what you need to know about Cortex, what we’ve been doing to Cortex in the past year, and what we plan on doing in the coming months.

What is Cortex?

Cortex is a Prometheus-compatible time series database that has a different take on some of the tradeoffs Prometheus makes. Cortex is a CNCF Sandbox project.

Global Aggregation

Cortex is horizontally scalable; it is not limited to the performance of a single machine. It can be clustered to pool resources from multiple machines, theoretically scaling to infinity! Cortex is also highly available, replicating the data across multiple machines such that it can tolerate machine failures with no effect on users.

Cortex Horizontally Scalable

These two Cortex features enable you to run a central Cortex cluster and have multiple Prometheis send their data there. You can then query all your data in one place with one query and get a globally aggregated view of your metrics.

This is super useful when you run multiple, geographically-distributed Kubernetes clusters, each with their own dedicated Prometheus servers. Have them all send metrics to your Cortex cluster and run global queries, aggregating data from multiple clusters, in one place.

Capacity Planning and Long-Term Trends

Cortex provides durable, long-term storage for Prometheus; it stores data in many different cloud storage services (Google Bigtable, GCS, AWS DynamoDB, S3, Cassandra, etc). Cortex uses the cloud storage to offer fast queries for historical data.

Long-term storage allows capacity planning and long-term trend analysis. You can go back a year and see how much CPU you were using, so you can plan the next year’s growth. You can also look at things like long-term performance trends to help identify releases that made latency worse, for instance.

One Cluster, Many Teams

Cortex supports native multitenancy; there can be multiple, isolated instances within a single Cortex cluster.

Cortex Multitenant

Multitenancy allows many teams to securely share a single Cortex cluster in isolated, independent Prometheus “instances” – without the overhead of having to operate multiple separate clusters. Simply put, there’s less cognitive load.

Cortex Progress in the Past Year

With the acquisition of Kausal, Grafana Labs invested heavily in the Cortex project in the last year. Here are some changes we’ve driven:

“Easy-to-Use Cortex”: We’ve built a single process/single binary monolithic Cortex architecture that makes it easier to get started and kick the tires. The same binary can be used for a set of disaggregated microservices in production. This work was heavily inspired by the success of a similar approach in Loki.

Query Performance: Over the past year we have built a parallelizing, caching query engine for Cortex. We have optimized Cortex’s indexing and query processing. Some queries have gotten 100x faster. We can now achieve ~40ms average query latency and <400ms P99 latency for our heaviest workloads in production clusters.

HA Ruler: You can now horizontally scale your recording rules, pre-aggregating much more data and helping make queries faster.

Ingesting Data from HA Prometheus Pairs: Cortex has always been highly available, but it has relied on a single source of truth for its data – a single Prometheus node per cluster. We now support highly available Prometheus pairs (or more) to make the whole pipeline highly redundant and replicated. Data is deduplicated on ingestion.

Cortex Infinitely Scalable

Cortex Going Forward

Cortex is an inherently stateful app, making techniques like continuous deployment challenging. We at Grafana Labs have been doing a release every week, usually every Monday, in which we deploy the latest master into our dev and staging environments and run it for a couple of days, catch bugs, then promote it to prod.

Cortex master branch is incredibly stable already, and those who are also running Cortex are also deploying master. We also give high importance to backwards compatibility, adding new features as off-by-default behind feature flags. But this means that unless operators are keeping a close eye on the changes, they are losing out on improvements. While so far nobody has voiced concerns, we don’t think this is a viable solution long term.

We will cut the first release of Cortex imminently and plan to cut a new release every month with detailed changelog so that folks can follow what’s going on and how to update to the latest and greatest.

We at Grafana Labs will still be running master to make sure our users will get the best of Cortex. We will ensure that master is still very stable for those wanting to deploy the bleeding-edge.

Are we done? Not yet. Our path to 1.0 includes adding a WAL (write-ahead log) for increased durability and deprecating old flags and index schema versions. And our post-1.0 goals include using Prometheus TSDB blocks to make Cortex an order of magnitude cheaper to run.

Interested in learning more? Join the Cortex project on GitHub.

How Verizon Achieved Automation and Self-Service with Grafana

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/05/20/how-verizon-achieved-automation-and-self-service-with-grafana/

Can you monitor us now?

That was the question Verizon started asking as the Fortune 500 company expanded its portfolio beyond communications services to include brands such as Yahoo! and Huffpost.

“We’re not just grandma’s landline,” Sean Thomas, Verizon Systems Engineering Manager, told the audience at GrafanaCon in L.A. “We’re not just your mobile provider. We are a media company. We have 5G solutions. We’re building technology. We’re building the future.”

By the end of 2018, Verizon employed 144,500 people to do just that. “In terms of scale, Superbowl LIII had 70,000 people at it. That means that we filled up two stadiums at Superbowl LIII and still had a few thousand people partying in the parking lot tailgating,” said Thomas.

But the varsity team for monitoring was the Verizon Systems Engineering team, which oversees cloud engineering, analytics, ITSM automation, and tools.

Thomas, who helps lead the full stack development division, said as the company grew the team strived to get a full picture of its internal systems so they could “take things into the future from a large-scale enterprise perspective.”

At the time, there were 40 servers running analytics for all of Verizon’s systems such as change management, availability management, change tracking, and event management. Those servers ran in an SSRS environment with SQL on Windows so the licensing costs alone were not ideal.

“It wasn’t efficient. It wasn’t scalable, it wasn’t modern, and it was just a pain to get anything done,” Thomas said during the GrafanaCon session. As Verizon restructured internally, “one of the hardest parts that we ran into was if the business said, ‘Hey, we’re going to make this change. This department is now called this.’ When that happened, you had to do a whole development effort just to change the name on all these reports. It was crazy.”

Grafana to the Rescue

The goal for the System’s Engineering team was to bring all the different data sources into a single, easily accessible view for end users and the executive team.

After looking at the infrastructure in place, Derek Meyer, a Verizon Engineer on the Automation Tools team, started looking into open source options. “I’ve always been a person who enjoyed open source software,” he said, “and try to contribute where I can.”

Meyer started playing around with Grafana. “I tossed up my little play website and played with my own data,” he explained. After some initial experimentation with other engineers, they decided to pursue Grafana as a company directive.

While putting together a new monitoring model, the team had set up the infrastructure to run several MySQL databases with replication to replace the SQL servers that incurred licensing costs. They also had Linux boxes set up for some time as well.

“We compared our old model to our new model and said, ‘Gee, this is a no brainer. Why don’t we continue down this path using Grafana?’” said Meyer.

There were a few hurdles along this path, however.

First, the team had to figure out how to handle its legacy infrastructure on premises, said Meyer. “Every time you build up on-prem stuff it’s, ‘Here’s a request; build me a server.’ Or ‘Here’s another request to get the OS on it.‘”

“That’s six months to do … if you’re lucky,” added Thomas.

To improve the ease of scalability, the team came up with a hybrid solution that leverages containers. “A lot of our data remains on prem just because of the sensitivity of it,” explained Meyer.

“Security has always been the biggest issue,” said Thomas. “That’s the main reason that we’re looking at a hybrid over a full cloud approach … There is quite a bit of sensitive data that the security and governance teams are uncomfortable with having out there.”

But, Meyer said, “we can stick the front-end in a hybrid situation cloud and help reduce the time as well as increase our redundancy.”

When they shifted their attention to the old SSRS servers, engineers discovered there was more than 500,000 lines of static code for stored procedures such as change management and instances.

“The code had been around for a very long time and to make a change to it you were really hoping that what you were doing wasn’t going to break something else,” said Meyer.

Instead the Verizon team broke down the existing code and drastically decreased that number to 500 lines of dynamic code in only five stored procedures thanks to the functions within Grafana.

“Those 500,000 lines were in 200+ different stored procedures. Lots of them were multi-thousands of lines, where everything was the same but one variable. When you want to go try and change it, it was hard,” explained Meyer, “we do all of our change metrics, our instance response, and ticket tracking off of five stored procedures now by leveraging Grafana and MySQL.”

Did Grafana Really Make a Difference?

With all these large shifts in infrastructure, “the next big question is, ‘Was the change worth it?’” said Thomas.

The numbers speak for themselves: Since Grafana was implemented, there has been a 100,000% reduction in stored procedure code and 4,000% decrease in total stored procedures.

“I triple-checked these percentages,” said Thomas. “That’s actually correct.”

But here are three major improvements that Thomas and Meyer outlined to drive their point home:

1. Better Use of Time

One of the most positive outcomes of Grafana has been how much time is saved in managing and monitoring metrics at the company.

When a line of business changes names or a new VP joins the executive team or there’s a management reorg, “everything dynamically updates from the source data,” says Thomas. “The dashboards that previously showed the information for one person shows it for the new person. I can automatically get everything I need.”

In the past, any org changes would involve multiple developers who would need to dedicate at least 30 days to complete the development effort.

“Every single one of those lines in the store procedure had to be updated – and everybody knows what happens when that goes on,” said Thomas. “You miss one line and that, of course, is the line that one VP looks at. Another VP is looking at a completely different dashboard. The two numbers don’t gel, and your CIO gets two different stories from two different VPs. Then guess who gets the phone call at 2 AM?”

With the new system, “taking that [process] down to automated tasks and just updating 500 lines of code, that’s two [free] FTE right there,” said Thomas. “Those developers are not focusing on dashboards. Now they can focus on actual deliverables and everything that you actually have to get done through the year.”

2. Empowered End User

Prior to Grafana, reports were manually created for every request. “We had thousands of reports, a lot similar to each other,” said Meyer. “Over time they were going stale. You don’t always know if they’re all working without checking thousands of dashboards. The automation behind it was extremely difficult to do.”

Also, because there are various ways to view the same data, separate SSRS reports were required for each development effort.

“Now it’s a filter at the top of the page,” said Thomas. “Executives don’t have to fill out [requirements]. They get the data as they need it. It makes their ops reviews quicker to put together. It’s all at their fingertips.”

With this self-service model for metrics “you empower the end user,” said Meyer. “Anybody from a call center rep to a CIO can turn around and leverage that information and see it in a way that they want.”

Plus with some of the log-in abilities in Grafana, “if you tie it in to your LDAP ability, you can set it so that certain reports are only available to certain people,” said Meyer.

“It’s got a lot of flexibility,” Meyer added, “and just makes life so much easier.”

3. Fewer Fire Alarms

Thankfully there have also been fewer unwanted data charges for the engineering teams.

“One of the big things that we first noticed immediately was fewer fire alarms,” said Thomas. “When I say fire alarms, I mean late-night text messages, late-night phone calls with ‘This data’s wrong; this data’s inconsistent.’”

Thomas has also noticed his inbox is getting much less traffic. “There’s a significant reduction in emails,” he said. “If you have one dashboard wrong in a company the size of Verizon, you don’t hear about it from one person. You get 17 different emails, all from different executive directors or different management teams.”

All of these factors add up to a better quality of life for developers at the company. “How many of us in here pulled 24 hour days doing something? Or had to get up at 2 AM to try to fix something? Or you left work and you turned around and said, ‘Oh crap, I forgot I need to do this by the morning,’” said Meyer.

“I know over the last 10 years my stress level and my blood pressure have gone up,” said Meyer. “Now with the infrastructure that’s in place, I don’t necessarily have to worry as much about it.”

Focusing on the Future

With the days of the “drop everything” fire alarm in the past, engineering teams can now look towards the future.

In recruiting more teams to use Grafana, “we first showed off the capabilities by decommissioning 40 licensed systems, moving it onto this singular platform,” said Thomas. “The next piece is now we’re marketing. We’ve got cloud engineering teams, our network teams, our storage teams coming on board and seeing the power that’s available within Grafana.”

As Thomas’s team shifts away from development efforts involving dashboards and metrics, “we can actually get real work done.”

That involves contributing to making Grafana even better. “This truly is that single pane of glass solution. There’s more data sources being added on a regular basis. It’s an open source solution for those data sources,” said Thomas.

And if Grafana doesn’t offer the solution Verizon needs at the moment, “maybe it doesn’t exist today, but it could exist tomorrow,” said Thomas. “We have plenty of talent within the company that could certainly contribute and create those data sources if they’re needed.”

For more from GrafanaCon 2019, check out all the talks on YouTube.

Meet the Grafana Labs Team: Johannes Schill

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/05/17/meet-the-grafana-labs-team-johannes-schill/

As Grafana Labs continues to grow, we’d like you to get to know the team members who are building the cool stuff you’re using. Check out the latest of our Friday team profiles.

Meet Johannes!

Name: Johannes Schill

Johannes Schill

Grafana Labs Developer Johannes Schill

Current location/time zone:
CET/GMT+1/STO/ARL/+46-8. I live close to the Stockholm office.

What do you do at Grafana Labs?
I’m a frontend developer. I don’t have any special focus; you can find me in the git history all over the place.

What open source projects do you contribute to?
Grafana. If we find issues/room for improvement in other packages, we help out the best we can.

What are your GitHub and Twitter handles?
jschill at GitHub. No fan of Twitter.

What do you like to do in your free time?
Free time? I have a 2-year-old at home. But I remember those days. I liked that “free time” thing.
Once a week I try to see some football (soccer if you’re U.S.). I’ve had a chair at Djurgårdens games for 18 years now. In the summer I try to run (but I’ve just had surgery on my knee, so I’ll have to wait a month or two this year), and in the winter I like snowboarding and skiing, both downhill and cross country.

Do you like to code in silence, with music or ambient noise on your headphones, or with people talking around you?
Depends on the situation, but if I need to focus, then music or silence. Music is a hobby of mine, and I have a little vinyl collection at home with German ’60s/’70s, Swedish indie, and some modern post-krautrock.

What do you do to get “in the zone” when you code?
Coffee, music, and that animated gif of Nicolas Cage leaving the bus in Con Air.

via GIPHY

Spaces or tabs?
A no-brainer. Tabs.

Worth a Look: Public Grafana Dashboards

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/05/16/worth-a-look-public-grafana-dashboards/

There are countless Grafana dashboards that will only ever be seen internally. But there are also a number of large organizations that have made their dashboards public for a variety of uses. These dashboards can be interesting to browse, giving you an insider’s peek into how real Grafana users set up their visualizations, with actual live data to boot.

Perhaps some of them will inspire you to get to work on your own Grafana?

GitLab

GitLab is a famously transparent company. They’ve even live streamed internal outages in the past. So it’s not surprising that they’d make a bunch of their internal Grafana dashboards for their cloud infrastructure public. The GitLab Dashboard offers graphs on everything from disk stats to fleet overviews, to alert reporting and triage.

GitLab Dashboard

Wikimedia

As one of the most popular sites on the Internet, Wikipedia operates at a truly incredible scale. The foundation behind the site exposes its Wikimedia Metrics via Grafana dashboards. The dashboards range from a global datacenter overview to API request rates. Be sure to adjust your eyes for some of their mind-bogglingly high numbers.

Wikimedia Dashboard

Cloud Native Computing Foundation

CNCF’s DevStats tool provides analysis of GitHub activity for Kubernetes and the other CNCF projects. Dashboards track a multitude of metrics, including the number of contributions, the level of engagement of contributors, how long it takes to get a response after an issue is opened, and which special interest groups (SIGs) are the most responsive. Grafana Labs is a member of the CNCF, and while we provided some help in getting DevStats up and running, the CNCF has put a lot of effort into this open source tool. It’s impressive to see what they’ve accomplished.

CNCF Dashboard

Grid Computing Centre Karlsruhe

GridKa, which is the home to the Large Hadron Collider, visualizes its data with a public GridKa Grafana that tracks everything from cluster utilization to system metrics for its experiments. Grafana powering science.

GridKa Dashboard

CERN

The European Organization for Nuclear Research operates the largest physics laboratory in the world. You can find more details about the experiments that members are doing at the Large Hadron Collider and other facilities on this public Grafana. Note those are tens of Gigabits per second they’re talking about.

CERN Dashboard

Zabbix Plugin

There’s a Grafana plugin for the Zabbix open source network monitoring system that’s maintained by one of our Grafana Labs team members, Alexander Zobnin, and his play site provides a good demo of how the plugin works.

Zabbix Plugin Dashboard

OGC SensorThings Plugin

This is a demo site for a plugin for the open source framework for interconnecting IoT. The example shown here is live tracking of a shuttle bus.

SensorThings Plugin Dashboard

Hiveeyes Project

The open source Hiveeyes Project is developing a flexible beehive monitoring infrastructure platform. This public Grafana visualizes weather in Germany.

Hiveeyes Dashboard

Percona

The Percona demo site offers examples of its Percona Monitoring and Management dashboard. It’s an open source platform that provides time-based analysis to ensure that your data works as efficiently as possible.

Percona Dashboard

Grafana

And of course there’s the Grafana Play dashboard. This is one of the original public Grafana instances, hosted by Grafana Labs. It has served multiple purposes. First, it’s a demo site for people to get introduced to the various features and capabilities of Grafana. We also use it as a way to test issues or fixes, or demonstrate particular features.

Grafana Play Dashboard

How to Streamline Infrastructure Monitoring with Sensu, InfluxDB, and Grafana

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/05/14/how-to-streamline-infrastructure-monitoring-with-sensu-influxdb-and-grafana/

“To start, your monitoring stack should not cost you stacks,” Sensu Software Engineer Nikki Attea told the crowd at GrafanaCon L.A. “Avocado toast is really expensive. But the good news is your monitoring solution doesn’t have to be.”

To prove it, Attea presented an easy developer-centric use case that leverages Sensu, a monitoring event pipeline which collects, processes, and roots different event types including discovery, availability, telemetry, and alerts.

“The pipeline makes Sensu extremely powerful and completely customizable. So just think Nagios on steroids,” Attea said.

The company offers multiple mechanisms to monitor performance metrics, whether for applications or for infrastructures.

Sensu1

With StatsD – a metric aggregator used to collect values such as gauges, counters, timers, sets – Sensu agents have an embedded StatsD daemon that listens for UDB traffic. Read more about this service on the Sensu blog.

To monitor infrastructure, Sensu service checks collect data on monitored nodes and follow the same protocol as Nagios service checks. Each Sensu agent runs the collection of checks, and each check will output data, produce an exit code, and indicate a specific state. Sensu then parses the check output and produces metrics.

While Attea focused on service checks at GrafanaCon, she said, “Spoiler alert: The more complex your stack gets, you’ll probably want both [checks and metrics].”

Using a simple stack including Sensu, InfluxDB and Grafana – all open source tools with enterprise counterparts – Attea walked through how Sensu service checks work with Grafana to visualize data and improve monitoring.

Output Metric Extraction

Sensu2

Sensu currently supports four different metric formats: InfluxDB Line Protocol, OpenTSDB, Graphite, and Nagios Performance Data.

“The key below each type is the identifier that you would use to define in a given Sensu check configuration,” said Attea. “This determines which format in the check output should be parsed and mapped to the field output metric format.”

In addition, “Sensu actually supports a wide variety of built-in metric protocols and basically limitless plugin potential to store them,” said Attea.

Sensu Check Configuration

Sensu3

Here, Attea defined a check called Check CPU InfluxDB.

“It’ll be set to run every 10 seconds on any node it’s subscribed to,” she explained. The command at the bottom is a simple shell script that will print out CPU usage in InfluxDB Line Protocol. The last two fields on the left indicate that check output metric extraction will occur.

“So the event that is produced contains not only execution context such as status, output, duration, etc.,” said Attea. “There will also be entity information about your monitor node, and lastly – and most importantly – the extracted metric, which is the Kube API server CPU value.”

The entire process is not only easy, said Attea. “It’s magic.”

Integrating a Time Series Database

Sensu4

Sensu has tight integrations with many time series databases, so users can simply pick the one they prefer.

In this example, Attea used an InfluxDB handler “because Influx has a super simple Golang client, and Go is my language of choice,” she said.

“The handler configuration on the right takes out event data and invokes the Go binary called Sensu InfluxDB Handler,” Attea explained. “This accepts configuration options as either command line flags or environment variables. And then additional metric tag enrichment can happen as part of the Sensu event pipeline.”

This setup will eventually accept the event data through standard in, and then the metrics will be sent off to the configured time series database.

“This was previously an enterprise feature, so take advantage of it,” advised Attea. Find the source code here.

Inside the Monitoring Event Pipeline

Sensu5

In the event pipeline, the Sensu backend will send service checks to monitor nodes with installed Sensu agents. The agents will execute the check, extract the metrics in any of the four supported formats, and then the backend will receive that event data and pass it through the monitoring event pipeline.

“In this specific use case, you can filter this event only if it contains metrics, mutate that event to enrich any metric tags, and add additional context about the data and source of the metrics. Then you would handle the events by sending them off to a time series database,” said Attea.

This diagram also folds in StatsD metrics as well as another integration with a Prometheus metrics endpoint.

“Essentially any telemetry event that the agent receives will be processed by the backend, which is important because in order to have complete visibility of your app, system, services, infrastructure, you’ll likely have to receive data from multiple sources,” Attea explained. “It’s great that there’s a single entry point for all of this data, but as you start to add different event types like availability and alerts, you’ll be thankful that the pipeline is dynamic enough to support re-usability all under the same hood.”

Visualizing the Data

Sensu6

This Sensu dashboard prioritizes critical events over normal statuses. “It’s backend- and API-driven, so while the Sensu dashboard does provide excellent visibility into the overall health of your system and the state, it doesn’t directly visualize time series data,” Attea said.

Enter Grafana.

Sensu7

“In this dashboard there’s a single data source as far as Grafana’s concerned, because we let Sensu do all of the heavy lifting,” Attea said. “The Sensu checks shown here are displaying metrics from both Graphite and Influx while the StatsD daemon is tracking all of the API calls and requests rates. … I’d say this dashboard is pretty sleek, so thanks Grafana for making that easy!”

For more from GrafanaCon 2019, check out all the talks on YouTube.

Using Grafana to Monitor EMS Ambulance Service Operations

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/05/13/using-grafana-to-monitor-ems-ambulance-service-operations/

The Emergency Services team at Trapeze Group provides 24/7/365 support for ambulances in Australia. Each fleet can contain as many as 1,000 vehicles, with more than 60 telemetry channels and 120 million messages going in and out to paramedics every day.

The information ranges from “the 911 call that comes into the control center to dispatching ambulances; to monitoring them on scene, looking at vehicle performance, making sure they’re on scene quick enough, getting them to the right location, getting them out of scene, getting them into hospital,” James Wetherall, Group Manager, Emergency Services, said during his talk at GrafanaCon L.A.

These operations can be, quite literally, a matter of life or death, so observability is critical. To make the information more accessible, Wetherall said, “we’re trying to ingrain Grafana in the customers that we’re working with, get it exposed, and get everything visual.”

Trapeze’s data environment is based around SQL Server and Postgres. Because Trapeze’s customers operate in large geographical areas where there isn’t always high-speed cellular coverage – “I’m talking like 9KB a second,” he said – Wetherall’s team has in the past year done some work with the spatial side of Postgres, building custom protocols.

The data covers location, connectivity, safety, communication, awareness, and integration. There’s information about where the equipment is and where the people are, and communication with the hospitals. There are many integrated data sources around vehicle telematics and the internet of lifesaving things, such as defibrillators and specialist equipment, as well as those around smart cities, like traffic lights. “We’re monitoring all this in real time, making sure they have connectivity as we do things like run on multiple networks,” said Wetherall. “If one service is out, we need to use others and keep retrying.”

EMS Slide 17

The Grafana Journey

When Trapeze started building out an observability solution for its customers three years ago, it was fully bespoke. “We’ve tried a whole bunch of stuff: different frameworks around Angular and React, some of the web bootstraps,” Wetherall said.

The team soon found that it was impossible to get everything they wanted quickly. “The software development to get something out in the hands of people was just slowing us down,” he said. “Adding multiple data sources was just impossible; it was like building something new every time. Most things weren’t operating in real time… The whole technicality of stitching these things hindered our progress. We really wanted to deliver in minutes. We wanted to be out of rapid prototype in front of users in real time. We wanted it to be synced to those data sources, remove dependency on developers.”

Wetherall’s goal was to focus development on building plugins and enhancing the solution, not architecting something from the ground up. “And we really want it to be an everyday tool,” he said. “Everyone in these organizations should have access, they should be able to see what they need to see quickly, and there shouldn’t be a steep learning curve.”

Grafana enabled all of this. “The key takeaway for me was to get it out of the back room and into our customers’ hands,” he said. “That’s just been a game-changer for us.”

During the talk, Wetherall showed a number of Trapeze’s key Grafana dashboards.

This dashboard monitors messages:

EMS Slide 25

Using Grafana’s built-in maps to track vehicles, this dashboard shows information like where they are, what status they are in, how fast they’re going, and what the current response time is:

EMS Vehicle Tracker

Other dashboards even drill down to things like the batteries in the vehicles, juxtaposed with what priority jobs they’re on and how long they’ve been responding.

Wetherall’s team keeps finding new and better ways to present all of the information that emergency services need. “The current world map plugin will allow us to plot points and scale points,” he said. In their first foray into customizing plugins, “we’ve been doing things like stitching together points to make lines, overlaying things like critical points within that response to an incident, when the crew got the job, when they arrived at scene. So it’s an easy way to drill back into jobs, go back in history, and replay how long the job took.”

EMS World Map

Other dashboards hooked into the vehicle telemetry – battery life, lamp status, oil pressure, temperature – are integrated into the fleet management system, so vehicles can get repaired when necessary. Field techs commissioning vehicles can get a quick view of whether everything is working properly. “We can do some sort of predictive modeling on this to see degradation of batteries and put in place things like optimal replacement programs,” said Wetherall.

Another feature the team is working on is overlaying information and using the annotation tool to allow people to analyze the data, flag an annotation, pull it out, and put in a request to a workshop to service that vehicle.

They’re also looking at calendar control to overlay information about how many jobs a vehicle has responded to in a day, to help with scheduling.

Looking Ahead

“When we first started on the Grafana journey, we had a lot of pushback from our development team,” Wetherall admitted. “They kept coming up with reasons why we couldn’t do it that way, and we needed to keep building the Angular dashboards. But we’ve been able to deal with all those requirements from our development team and our users, and I think we’re sort of at the stage where we’ve got a couple use cases that would really enhance it.”

The world map is a key part of those enhancements, and so are horizontal charts (which are coming soon). “And we’re really interested in pulling in things like live data sources,” said Wetherall. “At the moment we are pulling databases, and I think there’s some real improvements in that space.”

EMS Plugins

“We’ll look at a whole bunch of things to really let us ingest a whole lot more data and analyze it on the fly, aggregate it up, throw it in a data store,” said Wetherall. Apache Spark, Prometheus, and InfluxDB are technologies of interest as they focus on developing the ability to store data and retrieve it later for analysis, and building up Trapeze’s message queues, anomaly detection, and big data stores.

“We sort of feel like Grafana gives us that visual ability,” he said, “so you can start conceptualizing it now, and you can start really building these use cases.”

For more from GrafanaCon 2019, check out all the talks on YouTube.

Meet the Grafana Labs Team: Andrej Ocenas

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2019/05/10/meet-the-grafana-labs-team-andrej-ocenas/

As Grafana Labs continues to grow, we’d like you to get to know the team members who are building the cool stuff you’re using.
Check out our latest Friday team profile.

Meet Andrej!

Name: Andrej Ocenas

Andrej Ocenas

Grafana Labs Developer Andrej Ocenas

Current location/time zone:

Bratislava, Slovakia, so UTC+1 (+2 in summer).

What do you do at Grafana Labs?

I am a fullstack developer. Right now I work mainly on backend, but otherwise I do a lot of React and frontend stuff.

What open source projects do you contribute to?

Right now mainly Grafana, which probably isn’t very surprising. Before I did not work that much on open source. I tried to contribute wherever I could and whenever I had time, which usually meant some fixes in things like the React animation library and some iOS Swift/Object-C stuff.

What are your GitHub and Twitter handles?

On Github, aocenas. On Twitter, I am @aocenas, but I never use it as I always found it a bit confusing.

What do you like to do in your free time?

My biggest hobby right now is snowboarding, and close second would be buying snowboarding gear. I probably watched more snowboarding gear reviews lately than I am comfortable to admit. 😀 So if anybody needs help deciding what to buy, I could probably help. Otherwise I try to exercise as much as I can, and I try to figure out how to eat healthy without putting too much time into cooking. I like to read, mainly sci-fi books, and I love Audible as it allows me to listen to books while doing other boring stuff.

What’s the last thing you binge-watched?

Love, Death & Robots on Netflix, and I can recommend it. I love this kind of sci-fi anthology series.

Which Avenger are you?

Probably Bruce Banner, but specifically as seen in Avengers: Infinity War, where he could not change into Hulk. 😀