Let’s debunk the myths that go around the topic!

Monitoring vs. Observability

Kristjan Hiis
Monitoring Metric
Published in
4 min readNov 8, 2020

--

The definitive guide to best practices of observability. Or maybe not so much …

Photo by Charles Deluvio on Unsplash

In the world where every job title uses acronyms or abbreviations from all sorts of different words — I’m looking at you SRE, DevOps, NetOps, NetSecOps, and all the other combinations of the words Site, Networking, Operations, Development, and Security.
Surely there are more combinations generated daily and to be fair, it’s getting really hard already to stay on top of it.

However, we are gathered here to talk about another phenomenon that is — observability (or o11y for those of you who like to nerd out with numeronyms). I’ll quickly plug in a good tweet about numeronyms that surely made me chuckle.

https://twitter.com/nathankpeck/status/1167469307233820672

Getting deeper into the subject between m8g and o11y (that’s monitoring and observability, folks) then Yuri Shkuro, a software engineer at Uber has put it this way: Monitoring is about measuring what you decide in advance is important while observability is the ability to ask questions that you don’t know upfront about your system.

Monitoring

When we start to think about monitoring in general it really boils down to the preconfigured checks that are systematically addressed and values that are gathered over these polling periods are checked against the reference values that you have set to these checks/metrics. Or to put it shortly — Monitoring is the activity of observing the state of a system over time.
While monitoring was totally acceptable with monolith environments and a decade ago, the current state of infrastructure and the containerized world surely is in denial with such an approach. Hence observability came into the picture.
Most of the time when we talk about monitoring we can think about Zabbix, Nagios, Cacti, and other fossils in the toolbox. Now I don’t want to say that they’re bad — no sir! They do the things that they are meant for, and they do it quite nicely to be fair.

Observability

Observability can be taken as a means to incorporate data from monitoring solutions to other platforms, such as logs and traces. Binding them all together we can start asking questions about our infrastructure and services that run on the aforementioned infrastructure. Observability is all about answering questions about your system using data that you have collected.
These questions aren’t arbitrary like is the server up or down these questions belong into the realm of monitoring solutions, however, if we’d tweak the question to “How many users are affected by the malfunctioning server?” we have a question for observability.

Practices, best, better, bestest

Let's touch base on the subject of best practices. We want our observability to be as good as humanly possible! That means that we’d just have to instrument everything, monitor everything, and log absolutely everything — right? Nope! What we need in this case is a truckload of high cardinality data, something that could be useful when trying to answer the real questions.

As Charity Majors (CTO of Honeycomb.io) profoundly put:

Don’t attempt to “monitor everything”. You can’t. Engineers often waste so much time doing this that they lose track of the critical path, and their important alerts drown in fluff and cruft.

https://twitter.com/mipsytipsy/status/1305398051842871297

Reduce alerts

Have you ever felt that some of the alerts that land into your #super-critical-alerts slack channel get overlooked and not addressed at all? Maybe they don’t need addressing, maybe the alert level is wrong. Finetune the alerts and map them to organizational KPIs to understand what alert could affect any internal SLA/SLO/KPI.

Segregate alerts

The worst practice is to send all the alerts into one channel. I know omnichannel has a nice ring to it and it has been growing as a buzzword, but your alerts shouldn’t never-ever be omnichannel. Segregate by alert level, geolocation, or whatever seems feasible to you.

Dip into SRE mentality

If the alert constantly fires — fix the underlying issue. For instance, if a server keeps alerting you about lacking memory, consider adding more power to it and sleep better without constant OpsGenie calls.

Rename your alerts

No one knows what you meant by “Something buggy on Server X”. Give the alerts some meaning by touching up the alert itself, and if possible give the alert a body with the possibilities of what went sideways and how to fix them.

Playbooks

Tightly coupling in with the last idea — if it’s a possibility please use playbooks, what commands to run, where to check in case of X, whom to contact etc. These will come in handy in times of despair.

Logging

Log only errors and exceptions, if there are no logs coming in, it can be viewed as good news rather than having a flood of logs coming in by the minute and you are trying to find the needle in a haystack.

Conclusion

Make your solutions work for you not the other way around, utilize the power of synthetic testing to mimic the end-user, log the essentials, use meaningful data aggregation. The idea around observability isn’t just do-it-once-and-then-forget-about-it it is a continuous improvement process, as any monitoring solution should be!

--

--

Kristjan Hiis
Monitoring Metric

Talking a lot in the realms of observability and engineering in general.