A Menagerie of DevOps Habits, Part 2
Alerts and notifications are what allow us to know if there’s something out of the ordinary with our systems. Unfortunately, as we scale up and out, the volume of those messages can get unwieldy. Today, I’ll be talking about some of the bad habits we have that allow our notifications to get into that state and what can be done to dial them back down.
Photo by Rahul Chakraborty on Unsplash.
I Need to Know Everything, So I Alert on Everything
This usually follows as an accidental result of “I don’t know enough about my systems (or code).” When you are afraid of the conditions you cannot predict, referred to as unknown unknowns, you will likely spend a lot of energy trying to alert on every situation you can imagine. Although all of the alerts will increase how much digital noise your systems make, this doesn’t mean you are successfully capturing potential critical issues in the way that you hope.
Anything known, to an extent, can be predicted. For the unknowns, you will probably find yourself building a series of “what-if” scenarios into your alerts that may not yield anything useful. You may not even delete them when you find them unhelpful, just in case they turn out to be. This is more harmful than helpful in the short and long term; you’re actually training people to ignore the alert because most are ignorable. It’s similar to listening to the AC or heater in your home—if it’s kicking on all the time, your brain filters it out so you don’t “hear” it any more.
Solve this 🔦: Learn how to understand what expected, errant behavior is and build a practice for handling all those unknown-unknowns. You will never know everything and will introduce too many undesirable alerts. (For more clarity on known knowns, known unknowns, etc., take a look at this article.)
I Don’t Know Who to Notify, So Everyone Needs to Know Everything
Another form of “too much of a good thing.” When there is uncertainty surrounding who is knowledgeable about a certain application, service, etc., a notification is blasted to everyone instead of directly notifying a subject matter expert. This may be truly everyone or a subset like “all of the leads.” While it might initially seem like sending notifications to everyone and leaving it to them to sort it out would be better than not sending out the notifications at all, that’s actually not the case. Since each person will receive many more alerts than they will take action on, people will learn to ignore them. Also, in circumstances where more than one person can take action, there is no clear path of ownership over a current incident. When this happens, ownership must be established during an active incident, which is not ideal since the priority should be resolving that incident efficiently.
Solve this 🔦: Take the time to determine who can take action on which alerts, and configure the notifications to reach out to only that person (who this person is should be set on a rotating schedule.) We have information about how to do this in our Incident Response Ops Guide.
Alerts Are Only for IT / Engineering
Previously, I mentioned that the only person or people who should be notified of an incident are those who can take action. Now, I’m going to ask you to broaden the scope of what “action” means.
The action that we normally associate with an IT incident is the triage, troubleshoot, resolve aspect of the incident, but there are more actions than only these. If there are multiple persons or teams involved in resolving the incident, then someone needs to coordinate them. Someone needs to send internal and/or external communications and so forth, depending on scope and severity. All of these actions should be handled by non-engineer roles because the engineers should be brought in as the subject matter experts to resolve.
What about alerts that aren’t even IT related at all? There are many unique use cases that do just that. For example, Good Eggs, a grocery delivery service, sets alerts on refrigerator temperatures (watch the customer video here), and Rainforest Connection uses alerts to learn of illegal logging. There are also ways that non-engineers can use alerts that aren’t restricted to niche use cases. For instance, depending on how your HR is set up, there could be a notification sent when an HR violation occurs that needs to be addressed or, if you’re in marketing, you could set up an alert on ad spend.
Solve this 🔦: Expand the scope of what “action” means to include all the tasks that need to be done for an active incident before it is resolved and set alerts for the people who truly need to be notified. Also think of ways that non-engineers could benefit from receiving alerts, such as improving their own manual processes.
We’re Measuring Everything We Need to Measure
This assumption happens a lot if you’re unsure what to measure so you search for ideas and implement whatever the top results are. If you’re looking to support a DevOps culture, you might look up DevOps KPIs, find a Top N or similar list, try to implement those, and pat yourself on the back for a job well done. Unfortunately, the job isn’t done if you haven’t asked qualifying questions—you might be measuring things you don’t need, missing things you do, or have the wrong scope. In this situation you’ll frequently find yourself gathering a lot of data but not being able to do anything useful with it.
Solve this 🔦: Start by asking what information you need, why you need it, and how to gather it. Knowing the “what” and “why” will help you determine which metrics to use and their scope, and the “how” can help you determine which tools to use.
Where to go from here
If you’d like more information about how to overcome cultural transformation hurdles for alerting and incident management, I definitely recommend reading our Ops Guides. Specifically you might want to start with our Incident Response Guide, and the branch out into Retrospectives and Postmortems. If you’d like to discuss how you’re implementing any of these changes or how you’d like to start, please feel free to find me on our Community Forum!