The Cost of Increasing Incidents: How COVID-19 Affected MTTR, MTTA, and More
Digital transformation accelerated for many companies during the last 18 months. While it may have been on the agenda prior to COVID-19, teams were pushed to extreme speeds to digitize and meet the rising online demand. During this time, organizations learned important lessons that they’ll carry on with them into this new future. Leaders can take these learnings and use them to build better products, healthier and more efficient teams, and a happier customer base.
Our team aggregated some of these key findings in our State of Digital Operations Report. One important lesson we learned was that critical incidents increased by 19% YoY between 2019 and 2020, and it doesn’t look like incident volumes will be slowing down anytime soon.
Some organizations had more opportunities to learn and grow than others during this period. For instance, the highest lift in critical incident volume was seen in the Travel & Hospitality and Telecom industries, with 20% more critical incidents. In late March 2020, we saw that highly stressed cohorts, including online learning platforms, collaboration services, travel, non-essential retail, and entertainment services, were experiencing up to 11x the number of critical incidents.
In this installment of our State of Digital Operations blog series, we’ll chat through how 2020 affected metrics like MTTR (mean time to resolve) and MTTA (mean time to acknowledge), burnout and attrition rates, and what leaders can do to improve the lives of their teams and their customers looking towards a digital future.
How did MTTA and MTTR change?
MTTA is the time it takes for a responder to acknowledge the alert. MTTR is the time it takes to actually resolve the incident. These aren’t the sole metrics that determine operational excellence, yet many organizations use them as a proxy and derive important insights. These insights are useful when pinpointing strengths and weaknesses in incident response processes.
Our platform data showed that, while MTTR is improving, total time spent resolving incidents is still increasing. This is likely due to the increased amount of critical incidents. As incident numbers rise, even as teams get better at resolving them, total time spent on incidents is still increasing. This takes a toll on technical teams as they see their workloads shift from planned work to unplanned work.
MTTA is decreasing alongside MTTR. As teams onboard PagerDuty, they’re able to achieve a higher level of digital operations maturity via the platform. Digital operations maturity is the level of proficiency ranging from manual to preventative teams have in handling urgent work. Each step is characterized by key capabilities. As teams are able to standardize incident response, their MTTR improves. As they create more efficient on-call and alerting rules, their MTTA improves.
Another aspect of MTTA is the ack%, or the amount of critical alerts acknowledged after an alert fires. This is another way to demonstrate operational maturity. The higher the ack% is, the more responsive and accountable your teams are. PagerDuty users were able to increase ack% over an account’s lifetime. The longer the account was using PagerDuty, the better the ack% and MTTA was. Even with performance cohorts split out, with the 10th percentile being nearly twice as fast at acknowledging incidents compared to the 25th percentile, all accounts are seeing improved MTTA over time.
Mobile adoption of the PagerDuty application helps improve MTTA and ack%, as on-call team members are rarely an arm’s reach from being able to respond to an alert. This means customer-impacting issues are being handled faster than ever. But, it also means that engineers are never really away from work. As the lines between work and home blur, it’s important to understand the weight of these alerts on technical teams.
How were burnout and attrition affected?
An abrupt 2 AM wakeup call might be an inconvenience if it happens once every few months. But, if it’s happening multiple times per week, the effect is more pronounced; teams begin burning out, their mental health suffers, and eventually they leave the organization in the hopes of being able to achieve a better work/life balance elsewhere. During this period coined The Great Resignation, it’s imperative that organizations are able to attract and retain talent.
Leaders looking to understand their teams’ pain points can examine on-call both qualitatively and quantitatively to determine who is at risk of burning out, and why. Our platform data has given us some insight into what these triggers are.
Compared to 2019, organizations saw 4% more interruptions in 2020. However, when digging into the spread across time categories, there was a 9% increase in off-hour interruptions and a 7% lift in holiday/weekend hour interruptions, compared to 5% increase in business hour interruptions and 3% decrease in sleep hour interruptions.
While it’s good that fewer engineers are being woken up during their sleep, the 9% increase in off-hours means that family time, dinners, evening workouts, and more are being put aside to respond to interruptions. Over time, this irregular schedule adds up to about 12 additional weeks worked per year from each on-call team member.
Our platform data also showed that the more frequently engineers were paged off-hours, the more burnt out they become. The median user receives two non-working hour interruptions a month. On the other end of the spectrum, burned-out users were experiencing 19 non-working hour interruptions per month. It’s no surprise that these burned-out users were the most likely to leave the company.
We saw that responder profiles leaving the platform (our proxy for attrition) experienced a higher than average off-hour incident load. Using regression analysis, we looked at material off-hour incident work volume for both deleted users and remaining users and found a statistically significant positive correlation between off-hour volume and a user’s odds of deletion.
In other words, to retain employees, leaders need to understand how to decrease interruptions (especially non-working-hour interruptions) for their teams. One way to do this is with intelligent noise reduction.
Reducing the noise to keep responders healthy
These off hour interruptions are sometimes unavoidable. Afterall, if your checkout cart stops working at 7 PM, you can’t just lose revenue until your team is back online the next morning. But, sometimes on-call engineers are paged at 2 AM for things they can do nothing about. Noise reduction can help as it allows teams to focus on what’s really important.
Production systems generate a lot of events; only some of these rise to the level of an alert, or something that could be wrong. Otherwise, many of these events can simply be logged in your monitoring system for further inspection. Additionally, some of these alerts can be irrelevant. They might be repeat alerts, or inactionable ones, or ones that could be resolved through auto-remediation with no human intervention.
Our platform data showed that through event compression and alert grouping techniques, we’re able to help customers reduce event to incident noise by 98%. Thus alert storms are reduced to the minimum necessary number of actionable alerts. If you want to learn more about this, you can hear from Etsy on how we helped the team proactively identify noisy, non-actionable alerts and control what got to disrupt the team’s flow state or deep sleep.
When alerts are meaningful, your teams are able to handle less but with more care. This limits the amount of time they need to spend away from the things they love during non-working hours and can protect against burnout and attrition.
It also means they’re able to focus on the critical issues at hand and provide excellent service to your customers. As organizations continue to focus on providing excellent customer experience in a digital world, this becomes even more important.
What does the future look like?
2020 changed the pace of acceleration for many companies making a digital transformation. But the pace won’t slow down now. Companies need to be prepared for this level of digital reliance from here on out.
If you think your teams are ready for a digital operations management platform, try PagerDuty free for 14 days. If you’d like to learn more about our findings, check out the State of Digital Operations Report.