Blameless Culture Key to Addressing Outage Outrage in Australia
After the unfortunate Commonwealth Bank of Australia outage last week, the powerful Payment Systems Board—whose members include the chairs of the RBA and APRA – announced it would make all outage data public to prevent banks, payment schemes, and telecommunications carriers from “hiding behind” the performance statistics shared by each institution.
Leading B2B and B2C brands know that delighting their connected customers with exceptional experiences, frequent innovations, and improved services invites technical complexity and an increased risk of technical failure. However, outages like the one that happened last week are inevitable for all digital businesses today that are trying to deliver exceptional customer experiences.
To the media, the government, and irate customers, being transparent about outages and who was responsible for them sounds like an effective approach to improve accountability across service providers. However, I’m concerned that the new regulation, though well intentioned, will be damaging in the long run. Let me explain why.
The Importance of Blamelessness
In addition to inquiries and potentially more stringent regulations, customers are demanding exceptional experiences around the clock, increasing the pressure on organizations to react and resolve incidents quickly.
My worry is that businesses may react to media and public pressure by focusing on identifying individuals to blame for outage and downtime, leading a culture of blame and scapegoating—behaviors that will damage the ability of companies to maintain service reliability. In our experience, however, companies with the cultural and organizational commitment to learn from mistakes and proactively prevent further issues are the ones with the lowest number of major incidents and highest customer satisfaction.
This makes sense: After all, people aren’t going to make fewer mistakes because they’re afraid of being blamed for them; instead, they just become better at hiding their mistakes. When individuals believe they will be blamed for human error, they’re less likely to speak up about issues that occur, no matter how small. At that point, when a major incident happens, organizations don’t have the insight they need about what has happened and is happening in their systems, further slowing down response and mitigation.
Additionally, blameless postmortems, a method that allow teams to iteratively learn and improve from incidents, should be considered after every incident, not just after the major events. There are two reasons for this:
- First, minor incidents are often a harbinger of more extensive failures on the horizon, and by learning what went wrong during minor incidents, organizations may be able to minimize or even prevent a major incident in the future.
- Second, making postmortems a common part of the process provides more opportunities to both teams and management to practice effective postmortems and instills the importance of learning from incidents.
Not only that, but postmortems allow teams to release smaller changes more frequently rather than larger changes less frequently, which increases service reliability. This sounds counterintuitive, but according to DORA’s “2019 Accelerate State of DevOps Report,” research demonstrates that organizations with smaller, more frequent changes actually respond and restore service faster when incidents happen.
Best Practices to Reduce Downtime and Outages
So how can organizations better understand and learn from outages? The first step is to understand that attempting to prevent all incidents is impossible. With today’s complex systems, failures will happen—it’s an unfortunate fact of digital business. For instance, an organization can have a plan for what to do if a system has a failure (e.g. restart the Apache web server), but it cannot plan for all the things that might happen in a cascading failure.
The next step is to reframe how we think about outages. Instead of asking “How do we ensure that we have no incidents at all?” ask “How can we increase the ability of our system to adapt to inevitable interruptions/incidents?” The focus should be on discovering and enhancing the adaptive capacity (capacity of a system to adapt if the environment where the system exists is change) of our systems.
Which brings me to my next point: People are a key part of any system and where the majority of the ability to respond occurs.
People: Your First Line of Defense
Incident response requires the creativity and brainpower of humans based on information of the current state of the systems—not what we thought might happen two months ago when the runbook was written. During a major incident, it’s crucial to have the ability to marshal the right people quickly and provide them with the tools, information, and ability to act to restore service.
Additionally, I want to emphasize that the focus during an incident is to rapidly restore service—incidents are not the time to identify the cause or fix underlying problems. An effective incident response process, which enables practitioners to work together to best restore service, is essential. These processes do not need to be elaborate or cumbersome; in fact, the more straightforward the process, the easier it is to do it well every time. PagerDuty has published our own incident response process at https://response.pagerduty.com, and we encourage teams to adapt this process for their individual needs.
In summary, the key takeaway from organizations looking to improve their incident response process is to develop a three-step approach:
- Institute a practice for learning from incidents. Have a transparent and understood process for blameless post mortems. Work to build trust in your practitioners by ensuring you have a blameless culture. Management needs to provide assurance that incidents will not result in punishment or sanctions. Remove the phrase “human error” from the vocabulary. For example, a person may have accidentally deleted a critical file. The issue is not that the person performed that action, but that the system or process enabled this action to take place.
- Evaluate your incident response process. Do you engage the right responders as quickly as possible? Are they given the information and resources they need to restore service? Do you have a clear method for decision making and communication? Do you have a mechanism for keeping stakeholders informed?
- Continue to adapt and improve your incident response and learning process as you go. Part of the post-incident retrospective should be on the incident response process itself. How can it be improved? Where did it work well?
Following these three steps can help ensure your business is best positioned to manage outages and grow from those incidents. Additionally, by continuing to learn and adapt, you can work to have more reliable and robust systems and provide increased customer happiness and satisfaction.
Interested in learning more about best practices to prevent outages and how PagerDuty can help? Sign up for a 14-day free trial.