10 Years of Failure Friday at PagerDuty: Fostering Resilience, Learning and Reliability
In today’s fast-paced and ever-evolving world of technology, failure is inevitable. Organizations should embrace failure as a learning opportunity for how to build and deliver more resilient services. At PagerDuty, we’ve practiced Failure Friday for 10 years now. Failure Friday–a practice inspired by the chaos engineering space–involves intentionally injecting failures into our systems to improve reliability and foster a proactive engineering culture.
We’ve interviewed Stevenson Jean-Pierre (SJP), a Senior Engineering Manager, and Mandi Walls, a DevOps Advocate, to help us understand the Failure Friday practice at PagerDuty and how the practice is adopted across the industry.
The origins and evolution of Failure Friday
On June 28, we marked ten years of Failure Friday at PagerDuty. What started out as a weekly practice, quickly evolved beyond that. As the involved teams continued to learn and grow, and in the spirit of continuous learning, Failure Friday has evolved over the decade as well.
SJP manages database reliability and infrastructure teams. For Failure Friday process leaders like him, this initiative serves multiple purposes. It deepens engineers’ understanding of systems, allows teams to get creative while testing failure scenarios, and enables controlled production changes with key stakeholders involved. As SJP explains, “Instead of waiting for things to fail in natural ways in our environments, we produce failure scenarios in our own infrastructure to better understand how they work.”
While the core concept of Failure Friday remains consistent, SJP and his teams have evolved their approach over time. Initially, automated failure testing–randomly disrupting parts of the infrastructure or stopping services–was common. Now, the focus has shifted towards intentional failure testing, targeting areas like load and performance. This deliberate shift allows teams to gain actionable insights into specific failure modes, bottlenecks and optimize their systems more effectively.
Additionally, Failure Friday is no longer limited to a specific day of the week and occurs based on the team’s needs. SJP emphasizes that failure can happen any day, and embraces this concept beyond Fridays, calling it “Failure Any Day.”
So how does it work?
The Failure Friday process at PagerDuty involves planning and executing failure scenarios, with specific objectives and hypotheses. SJP and his team identify the system to be tested, define the failure scenario, and select the stakeholders involved. An Incident Commander (IC) leads the process, ensuring it mirrors the response during a real incident. The process is documented and analyzed, allowing stakeholders to provide their perspectives. A postmortem is conducted to identify areas for improvement, with its scope depending on the scenario being tested. If there are enough surprises or lessons learned, the team gives the postmortem a more formal treatment or invites a wider audience, as the level of detail and rigor needed for the review so justifies. Otherwise, it is conducted by the team that owned the Failure Friday.
For SJP, one key takeaway from Failure Friday is the realization of how complex digital infrastructure has become with all of its interconnected data and systems. Emergent behaviors can surprise even experienced engineers. By inducing failures, his teams gain a more holistic understanding of the system and uncover hidden dependencies. This knowledge equips SJP’s teams to better handle real-world incidents and prevent customer impact. It empowers them to design more robust software and ultimately enhance system reliability.
SJP mentioned that Failure Friday also builds trust within the organization. The engineering teams have successfully built a culture of reliability, where a proactive approach to failure is embraced and valued.
Fostering a culture of innovation and learning in DevOps with Failure Friday
Failure Friday is a practice that has gained popularity in the DevOps community. It’s used as a means to foster innovation, enhance system reliability, and improve the customer experience. Mandi shed light on the significance of Failure Friday in the DevOps sphere and its impact on organizational culture.
For her, the clear focus of Failure Friday is the customer experience. Mandi emphasized the need for graceful error handling, clear communication to users, and minimizing disruptive errors that result in poor customer experiences. For her, simple actions can significantly improve customer experience: “Do you have more graceful handling of certain errors? Do you pop up a nice message to the user? Or do you just report a 503 error? That’s not a very good customer experience.”
According to Mandi, implementing Failure Friday or similar practices may come with benefits and challenges for organizations. These practices foster collaboration among various stakeholders, including engineers, product managers and business owners. Integrating Failure Friday into DevOps processes promotes better alignment and understanding of failure impacts. Additionally, Failure Friday contributes to developing a positive organizational culture by creating a low-stakes environment for open discussions and learning. This encourages psychological safety and a blame-free atmosphere, facilitating honest conversations and a proactive approach to system resilience.
However, challenges may arise in introducing intentional errors into production systems due to stability concerns and limited testing capabilities. Tools and services can help mitigate these concerns, making testing in production more secure and accessible. Fostering collaboration between stakeholders requires effective communication and coordination, often necessitating adjustments to existing workflows and structures. Moreover, embracing Failure Friday and cultivating a blame-free environment may require a shift in organizational culture, which can be challenging but essential for the success of such initiatives.
Ultimately, for Mandi, Failure Friday positively impacts team collaboration and communication within an organization. The practice encourages teams to engage in honest discussions, enhances trust, and fosters a proactive approach to system resilience and customer satisfaction. At the end of the day, investing in building resilience will pay off in better digital experiences for your customers.
How to start Failure Friday in your organization
For those interested in experimenting with Failure Friday, SJP suggests starting small and gradually scaling up. Mandi’s top suggestion is for organizations to prioritize building psychological safety and creating blame-free environments. As she says, “Failure Friday is not just a practice; it’s an opportunity to foster a culture of collaboration and resilience.”
SJP strongly believes that “other teams and organizations can benefit from adopting a similar approach.” Recognizing that failures are inevitable, he emphasizes the value of understanding systems comprehensively and adopting a failure-mode mindset. Whether running full Failure Friday exercises or starting with tabletop exercises, organizations can enhance their engineering practices and cultivate a culture of resilience.
If you’re eager to learn more and join the discussion…
Don’t miss out on a chance to learn from industry experts and discover how Failure Friday can revolutionize your organization and DevOps practices. Watch the recorded Twitch stream and be part of the dialogue in the comments. Get ready to embrace failure as a pathway to success in the world of technology!