Comments on: Failure Friday: How We Ensure PagerDuty is Always Reliable by

By: vladfr

vladfr — Fri, 11 Apr 2014 07:23:00 +0000

So… all of this. In. Production?

In all seriousness, I do appreciate what you are doing. Your stack and client code must be incredibly resilient if you can do this every week without much functional impact to the app. My reaction probably means that we’re far from this.

Did you get a chance to use the Simians? Your tests seem simple to implement, yet effective, and also give a head start in debugging.

By: dougbarth

dougbarth — Wed, 05 Mar 2014 20:58:00 +0000

In reply to Sunny.

Hi Sunny,

Isn’t it obvious? Failure and Friday start with an F. 🙂

More seriously, Fridays tend to naturally have fewer deploys, so we aren’t blocking the company doing testing on that day. Also, the failures we’re inducing are failures the system should handle, and once we remove that failure, we’re back to a working state.

By: Sunny

Sunny — Wed, 05 Mar 2014 20:54:00 +0000

Why a Friday? I wouldn’t want to potentially risk ruining people’s weekends….

By: Joshua Buss

Joshua Buss — Thu, 05 Dec 2013 19:43:00 +0000

In reply to dougbarth.

That’s really cool.. in addition to starting to use PagerDuty ourselves we also recently switched to Percona’s XtraDB Cluster at BrightTag as well. Great minds must still think alike! Nice writeup, Doug – you’ve got us thinking about incorporating Failure Friday as well here.

By: dougbarth

dougbarth — Mon, 25 Nov 2013 16:50:00 +0000

In reply to Emre Hasegeli.

Hi Emre,

I respectfully disagree. MySQL certainly has its share of silly shortcomings but those are known issues that have common workarounds. Both Google and Facebook use MySQL in their infrastructure and both are highly reliable.

To improve our own reliability of MySQL, we’ve recently switched to Percona’s XtraDB Cluster software. We now have a 3 node synchronous multi-master cluster for our application tier. In the near future, we will move to a multi-master model using that clustering software so cluster nodes can come and go with no impact to the application.

By: Emre Hasegeli

Emre Hasegeli — Sun, 24 Nov 2013 15:27:00 +0000

I saw that you are using MySQL. You have no chance to be reliable with it.

By: dougbarth

dougbarth — Thu, 21 Nov 2013 05:00:00 +0000

In reply to HM. Yes. We have our alerting split into two groups. Most issues are delivered to us by PagerDuty. Critical issues (like the system being down or unable to deliver events) use an external monitoring system which phones everyone directly.

By: HM

HM — Thu, 21 Nov 2013 04:44:00 +0000

Does PagerDuty use PagerDuty?

By: Arup Chakrabarti

Arup Chakrabarti — Thu, 21 Nov 2013 01:11:00 +0000

In reply to Ryan Williams. Hi Ryan, the primary reasoning was that we did not want a scheduled job to interfere with the attacks, e.g. having a Chef run override our firewall rules when we are doing the network isolation or having a Cassandra repair run in the middle of taking the service down. For cron job failures, we have pretty limited monitoring in place right now, but we are actively looking to implement something like Proby for our purposes.

By: Ryan Williams

Ryan Williams — Wed, 20 Nov 2013 21:00:00 +0000

Could you explain the reasoning for disabling the cron jobs for the duration of the simulated failure? I’ve thought of a number of possibilities for how they might cause more problems, but I don’t know your setup so it’d be nice to hear more about them. Then my follow-on question is: how do you test failure modes of the cron jobs?