Comments on: Failure Friday: How We Ensure PagerDuty is Always Reliable by https://www.pagerduty.com/blog/failure-friday-at-pagerduty/ Build It | Ship It | Own It Fri, 11 Apr 2014 07:23:00 +0000 hourly 1 https://wordpress.org/?v=6.3.1 By: vladfr https://www.pagerduty.com/blog/failure-friday-at-pagerduty/#comment-553 Fri, 11 Apr 2014 07:23:00 +0000 /?p=2955#comment-553 So… all of this. In. Production?

In all seriousness, I do appreciate what you are doing. Your stack and client code must be incredibly resilient if you can do this every week without much functional impact to the app. My reaction probably means that we’re far from this.

Did you get a chance to use the Simians? Your tests seem simple to implement, yet effective, and also give a head start in debugging.

]]>
By: dougbarth https://www.pagerduty.com/blog/failure-friday-at-pagerduty/#comment-552 Wed, 05 Mar 2014 20:58:00 +0000 /?p=2955#comment-552 In reply to Sunny.

Hi Sunny,

Isn’t it obvious? Failure and Friday start with an F. 🙂

More seriously, Fridays tend to naturally have fewer deploys, so we aren’t blocking the company doing testing on that day. Also, the failures we’re inducing are failures the system should handle, and once we remove that failure, we’re back to a working state.

]]>
By: Sunny https://www.pagerduty.com/blog/failure-friday-at-pagerduty/#comment-551 Wed, 05 Mar 2014 20:54:00 +0000 /?p=2955#comment-551 Why a Friday? I wouldn’t want to potentially risk ruining people’s weekends….

]]>
By: Joshua Buss https://www.pagerduty.com/blog/failure-friday-at-pagerduty/#comment-550 Thu, 05 Dec 2013 19:43:00 +0000 /?p=2955#comment-550 In reply to dougbarth.

That’s really cool.. in addition to starting to use PagerDuty ourselves we also recently switched to Percona’s XtraDB Cluster at BrightTag as well. Great minds must still think alike! Nice writeup, Doug – you’ve got us thinking about incorporating Failure Friday as well here.

]]>
By: dougbarth https://www.pagerduty.com/blog/failure-friday-at-pagerduty/#comment-549 Mon, 25 Nov 2013 16:50:00 +0000 /?p=2955#comment-549 In reply to Emre Hasegeli.

Hi Emre,

I respectfully disagree. MySQL certainly has its share of silly shortcomings but those are known issues that have common workarounds. Both Google and Facebook use MySQL in their infrastructure and both are highly reliable.

To improve our own reliability of MySQL, we’ve recently switched to Percona’s XtraDB Cluster software. We now have a 3 node synchronous multi-master cluster for our application tier. In the near future, we will move to a multi-master model using that clustering software so cluster nodes can come and go with no impact to the application.

]]>
By: Emre Hasegeli https://www.pagerduty.com/blog/failure-friday-at-pagerduty/#comment-548 Sun, 24 Nov 2013 15:27:00 +0000 /?p=2955#comment-548 I saw that you are using MySQL. You have no chance to be reliable with it.

]]>
By: dougbarth https://www.pagerduty.com/blog/failure-friday-at-pagerduty/#comment-547 Thu, 21 Nov 2013 05:00:00 +0000 /?p=2955#comment-547 In reply to HM.

Yes. We have our alerting split into two groups. Most issues are delivered to us by PagerDuty. Critical issues (like the system being down or unable to deliver events) use an external monitoring system which phones everyone directly.

]]>
By: HM https://www.pagerduty.com/blog/failure-friday-at-pagerduty/#comment-546 Thu, 21 Nov 2013 04:44:00 +0000 /?p=2955#comment-546 Does PagerDuty use PagerDuty?

]]>
By: Arup Chakrabarti https://www.pagerduty.com/blog/failure-friday-at-pagerduty/#comment-545 Thu, 21 Nov 2013 01:11:00 +0000 /?p=2955#comment-545 In reply to Ryan Williams.

Hi Ryan, the primary reasoning was that we did not want a scheduled job to interfere with the attacks, e.g. having a Chef run override our firewall rules when we are doing the network isolation or having a Cassandra repair run in the middle of taking the service down.

For cron job failures, we have pretty limited monitoring in place right now, but we are actively looking to implement something like Proby for our purposes.

]]>
By: Ryan Williams https://www.pagerduty.com/blog/failure-friday-at-pagerduty/#comment-544 Wed, 20 Nov 2013 21:00:00 +0000 /?p=2955#comment-544 Could you explain the reasoning for disabling the cron jobs for the duration of the simulated failure? I’ve thought of a number of possibilities for how they might cause more problems, but I don’t know your setup so it’d be nice to hear more about them. Then my follow-on question is: how do you test failure modes of the cron jobs?

]]>