PagerDuty Engineering Blog | Technical Topics From Our Eng Team https://www.pagerduty.com/eng/ Build It | Ship It | Own It Tue, 07 Sep 2021 17:46:13 +0000 en-US hourly 1 https://wordpress.org/?v=6.3.1 How to get buy-in for on-call and design for humans by Quintessence Anx https://www.pagerduty.com/eng/buy-in-on-call-design-for-humans/ Wed, 08 Sep 2021 13:00:03 +0000 https://www.pagerduty.com/?post_type=eng-blog&p=71244 On-call is about more than just reducing mean time to acknowledge and mean time to resolve (MTTA and MTTR, respectively), it’s about improving the human...

The post How to get buy-in for on-call and design for humans appeared first on PagerDuty.

]]>
On-call is about more than just reducing mean time to acknowledge and mean time to resolve (MTTA and MTTR, respectively), it’s about improving the human experience on your teams. That might seem odd; after all, doesn’t a shift to on-call usually mean teams start working unfamiliar hours? Possibly even outside the work day and on weekends?

It’s true that being on-call can mean changing hours, but it also means shifting workflows from a difficult or “frictionful” state to something easier to use and understand. This will in turn make it easier for your teams to work together, solve cross-team incidents, etc.

What is on-call?

Before you get started, it is very important to decouple what you need from on-call from any preconceived notions of what on-call must be. Succinctly:

“On-call is a response model that codifies hours of coverage, ownership, and escalation.”

So what does on-call look like? It can vary from company to company, or from service to service, depending on criticality. Here are some examples of simple on-call models:

  • One person, with a backup, is on-call for 24 hours per day for a given number of days (usually one week).
  • “Follow the sun” shift model for geographically dispersed teams, where people in each timezone work their normal working hours but in combination they provide coverage for longer than an 8-hour day.
  • Shift schedules where there are two or more 8-hour shifts that in combination cover longer than an 8-hour day.
  • Business-hours only coverage for services that do not require extended coverage.

In the above examples, the first three are for services that require longer support than an 8-hour working day can provide. This is common for Tier 1 services that are critical business functions. Business-hour coverage is common for providing support for services that have same or next-business-day Service Level Agreements. Another benefit of a Business-Hours-Only schedule is that you can create one for InterruptDuty, giving the physical or virtual “walk ups” a person to ask questions while others on the team continue focus work.

Why you might not have on-call today

If you’re reading this, it’s very likely that you either don’t have on-call established yet—but are looking to start—or that you’ve only very recently started your on-call journey. Commonly, this is due to a combination of age and size. Response times and patterns of today, versus a few years ago, or versus a decade ago, are all very different scenarios. On top of that, software release cycles were also slower and usage didn’t peak in the same ways that it can now. As a result, the needs of even a tenured company would have been very different even less than a decade ago. Essentially, for new companies, a lack of on-call can mean that the business hasn’t grown yet to necessitate it, and for established companies it can mean they grew with a different structure.

Working for a company that is more than a few years old is not the only reason that a company might not have an on-call rotation. Two other important factors are company size and customer size. These usually go hand-in-hand. Smaller businesses and startups don’t have the scale and impact with their incidents as larger corporations and have significantly fewer employees. For the applications and services, this can mean that the agreement promised is the same or next business day, so there is no out of hours coverage. In terms of “knowing who to contact for an incident,” when a company is perhaps literally the size where everyone is working out of a garage, basement, or similar, then there simply aren’t complicated logistics around contacting your teammates.

Everyone who knows everything to know about that company is in shouting distance, so you give a shout, or you can @here in #general to all 5 people in there. But then the company grew out of the garage, out of the basement, into an office space, into several office spaces. The customer base grew. The demands grew. Suddenly knowing who to contact and how to reach them, as well as how to meet increasing uptime demands, became problems to solve.

Design for people, with people

There are two top-level needs that need to be met with on-call: the business needs and the needs of the people that make up the business. To put this another way, it is equally important to understand what business requirements are being solved by on-call and what people’s needs are so they can implement these on-call schedules.

Understand the what, why, and how of on-call

In a shareable brainstorming document, start to think about the “what and why” of on-call—what issues or gaps in the business that you are looking to solve. Some common reasons include, but aren’t limited to:

  • Improving team communication
  • Reducing response times
  • Improving response quality
  • Reducing response stress
  • Codifying ownership structure
  • Gaps and/or bottlenecks in workflows and processes

Next, start to dissect why what you discovered needs to be improved or changed. This will guide later discussions for what changes need to be made in order to genuinely improve the top-level concern.

A quick example to use as a guide: let’s say that “reducing response and resolution time” is on the list of what to improve. Typically this is measured by MTTA and MTTR. The broader context of why to improve this is how the incidents are impacting others. Depending on the conditions of the incident, it could be inhibiting or completely blocking work that is built on top of that service by either internal or external users, i.e. colleagues and/or customers. So in this case, a complete statement with context would be “improve service reliability internal and/or external use by reducing MTTA and MTTR.”

Qualifying information for more subjective needs is equally important. When you are asking for “improved quality”, the current and desired quality both need to be outlined, or else you and your teams won’t know what to specifically improve. An example in this area might be looking at the postmortem process, where the process occurs but perhaps doesn’t have enough detail to learn from in the future, or the process lacks a single owner so documentation might not fully complete at all.

Once you have this level of detail, you can explore how on-call will solve what’s on the list. A necessary component of creating on-call rotations is having an ownership model, as this determines who is reached out to when there’s an incident or issue. So an example statement would be “as part of creating on-call, we will need to create and implement an ownership model, which will improve inter-team communication by documenting who to contact for a given issue.”

Another statement would be “incidents are significantly delayed by lack of internal documentation and knowledge transfer, we will implement a knowledge sharing source as part of our on-call implementation and update alerts to include links to the documentation in the body text”. An aggregate of these solution statements will determine how you build your on-call culture to meet business needs.

Understand the what, why, and how of people

When implementing a massive change there will likely be resistance and on-call is no different. Some of this will be solved by having a clear understanding of what you hope to accomplish, how you hope to accomplish it, and working with your teams and leads to build this understanding. But even once you have all this in place, there will be concerns around the implementation and what it means for the people who are doing the work. To provide context for this next conversation, it’s important to understand what motivates people. A common framework for intrinsic motivation states that people are motivated by:

  • Autonomy
  • Mastery
  • Purpose

Breaking these down, let’s look at how they impact on-call scheduling. Autonomy comes into play when discussing not only incident resolutions, but the ownership of the structure of the on-call rotation itself. Basically: if the on-call structure is unsustainable, who has the autonomy to fix it? The most authority resides in management, so management would need to gather feedback, tools, as well as empower their teams to correct issues with the on-call structure itself. What are some situations that create an unsustainable on-call?

  • Small team sizes resulting in people being on-call too often.
  • High frequency of incidents and/or long duration of those incidents.
  • Not addressing known peaks or troughs.
    • For example, a hotel chain during high travel season or retail in gift giving seasons both experience known surges in traffic at that time, as well as lulls in traffic during the “off season.” Are the same people on-call for all of the peaks? If the peak is prolonged, for weeks or more, is the schedule adjusted to accommodate?
  • Other planned work not being deferred / adjusted to account for on-call duties.
  • Small response size at more senior positions, meaning a shorter rotation at higher levels of escalation.

Many of these issues are problems that teams can resolve with management approval, mostly centering around ensuring that the additional on-call duties do not result in burnout. Using the specific example of other planned work, the teams can plan their overall workload around the known on-call rotation and for those on rotation, they could take less time critical work, be on InterruptDuty, or whatever the team needs.

For mastery, IT specialists spend a lot of time learning, iterating, and improving their craft. One of the tangential benefits of on-call is that on-call teams will have greater visibility into the design decisions as well as shared ownership and workload. When development and operations specialists are on-call, they can make use of these to improve their mastery of their craft.

Specifically, that increased visibility means that they will know which decisions lead to more, or longer, incidents or create other downstream problems that might not have been visible when duties were separated. Responding and resolving incidents that occur also helps people develop more holistic views of the overall environment and plan future work that can improve design.

Purpose is a mixture of the direct issues that on-call is resolving, e.g. specific issues around response and service ownership, as well as tying this into the overall purpose of the organization. This would be explicitly tying the changes to engineering to the overall organization. This can start with discussing how the improved response time and quality improves user trust and experience, as well as what reliability means for the organization itself.

Nothing is 100%: An acknowledgment

With everything in place, you will have a smoother transition and more buy-in for adoption, but there will likely still be people who will not fully want to start on-call and that is both expected and okay. There are going to be people who chose to work at your company because it didn’t have on-call, or didn’t have on-call past a certain level of seniority, and they still might want that. That doesn’t mean that the transition won’t still be able to move forward, or that things won’t get easier over time – they will.

Where to go from here?

There is a lot of reading that can help you and your teams prepare for on-call. For the rotations themselves, I recommend our Best Practices for On-Call Teams Ops Guide and the On-Call section of the Google SRE Handbook. When you start taking a look at the service level agreements, I also recommend taking a look at the Service Level Objectives section of the Google SRE Handbook to ensure that the objectives and agreements are in alignment, doable with the team and on-call structures, and that the indicators in place measure what is needed.

The post How to get buy-in for on-call and design for humans appeared first on PagerDuty.

]]>
How We Use React Embedded Apps by Margo Baxter https://www.pagerduty.com/eng/react-embedded-apps/ Mon, 12 Jul 2021 13:00:45 +0000 https://www.pagerduty.com/?post_type=eng-blog&p=69120 This is the first in a three part series about how the PagerDuty Front-end team approaches their micro front-end architecture. The front-end of PagerDuty’s web...

The post How We Use React Embedded Apps appeared first on PagerDuty.

]]>
This is the first in a three part series about how the PagerDuty Front-end team approaches their micro front-end architecture.

The front-end of PagerDuty’s web app is client-side Javascript, using a combination of React, Ember, and Backbone frameworks. We’re moving towards React, but getting out of legacy code is a slow process, and prioritization needs to be balanced carefully against new features. In the meantime, it is difficult for us to build new features on the older pages and features that affect every page in the site. Embedded React apps solved this nicely for us.

A Solution for Our Global Navigation

One particularly difficult case had been our ten-year old global navigation that existed in multiple versions and on various pages. Bugs and inconsistencies popped up frequently since a content update required a developer to run multiple environments and change the code in multiple Javascript frameworks. Implementing new navigation across multiple frameworks would have taken longer than realistically feasible, and would have limited our ability to create a rich experience.

Before even starting to build the new navigation, we needed to come up with a way that all of our frontend codebases could consume a single React app. Web components were not an option for us due to lack of browser support. Server-side rendering would have been overly complex. Iframes are a bad idea.

We decided to go with a custom embedded React architecture. If you are working on an app with multiple frameworks that would benefit from a lightweight solution for sharing common components, I recommend you take a look at what we did and consider how you might adapt it to your use case.

Our new navigation in the PagerDuty web app. The navigation code exists as its own React repo and is embedded into each page in the site, regardless of framework. For example, the Incidents page shown here is part of an Ember.js monolith, but the navigation is embedded seamlessly into the page from an external React app.

Easy to Embed React Anywhere

A simple script tag is all that’s required to use one of our embedded React apps, which makes them easy to add to any page or framework.

We embed our navigation like this:

We use the script tag raw just like this in the page templates in our Ember and Backbone monoliths, and we made a shared wrapper component for our full-page React apps to use. The script also takes optional height and width attributes to customize the size of the target DOM container.

The Embedded Architecture

There are two parts of the embedded architecture: a loader script and the embedded app itself.

First, the consuming page sends the embedded app ID to the loader script. The loader fetches the app manifest file, which provides it with a list of assets required for that app. Then, it injects script and link tags into the DOM for each of the manifest assets and tells the embedded app which DOM element it should render itself into. Finally, the consuming page loads the app resources and renders the React app.

The embedded app is a regular React app built with Webpack, with the addition of two global functions used for rendering and cleanup.

The Loader Script

This script is loaded into the consuming page with the script tag as described above.

First we create the div where the app will be mounted. We know the app id from the data-attribute-id attribute passed to the script tag. If width or height were specified, we set those.

Our app assets and its manifest file are stored in AWS and keyed by the app id. The manifest file is created by Webpack at build time and contains a JSON object listing all the assets needed to run the app, including Javascript chunk files and CSS. We use a standard HTTP request to fetch the manifest file by app id. Here’s what ours looks like for the navigation app.

We parse out the Javascript and CSS assets from the manifest, and load each one onto the page by injecting a <script> or <link> tag into the DOM respectively. We set an onload listener so we can track when each asset is loaded. These are all done async. Here’s how we load each Javascript asset.

 

After the assets are loaded, we are ready to mount the embedded React app. The loader script calls a function named <app-id>_runner that is defined globally within the embedded app. If the function doesn’t exist, we throw an error and the app will fail to load. The user will see an empty placeholder on the page.

The loader script also defines an event listener to do tasks related to unloading. The listener calls a function named <app-id>_destroyer that is defined globally within the embedded app. The destroyer function should unmount the app and do any additional cleanup needed to prevent memory leaks.

At that point, the loader script has done its job and the embedded app takes it from there.

The Embedded App

Any React app can be adapted to work with our loader script by adding a few bits of code. In fact, any app in any framework can be adapted to work with our loader script as long as it creates a Webpack manifest file and defines the global runner and destroyer functions.

For React apps, these functions go in the index.js file. The runner calls ReactDOM.render and the destroyer calls ReactDOM.unmountComponentAtNode.

If you wish to allow multiple embedded apps on one page, you need to add a custom jsonpFunction prefix to the Webpack config to create a unique code namespace and avoid code collisions.

Note: If you use create-react-app like we do, you can install rescripts to make changes to your Webpack config without ejecting.

The setup on the embedded apps is fairly minimal, but nonetheless we made an embedded app template for our developers so they don’t have to worry about how the plumbing works and can get straight to work building features.

After initial launch, we made our navigation responsive by adding a vertical option for small screens. Updates to the navigation app flow through to every page in the site immediately when the code is deployed, so we can focus our time on building new features rather than how we will deliver the changes.

A Few Minor Downsides

There are a few small downsides to be aware of with this approach. Since the loader script loads first, loads the embedded manifest, and then loads the embedded assets, there is a bit of a lag in the time to render. The performance may or may not be a problem—depending on your use case. For the PagerDuty product in particular, the embedded apps still generally flow within the load time of the rest of our pages, so it’s acceptable as an interim solution.

We also periodically see CSS conflicts occur on pages that use embedded apps because the CSS on either the consuming page or in the embedded app is not sufficiently encapsulated. Modular CSS has helped us solve this problem for the most part, but we still occasionally have cases here and there where we need to resort to an override with !important.

Faster Development and Innovation

We were able to launch a new and modern responsive global navigation on every page in the site from a single React codebase. Developers are happy because it’s so much easier to make updates now, and they don’t have to touch legacy code. Our Product and UX folks are happy because we can make our newest features discoverable from day-one, and the content structure reflects the state of the product today. We were even able to run an A/B test before launching to confirm a positive user experience with our new information architecture. That wouldn’t have been possible with multiple copies of the code. Our navigation is now ready to grow and change right along with our product.

Broad Usage During Legacy Transition

We saw teams adopting the embedded architecture for uses other than the navigation before it was even fully ready for production use. Our product delivery teams are often asked to build new features on legacy pages. They don’t want to work in the legacy code and we don’t want them to create more legacy code, so giving developers the ability to work in React via embedded apps has solved this problem. It has also enabled significantly faster development, stopped the proliferation of tech debt, and gotten all of our product delivery teams working in React. Many of the features that have been released in the past year take advantage of embedded React.

While originally intended as a way to reduce our global navigation code to a single codebase, embedded apps are allowing us to modernize, innovate, and build front-end faster amidst a complicated transition from legacy monoliths to React microservices. Win-win.

To learn more about PagerDuty’s Front-end architecture and infrastructure, stay tuned for the next blog in this series covering how we manage our fleet of micro front-ends

The post How We Use React Embedded Apps appeared first on PagerDuty.

]]>
How Lookups for Non-Existent Files Can Lead to Host Unresponsiveness by Tamim Khan https://www.pagerduty.com/eng/lookups-non-existent-files-kafka/ Thu, 27 May 2021 13:00:12 +0000 https://www.pagerduty.com/?post_type=eng-blog&p=69556 Late last year, we had an interesting problem occur with the Kafka clusters in our staging environment. Random hosts across several clusters started experiencing events...

The post How Lookups for Non-Existent Files Can Lead to Host Unresponsiveness appeared first on PagerDuty.

]]>
Late last year, we had an interesting problem occur with the Kafka clusters in our staging environment. Random hosts across several clusters started experiencing events where they would become unresponsive for tens of seconds. These periods of unresponsiveness led to client connectivity and throughput issues with services using Kafka, in addition to under-replicated partitions and leader elections within Kafka.

Since the issue started with only a handful of machines, we initially suspected degraded hardware, so we provisioned new machines to replace the problematic ones. This fixed the problem for the next couple of days, but eventually the newly provisioned machines also started to experience lockups.

Eventually, we discovered that the culprit responsible for these issues was third-party software that we were trialing at the time. When we initially began our investigation, we suspected this software since its rollout was somewhat aligned with the start of problems. However, our suspicion completely hinged on this correlation and wasn’t supported by any further evidence. The process monitoring tasks that this software was responsible for didn’t seem like they could lead to the system-wide machine unresponsiveness that we were experiencing.

So how were we able to trace this problem back to this tool and what was it doing that was leading our machines to become unresponsive?

Consistently Reproducing the Problem

Since the problem was occurring randomly, our first challenge was being on the right machine at the right time when it was occurring so we could investigate in real time. This is where we caught a lucky break and noticed while removing unnecessary dependencies from a long running host that systemctl --system daemon-reload could also reproduce the issue.

When we ran this command, our terminal locked up and our graphs looked identical to when the issue was occurring “randomly.” Running this command on a newly provisioned host had no such impact and returned instantly without any noticeable change in our graphs. A quick top showed us that systemctl was not really the culprit either but rather systemd itself, which was using most of the CPU at the time.

Using strace and perf to Find a Source

Since our problem was related to the network and all systemctl --system daemon-reload is supposed to do is reload all the unit files known to systemd, we figured systemd’s interaction with the kernel was probably the source of our problem. As a result, we ran a strace on the systemd process in order to get an idea of the system calls it was making and how long they were taking:

In this output, we can see all the syscalls being made systemd, the arguments to those syscalls, the return code of the syscall and most importantly how long the syscall took to execute. What immediately jumped out to us was that inotify_add_watch—specifically those on the / directory—seemed to take orders of magnitude longer than other inotify_add_watch calls. When we compared this to a machine unaffected by this issue, inotify_add_watch calls on the / directory were not special and took just as long as other directories.

While at this point we could try looking at the kernel source, there was a lot of code that can be called by inotify_add_watch. As a result, in order to narrow down our search further, we ran perf top to figure out where inotify_add_watch was spending most of its time:

In this output, it looks like the kernel was spending a lot of time in __fsnotify_update_child_dentry_flags (55.08%), a function that updates flags on various dentry objects. Before we go any further, let’s quickly talk about what these are and how they are used in the kernel.

dentry Objects and the dentry Cache

A directory entry or dentry is an object that makes it easier for the kernel to do path lookups. Effectively a dentry is created for every single component in a path. For example, the path /usr/bin/top would have a dentry for /, usr, bin and top. dentry objects are dynamically created by the kernel as files are accessed and don’t correspond to actual objects being stored on the filesystem. However, they do have a pointer to the structure that represents these file system objects (ie: an inode).

There are 3 possible states for a dentry: used, unused, and negative. A used dentry is one that points to a valid inode that is actively being used by something. A dentry object in the used state cannot be deleted regardless of memory pressure. An unused dentry also points to a valid inode but it is not actively being used by anything. These unused objects will be freed by the kernel if there is enough memory pressure but it will generally try to keep them around in case it ends up being used again. Finally, a negative dentry is one that doesn’t point to a valid inode. Negative dentry objects are created when lookups are made for non-existent paths or a file is deleted. They are kept around mainly to allow the lookups of non-existent paths to be faster but just like unused dentry objects these will be freed if there is enough memory pressure.

Since computing a dentry takes a non-trivial amount of work, the kernel caches dentry objects in a dentry cache. Among other things, the cache contains a hash table and hash functions that allow for the quick lookup of a dentry when given a full path. How this cache works is that when you access a file for the first time the kernel traverses the full path using the file system and puts the computed dentry objects for each component in the path in the dentry cache. When you access that file again there is no need to do that file system traversal again and you can just lookup the path using the dentry cache. This significantly speeds up the file lookup process.

Large dentry Cache

Now that we know what a dentry and dentry cache are let’s take a look at the kernel source of the __fsnotify_update_child_dentry_flags function to try to understand why the manipulation of dentry objects are leading to lockup issues:

In this function, we are given an inode (ie: the / directory) and we go through the dentry objects for the given inode and all its subdirectories and change flags on them to indicate that we are interested in being notified when they change.

One important thing to call out here is that while we are going over all dentry objects associated with /, this function is taking a lock on both the inode and the dentry associated with /. Various filesystem related syscalls like stat (which needs to traverse dentry objects starting at /), read/write (which needs to inotify all parents of a given path including / is being accessed or modified) and many others also often need to take one of these locks in order to run to completion. Normally, this is not a problem since this function takes on the order of microseconds to run but if it doesn’t these syscalls could hang for large periods of time.

Another important thing to note is that attempting to take a spin_lock disables preemption on the CPU core running the syscall. This means that core cannot run anything other than interrupt processing until the syscall completes (ie: the syscall cannot be “preempted”). This is done mainly for performance reasons since otherwise a lock acquiring syscall could be preempted by something that also requires the same lock and that syscall would just waste CPU waiting for a lock that cannot be released. Normally, this is not a problem since kernel locks are supposed to be held for very short periods of time. But if kernel locks are held for a long time this could lead to a situation where several syscalls, all of which have prevented other things from running on their CPU core (ie: disabled preemption), prevent anything else from running on a machine for long periods of time as they wait for a lock to be released.

So now we have some reasons why this function taking a long time to run is problematic, but why is it taking a long time to run in the first place? Since the number of files on the filesystem between an unaffected and affected machine were effectively identical, it didn’t really make much sense for this function to take so much longer unless there were significantly more entries being tracked by the problematic machines or there was some sort of contention issue. In order to rule out the first case, we decided to take a look at the slabtop command to determine how many dentry objects were actually being stored by the kernel.

Now we are getting somewhere! The number of dentry objects on an affected machine was orders of magnitude larger than that of an unaffected machine. What’s more is that when we dropped the freeable dentry cache entries using echo 2 > /proc/sys/vm/drop_caches our systemctl command returned instantly and the random machine lock ups that were occurring on this machine disappeared. When we checked back in on this machine the next day we also noticed that the number of dentry objects being tracked had grown substantially. So something was actively creating dentry objects and this is why even newly provisioned machines would start to experience issues after some period of time.

Pinpointing the Source Application

With the information that the size of the dentry cache is the source of our problems and that some application on this machine is creating these dentry objects on a continuous basis, we can now start to instrument the kernel to try to find the source application. To achieve this we used bpftrace a program that uses eBFP features present in more recent kernels (4.x) to trace and instrument kernel functions. We mostly chose it since it had the features we needed and was quick to get working; however, there are other alternative programs that can achieve similar results (ie: systemtap). We came up with the following script to try to find the source:

There are quite a few things going on here so let’s break it down. Here we are instrumenting the d_add function to track additions to the dentry cache and __dentry_kill function track removals to the cache. In order to reduce the amount of “noise” we get from other applications, we added an if statement to only find dentry objects that would end up on the / path. Next, we count() up all the calls made by comm (ie: the process that executed the sycall that got us here) and store them in an associative array (@..._process[...]) so that it is printed at the end of the run. Finally, in order to get an idea of what sort of dentry objects are actually getting created we print some information about each dentry as we encounter them including the name (ie: the path in the file system this dentry represents), if it’s in a negative state, and the inode it is associated with.

After running bpftrace with the script above for a couple minutes, a couple of things quickly jumped out to us:

The vendor application we were trailing was making an order of magnitude more dentry objects than the next application on the host and also seemed to only be cleaning up about half of them during the timeframe we were running our trace. In addition, it was creating dentry objects that looked like the following:

This is a negative dentry, which means the file does not exist on the filesystem, and it is also suspicious that the filename starts with null followed by a series of random letters and numbers. At this point we suspected that the app had a bug that was causing it to look up effectively random paths on /. This was causing the number of negative dentry objects being cached by the kernel to balloon over time eventually causing inotify_add_watch calls to take orders of magnitude longer while holding a lock that a wide variety of filesystem related calls require in order to function. Since we were running several threads that were making these sorts of filesystem related calls, they all attempted to wait for this lock while also disabling preemption on the CPU core on which they were being run, locking up the machine until inotify_add_watch had finished running.

Fixing the problem

At this point, we had enough data to go back to the vendor of the software and ask them to investigate. Sure enough, our suspicions were correct and they discovered a bug in their application where it was periodically doing lookups for invalid files on the / path. Once they fixed the bug and we deployed the new version of their software, our dentry cache growth issue disappeared and we stopped having random machine unresponsiveness events.

Getting here was no easy feat. However, we learned a lot about how to go from a generic problem (random machine unresponsiveness) to the root cause of an application doing a semily unrelated interaction with the kernel (creating negative dentry objects). Hopefully, you can use some of the tips here to trace your own problem down to an application’s interaction with the kernel. Happy hunting!

The post How Lookups for Non-Existent Files Can Lead to Host Unresponsiveness appeared first on PagerDuty.

]]>
The 4 Agile Scrum Ceremonies: Everything You Need to Know by Derek Ralston https://www.pagerduty.com/eng/agile-scrum-ceremonies/ Thu, 18 Feb 2021 14:00:23 +0000 https://www.pagerduty.com/?post_type=eng-blog&p=67804 Commonly referred to as agile ceremonies, the Official Scrum Guide calls them events. But what are we really talking about? These are meetings, which are...

The post The 4 Agile Scrum Ceremonies: Everything You Need to Know appeared first on PagerDuty.

]]>
Commonly referred to as agile ceremonies, the Official Scrum Guide calls them events. But what are we really talking about? These are meetings, which are one important element of agile software development. “Why don’t we just call them meetings,” you ask? Probably because the “m” word has a negative connotation to it in some organizations. Agile ceremonies don’t solve all of a team’s agility problems, but when done properly within the context of a healthy engineering culture and solid engineering practices, they can support effective communication within the team and organization.

Each agile ceremony has a unique objective tied to it and is organized and facilitated in a way that helps the team achieve that goal. As teams mature in their agility, they may notice that they are able to run their agile meetings more efficiently (and potentially spend less time in meetings altogether).

4 Agile Ceremonies to Know

Let’s take a look at 4 agile ceremonies that support teams in effective agile software development. Some of them come from scrum, but can be applied to lean or kanban agile teams as well.

1. Sprint Planning

Vocabulary tip: The term “sprint” comes from scrum and refers to a timeboxed period of development. Other agile frameworks call the timebox an “iteration” and the meeting “Iteration Planning.” Kanban refers to the meeting as “Replenishment.”

What is it?

A sprint planning meeting is held to decide on the work that the team will commit to for the next sprint. The team can review all work items before they commit to ensure alignment, as well as agree on the priority and sequence of the work.

What’s the objective?

The objective is for the team to commit to new work as a team for the upcoming sprint.

How often?

Scrum teams do Sprint Planning at the beginning of every sprint. Kanban teams may choose to replenish at the beginning of every iteration (if they have iterations), or when their number of tickets ready to be worked on gets low. Some kanban teams set a threshold. For example, when 3 tickets are left in their “ready to be worked on” bucket, the team knows to replenish the work by holding a meeting.

To prevent Monday fog and Friday holiday collisions, many teams choose to start and end their sprints mid-week. Here’s a sample sprint schedule for a scrum team:

How long?

For a 2-week sprint, a typical sprint planning ceremony would be 2-4 hours in length.

Who?

The Development Team, the Product Owner, and the Scrum Master

How?

During sprint planning, the team addresses the following topics:

  1. Why is this sprint valuable? Output: The team decides on a sSprint goal.
  2. What can be done this sprint? Output: The team selects items from the product backlog to include in the upcoming sprint.
  3. How will the chosen work get done? Output: The team comes up with a plan for delivering the backlog items selected for the upcoming sprint.

2. Daily Standup

Vocabulary Tip: Scrum refers to this ceremony as the “Daily Scrum.”

What is it?

A very short, 15-minute (or less) meeting with the entire team for daily coordination, self-organization, and planning. It focuses on how the team can best complete work items and unblock issues. This is not a status update. It is the most regular ceremony used to inspect and adapt.

Think of a team’s daily standup as the huddle before a team makes a play in football or rugby.

What is the game plan for today? Is everyone clear and ready? What do we need to do to make the most of the day? Ready? Ok! Go team!

What’s the objective?

The objective is for the team to check in with each other, recommit to each other, identify what is needed to move forward, and ask each other for help.

How often?

Daily. This doesn’t have to be in the morning, but the daily standup typically signifies the start of a new working day, so you will speak to the last 24 hours and the next 24 hours.

How long?

15-minutes or less.

Who?

The Development Team, the Product Owner, and the Scrum Master

How?

There are two approaches to stand-up formats:

1. The People-Centric Stand-up

This is the most popular stand-up format, especially for teams using scrum. You go around the room and ask every development team member to answer these 3 questions:

  1. What did I get done yesterday that helped the team meet the sprint goal(s)?
  2. What will I get done today to help the team meet the sprint goal(s)?
  3. Do I see any impediment/blockers that prevent me or the team from meeting the sprint goal(s)?
2. The Work-Centric Stand-up

This format is more commonly used by kanban teams. Teams will walk through their kanban board from furthest-right column down, then the next column to the left and so on.

For each card, the assignee will speak to:

  • Progress made
  • Plans for today
  • Any blockers/flag potential impediments
  • Any help that’s needed from their teammates

At the end, give a chance for other team members that don’t have an item on the board to give updates of what they’re working on. This will differ between teams, but it is a great place to start.

Tip: Take discussion about items not displayed on the board offline, or “park” them for a later discussion if there’s time leftover.

3. Sprint Review

Vocabulary Tip: Outside of scrum, this ceremony is commonly referred to as an “Iteration Review.”

What is it?

A live demonstration (demo) where the team shows increments of work to internal and/or external customers. Here, the team gets early feedback to inform future iterations of their product/feature.

What’s the objective?

The objective is for the team to share what they’ve built and get feedback.

How often?

This is decided by the team, but typically once per sprint, typically every 2 weeks or once per month.

How long?

This ceremony is typically 30-60 minutes in length.

Who?

The Development Team, the Product Owner, the Scrum Master, stakeholders, customers, and anyone who’d have valuable feedback on what’s been built to ensure it’s providing the intended value.

How?
  1. The Development team demonstrates what was done during the last iteration or what is new.
  2. The team shares what goal/epic/feature the demo relates to.
  3. The team can share what went well and what challenges were encountered.
  4. Stakeholders share feedback and the team can discuss/share where to go from here and then collaborate on any changes/next steps.

4. Sprint Retrospective

Vocabulary Tip: Outside of scrum, this is simply referred to as a “Retrospective,” and in kanban, this is referred to as a “Kaizen.”

What is it?

A meeting where the team can honestly and openly reflect on the team progress, team process, and team health. It is a safe place for the team to regularly reflect and come up with specific steps to improve blockers and pain points—and also celebrate their successes.

What’s the objective?

The objective is for the team to focus on how they can improve their processes and value delivery as a team.

How often?

Typically at the end of each sprint.

How long?

For a 2-week iteration, typically 60-90 minutes.

Who?

The development team, The product owner, and the Scrum Master

How?

Tip: We have a detailed and open-sourced Retrospective Guide available here!

The meeting has a facilitator run the retro (e.g. a Scrum Master or a rotating team member that’s been trained through retrospective facilitator training) and the whole team participates.

  1. Set the stage. The team reflects on the timeline that passed
  2. Gather data. What do individuals feel was positive, negative and how can we improve?
  3. Generate insights. – We identify trends and decide what to focus the discussion on.
  4. Decide what to do. Brainstorm concrete actions the team can take to improve.
  5. Close. End on a positive note by sharing our team appreciations and self-assign action items.

Agile Scrum Ceremonies Aren’t Everything

Agile ceremonies facilitate communication within the team and organization, but they aren’t a panacea for a team’s agility (don’t believe the agile BS). Successful agile software development requires organizational buy-in from all levels, engagement from engineering, focus on the customer, and potentially bringing in dedicated Agile Coaches.

Share Your Learnings

Do you have other tips and learnings to share on facilitating effective agile scrum ceremonies? We’d love to hear from you in the PagerDuty Community.

The post The 4 Agile Scrum Ceremonies: Everything You Need to Know appeared first on PagerDuty.

]]>
Efficiently Structure Data Science Teams to Achieve Company Goals by Mitra Goswami https://www.pagerduty.com/eng/data-science-team-structure/ Thu, 11 Feb 2021 14:00:44 +0000 https://www.pagerduty.com/?post_type=eng-blog&p=67739 Mitra Goswami (PagerDuty Senior Director Data Science) is a machine learning professional with experience working in Astrophysics, Media, Martech, and the Financial Services Industry. Her...

The post Efficiently Structure Data Science Teams to Achieve Company Goals appeared first on PagerDuty.

]]>
Mitra Goswami (PagerDuty Senior Director Data Science) is a machine learning professional with experience working in Astrophysics, Media, Martech, and the Financial Services Industry. Her expertise and interests lie in building scalable data-centric teams, creating cost-effective data engineering solutions to empower data science and machine learning in organizations, and leading machine learning innovation driving business value.

Data science continues to be an evolving field that encourages diversity in educational backgrounds, team structures, and the process of software development. Given this, there are multiple industry approaches for how to efficiently structure data science teams within organizations. This is not an easy task, and requires introspection of company goals, a realization of the data infrastructure’s maturity, and an understanding of the nuances of data science feature development. In this blog, I will explain how to efficiently structure data science teams to realize data science features in the product based on the aforementioned key factors.

Differences in Feature Development

So what’s the difference between typical feature development and data science feature development? This question is natural as data science feature development is very different from regular feature development. For example, while regular feature development can be driven entirely by the engineering team with a product collaborator, data science feature development requires engineering to closely collaborate with the data science team. In most cases, a huge share of the core work is driven by the data scientists.

Another differentiator is the absence of a proper infrastructure that favors data unification, cleaning, and mining. In this scenario, the data science feature development life cycle can take almost two or three times as long in comparison to regular feature development cycles. Finally, a data science feature is often layered with complexities and has an experimental component to it. This type of feature typically requires customer feedback, A/B testing, etc., which bridges the gap between accuracy and ease-of-use for a successful adoption by customers.

In the light of all these collaborations between teams and commitments on infrastructure, and the roadmap for launching data science features into the product, my questions for leaders are:

  • What do you want to do with data science in your organization? Do you want to disrupt and innovate to make data science your core engine, or do you want sequential data science feature launches aligned to your product growth? Check out the centralized & embed model below.
  • Where are you in the data science journey? Do you have the infrastructure (end-to-end pipeline from data storage -> cleaning -> training -> deployment) to perform machine learning, or do you have an expedient solution to run models, which can be improved by an incremental implementation of a long-term vision?

Let’s dive into the centralized & embed models below.

Embed Model

In this model, data scientists are embedded into the product teams, although data scientists from different engineering teams can also report to a central data science head. In collaboration with the data science team, each engineering manager is in charge of scoping out the requirements necessary for a data science project that the engineering team has the bandwidth to support (e.g., release in production).

If you are still in the infrastructure development phase and value sequential data science feature launches aligned to the product, the embed model is the way to go; however, single-person embed models are generally not very efficient. In most organizations, there is a mix of senior and junior data scientists and engineers. I define a senior data scientist as someone who can understand business requirements, code algorithms, and describe results that create a story. Similarly, a senior data engineer is someone who can understand the algorithm and can “productionize,” keeping in mind real-time vs. weekly/monthly training, cost efficiency, and performance.

As a rule of thumb, one senior data scientist and one senior data engineer is a good team that can work together within the embed model. What we want in the embed model is a small team with all the aforementioned skill sets who are aligned to the engineering team and the product counterpart. With this approach, the embed team can develop the model in production, creating a “handshake point” to be integrated with the engineering team simultaneously, and the engineering team can work on their end towards a successful integration.

If you are keen on increasing the velocity of data science feature development and making data science an innovation hub in the future, make sure you invest in building infrastructure with data engineers simultaneously as the embed teams develop product features.

Centralized Model

In a centralized model, the data science team takes the lead, while product teams perform market research, identify big bets, and prototype the solution. Once the prototype or proof of concept is ready, the engineering team implements the prototype into the product.

If your company has the proper infrastructure, you can develop an innovation hub internally. In this scenario, a centralized model would provide you and your teams the most value. Data scientists working together as a unit—with a product counterpart collaborating to build competitive research—is a dream job for almost any data scientist.

Incremental, product-driven feature development also becomes super easy at this point. However, I have also seen the centralized model with a robust architecture struggle at some organizations. Product-driven feature development requires product managers and data scientists to work in tandem to explore the possibilities and nuances of data science, and the exercise demands clarity in the roadmap for engineering and data science to collaborate on the implementation side. This is also a critical piece to ensure the success of various data science investments.

In summary, if you would like to increase your ROI on data science investments, go “EASY:”

  • Establish infrastructure
  • Access multiple service datasets efficiently and in a unified manner
  • Socialize among the product managers & engineers and incentivize to collaborate on market research for data science features
  • Yield clarity in the organization-wide roadmap to deliver data science jointly with the engineering organization

Are you facing similar challenges in your organization? Are you able to map your location in the data science journey? Stay tuned for our next blog on data science.

The post Efficiently Structure Data Science Teams to Achieve Company Goals appeared first on PagerDuty.

]]>
Recipe for Meaningful Change by Parsa Alipour https://www.pagerduty.com/eng/recipe-meaningful-change-2021/ Wed, 20 Jan 2021 14:00:57 +0000 https://www.pagerduty.com/?post_type=eng-blog&p=67202 I’ve had the privilege to be with PagerDuty since 2016, and in that time, I’ve seen a lot of change. I’ve seen the company evolve...

The post Recipe for Meaningful Change appeared first on PagerDuty.

]]>
I’ve had the privilege to be with PagerDuty since 2016, and in that time, I’ve seen a lot of change. I’ve seen the company evolve from a private startup to an enterprise that undertook a successful IPO. Being part of a team that consistently delivered projects and a company that grew quickly yet sustainably allowed me to observe and understand how meaningful change happens.

Although I’ve approached this question from an engineering perspective, I tried to keep these ideas generic; they can be applied to many parts of a business: engineering, product, management, and many others. To be able to lead in any arena, one must be capable of driving meaningful change. I believe this is only possible with a fundamental understanding of the business.

What is Meaningful Change?

It’s important to distinguish between simply driving change versus driving meaningful change. Simply driving change may not be good enough. Change can result in accidental complexity or yield short term gains at the expense of dark debt that costs the business more than the value it produced later down the line. Change can also destabilize the company or put the business at risk. I define meaningful change as change that results in sustained or net positive value added for the business over its lifetime.

To drive meaningful change, I believe it’s vital to have a solid understanding of:

  • Cost and impact
  • Importance of influence
  • Challenges of coordination

I believe that individuals who effect meaningful change consistently have a very deep or intuitive understanding of these concepts and use them to make the appropriate tradeoffs.

“Change can result in accidental complexity or yield short term gains at the expense of dark debt that costs the business more than the value it produced later down the line.”

Cost and Impact

When discussing the topic of driving change, we must talk about cost and impact. This is not a new concept, and I’m not going to bore folks with the details of Risk Impact, which is well understood by most functioning organizations, but I believe it’s important to understand value birthed from change.

The potential net value (upside) created by change is a function of its cost and impact:

  • What are the resources required to drive said change?
  • How many internal/external stakeholders and customers will be affected by this change?
  • How will this change affect confidence in the company (internal/external)?
  • What are the expected, short-term returns?
  • How does the decision play into the business’s strategy (expected long-term returns)?

Most often, impact behaves similar to leverage in financial instruments; both the upside and downside are amplified as exposure increases. Decisions that lead to more visible changes can either pay off immensely or put the company at significant risk.

This is why it’s imperative to understand the risk appetite of the business when evaluating an initiative or plan of action. In general, larger companies are risk averse due to the downside because there is more at stake while tiny startups cannot survive without undertaking significant risk. Even in larger companies, there may be teams or departments whose mandates both afford and necessitate greater risk taking. Because effective risk taking and risk management are tied directly to growth, developing an intuition for and exhibiting these abilities is extremely valuable to the business and crucial to one’s ability to drive meaningful change.

In general, as impact increases, so does risk (and often, but not always, cost). As the monetary stake involved increases, the number of stakeholders increases, and responsibility is afforded to those that are trusted and have the most influence. Because of this, driving high-impact decisions requires significant influence across the company.

Importance of Influence

“Often, they need access to certain people or resources—and that access requires influence.”

Influence is a form of social capital that’s profoundly important in the healthy function of any human system, whether it’s government, schools, nonprofits, or for-profit organizations and is at the very core of the human experience. It’s an idea that is very intuitive and embedded subconsciously in the human psyche—yet not talked about enough.

By and large, almost any meaningful change that happens in an organization is driven by some very motivated individuals. To be able to operate in the capacity that they need to, these individuals need to wield some amount of influence to be able to execute their vision. Often, they need access to certain people or resources—and that access requires influence.

When talking about developing influence, I think the best way to understand it is by discussing incentives. The business and one’s peers operate on different incentives — influence derived from either will depend on the value that they each receive.

Providing Value to the Business

When a business’s interests and incentives are tied directly to delivering value, it follows that employees who provide the most value will be rewarded (whether it’s compensation or social capital). I believe there’s two major ways to provide value to the business:

Tangible Business Value

Short term

  • An individual or team increases revenue or decreases cost in a meaningful way
  • Many product facing engineering teams work on features that roll up into a product SKU that customers are actively paying for
  • As a personal example, in 2016 I led an effort to scale down over-provisioned machines across PagerDuty, saving roughly $270,000 per year

Long term

  • Strategic decisions that position the business for massive expansion of customers and potential revenue—which are almost always driven by executives or very senior individuals. A great example at PagerDuty is the evolution of the product from a simple paging tool into a platform consisting of multiple products, such as PagerDuty Analytics, Event Intelligence, etc.
  • Contributes to a positive company brand and image (customer satisfaction, evangelism, etc.)
Organizational support
  • Any work that provides intangible value and empowers employees either directly or indirectly by increasing efficiency or reducing waste across the org
  • As an example, consider an onboarding process that allowed employees to be productive in half the time after joining? Is there a dollar value attached to that in an organization with thousands of employees?
  • Another good example of this would be development of tooling and elimination of tech debt. Both enable developers to deploy code much more efficiently and cut down on time spent as a result of inadequate tooling or dark debt

Providing Value to Peers

Peers also have a financial stake in the company and are vested in its success, but less so than executives—their incentives are tied more so with job satisfaction and career development. In my opinion (and speaking from experience), as seniority increases, employees need more competent and capable peers to learn from and help them through their journey. An individual can provide value and develop influence among peers in two main ways:

  • Demonstrating expertise
    • Valuable employees take pride in their skills and want to surround themselves with other experts in their professions (to learn from, collaborate with, etc.)
  • Providing material support
    • Employees value and elevate peers who support them in achieving goals

Visibility & Developing a Brand

One of the most important parts of developing influence is establishing one’s brand in the organization and being seen. There’s work that needs to be done to develop expertise and domain knowledge in a certain area, but there’s also work that needs to be done to market one’s self and make the organization aware of one’s interest and abilities. From experience, most engineers focus on the former, while the latter seems to fall by the wayside. Advocating for yourself is a very important part of developing influence!

Let’s say an individual wants to be able to drive highly visible system design decisions in engineering. Before engineering leadership entrusts the individual with the responsibility to drive said decision, they will ask themselves: “What do I know about this person? What is this person’s track record on similar work in the past? How has their previous decisions regarding tradeoffs on different designs and systems panned out?” For the individual to have won the trust of leadership, they need to have worked towards that reputation by focusing on work that allows them to both exercise and demonstrate it on a smaller scale.

We’ve talked a lot about influence and how we may go about achieving it. But we need to understand the function it serves. Is it really necessary? I believe it is. Our need for hierarchy and influence serves to address the problem of coordination.

Challenges of Coordination

Flocks of birds exhibit swarm intelligence to self organize and coordinate movements

At PagerCon 2020, I had the privilege of giving a talk on the topic of Coordination, Effective Consensus, and Avoiding Gridlock. I thought this blog post could be another good opportunity to allude to some of these ideas. When is it acceptable to incur the costs of coordination and when is it essential to minimize them? To understand the tradeoffs, we have to analyze and understand hierarchies.

Every individual has their own perception of reality and the world we live in. Our diverse backgrounds and experiences mean that we will often have different opinions on the solutions to problems. This is where coordination and communication becomes necessary and why it makes sense to organize ourselves in hierarchies.

The vast majority of hierarchies for businesses at scale, including governance at PagerDuty, is centralized. Select organizations have been known to buck the trend — in an attempt to increase transparency and distribute decision-making power, some organizations adopted “Holacracy,” a form of flat hierarchy as their form of governance. This blog post does a phenomenal job explaining the drawbacks with flat hierarchies and why many high-profile companies that adopted it experienced significant challenges and had to for the most part, re-organize their governance. In short, it can be difficult to make decisions that are well informed and efficient due to coordination challenges, especially as a business grows.

This picture from the aforementioned blog post provides a visual into how quickly the cost of coordination (denoted by edges) can increase as more actors (denoted by nodes) are added:

The number of edges (signifying cost of coordination) increases quadratically:

Without a centralized hierarchy, speed and quality of decision-making suffers drastically at scale due to non-linear increase in coordination cost. This is formally referred to as combinatorial explosion.

I have two general takeaways from this discussion.

First, it’s important to take the above with a huge grain of salt, as there is no one-size-fits-all solution. I’ve experienced significant improvement to the morale and performance of my team since we moved to a model where we distributed leadership to some extent and allowed room for less senior members of the team to also lead. In a system with low coordination costs, such as a small team, distributed approaches to leadership can be very empowering for the individuals and the team. At scale, however, is when the drawbacks of a distributed or decentralized leadership model starts to really become apparent and inhibiting.

Secondly, because of rapidly increasing coordination costs resulting from combinatorial explosion, it can be cost prohibitive to achieve absolute consensus among all stakeholders. Not only that, a single dissenting party can cause gridlock, and decision making can grind to a halt. This is why it’s often more preferable to aim for and achieve rough consensus, particularly when there are many stakeholders involved. As long as all concerns have been addressed, it’s important that everyone commits despite disagreements. Once again, this is not a one-size-fits -all approach, but it can be a good strategy to strike a balance between alignment and speed of decision making.

Closing Remarks

As leaders, we should strive for meaningful change; however, it’s important for us to recognize that not all meaningful changes are immediately visible and generate lots of buzz. Small incremental improvements are often overlooked despite being necessary and moving the organization in the right direction and setting the foundation for dramatic improvements.

To drive meaningful change, we have to understand our domain, master our craft, make the right tradeoffs, listen, and influence those around us. We must try to minimize our coordination costs as much as possible without taking away control and autonomy from the people who deliver our successes. At the same time, hierarchies exist and they are a good thing—let’s use them to our advantage. Let’s empower those that love to execute and are good at it. Let’s foster our future leaders, instill these values, and provide them the opportunity—should they seek it.

As a final remark: one of the things that I’ve always appreciated about PagerDuty as a company has been its humanity. Let’s be kind. Let’s be mindful of how our actions may take away influence from others and share stake, success, and recognition as much as reasonably possible. This is how we can grow—all of us, and bask in our successes, unified and as a team.

The post Recipe for Meaningful Change appeared first on PagerDuty.

]]>
What is Software Understandability? by Liran Haimovitch https://www.pagerduty.com/eng/what-is-software-understandability/ Tue, 19 Jan 2021 17:05:58 +0000 https://www.pagerduty.com/?post_type=eng-blog&p=67227 Liran Haimovitch is the co-founder and CTO of Rookout, a modern software debugging platform. Back in my early days at Rookout, I had the privilege...

The post What is Software Understandability? appeared first on PagerDuty.

]]>
Liran Haimovitch is the co-founder and CTO of Rookout, a modern software debugging platform.

Back in my early days at Rookout, I had the privilege of working with a large and well-known enterprise. They were heavily data-driven and had even developed a custom NOC solution. As it turned out, one of their engineers could not log on to that application. No matter what he did, he ended up getting an obscure error message from Google. The team responsible for that application had chased that bug for over six months. They scanned through the authentication flow dozens of times and still had no idea how could such a thing happen.

That’s the bane of modern software engineering. Software applications in general, and cloud-native applications in particular, are becoming sprawling and complicated affairs. Services, both first- and third-party, are becoming more interdependent. Customization options are growing out of control. The internet is hell-bent on throwing the most bizarre and unexpected inputs our way. And so, with engineering turnover, we find more and more teams failing to understand the software they are responsible for developing and maintaining.

Defining Understandability

To tackle that issue, we need to start by giving it a name, understandability. Drawing inspiration from the financial industry, we define understandability as:

“Understandability is the concept that a system should be presented so that an engineer can easily comprehend it.”

When a system is understandable, engineering operations become a straightforward process. Every time you and your team face a task, you just know what you need to do. It doesn’t matter if it’s developing new features, tackling customer issues, or updating the system configuration. When you comprehend the software, you know how to execute in a consistent, reliable manner, without unnecessary back and forth.

What Happens During Incident Response?

When your phone is going off in the middle of the night because something is wrong, your understanding of the application is vital. First of all, you have to use the information you have to verify this is an actual service disruption. Second, you need to triage the incident, understand the interruption’s impact, and identify its general area. Last, you either workaround it yourself or escalate it appropriately.

If you have a clear mental image of the application in your mind, and high-quality data is available to you, you can expect to perform reasonably well in all of those tasks. You’ll be able to make faster and better decisions in every situation and are much more likely to be able to resolve it yourself. If you have a poor understanding of the application, you won’t likely perform as well.

When it comes to on-call rotations, poor understandability will have a significant impact on teamwork. In some services, new engineers join the on-call rotation quickly, while in others, it takes them a long time to acquire the necessary level of knowledge and confidence. But even more critical (as we all love our bedtime), how many engineers does it take to figure out and fix a single service disruption?

How Does it Impact Bug Resolution?

Understandability also plays a massive role in resolving bugs. We all had that eureka moment where a colleague approached us to report a bug, and even before we finished the sentence, we knew what was wrong and in which line of code should fix it. Unfortunately, in complex software environments, more often than not, that’s not the case.

Today, when facing a bug, you have to take into account:

  • Source code. More specifically, the version(s) deployed at the time of the bug occurring.
  • Configuration and state. Includes everything from the number of servers we are running through custom settings defined for the user to previous application actions.
  • Runtime environment. This includes the operating system, container runtime, language runtime, multi-threading model, databases in use, and so much more.
  • Service dependencies. Everything our service relies on, whether first- or third-party, may be impacting the behavior.
  • Inputs. Potentially the most crucial part, what exactly were the inputs that resulted in the issue?

The more you understand the application you are responsible for, the easier it is to put the pieces together. The easier it is to collect the information you need and the less information you will need to see the bigger picture. The less you understand, the more you will find yourself relying on other people, struggling to collect more information. If all else fails, you just may find yourself pushing log lines to production to try and figure out what is going on.

What Can You Do About It?

If you think through the past few months, you will likely notice both positive and negative examples of understandability. In some cases, you were knowledgeable about the software and quickly and efficiently carried out tasks. Things might have been more challenging in other places, with jobs requiring help from other engineers and sometimes even requiring some back and forth to get it right.

Fortunately, when it comes to software understandability, there’s always room for improvement. There are several tried and true approaches to get you there:

  • Minimizing complexity will create a higher-quality, more comfortable environment to comprehend software.
  • Carefully curating knowledge will make for a gentle learning curve and resources to patch over information gaps.
  • Building high-quality development environments will enable engineers to experiment with the application with real-world input and configuration examples.
  • Deploying observability tooling will provide engineers with some feedback about the application’s behavior.
  • Adopting traditional and next-generation debuggers will show engineers to empower engineers to dive into their code as it’s running.

Summary

Software understandability is a crucial factor in software engineering, measuring how easy it is to comprehend applications. Understandability is empowering your everyday efforts, and poor understanding is painfully apparent, especially when it comes to incident response and bug resolution. We have the essential techniques to improve understandability, and you can read more about it here.

What about the state of your software (and systems)? Are they mostly easily understandable or have you been struggling? We’d love to talk to you and help where needed, if you’d like to commiserate or ask questions. Please feel free to engage with either directly Liran on Twitter (@Liran_Last) or with us on PagerDuty’s community forum! You can also listen to the Page It to the Limit podcast that accompanies this blog post, here.

Wait! What Was the Bug?

That NOC application used Google’s social login for authentication. They had inadvertently passed a flag requesting Google to include the full user profile within the login token. Later in the code, they had a safety check, trimming the token above a certain length and rendering it invalid. As the application validated the token, it responded to validation failures by sending the user’s browser to re-authenticate.

That engineer had a large user profile within their GSuite account. That led to an enormous token, resulting in too many authentication attempts, and finally, that infamous error message. Within that safety check, there was a TODO comment to add a logline for this abnormal situation. Nothing is more painful than the logline that should have been there but wasn’t.

The post What is Software Understandability? appeared first on PagerDuty.

]]>
A Menagerie of DevOps Habits, Part 2 by Quintessence Anx https://www.pagerduty.com/eng/menagerie-devops-habits-part-2/ Wed, 02 Dec 2020 14:00:25 +0000 https://www.pagerduty.com/?post_type=eng-blog&p=66300 Alerts and notifications are what allow us to know if there’s something out of the ordinary with our systems. Unfortunately, as we scale up and...

The post A Menagerie of DevOps Habits, Part 2 appeared first on PagerDuty.

]]>
Alerts and notifications are what allow us to know if there’s something out of the ordinary with our systems. Unfortunately, as we scale up and out, the volume of those messages can get unwieldy. Today, I’ll be talking about some of the bad habits we have that allow our notifications to get into that state and what can be done to dial them back down.

Photo by Rahul Chakraborty on Unsplash.

I Need to Know Everything, So I Alert on Everything

This usually follows as an accidental result of “I don’t know enough about my systems (or code).” When you are afraid of the conditions you cannot predict, referred to as unknown unknowns, you will likely spend a lot of energy trying to alert on every situation you can imagine. Although all of the alerts will increase how much digital noise your systems make, this doesn’t mean you are successfully capturing potential critical issues in the way that you hope.

Anything known, to an extent, can be predicted. For the unknowns, you will probably find yourself building a series of “what-if” scenarios into your alerts that may not yield anything useful. You may not even delete them when you find them unhelpful, just in case they turn out to be. This is more harmful than helpful in the short and long term; you’re actually training people to ignore the alert because most are ignorable. It’s similar to listening to the AC or heater in your home—if it’s kicking on all the time, your brain filters it out so you don’t “hear” it any more.

Solve this 🔦: Learn how to understand what expected, errant behavior is and build a practice for handling all those unknown-unknowns. You will never know everything and will introduce too many undesirable alerts. (For more clarity on known knowns, known unknowns, etc., take a look at this article.)

I Don’t Know Who to Notify, So Everyone Needs to Know Everything

Another form of “too much of a good thing.” When there is uncertainty surrounding who is knowledgeable about a certain application, service, etc., a notification is blasted to everyone instead of directly notifying a subject matter expert. This may be truly everyone or a subset like “all of the leads.” While it might initially seem like sending notifications to everyone and leaving it to them to sort it out would be better than not sending out the notifications at all, that’s actually not the case. Since each person will receive many more alerts than they will take action on, people will learn to ignore them. Also, in circumstances where more than one person can take action, there is no clear path of ownership over a current incident. When this happens, ownership must be established during an active incident, which is not ideal since the priority should be resolving that incident efficiently.

Solve this 🔦: Take the time to determine who can take action on which alerts, and configure the notifications to reach out to only that person (who this person is should be set on a rotating schedule.) We have information about how to do this in our Incident Response Ops Guide.

Alerts Are Only for IT / Engineering

Previously, I mentioned that the only person or people who should be notified of an incident are those who can take action. Now, I’m going to ask you to broaden the scope of what “action” means.

The action that we normally associate with an IT incident is the triage, troubleshoot, resolve aspect of the incident, but there are more actions than only these. If there are multiple persons or teams involved in resolving the incident, then someone needs to coordinate them. Someone needs to send internal and/or external communications and so forth, depending on scope and severity. All of these actions should be handled by non-engineer roles because the engineers should be brought in as the subject matter experts to resolve.

What about alerts that aren’t even IT related at all? There are many unique use cases that do just that. For example, Good Eggs, a grocery delivery service, sets alerts on refrigerator temperatures (watch the customer video here), and Rainforest Connection uses alerts to learn of illegal logging. There are also ways that non-engineers can use alerts that aren’t restricted to niche use cases. For instance, depending on how your HR is set up, there could be a notification sent when an HR violation occurs that needs to be addressed or, if you’re in marketing, you could set up an alert on ad spend.

Solve this 🔦: Expand the scope of what “action” means to include all the tasks that need to be done for an active incident before it is resolved and set alerts for the people who truly need to be notified. Also think of ways that non-engineers could benefit from receiving alerts, such as improving their own manual processes.

We’re Measuring Everything We Need to Measure

This assumption happens a lot if you’re unsure what to measure so you search for ideas and implement whatever the top results are. If you’re looking to support a DevOps culture, you might look up DevOps KPIs, find a Top N or similar list, try to implement those, and pat yourself on the back for a job well done. Unfortunately, the job isn’t done if you haven’t asked qualifying questions—you might be measuring things you don’t need, missing things you do, or have the wrong scope. In this situation you’ll frequently find yourself gathering a lot of data but not being able to do anything useful with it.

Solve this 🔦: Start by asking what information you need, why you need it, and how to gather it. Knowing the “what” and “why” will help you determine which metrics to use and their scope, and the “how” can help you determine which tools to use.

Where to go from here

If you’d like more information about how to overcome cultural transformation hurdles for alerting and incident management, I definitely recommend reading our Ops Guides. Specifically you might want to start with our Incident Response Guide, and the branch out into Retrospectives and Postmortems. If you’d like to discuss how you’re implementing any of these changes or how you’d like to start, please feel free to find me on our Community Forum!

The post A Menagerie of DevOps Habits, Part 2 appeared first on PagerDuty.

]]>
A Menagerie of DevOps Habits, Part 1 by Quintessence Anx https://www.pagerduty.com/eng/habits-devops-best-practices/ Fri, 30 Oct 2020 13:00:50 +0000 https://www.pagerduty.com/?post_type=eng-blog&p=65673 As many of us settle into our careers, we fall into habits—some are conscious and we know we’re doing them, but we’re just not actively...

The post A Menagerie of DevOps Habits, Part 1 appeared first on PagerDuty.

]]>
As many of us settle into our careers, we fall into habits—some are conscious and we know we’re doing them, but we’re just not actively thinking about them. Other habits are more insidious—we’re unaware of them, but are practicing them nonetheless. In fact, in order to build the list in this blog, I reflected on situations I’ve encountered and how I resolved them—or tried to.

Photo by Jaredd Craig on Unsplash

Documentation Is Write Once, Read Many

We have a saying in IT that “naming is hard,” but writing is harder. (I say as I write this post.) The labor that goes into creating a project is massive. Equally as important, but much less considered, is the accompanying documentation. While groups and initiatives like Write the Docs have helped prevent documentation from being completely forgotten, it is common to not only write sparse docs, but also to neglect to update them with code updates. When this starts to happen, the information you are providing people to work with your project becomes less and less accurate. It might go without saying, but that not only reduces people’s ability to effectively use what you’ve written—it also decreases trust in the deliverables you are promising and the information you have documented.

Solve this 🔦: Just like teams need stand-ups and sprints, projects need documentation. Normalize iterative documentation updates every time there are project/code updates.

I Only Need to Write for One Group/Audience

This assumption sneaks up on us as engineers and IT professionals. When we start writing our application or service, we usually write the immediate documentation for who we think will be consuming it—typically other developers, perhaps our coworkers, or maybe open source contributors. And after that, we normally stop.

Even within the developer community, there are multiple audiences to consider. Ask yourself:

  • Does this documentation account for differences in discipline? Consider including information that would be needed for infrastructure requirements, quality assurance testing, data and business intelligence specialists, and so on.
  • Does this application or service have non-engineers using it, and is there documentation available that is in a language that they understand?
  • What is the minimum expected skill level of the user or consumer of your application or service?

The list can go on, but essentially, there are going to be many groups in your audience and you will need to accommodate them.

Solve this 🔦: Use some of the questions above to expand information for your product use cases, as well as outline the expected skill types and experience levels for everyone who will be interacting with what you build.

Documentation Needs to Live in One Place

There are strong benefits to having documentation in a single place—it doesn’t need to be updated in multiple places every time a change is made, and people will know where they need to go to learn something. However, where documentation lives is not a one-size-fits-all.

Consider what information you’re trying to store, who the audience is, how long it needs to “live.” Some tools, like wikis, are good for centralized information that needs to persist; however, it will not work for all project types. For example if you’re looking to collaborate on a blog post before it goes live, a tool like Google Docs might be a better fit. If you’re working with sales to document customer meetings, a more sales-oriented tool might be better for capturing that information. To complicate the matter, most of these tools aren’t free, so it’s entirely plausible that not all employees of a company will have access to all tools.

Solve this 🔦: Establish what information needs to live where, based on whatever criteria makes sense for your situation. It might be based on content, intended audience, longevity, and/or something unique to your needs. Once you’ve established what information needs to “live,” along with where and why, you can habituate those workflows on your teams.

${FAANG} Does This; We Should, Too

Photo by Martina Misar-Tummeltshammer on Unsplash

FAANG is the acronym for Facebook, Amazon, Apple, Netflix, and Google. These companies are the sources of a lot of innovation in our industry, in part due to the fact that they have massive engineering departments, user counts, and a scale that most companies could only dream of (or have nightmares about). This size and scale allows them to experiment in a way that more typically sized companies cannot. That said, it also means that some of the best practices that come out of these experiments cannot directly apply to most companies as-is, so the “let’s try these metrics that Apple uses” or “Netflix does this practice, let’s mirror that” approach likely won’t work.

Solve this🔦: Make sure to understand not only your business needs but also your organization’s size and scale. Definitely do make use of the amazing research that comes out of these companies, but be well informed enough to be able to adapt their practices and procedures to a model that fits your organization’s needs.

Some Next Steps

There are a lot of ways that you can start to break out of the bad habits shared in this post. The short solutions I provide for each are by no means all encompassing, but are a way for you to start learning more and branching out into better practices. If you have encountered these or other habits and got rid of them in your own company, we’d like to hear about it in our community.

While these corporate superstitions focused on documentation, in the next post of this two-part series, I’ll cover notification habits and behaviors, and ways that you can disprove and improve when you take a closer look at your processes.

The post A Menagerie of DevOps Habits, Part 1 appeared first on PagerDuty.

]]>
Collaborating to Build Secure, Maintainable Systems by Sarai Rosenberg https://www.pagerduty.com/eng/collaborating-build-secure-maintainable-systems/ Wed, 14 Oct 2020 13:00:43 +0000 https://www.pagerduty.com/?post_type=eng-blog&p=64826 I’ve built and taught others about building systems of many kinds—as a mathematician and teacher, and more recently as a security engineer in the last...

The post Collaborating to Build Secure, Maintainable Systems appeared first on PagerDuty.

]]>
I’ve built and taught others about building systems of many kinds—as a mathematician and teacher, and more recently as a security engineer in the last 6 years. Over time, I’ve found a few consistent patterns stand out as being effective lessons across domains and products and cultures. The approaches I will share often hold value for security and reliability—but they are fundamentally about people and about the processes that people use to collaborate with each other.

I first read some of these ideas from my mother’s 1971 copy of The Psychology of Computer Programming. These patterns aren’t without exception; they aren’t comprehensive, and these ideas have been written about by hundreds of people for decades. In this blog, I’ll share a few examples that helped me develop an empathetic intuition for concepts, processes, and models that help people collaborate effectively in system design, including:

  • Preventing single points of failure
  • Building rituals
  • Building templates: Repeating known good patterns
  • Experimenting with controlled variables: Varying from those patterns
  • Drawing boundaries and crossing them with compassion
  • Treating your colleagues fairly

Prevent Single Points of Failure

How many coworkers can leave before your team can’t maintain your service?

It’s vital to share knowledge, document your systems, and have multiple people of various backgrounds and levels test your documentation. On the PagerDuty Security team, we balance which tasks should be “quick wins” that can be quickly completed by someone with experience in that knowledge area, and which tasks should be spread across the team to share knowledge. Shared tasks also enable us to identify where we can improve our documentation, our code readability, and tools we use, such as dashboards and audit logs.

Every concept in building maintainable systems is eventually about managing risks. Single points of failure are considerable risks to your system that should be assessed regularly in your risk management program.

Imagine if a failure in any single service results in a complete outage for all customers at PagerDuty. That’s bad, right? We want our services to be robustly resilient to failures—to degrade gracefully, maintaining our customer commitment to providing the best of our product. The same reliability commitment that PagerDuty executives market to our customers also applies to our teams: We increase our points of failure and our resilience by sharing knowledge, by continuously improving our documentation through practical application, and by making our services interchangeable among teams.

Security leverages the concept of single points of failure in our threat models, too! Defense-in-Depth is a layered defense security strategy so that no single compromise is exploitable. By using Defense-in-Depth, we avoid the risk of a single point of failure where the failure of a single layer could expose a vulnerability in our infrastructure.

For example, the infrastructure security pattern of using hardened “bastion hosts” is a form of Defense-in-Depth that prevents a single point of failure. “Bastion hosts” are hardened, monitored servers and the only hosts in our infrastructure with an open SSH port 22. These hosts serve as an SSH proxy for our destination host; all SSH connections must pass through the hardened bastion hosts. SSH typically includes additional security layers, such as Access Control Lists (which require an auditable 2-person code review to change) and Multi-Factor Authentication such as hardware security keys, SSH keys, and SSH passphrases.

Furthermore, all infrastructure activity is logged, audited, and monitored by an Intrusion Detection System. Every system is protected by multiple layers of security—and these layers aren’t redundant but complementary components of building secure systems that are resilient to misconfiguration, to outage, and to exploitation.

Data flow diagram of a bastion host:

Build Rituals

Every mathematics course I’ve taught or taken starts from a set of principles that construct a foundation of what students are expected to know. While more abstract courses expect greater mathematical maturity from students, starting each mathematics course from axioms is a pedagogical ritual to build common ground.

Programming has long established a “Hello World” ritual: the classic introduction to a new tool or programming language is an exercise that produces the text “Hello World!” This ritual reduces the cognitive load of the first learning exercise using a familiar pattern, allowing the learner to focus on the syntax and semantics of the new language.

At PagerDuty, we introduced a “Hello Pagey” modular onboarding exercise that walks our new software engineers through the steps of deploying a miniature, personal “Hello Pagey” microservice using PagerDuty’s infrastructure tools. Each module offers a gentle, step-by-step introduction to our coding practices, style guide, CI/CD tools, container orchestration, Terraform, and handling secrets such as API keys. The final module produces a satisfying web API endpoint that prints the return text “beep” after sending “boop” to another “Hello Pagey” microservice.

Some rituals span teams, and some are unique to your team. Find them, and leverage those rituals that are working for you.

Build Templates: Repeat Known Good Patterns

Rituals are one form of procedural templates. Templating applies to everything from procedures to our tech stack and our service “skeletons” for generating new microservices. Following Known Good Patterns means that when someone else needs to learn your service, they know how to navigate the code and documentation, they know where to manage monitors, and they have the gist of maintaining the service without ever having seen it.

Sometimes using Known Good Patterns means departing from what may appear to be “the best solution” for the sake of a familiar pattern with known weaknesses. The engineering tradeoffs are debatable for any given issue—but let’s have a chat about how Known Good Patterns impact security and risk management.

Using the same old Known Good tools and patterns generally means that our Security team has examined the pattern and helped to secure it—such as the hardened base Docker images produced by our SOC 2 System Hardening Project. When an engineering team builds a service from service skeletons and hardened base Docker images, their service inherits vulnerability scanning, monitoring, audit logs, and many more of the secure development practices required by PagerDuty Security for a production service.

Some aspects of using Known Good Patterns are more critical to security than others. For instance, it doesn’t matter which local software that our developers use to write code—but our security would be compromised by authorizing a non-approved cloud service to access our source code.

Experiment With Controlled Variables: Vary From Those Patterns

Good scientific experiment design relies on holding some controlled variables (“control group”) constant while testing the impact of changing selected experimental variables (“experiment group”). Experiments start with a hypothesis about what we expect to see, and how we measure differences from the control group.

PagerDuty’s Failure Friday template for chaos engineering experiments has some similar elements, as do our experiments when we deploy changes to staging, either by dogfooding, by feature flags, or by diverting percentage points of load balancer traffic to experiments. The Failure Friday template is an established ritual pattern based on the Incident Response structure familiar to every PagerDuty engineer. The team that owns that week’s Failure Friday event customizes the template with a plan of the experiment they will run. One crucial element of our Failure Friday template is a rollback plan: How do we handle ending, aborting, or rolling back our experiment? Of course, not all experiments go according to plan. 😏

The PagerDuty Security team encountered an experiment opportunity recently while we were discussing how we handle security requirements for a service scheduled to sunset in a few months. The short upcoming lifespan meant that we would have a few months to see the results of our experiment, but we wouldn’t bear the cost of rolling it back. However, legacy services can be dangerous for experimentation because they are often too dissimilar from our newer monostack services—in other words, the experiment would have few control variables, which increases risk and reduces the value of results. Our risks outweighed the value of the experiment, so we chose not to move ahead with it.

In contrast, PagerDuty’s Site Reliability Engineering (SRE) team completed experiments with Terraform that proved to be a good risk. They produced a valuable new pattern for service owners to use Terraform for Amazon Web Services (AWS) resources. In parallel, the PagerDuty Security team developed a threat model for automated access to AWS Identity and Access Management (IAM) using the STRIDE threat model and the DREAD risk assessment model. Our team used that threat model to plan and complete an AWS IAM Permission Automation project which mitigated risks from SRE’s Terraform project for automated access.

Image showing a DREAD template: 

Draw Boundaries, and Cross Them With Compassion

Our engineering teams use the PagerDuty product’s “Service Directory” feature, which defines service ownership for our teams and services: the Service Directory delineates boundaries of ownership and responsibility for PagerDuty tools and services.

Boundaries allow us to operate freely with a sense of shared trust within and between our teams. Boundaries allow a team to iterate on experiments within their boundaries, while defining their accountability to follow shared expectations, rituals, and patterns that span across teams.

Crossing a boundary is a violation of shared expectations, but can be done with compassion in support of another person or team. In both personal and organizational collaboration, crossing boundaries can be a powerful tool of compassionate collaboration.

In digital operations, crossing boundaries is necessary for almost every operational incident because the contributing factors span multiple teams. Almost every security incident is a shared responsibility, where our Security team collaborates with one or multiple teams to resolve the incident, following the ritual of our Security Incident Response process.

We talk about crossing boundaries when we refer to reassigning an incident as “tossing it over the wall”—and that phrase communicates that we feel a boundary was violated. Even a “warm handoff” is an opportunity for compassionate boundary crossing! Approach those moments with empathy for your colleagues by entering the conversation with a question such as, “Can I help you with this?” If your expertise is relevant to someone, your time is best spent collaborating with them to share that knowledge and discover how to more efficiently prevent future incidents and spread your expertise.

When you cross a boundary, take care to recognize the expertise and ownership of others. Be explicit about what you are offering, and check in with your colleagues about their needs—essentially, you’re offering to renegotiate the boundaries to provide your support. Bring compassion to that negotiation by explicitly acknowledging your respect for their expertise, and validate any concerns that they raise with empathy.

Treat Your Colleagues Fairly

Software engineering is a team sport. We all have to collaborate—and we collaborate more effectively within an atmosphere of mutual trust and respect. Part of cultivating a compassionate culture is consistently communicating your trust and respect.

Think of expressing the respect you have for the skills and expertise of your colleagues as the syntactic sugar of effective collaboration: It’s not strictly necessary, but it sure does make working with you a lot easier! While a plain text log contains the same raw data as a structured log, the structured log separates and identifies each piece of information, associated with attributes. We can usually identify the meaning of a timestamp or log severity from the log. Structured logging provides explicit context that helps your audience parse data efficiently, especially during tense moments like incident response.

We cannot do our best work to build secure, maintainable, reliable systems if we do not trust each other or if we do not feel safe. Just like a structured log, stating that explicitly can make all the difference for building and maintaining strong collaborative relationships. Working together doesn’t mean we always agree with each other, that we don’t have conflicts, or that we want to drink tea together every afternoon. Our success is mutual: By going beyond the minimum to build compassionate collaboration, we learn and grow more together, and we accomplish more.


These ideas are only a starting point. What patterns are you noticing in your teams? What resources have helped your teams develop a culture of collaboration in system design? What resources have helped you learn about patterns in building secure and maintainable systems? Tell us about it by tweeting @pagerduty or commenting on this blog and in our community forums!

The post Collaborating to Build Secure, Maintainable Systems appeared first on PagerDuty.

]]>