On-Call Life | Categories | PagerDuty https://www.pagerduty.com/blog/category/on-call-life/ Build It | Ship It | Own It Thu, 17 Aug 2023 21:39:56 +0000 en-US hourly 1 https://wordpress.org/?v=6.3.1 3 New Updates to the PagerDuty Scheduling Experience by Débora Cambé https://www.pagerduty.com/blog/3-new-updates-to-the-pagerduty-scheduling-experience/ Fri, 18 Aug 2023 12:00:31 +0000 https://www.pagerduty.com/?p=83652 With the acceleration of cloud and digital transformation initiatives, enterprises are under pressure to adopt more agile, DevOps practices to be responsive to the business....

The post 3 New Updates to the PagerDuty Scheduling Experience appeared first on PagerDuty.

]]>
With the acceleration of cloud and digital transformation initiatives, enterprises are under pressure to adopt more agile, DevOps practices to be responsive to the business. But the increased complexity of digital systems and reliance on digital business only makes the cost of incidents more expensive. 

When incidents happen, protecting customer experience and minimizing downtime starts with bringing the right subject matter experts in to fix the problem. For many organizations tackling agile methodologies, embracing service ownership (you build it, you own it) into the incident response process is key to success. However, this cultural transformation to putting developers on-call for their services in production is no small feat. Having the right platform to assign responders with dedicated on-call schedules that can mobilize the right people at the right time when seconds matter makes all the difference. 

As the best-in-class solution for incident response, the PagerDuty scheduling experience  continues to be a focus area for ongoing iteration to ensure that the customer workflow is as seamless as possible.

Therefore, we’re proud to launch a number of highly requested updates to scheduling. Highlights include the ability to rename layers and manage users associated with a schedule. Keep reading to learn all the details.

Consolidated Schedule View

The new schedule details page brings the most relevant information about each schedule front and center. This way, both on-call users and admins / team managers can quickly get an overview of current and upcoming rotations. Here are the details you can easily see on this page: 

  • Who’s on call
  • Who will be next on call
  • Calendar feeds
  • Collapsible menus to check which users, teams and escalation policies are associated with the schedule

Screenshot of consolidated schedule view

Dynamic Schedule Creation

By listening to our customers feedback on usability, we were able to design a more fluid and dynamic schedule creation experience. New capabilities include:

  • Mandatory schedule names: every new schedule created requires a name to give teams more clarity about the existing schedules – both users and team managers / admins. 
  • Dropdown time picker: instead of manually typing a handoff time, you can now use a time picker, also available when adding restrictions to a schedule.
  • Relocated buttons: the cancel and save buttons are now on the right hand side of the page and they follow the user’s page scroll, making them easier to find.

Screenshot of dynamic schedule creation

Flexible Editing

More than ever, change is the only constant for organizations and their teams. So it’s important that their schedules can quickly be adapted to reflect their current structure and process. Check these three new functionalities that improve schedule editing:

  • Manage teams associated with a schedule: admins can now add or remove teams directly from a schedule.
  • Change a schedule layer’s name: users and admins can edit a schedule layer’s name to give it a more descriptive title and adjust it in a way that makes more sense to the organization – it even supports emojis. E.g.: You can create an East Coast layer and a West Coast layer. 
  • Reorder and collapse schedule layers: the new drag and drop functionality allows users to reorder layers easily in both the schedule layer creation and editing pages.

Screenshot of flexible editing

Deep Dive on the New Scheduling Experience

Want to see these new capabilities in action? Watch the below video where Senior Product Manager Kara Smith joins Developer Advocate Mandi Walls to show off all the scheduling UI enhancements.

We’ve launched these updates to make the scheduling experience easier, but we’re not stopping there. Stay tuned on how we’re continuing to build out the PagerDuty Operations Cloud to help scale teams with the power of AI and automation to transform the entire incident management process. Try for yourself with our free 14-day trial

The post 3 New Updates to the PagerDuty Scheduling Experience appeared first on PagerDuty.

]]>
Knightscope Relies on PagerDuty to Keep Their Robots Rolling by Rachel Schmitz https://www.pagerduty.com/blog/knightscope/ Wed, 08 Feb 2023 14:00:52 +0000 https://www.pagerduty.com/?p=81160 As security becomes more advanced and available, companies must look for ways to be more efficient with their resources in order to stay competitive. With...

The post Knightscope Relies on PagerDuty to Keep Their Robots Rolling appeared first on PagerDuty.

]]>
As security becomes more advanced and available, companies must look for ways to be more efficient with their resources in order to stay competitive. With challenges that limit the capabilities of companies, such as limited employee resources and low customer tolerance for delays in services, reliable and affordable solutions are necessary. In this case, it means disrupting the traditional security industry. Organizations are achieving their goals by relying on automation and technology. 

Knightscope Leads the Way

For the past nine years, Knightscope, an American advanced security technology company, has been leading the charge with its Autonomous Security Robots. The publicly traded company designs, builds and deploys Autonomous Security Robots for use in monitoring people and vehicles in public areas such as malls, parking lots, neighborhoods, and casinos. They have been innovating public safety using robots and proactive monitoring to deter and report security risks. Their mission is ambitious: to make America the safest country in the world. 

When Knightscope shipped their first robot in 2015, the security robot industry was still in its infancy. With little competition and a blank canvas on the landscape of the industry,  Knightscope was a pioneer. They dove head first into the autonomous security robot space and now have contracts for more than 100 autonomous security robots. 

It Takes a Village

All of the Knightscope robots are, essentially, vessels for the company’s API software. The data collected by the robots is their real product. Several departments work together to keep the systems rolling, and they need to ensure that they are working as efficiently as possible. Jane Miller, Senior Operations Manager, manages the Network Operations Center (NOC) at Knightscope, where the team is responsible for maintaining client’s robots—keeping them up and running 24/7. “They’re our first line of defense for everything technical going on with our fleet of robots,” said Miller. 

The NOC needs to be able to handle potential software and hardware issues anytime, including after business hours, such as a computer’s hard drive within the robot filling up. Also, because Knightscope software is running in autonomous security robots, the robot’s vessel can end up with problems independent of the software glitches. For example, it could be vandalized or hit by a car. It may need human intervention in situations where it’s not sure what decision to make. In these situations, the on-call team member must resolve the issue or complete the diagnostics in preparation to pass it over to the appropriate team. 

Miller is also responsible for the technical client support of Knightscope’s Security Operations Center (KSOC), where Knightscope’s software interface allows clients to interact with their robots. Here, clients can view the video coming from the robots, and have an emergency phone call system where the robots can be programmed to call the Security Team. Clients can also prompt audio announcements if someone is seen as a threat or enters a restricted access zone. 

With all of the different systems and departments, resolving an issue could get complicated. Because they are a technology company at their core, they know they need a dependable and proven way to align all of the systems and teams.  

PagerDuty is Their Solution

Knightscope has been using PagerDuty since 2015 to manage its real-time operations. It connects the PagerDuty API into its proprietary robotics software to monitor the robot hardware, software, and autonomous decision making. The teams will receive proactive notices via PagerDuty if a piece of software crashes unexpectedly, if the robot physically needs help, or if connection to the robot has been lost. PagerDuty also provides data about the health of the fleet of robots, ensuring new software releases or bug fixes are operating as expected. 

“We take care of making sure that the robots are doing everything they need to be doing from a security perspective, if it’s having any hardware or software issues. If anything comes up, PagerDuty lets us know and we take care of it. If any new issues come up, we investigate, and escalate to our R&D teams,” said Miller.

This is where PagerDuty Incident Response shows its value. When a robot isn’t responding the way it should, “We’re using PagerDuty to make sure that the robots are doing everything they need to be doing to support the clients,” said Miller. In fact, it’s a huge relief for new employees when they start working for Knightscope and see that they can rely on PagerDuty to tell them when something is wrong, instead of wading through mounds of data. 

PagerDuty has proven to be adaptable and reliable as Knightscope continues to develop new features and platforms. When a technology company, especially an industry disruptor, is at the forefront of tremendous progress, they need to ensure their systems are nimble and can change quickly. “We’re still a startup culture, so we’re always changing things. It’s been really easy for the team to integrate PagerDuty into new technologies—to adapt and grow as we go. It keeps us production-ready; ready to work with our clients.” 

Knightscope has benefited from PagerDuty because it is:

  • Easy to use – Easily integrated with their software using PagerDuty’s custom APIs.
  • Adaptable and agile – Easy to implement new features, platforms, technologies. 
  • Flexible – Easily create and modify rotation schedules & notifications received how you want them. 
  • Cost effective – Onsite robots, available 24/7 reduce the need for human staff. 

The Future is Changing

The future for Knightscope seems limitless. In the world of technology, change is certain. With ever changing priorities, crises, budgets and access to reliable employees, companies like Knightscope are constantly looking for ways to create effective and consistent solutions. They are pioneering the new age of security, and they rely on PagerDuty to achieve their goals. 

To learn how PagerDuty can help your organization, contact your account manager or try a 14-day free trial today.

The post Knightscope Relies on PagerDuty to Keep Their Robots Rolling appeared first on PagerDuty.

]]>
Create and Manage Maintenance Windows Through PagerDuty Mobile App by Laura Chu https://www.pagerduty.com/blog/new-mobile-maintenance-windows/ Wed, 28 Sep 2022 13:00:07 +0000 https://www.pagerduty.com/?p=78801 In order to respond in real-time to urgent, critical digital incidents, on-call responders must be able to take action from anywhere.  But when on-call responders...

The post Create and Manage Maintenance Windows Through PagerDuty Mobile App appeared first on PagerDuty.

]]>
In order to respond in real-time to urgent, critical digital incidents, on-call responders must be able to take action from anywhere. 

But when on-call responders become overwhelmed with alerts, they often just “ignore them” because they cannot tell the difference between a real alert and a false one. For example, when there is a down service due to maintenance or upgrades, this event could trigger multiple incidents, meaning the responder could receive false alerts that do not pertain to a real incident. Other times, however, a service triggers critical incidents and requires the responder to dive into the problem and solve the matter fast. 

On-call teams need a better solution that is more intuitive and flexible–one that allows them to disable a service as well as pause incident alerts on their mobile device, so they can focus on what matters: solving an incident without interruptions. 

We believe effective incident management empowers teams to do their jobs more efficiently while minimizing interruptions. That’s why we are excited to announce the general availability of Maintenance Widows through the PagerDuty Mobile App. 

Maintenance Windows help responders temporarily disable a service, including all of its integrations, while it is in maintenance mode. When a service is in the maintenance window, all of the service’s integrations are effectively “switched off” so that no new incidents will trigger. 

Easy to create, update and delete maintenance windows from anywhere:

Creating Maintenance Windows within the mobile app takes just a few simple steps: 

  1. Choose the Service Directory from the hamburger menu and select your preferred service.
  2. Tap on settings and tap “create maintenance menu.”
  3. Enter a description to explain why this maintenance is happening.
  4. Schedule the beginning and end date (and time) for the maintenance. 
  5. Once the maintenance window expires, the service exits the maintenance mode, and new incidents can be triggered again.

Creating a maintenance window

You can delete an existing maintenance window by going to settings and tapping on “end maintenance window.”

List of active maintenance windows

A maintenance window for multiple services:

  • PagerDuty’s mobile experience allows for the creation of a maintenance window on one service at a time. Users who want to create a maintenance window covering multiple services can be done through the web application.
  • Updating and deleting options for a maintenance window covering multiple services can be done through the mobile app. 

This latest addition to PagerDuty Mobile empowers on-call teams to manage and respond to incidents without sacrificing time and work-life balance. We’re continuing to improve the PagerDuty mobile experience by giving teams the trusted information to continue serving their customers better. 

You can learn more about PagerDuty Mobile and the Maintenance Windows through our Knowledge Base Articles. Or try it out using the following QR code to download:

QR code for downloading PagerDuty Mobile on iOS

iOS

QR code for downloading PagerDuty Mobile on Android

Android

Want to learn more about PagerDuty Incident Response and how it works with our mobile app? Participate in the free 14-day trial and experience how PagerDuty can empower your teams with faster time and efficiency, and drive innovation across your Operations Cloud.

The post Create and Manage Maintenance Windows Through PagerDuty Mobile App appeared first on PagerDuty.

]]>
Get to the Root (Cause Analysis) in 5 Easy Steps by PagerDuty University https://www.pagerduty.com/blog/get-to-the-root-cause-analysis-in-5-easy-steps/ Wed, 10 Aug 2022 13:00:45 +0000 https://www.pagerduty.com/?p=77861 What is one of the first things you should do when you are assigned an incident via PagerDuty? If you immediately thought “Acknowledge!” you are...

The post Get to the Root (Cause Analysis) in 5 Easy Steps appeared first on PagerDuty.

]]>
What is one of the first things you should do when you are assigned an incident via PagerDuty? If you immediately thought “Acknowledge!” you are not wrong, but after that, it’s all about resolving the issue as quickly and painlessly as possible. The first step to resolution is to investigate what caused the incident in the first place so you can easily get a fix in place.

In the PagerDuty platform, Root Cause Analysis* refers to a set of features that aims to provide you, the responder, with as much context and actionable intelligence as possible. By surfacing past and related incidents, as well as insights into incident frequency, responders will have tools to quickly gain the situational awareness they need to determine probable root cause and speed up triage, and ultimately resolve faster. Likely origin points based on historical data will also be highlighted to help add context. 

Here are the five places on the incident details page to help you investigate the potential root causes:

  1. Outlier Incident
    When first opening an incident, look for the Outlier Incident classification label. This label is located directly under the incident name and will have a classification label of “Frequent,” “Rare,” or “Anomaly.” Based on this classification label, you can quickly gauge whether this incident has occurred before and how you might respond to it based on past experiences. Hover over the label to read their definitions.Outlier Incident classification label of "Frequent," "Rare," or "Anomaly."
  2. Past Incidents
    Once you have determined the frequency at which the incident has occurred on the service, navigate to the Past Incidents tab further down the page. A heat map is displayed to show when previous incidents like this open incident have occurred over the last six months. Look for patterns in the colors – darker colors equal higher concentration of incidents – or hover over the heatmap colors to see further details about the relevant incidents. Below that are details about the Top 5 past incidents like the open incident (if there are any!) along with information about when they occurred and who last changed the incident. Note: That person would be a great resource if you want to ask them about what they did/see their notes on the incident! To open up the incident details page for any past incident, click on the hyperlinked title.Past Incidents heat map
  3. Related Incidents
    Another quick source of information is the Related Incidents tab. Here you see if there are currently any ongoing incidents that might be related to your issue from across all services, unlike Past Incidents, which only shows similar incidents on the same service. Understanding the scope of an incident across the business (is this isolated or part of a larger problem?) can help you understand the impact and to quickly identify who you need to collaborate with to fix the problem.View of Related Incidents tab
  4. Probable Origins
    Jump start your triaging efforts with the Probable Origins widget located on the incident details page. This widget will calculate the likely origin percentage based on historical data, like whether the incident occurred directly before or after a similar event to the current open incident. Screenshot of Probable Origins widget
  5. Change Correlation
    Lastly, it can greatly accelerate resolution when you are aware of any changes to your infrastructure or code that might have caused the incident. Change Correlation, displayed under Recent Changes on the incident detail page, shows the three recent change events that are most relevant to an incident based on time, related services, or PagerDuty’s machine learning. The recent change events will indicate why the platform surfaced the event, helping you to easily narrow down potential causes. Screenshot of Change Correlation display

Knowledge check! True or false: The Past Incidents tab displays Resolved Incidents from the same service, while Related Incidents will display only Open Incidents on other services. (see answer at the bottom of the page)

How’d you do? Remember, these are five places you can look, to quickly gain context and jumpstart your triaging efforts. 

To solve incidents faster and help reduce downtime further, combine this set of Root Cause Analysis features with Noise Reduction and Event Orchestration capabilities. If you need a refresher, take PagerDuty University’s Event Intelligence courses and then show off your ability to work smarter, not harder, by completing the Event Intelligence Certification!

Resources for Next Steps:

Event Intelligence Courses can be found on the PagerDuty University eLearning Portal.

  • Noise Reduction
  • Event Orchestration
  • Root Cause Analysis

Event Intelligence Certification Exam information can be found on this page under “Specialty Product Certification.” As a celebration of this new series launching, we are offering complimentary registration for the exam for 30 days, so register now!

*Footnote: While we refer to this category of features as Root Cause Analysis, PagerDuty is not predicting or identifying root cause. Rather, our features help to create context around incidents to drive faster resolution. It’s also worth noting that there has been an industry shift to adopt the term probable or proximate cause rather than suggesting that there is any one true “root cause.”

Knowledge Check Answer: False. While the statement is correct that Past Incidents only displays resolved incidents from the past that were on the same service, Related Incidents will look at other active incidents – open and recently resolved – across ALL services (including the service your current incident is on) to find if any incidents are related to your current incident.

The post Get to the Root (Cause Analysis) in 5 Easy Steps appeared first on PagerDuty.

]]>
More Powerful than Ever: PagerDuty’s Revamped Mobile App is Primed for Even Better Incident Response by Hannah Culver https://www.pagerduty.com/blog/revamped-mobile-app-for-incident-response-2022/ Tue, 12 Jul 2022 13:00:43 +0000 https://www.pagerduty.com/?p=77078 2020 revolutionized how we work. Many went from full-time office work to 100% remote overnight. And now that in-office is once again on the horizon,...

The post More Powerful than Ever: PagerDuty’s Revamped Mobile App is Primed for Even Better Incident Response appeared first on PagerDuty.

]]>
2020 revolutionized how we work. Many went from full-time office work to 100% remote overnight. And now that in-office is once again on the horizon, companies are thinking of ways to continue to work flexibly. However, this comes with increased challenges, and a need for tools that match this working style.

The PagerDuty mobile application is well recognized, with a 4.8 stars rating on the App Store and Google Play. We understand how important it is to reach the right people immediately – that’s why we’ve made significant investments in iOS and Android to help responders resolve critical work from anywhere, anytime. 

This blog post covers some of the most exciting improvements, such as a new navigation interface to find the information you need most; improved incident intelligence through past and present incidents; and leverage automation to trigger automated diagnostics and take remediation actions. 

Easily navigate to find the information you need most

For responders, you need to know when you’re on call and what services you’re on call for. If and when you do get called, it’s crucial that you can identify how the technical services you’re responsible for are performing. And most importantly, you want to be able to see all this information at a glance from your mobile app – not navigate through multiple screens and digging for information buried deep within an app.

With the new PagerDuty mobile home screen experience, the most important information that responders need is readily available. This redesign puts the top open incidents, on-call shifts and impacted technical services front and center, reducing the number of taps needed to navigate through the app. 

The redesigned home screen is now available for early access. If you’d like to try it out, you can fill out this early access form and choose the New Mobile Home Experience selection, and we’ll send you instructions.

Part of working flexibly means having critical incident context at your disposal during the moments that matter most. When an incident begins, you need to get up to speed as quickly as possible to begin making decisions on how to best mitigate impact.

One way to do this is with the new mobile incident details screen. This screen provides you an easier visual experience and  access to all your most important features to help you address incident response faster. The most important information about an incident is available to you immediately, such as notes from other responders working on the incident, change events, past incidents, and the latest alerts.

A new carousel on the updated version of the mobile incident details screen also allows you to run a play, add a priority or note, post a status update, and more.

Gain critical incident context through the past or present

When you experience an incident, one of the biggest hurdles to jump is answering the question, “Have we seen this before?” If you have, resolution might be as straightforward as running a play or executing an automation sequence that worked before. But it often can be difficult to find that historical context, and that’s time wasted that nobody can afford.  

  • The Past Incidents feature on the PagerDuty mobile app displays incidents with similar metadata that were generated on the same service as the current, active incident. This additional context facilitates accurate triage and reduces resolution time. For example, you can see whether you, or someone on your team, has been involved in a similar previous incident, and dive into details to discover what remediation steps were taken. 

  • Change Events – Changes within the system are often the culprit behind incidents. They are often overlooked because it can be hard to pinpoint exactly what change caused the incident, especially when many organizations are shipping new code dozens or even hundreds of times per day.   However, “Gartner estimates that approximately 85% of all performance incidents can be traced.” Change Events will enable you to look at changes impacting your environment and help you establish the potential root cause. Change Events can be easily viewed in two areas of the mobile app:  the new mobile incident details screen and service details. Either tap on a desired incident and scroll to Change Events, or navigate to the Service Directory to select a service to view a maximum of two Change Events. Event details displayed include the date and time, summary, service, type, links, and source.

  • Another important piece of information during incident response is understanding the impact radius of an issue. One way to glean this information is by understanding Service Dependencies. If a large, customer-facing business service is experiencing problems due to the technical service incident, you’ll need to respond faster and with more contextual intelligence to better understand the scope of the problem.With Service Dependencies in the mobile app, you can view what services are affected to better understand scope. Service Dependencies are listed within each particular service’s profile in the Service Directory.

Leverage automation for faster response

As technology environments become more complex, it’s more important than ever to conserve people’s time and cognitive resources. This means ensuring that machines, not humans, serve as the first line of defense. 

Automating repetitive manual tasks and well-understood incidents can divert unnecessary toil away from responders so that they can focus on their day jobs, and are only called for the incidents where they’re needed most. One way to do this is with automation that you can run with the tap of your finger.

PagerDuty Automation Actions is now generally available within PagerDuty mobile, empowering you to trigger automated diagnostics and take remediation actions from anywhere, anytime. It improves productivity by automating repeated diagnostic and remediation steps, replacing the toil of manual tasks. In addition to running the scripts, you can view previously run scripts and output reports directly from the mobile app. These update in real time, meaning you never miss a thing.

These latest additions to the PagerDuty mobile application help responders work in the way you want without sacrificing time, quality, or customer experience. Flexible work is here to stay, and PagerDuty’s powerful mobile application is committed to helping you make the most of it. If you haven’t tried our mobile application in a while, it’s time to take a second look. Use the QR Code and download either. 

iOS

or Android.

Important: Ensure your mobile experience is secure

With so many new great features added to PagerDuty mobile, we are introducing the new minimum OS requirements to ensure the mobile app continues to be innovative and secure and improve the user experience. Starting June 27, 2022, the future versions of the PagerDuty mobile app will require Android 9.0 and iOS 14.0  or higher. Please ensure your device is upgraded to continue receiving mobile app updates.

The post More Powerful than Ever: PagerDuty’s Revamped Mobile App is Primed for Even Better Incident Response appeared first on PagerDuty.

]]>
The Future of Incident Response is Automated, Flexible, and Proactive by Vivian Chan https://www.pagerduty.com/blog/summit-incident-response-updates/ Tue, 07 Jun 2022 13:00:31 +0000 https://www.pagerduty.com/?p=76691 We know our customers rely on PagerDuty as the backbone of critical real-time operations, so we want to make sure each and every enhancement helps...

The post The Future of Incident Response is Automated, Flexible, and Proactive appeared first on PagerDuty.

]]>
We know our customers rely on PagerDuty as the backbone of critical real-time operations, so we want to make sure each and every enhancement helps streamline incident response. How can we help our customers spend less time firefighting and more time innovating? 

One of PagerDuty’s values is Champion the Customer – and we take this very seriously. When building and improving features, we aim to keep a pulse on what’s going on with our customers: what’s keeping them up at night? What do they need today? How have their circumstances changed recently? And how can we help them scale their goals for tomorrow?

I sat down with Dan McCall, VP of Product for Incident Response, to learn more about his philosophy for building on the legacy of PagerDuty’s best-in-class incident response solution. To hear about all the features that Dan’s team is building at PagerDuty, check out his session, “Incident Response Keynote: Automated, Flexible, Proactive”. Registration is easy, just click here

Q:  So Dan, are there any patterns that have emerged from speaking with customers? What’s top of mind?

I’m hearing customers talking a lot about maximizing efficiency, minimizing toil, and generally becoming more data-driven so that they can build resilience at scale. What’s interesting is that this is the case whether they’re just getting started on their DevOps journey or have been at it for years. This makes sense – complexity is increasing and incidents are happening more often across the board, but it impacts customers differently. For some, just getting the right person at the right time is the goal, while others prioritize fine-tuning response to streamline ongoing processes and contain impact to responder health. 

But there’s one thing that I hear most, and it’s that while building resilience and scaling efficiency are challenging to solve in the best of times, everything has become a whole lot harder because of the “Great Resignation.” In fact, in our most recent customer survey, 64% of our respondents said that they’re experiencing increased turnover this year. It goes without saying that attrition puts added strain on teams – it takes resources to hire and onboard new people, and running understaffed can lead to a vicious cycle of even more manual toil and burnout. And this situation drives even more urgency for getting operations into a healthier, more mature state. 

Q: What do you mean when you say operational maturity? 

Operational maturity is about providing a better, more predictable experience for your teams so you can address and get ahead of the underlying issues behind attrition and burnout, with process and behavior to turn the corner on some of that potential turnover. 

We created this digital operations maturity model after looking at teams and organizations across our platform, codifying the behaviors that we observed. 

For those of you who might be newer to operational maturity, customers often ask us what ‘good’ looks like to help organizations measure their operational maturity, we developed the Digital Operations Maturity Model. The model gives organizations a way to define operational maturity, learn how to identify where they fall on the spectrum, and understand where to focus their efforts to improve.

To take this a step further and make it even more tangible, our product analytics team modeled the operational maturity model with data on our platform. We see that reactive teams consistently experience higher turnover than preventative teams – just last quarter, the delta was over 2x! When you think about that against the backdrop of the Great Resignation, it’s clearer than ever before that our products can make a big difference in helping our customers with their most pressing operational challenges. I’d highly recommend you check out this talk, “Getting from Reactive to Proactive (and Beyond!)”, from Scott Bastek and Tejere Oteri, which you can access by registering here

Q: How does what you’ve been hearing from customers shape your vision for the future of our incident response solution? 

When thinking about where we can steer our product to best help our customers achieve this transformation and level up their operational maturity, my team’s vision is to make incident response more:

  • Automated to eliminate waste and inefficiency
  • Flexible to address a multitude of unique business needs at scale
  • Proactive to anticipate and prevent business disruption

And we’re going to do this while staying true to the core of what our customers know and love about PagerDuty.

Q: Automation can mean a lot of things to a lot of people – when you think about Automated Incident Response, what does it mean to you? 

Automated Incident Response to me is humans and machines working better together. To help illustrate this, I often think about the concept of Centaur Chess. The TLDR version is: AI can beat a human at the game of chess, but a human paired with AI can beat pure AI. 

Automation as the first line of defense empowers teams to balance critical workloads between humans and their machines, helping humans work smarter when they’re needed, and removing the burden when they’re not. There’s plenty in the incident response process that involves manual toil or well-understood tasks – our goal is to remove that unnecessary burden from your humans, so that the humans you have can stay focused and do better at their jobs. 

One example of how we’re enabling this is by making it possible to call Automated Diagnostics right from your mobile app, so that your responder doesn’t have to manually run through a rote set of tasks associated with standard diagnostics when they get to their desk. With automation, it’s already run and ready to go by the time your responder gets to the incident. 

At its best, automation and AI can take care of things that your teams shouldn’t be doing in the first place. Helping people do less repetitive, manual work helps them stay more engaged, which reduces burnout and helps with attrition. More time to think and focus on how to innovate also means having the extra cycles you need to learn from incidents and improve processes to build the resilience that you want. 

Q: PagerDuty has been actively investing in several acquisitions – how has this tied into your roadmap? 

We’re thrilled to harness really strong partnerships with our most recent acquisitions, Rundeck in 2020 and Catalytic earlier this year, to spin out better experiences for our customers. 

For Incident Response, we’ve been working with our colleagues from the Rundeck acquisition to take their product (now known as Process Automation) and embed Automation Actions deeply within our Incident Response experience –starting from ingest and Event Orchestration, to Mobile, and even our web experience. 

First-line responders often find themselves actioning the same, recurring diagnostic steps when it comes to incident triage and remediation, which takes time away from high-value work, keeps specialists firefighting instead of innovating, and prolongs MTTR. So making it as simple and light as possible for teams to start leveraging automation in their incident response lifecycle was really important to us. With the ability to call Automated Diagnostics in any number of ways, teams can save time that they would’ve had to spend on rote, manual tasks. Instead, they can have the results ready by the time the responder gets to their desk. 

With Catalytic, we’re taking a different approach. When an incident strikes, organizations typically have a checklist of important steps to run through, which are often manual and hard to remember, especially in the heat of the moment at 2 a.m.! Finding and remembering these steps can distract the response team from its main focus: resolving the incident. We’ve had lightweight response plays for a few years now and have been asked by customers for more ways to automate steps of their incident response processes with more flexibility, which is why we’re excited to introduce Incident Workflows. 

Coming later this year, we will be upgrading our lightweight response plays into powerful Incident Workflows based on the new workflow engine from our Catalytic acquisition. These workflows will allow you to define an orchestrated response using “if-this-then-that” logic, which will make it effortless to configure a sequence of common incident actions—such as adding a responder, subscribing stakeholders, or starting a conference bridge—into an orchestrated response. 

You can customize your Incident Workflows to reflect your organization’s unique processes for any number of use cases, such as by incident priority, status, or urgency. And as you learn from an incident, you then can encode that learning back into your workflows to automate those repetitive and mundane tasks for the next time an incident occurs.

Q: Which of these announcements do you think our customers will be most excited about?

It’s hard to pick just one, so I’ll tease two and you’ll have to check out my session to hear about all the goodness we’ve got in store for you. First, I think customers are going to be really excited about where we’re taking the future of Response Plays. We’ve already been getting some amazing feedback on how Incident Workflows will deliver a step-function-level improvement on Response Plays with the powerful UI and modular flexibility based on things like priority. I’m personally really excited to see what customers will do with Incident Workflows and how they make them their own. One of the beautiful parts of building this “in a platform way” is that, although we’re showcasing how it can be useful in major incidents, it can be used in a multitude of other ways. You can hear more about this in my session at Summit where Stephanie Gridley, a Resilience Manager from Wayfair, details how their team might use the functionality for both P1 and P5 incidents. 

Customers will also be very happy about seeing some updates on some core features that they’ve wanted for a long time, such as Status Update Notification Templates. What’ll get even more interesting is when these features eventually start feeding into each other to do even cooler things. It’s the nexus of these features working in context with one another that provides a multiplier impact greater than the sum of the parts.

If you want to learn more about what else is on the Incident Response roadmap for this year, check out Dan’s virtual keynote session: “Incident Response Keynote: Automated, Flexible, Proactive.” It’s not too late to register for PagerDuty Summit – register here.

The post The Future of Incident Response is Automated, Flexible, and Proactive appeared first on PagerDuty.

]]>
What’s New: Updates to Incident Response, AIOps, PagerDuty® Process Automation, and More! by Vera Chan https://www.pagerduty.com/blog/whats-new-product-update-2022-05-31/ Tue, 31 May 2022 13:00:45 +0000 https://www.pagerduty.com/?p=76375 Summit’s right around the corner (have you registered yet?) but the shipping doesn’t stop! We’re excited to announce a new set of updates and enhancements...

The post What’s New: Updates to Incident Response, AIOps, PagerDuty® Process Automation, and More! appeared first on PagerDuty.

]]>
Summit’s right around the corner (have you registered yet?) but the shipping doesn’t stop! We’re excited to announce a new set of updates and enhancements to PagerDuty’s Digital Operations Platform. Recent updates from the product team include On-Call Management, Incident Response, Process Automation, and Integrations, to PagerDuty Community & Advocacy Events. New capabilities enable users and customers to resolve incidents faster, do the following, and more:

  • Quickly and easily find what you need with the new on-call scheduling UI. Keep stakeholders informed with the necessary level of content and context within status update notification templates during Incident Response
  • Leverage new PagerDuty® Process Automation and PagerDuty® Runbook Automation plugins to empower first-responders to run automated diagnostics on AWS infrastructure and applications reducing escalations to and engagement of others
  • For AIOps, you can now Copy & Paste Rules in Event Orchestration (Available to all plans)
  • Take action on webhook Integration requirements and try out our most recently updated PagerDuty App for Splunk from our Partner Ecosystem
  • Stay in the know with Webhook V1/V2 Product Deprecations and User Administration call to actions
  • Sign up for upcoming Events view our recent Podcasts and Twitch Streams, and join members of our Community Team and Product Teams as they walk you through new products and leading practices in the software industry

Incident Response

Schedules List UI Refresh

The new visual experience on our Schedules page is now generally available on all accounts. This redesign delivers the following to our customers:

  • Enhanced search functionality
  • Toggling between all schedules and team schedules
  • Collapsing or expanding information based on the level of detail needed
  • Easier schedule comparison with time windows displayed
  • Streamlined view of on-call responders with shift times

Early access rollout to accounts is ongoing.

Learn more about Schedules

Status Update Notification Templates

Our new notification templates give you standardized and reusable formatting for status updates during incidents. You can now:

  • Easily include common incident details via a drop-down menu that populates key information
  • Add logos, hosted third-party images, and other contextual information, as needed
  • Ensure updates consistently adhere to your company’s internal communication standards

This feature has already started ramping to customers and should reach all customers by mid June.

See it in action below!

Learn more in the knowledge base or watch it in action later

New SMS Formatting Standard for Incident Notifications (including Status Updates)

We’re standardizing the way that incident notifications are formatted for SMS notifications. As of May 23rd, all SMS incident notifications (including status updates) now start with the prefix [PagerDuty]

Learn more about notifications

PagerDuty® Process Automation Software and PagerDuty® Runbook Automation

New AWS Plugins for Automated Diagnostics

Want to reduce the amount of people pulled in to troubleshoot incidents? PagerDuty® Process Automation already helps first responders fetch diagnostic data typically only retrievable by domain-experts.. Now users can get up and running with auto-diagnostics for the AWS faster  via new AWS plugins. The Amazon CloudWatch Logs Plugin retrieves diagnostic data from AWS infrastructure and applications, while the AWS Systems Manager plugin allows users to manage their global VM footprint with security best-practices.

AIOps

Event Orchestration: Copy/Paste Event Rules

Ever wanted to reuse service rule configurations to avoid re-configuring rule conditions and actions from scratch? You can now simply “copy” any Service Orchestration rule and then “paste” that rule into any Service Orchestration to reuse rule conditions and actions. The best part? You can still have the ability to tweak those conditions/actions for each specific service/orchestration.

Learn more about Event Orchestration or Copying Service Rules

User Administration

Email Domain Restriction Verification – Action Required!

PagerDuty has changed the default setting for the “Email Domain Restriction” Feature in the Account Settings to be “ON” by default (the current setting is OFF). Since April 15th, this has automatically been enabled for customers and through this update, email domains will be restricted only to domains that are owned by the account. This applies only to login email addresses, and not contact method email addresses.

The following exceptions apply to the automatic enablement of the Email Domain Restriction feature for existing customers:

  1. Email addresses using common carrier domains (yahoo, gmail, aol etc.) will not be restricted
  2. The restriction will not be enabled for accounts with more than 10 email domains.

ACTION: Account owners should please check that the settings and email (sub)domains are set appropriately for your needs. For further details on this topic, please refer to the knowledge base.

Integrations and Partner Ecosystem

PagerDuty App for Splunk v4.0.1

Ever feel like you’re spending too much time managing alert configurations and settings or feel overwhelmed by alert storms? Now, generally available, the latest PagerDuty App for Splunk (v4.0.1) now enables PagerDuty and Splunk customers to:

  • Allow PagerDuty to leverage the Splunk Alerts framework to save time and energy
  • Streamline and group incident data within PagerDuty in a way that makes sense to you to reduce alert noise and information overload

Learn more in the splunkbase

Webhooks IP Addresses

Do you have any webhooks configured in PagerDuty? If so, please be aware that since May 5, 2022, the official list of IP addresses that PagerDuty uses to send Webhooks calls have been provided on this page.

Please safelist those IP addresses to ensure service continuity.

Learn more about webhooks IPs

Product Deprecations

Please take note and keep your teams informed of our upcoming product deprecations:

V1/V2 Webhooks

If you are currently using V1/V2 webhook extensions in your PagerDuty environment, you need to migrate them to V3 webhook subscriptions to maintain functionality. Please follow our migration guide.

Important Dates:

  • V1 Webhooks – V1 webhook extensions became unsupported (no new features or bug fixes) since November 13, 2021 and will stop working in October, 2022.
  • V2 Webhooks – V2 webhook extensions will be unsupported in October, 2022 and will stop working in March, 2023.

Required Permissions:

  • Admins or Account Owners can migrate an entire account.
  • Team Managers can only migrate webhooks for their assigned Teams.

What are Webhooks? Webhooks allow you to receive HTTP callbacks when significant events happen in your PagerDuty account, for example, when an incident triggers, escalates, or resolves. Details about the event are sent to your specified URL, such as Slack or your own custom PagerDuty webhook processor.

If you have any questions, please reach out to your PagerDuty contact or our support team at support@pagerduty.com.

Learn more about webhooks

Webinars & Events

Join us for the following webinars and events to learn more about PagerDuty’s recent product updates and how they benefit customers.

PagerDuty Summit 2022 (In-Person and Online)

Join us to learn new ways to orchestrate, accelerate, and elevate your critical work. Enjoy inspiring keynotes, exclusive product demos, hands-on training courses, and new perspectives on developer operations, site reliability engineering artificial intelligence for IT Operations, process automation, and more! We look forward to empowering you with new ideas and skills, so you can be ready for anything in your digital operations. Join us in-person in San Francisco (June 7th), Sydney (June 15th), and London (June 21st), or online!
Register Today!

Supercharge Your AWS Cloud Platform with Self-Service Cloud Ops

Join Mandi Walls (PagerDuty), Mark Kriaf (AWS), and Cody Brown (Techstrong Group), as they share how you can:

  • Delegate self-service automation to run faster and eliminate toil
  • Automate AWS infrastructure tasks through one interface
  • Connect AWS operations to IT process workflows
  • Turn AWS requests into self-service operations for end users

Learn More

Register for upcoming events in June here!

PagerDuty Community Twitch Stream

Join us on our Twitch channels, PagerDuty Twitch Stream and PagerDuty Community Twitch Stream, to catch up on one of our latest streams led by our Developer Advocates! Catch our past streams via the YouTube Twitch Streams Channel.

PagerDuty Community Twitch Stream

If your team could benefit from any of these enhancements, be sure to contact your account manager and sign up for a 14-day free trial.

The post What’s New: Updates to Incident Response, AIOps, PagerDuty® Process Automation, and More! appeared first on PagerDuty.

]]>
What’s New: Updates to Event Intelligence, On-Call Management, Automation, Mobile, and More! by Vera Chan https://www.pagerduty.com/blog/whats-new-product-update-2021-01-31/ Mon, 31 Jan 2022 14:00:39 +0000 https://www.pagerduty.com/?p=73668 We’re excited to announce a new set of updates and enhancements to the PagerDuty platform. Recent updates from the product team include On-Call Management, Event...

The post What’s New: Updates to Event Intelligence, On-Call Management, Automation, Mobile, and More! appeared first on PagerDuty.

]]>
We’re excited to announce a new set of updates and enhancements to the PagerDuty platform. Recent updates from the product team include On-Call Management, Event Intelligence, and Mobile Products, to PagerDuty Community & Advocacy Events. New capabilities enable users and customers to resolve incidents faster, do the following, and more:

  • Better control the noise and reduce the amount of manual event processing across your systems with powerful Event Intelligence
  • Improve Automation security with several recent Rundeck Releases.
  • Reduce responder burnout, ensure that responders are always available to address critical incidents quickly and more effectively with improved On-Call Management capabilities
  • View critical context and access important incident response features via the latest updates to PagerDuty’s Mobile App
  • View product demos of our PagerDuty App for ServiceNow and a recap of our Event Intelligence capabilities from 2021 from our Webinars & Events
  • Learn from peers in the industry about how to integrate PagerDuty with Rundeck and Lacework over Twitch Streams

We also welcome you to join us as Design Partners for new features and hope you’ll discover more about how different teams across your organization can embrace and benefit from the PagerDuty Operations Cloud through our collection of Solution Guides.

Solution Guides

Did you know that PagerDuty can be used by departments outside of IT? That’s right! Check out the Solution Guides we’ve shared that help teams like marketing, sales, finance, human resources use PagerDuty to handle critical business functions.

Event Intelligence

Event Intelligence (EI) is a ML-powered solution helping organizations reduce downtime and deliver on customer experiences. The latest EI features provide teams with actionable insights to help reduce noise, drive to root cause, and automate manual processes for fewer incidents and faster resolution.

Event Orchestration

Event Orchestration helps teams reduce noise and cut down on manual event processing to both improve operational efficiency and reduce toil. The event orchestration decision engine enables teams to create custom logic to enrich, modify, and control the routing of events to the right teams based on event conditions at scale!

Combine nested event rules with machine learning and targeted automation to trigger actions, including diagnostic and remediation actions (including retrieving system health stats, implementing self-healing, or automatically rolling back a deployment and restarting a server).

View the demo to learn about Event Orchestration.

On-Call Management

The health, happiness, and overall well being of teams is key to every organization’s success. Our on-call management capabilities strive to help protect an organization’s most valuable asset—the people.

Round Robin Scheduling

Now generally available on Business and Digital Operations plans, Round Robin Scheduling allows teams to equitably distribute on-call shift responsibilities amongst multiple team members. It works by automatically assigning new incidents across different users on a team to ensure that teams can resolve incidents as efficiently as possible with less risk of burnout.

View the demo:

 

Automation

Rundeck 3.4.7, 3.4.8, 3.4.9, and 3.4.10 Releases

Check out the latest Rundeck features and enhancements for Rundeck Enterprise and Rundeck Community:

  • HashiCorp Vault Plugin includes more logging and supports authentication to a different namespace from where passwords are stored for enhanced security
  • Bump Spring Security version from 5.1.11 to 5.2 for enhanced security
  • Update log4j to 2.17.0 for both Rundeck Enterprise and Rundeck Community addresses recent Log4J vulnerabilities
  • Password Reset feature on Rundeck Enterprise enables the ability to reset Local User passwords by sending an email with a reset link rather than resetting a password directly on an account

Learn more about these updates and other core product updates and bugfixes from the release notes for 3.4.7, 3.4.8, 3.4.9, 3.4.10.

Incident Response

Mobile Incident Details Refresh

The new and improved mobile Incident Details screen now provides easier access to all of your favorite features during the incident response process from one place. Run a play, add a priority or note, post a status update, and more with the new carousel.

Learn more in the knowledge base through the latest mobile release notes

Product Deprecations

  • Please take note and keep your teams informed of our upcoming product deprecations listed by the nearest date:March 31, 2022 – Webhooks V2 will be replaced by Webhooks V3 which is now generally available.

Webinars & Events

Join us for the following webinars and events to learn more about PagerDuty’s recent product updates and how they benefit customers.

  • How to Reduce Noise and Manage Event Routing with PagerDutyHear from Frank Emery, Senior Product Manager at PagerDuty as he guides you through how you can deal with unmanageable levels of noise and complexity with PagerDuty Event Orchestration. View the webinar on demand
  • What’s New in Event Intelligence – 2021 Roundup – Catch the latest updates in Event Intelligence available on demand in a product demo and Q&A led by Vivian Chan, Frank Emery, and Vera Chan. View the webinar on demand
  • Event Intelligence 101 with PagerDuty University – Hannah Lodise goes in-depth to demonstrate how Event Intelligence features can help teams reduce system noise, quickly troubleshoot issues, and remove manual toil for faster resolution. View the webinar on demand
  • Integrate ServiceNow with PagerDuty to Improve Major Incident Management Response Times – Enjoy a product demo of the ServiceNow integration and Q&A by Laura Chassagne and Vera Chan to learn more about leveraging CMDB data, gaining more visibility and context, incident response, as well as automated diagnostics and self-healing to help drive down incident resolution times. View the webinar on demand
  • PagerDuty Pulse Q3 – Catch up on all of our latest product demos combined on demand within a playlist via the Q3 PagerDuty Pulse. View the Q3 PagerDuty Pulse Today!

PagerDuty Community Twitch Stream

Join us on our Twitch channels, PagerDuty Twitch Stream and PagerDuty Community Twitch Stream to catch up on one of our latest streams led by our Developer Advocates!

PagerDuty Community Twitch Stream

  • Subscribe and get notified when we’re live and view previous recordings
  • Missed one of our broadcasts? Watch any of these recent Twitch streams or YouTube videos:

If your team could benefit from any of these enhancements, be sure to contact your account manager and sign up for a 14-day free trial.

The post What’s New: Updates to Event Intelligence, On-Call Management, Automation, Mobile, and More! appeared first on PagerDuty.

]]>
New Tech Leader Survey Reveals Why the Time for Real-Time Operations is Now by Vivian Chan https://www.pagerduty.com/blog/new-tech-leader-survey-reveals-why-the-time-for-real-time-operations-is-now/ Wed, 10 Nov 2021 14:00:36 +0000 https://www.pagerduty.com/?p=72419 “Customer obsessed.” “Customer-centric.” “Customer-first.” For CEO’s everywhere, setting and maintaining a coordinated focus on the customer has become a top priority when driving innovation. After...

The post New Tech Leader Survey Reveals Why the Time for Real-Time Operations is Now appeared first on PagerDuty.

]]>
“Customer obsessed.” “Customer-centric.” “Customer-first.” For CEO’s everywhere, setting and maintaining a coordinated focus on the customer has become a top priority when driving innovation. After all, for many organizations regardless of industry, digital customer experiences are what can make or break the bottom line. The adoption of DevOps and cloud technology has rotated around unlocking agility and scale as a competitive advantage to attain and keep these digital customers.

With this focus on digital, all eyes are on technical leadership to deliver. Whether B2B or B2C, modern customer expectations for digital, hybrid, and omnichannel experiences are higher than ever: everything must be seamless and available 24/7, 365 days a year. Reliability and innovation are business imperatives, and all eyes are on technical leadership to deliver.

But digital is hard. As with any complex systems, technology will break down and trends show that digital incidents are on the rise across the board. Left unchecked, these incidents threaten to not only damage customer experiences and business revenue, but also employee morale and retention.

As the operations cloud for the modern enterprise, we know that uptime is money, and we want to keep a pulse on how technical leaders globally are thinking about the needs of the business and how they’re planning to balance innovation with incident response in a way that can unlock better digital operations, enable faster remediation, and keep the business always on. To do this, we commissioned a global survey of 700 senior IT and development decision makers in large enterprises across North America, EMEA and APJ. Here’s what we learned.

The pressure to deliver on digital has never been greater, but traditional ITOps models aren’t keeping up

Today’s technical leaders are not only expected to deliver innovation alongside cost savings, they also have to ensure services are reliable around the clock. This increased pressure comes at a time when team health and happiness are extremely top of mind as the Great Resignation continues to have an impact across industries and leaders are keen to find strategies for preventing burnout and attrition.

Against this backdrop, most tech leaders (68%) say that they are doubling down on their digital transformation strategy and that those strategies (66%) are becoming more aligned to the business. This means more cloud migration and increasingly adopting microservice architecture to give development and IT teams the flexibility and scale to create and deliver at speed.

At the same time, the survey shows that traditional operating models are holding leaders and their organizations back. The overwhelming majority (91%) say that traditional IT Operations that were not built for the dynamic and complex technology stacks of today are no longer fit for purpose in the digital era.

Incidents impact both the business and its humans

The increasing volume of digital traffic, coupled with organizational inability to manage incidents effectively, hits both the bottom line and team morale. Innovation is sacrificed, with 3 in 5 leaders believing that innovation has “taken a backseat” to firefighting digital incidents.

Not having an effective way of managing incidents can also hurt revenues. Forty percent of leaders say their organizations have lost revenue as a result of rising digital incidents.

This increased burden also has a real human impact and poses a risk to teams’ wellbeing. Of the tech leaders we asked, 38% have seen an increase in employee burnout in technical teams. To contain this, workload management and team health must be a top priority to keep individuals happy and productive. To make that possible, leaders need a more dynamic, sustainable way to handle digital incidents.

When Seconds Matter, Real-Time Digital Operations is the Answer

The good news is, leaders are aware of these challenges and they want to do better. The majority (70%) say that if they are to innovate at pace, they need a new way to deal with digital incidents. To do this successfully requires a shift from traditional, ticket-based ITOps approaches towards real-time operations that mobilizes response teams in seconds, drives collaboration, and gives deep context on digital incidents.

And they believe that real-time operations can help. In fact, 65% of respondents say that adopting real-time digital operations will allow them to reduce the cost of ITOps and accelerate innovation.

What’s more, there are some clear steps that leaders have identified that can help them realize this modern vision of real-time operations. One key investment? Automation. Almost three quarters (73%) have already invested, or are planning to invest, in AIOps tools and automation to increase productivity and reduce the toil of manual, repetitive work on technical teams.

Here’s an infographic that summarizes some of the key findings from the report – to read the whole study, visit this link.

Download the full Digital Dependency in 2021: The Urgency of Real-Time Operations report, here.

Methodology
This report is based on a global survey of 700 senior IT and development decision makers in large enterprises with over 1,000 employees, conducted by Coleman Parkes and commissioned by PagerDuty. The sample included 200 respondents in the U.S., 100 in each of the UK, France and Australia, 50 in each of Japan and New Zealand, 34 in Germany, and 33 in each of Austria and Switzerland.

The post New Tech Leader Survey Reveals Why the Time for Real-Time Operations is Now appeared first on PagerDuty.

]]>
It Came From Below by Kelsey Shannahan https://www.pagerduty.com/blog/it-came-from-below/ Thu, 31 Oct 2019 13:00:30 +0000 https://www.pagerduty.com/?p=55882 Kelsey Shannahan is a Senior Developer at a leading healthcare IT company in Columbus, Ohio. She is passionate about Elixir, her three cats, and Eldritch...

The post It Came From Below appeared first on PagerDuty.

]]>
Kelsey Shannahan is a Senior Developer at a leading healthcare IT company in Columbus, Ohio. She is passionate about Elixir, her three cats, and Eldritch Horror.


I’m going to assume most people who read this blog are familiar with PagerDuty. But just in case anyone isn’t, PagerDuty is a tool we use in IT to notify us if some predefined check has failed. Maybe a key process has died or maybe we’re not seeing our expected traffic volume or maybe our server has stopped responding to ping. Whatever it is, PagerDuty will relentlessly, remorselessly, and loudly notify whoever is on call that something needs attention.

For a while, my phone would serenade me (at full volume) with a barbershop quartet singing about how the server was on fire. However, after I got over 100 alerts in less than 15 minutes when our entire stack went down, I was a little tired of that and now I get a classic “red alert” siren. There’s nothing like sitting in an office filled with a cacophony of barbershop quartets, sirens, and sad trombones to make you regret your choice in career.

It’s even worse when it goes off when you’re sleeping. It’s like a shot of adrenaline to the heart at three in the morning. I suspect it takes a year off my life every time I’m startled awake—or startled by it at any time, really.

About a month ago, it woke me close to midnight. I hadn’t been in bed very long; just long enough to settle into a deep sleep that left me confused when I woke. I fumbled for my bedside light by instinct, squinted hard at my phone to find the “Acknowledge” button, and then put my glasses on. I sat there a moment, dazed, trying to remember where I’d left my laptop in the house.

Then I remembered: I’m not on call this week. I’m not even the backup on-call person. I shouldn’t be getting paged at all.

Maybe someone escalated to me because they specifically needed my help, I reasoned. I tried to think of what recent changes I’d deployed to production that could have caused an outage so late at night. Nothing came to mind. Regardless, I stumbled out of my bedroom and down the dark hallway to my office. The floor creaked under me—and then my phone went off again with that damn siren. I nearly dropped it before cursing and fumbling to acknowledge the new page.

I remember standing there after the phone was silenced and thinking, Huh, that’s odd. The floor continued to creak even though I was no longer moving. I reached over and turned the lights on and looked back behind me.

Nothing. The house was silent.

I continued to my office, trying to ignore the sense of unease. First thing I checked on my laptop were my messages on Slack. Nothing from my coworkers. There were still a couple people online (I work with a lot of night owls) but there wasn’t any chatter in our maintenance or outage rooms. Then I checked the PagerDuty website to review what exactly the pages were that were causing all these alerts.

I didn’t recognize either of them. The first one read: Build failed: compilation error in watchman.rb.

First off, a file ending in .rb is a Ruby file. Ruby isn’t a compiled language. It can’t throw a compilation error. Second,: we don’t alert anyone for failed CI builds. That’s just being mean.

The other error was nymaster1n2: Process STOPped. Which also makes no sense because even though we’re based in New York, we don’t have servers here. And what process is it referring to, anyway?

All our alerts are configured to link either to the monitor that’s alerting or to instructions on how to resolve the page for more complicated scenarios. I tried clicking on the link that should take me to the monitor so I could check the status, but it just directed me to a Google Maps page set to the location of my house.

At that point, I assumed this was a hoax. You can manually create an incident in PagerDuty after all, so maybe someone got a little creative and was playing a prank on me. They know I live alone. I’d be an easy target, especially since I don’t have a spouse that would get pissed off.

I resolved the fake incidents, shot a message off in Slack to my team that this wasn’t funny late at night (it is clever, but not while I’m trying to sleep), and went back to bed. A couple hours later, my phone went off again. I repeated the process of turning on the light, acknowledging the alert, and putting on my glasses. I’d also brought my laptop into the bedroom with me earlier so I could just pick it up without leaving my bed. At this point, it was 2 a.m. and none of my coworkers were online. I checked the error.

Unrecognized entity from whoami

I guess that could be legitimate, but I’m not ops. I’m a developer, and I should only be getting alerts for when our apps misbehave, not for server-level stuff. And unless I’m mistaken, whoami is a server thing.

I escalated it to our ops on-call person, slammed my laptop shut, threw myself back down on the bed, and turned the lights off. I hadn’t even fallen asleep yet when my phone went off again. But it wasn’t my chosen siren ringtone.

It was the “Engineer Laughing” one, which, let me tell you, sounds really damn creepy when you’re in a house by yourself at two in the morning.

The notification read raise StandardTerror and included a link to what looked like a log file, but it was just Google Maps again—Street View this time, zoomed in on one of my basement half-windows.

Okay, hahah, I see what you did there, nameless coworker. Standard Error. Standard Terror. Real funny. This time, I dropped the screenshot—and my ire—into our developer Slack channel.

I turned the lights off, but did not lay back down. I waited, suspicious as to whether I would be allowed to sleep. In the silence, I heard something like scratching; distant, perhaps some small animal outside, near the house.

My phone shrilled again.

Heartbeat check failed: ‘nymasterbedroom1n2’

This was getting out of hand. I considered just turning my phone off, but I’ll be honest: I actually wanted to see where this was going. Sure, maybe it’d keep me up all night, but I’d have a hell of a story to tell and I could always sleep in late tomorrow and blame PagerDuty for the late start.

In the end, I kept my lights off to keep the glare off my phone screen. Then I waited, listening to the faint, steady scratching noises and wondering what it was. A racoon, digging at the foundation below my bedroom window, perhaps? I was on the second floor.

Another alert.

Static exceeding established threshold. Terminating supervision of child processes.

That first part? None of that makes sense. That’s not a real error. That’s nonsense. Since none of my coworkers were fessing up to the prank (they were still all offline), I opened my laptop, pulled up the PagerDuty website, and filed a support ticket. I included the text of the alerts and explained that these errors did not exist in our ecosystem. If it wasn’t some mischievous coworker responsible, then I wanted the root cause resolved as quickly as possible. I wondered if I was getting errors from someone else’s system. That’s the thing about being paged late at night—it makes you highly motivated to make sure it never happens again.

Then I sat there and waited in my dark bedroom, eyes fixed on the glowing laptop screen. The scratching from outside my house seemed to be getting… longer. More a scraping sound now. I wondered if perhaps I should go shine a flashlight out the window and scare off whatever it was. I wasn’t necessarily afraid yet. I don’t scare easily and I have an alarm system, which meant my house was a less attractive target than the neighbors’.

Another alert with another ringtone. The one with a small child singing, “Something’s broken, something’s broken, it’s your fault!”

“It damn well isn’t my fault,” I muttered as I refreshed the screen.

“keeplightson.sh: Syntax error: “a̗͌b̠̔n̩͕̲͂̑̈̕͟o̡̧̘̾͋́r̢͙̣͍̗͌̀̐̿͝m̧̨̛̼̘̗̓͌̅̚ã̺͎͕̑̿̈́͢ḻ̥̇̈́i͈͎͚̮͗͐̅̚ẗ͉̮̬̗́́̏͡y̢̲̆̑” unexpected”

It was at that point that I took the nuclear route.

I @channel’d the General room on Slack, saying that I was seriously freaked out, this prank was going way too far, and I’d really appreciate it if someone would fess up and stop sending me alerts.

Did that push a notification to literally every person in my 1,000+ employee company? Yes. Did I care? No.

No one responded. The little icons of all my coworkers’ availability status remained stubbornly blank and meanwhile, my phone was blowing up with alerts, interrupting its own siren sound before it had finished with the next incoming klaxon.

The same notification, over and over and over.

Heartbeat check failed: ‘nymasterbedroom1n2’. Run keeplightson.sh

Then. A lull. Just a minute or so. But in that silence, I realized I no longer heard the scratching sound; instead, I heard something else: a creaking. The same creaking I’d heard in the hallway when I went to retrieve my laptop. It came from somewhere inside the house.

I froze, straining to listen to the sounds around me. Then I heard it: a rhythmic thud that repeated over and over, heavy and deliberate. Something was hitting the wall—no, a door. I heard the rattle of the hinges.

The basement door.

A PagerDuty notification lit up my screen and the speaker began blasting the harmonized tones of the barbershop quartet. It was a song I’d never heard before, one that wasn’t included in the app’s options.

“It is cooomiiiiiiiiing,” they crooned.

I reached over and, with shaking hands, jerked the cord on the bedside lamp, flooding my room with a warm glow. Then I acknowledged the alert to shut up the barbershop quartet and dialed 911. I told the operator there was something in my house.

A shuddering crash, the sound of a door being thrown open and slamming into the wall. I tumbled out of bed, trying to think of an escape route as the operator on the other end calmly instructed me to find a way out of the house.

“There is none,” I gibbered. “I’m on the second floor.”

She told me to shut the bedroom door and drag something heavy in front of it. Police were on the way, she said. I did as she instructed, shoving my wardrobe over to barricade myself inside the room before curling up in the far corner, next to the bedside table and the glow of the lamp. And waited.

From the hallway, a noise like nails on a chalkboard grew steadily closer. My mouth was dry and my eyes were fixed on the doorway. The 911 operator stayed on the phone with me, telling me to stay calm, that the police were almost there.

“I don’t think they’re going to make it in time,” I whispered.

Something slammed against the bedroom door. My wall shook with the impact. Something large, something viciously strong. The wardrobe moved an inch. Another impact. Another inch. And again. I couldn’t breathe. I could barely hear the operator on the phone telling me to stay calm, that the police would get there, telling me to hide if I could. I fumbled in the drawer of the nightstand and found a heavy flashlight that I kept nearby in case of a power outage in the night. I recognized the futility of using it for a weapon against something so massive, but I felt I had to try something.

The doorknob rattled and the door swung open, bumping up against the wardrobe and going no further than a handful of inches. The hallway was pitch dark beyond the doorframe and as I watched, a hand squeezed through the gap. Black, an empty darkness like the void, with irregular edges that fizzled in erratic lines like static. A hand as large as my head with fingers like spikes. It slithered inside my room, along the doorframe, along the drywall, trying to reach the panel of three switches that controlled all the outlets in my room.

Heartbeat check failed in the New York master bedroom. I was going to die.

PagerDuty had said to keep the lights on.

I flicked the flashlight on and shone it at the hand. The fingers vanished where the light hit it, I moved the beam to the left and the hand vanished as well, like erasing a shadow.

I heard police sirens in the distance, quickly growing louder. My breath came in wild gasps and I wanted to break down and cry, but I kept the flashlight fixed on the doorway. Whatever was out there did not try to enter the room again. I only relaxed when I heard the police pounding on the front door, before breaking it down, and I yelled at them to turn the lights on. Turn the damned lights on.

Only then did I see, over on my laptop monitor, that all of my active incidents got marked as auto-resolved in PagerDuty.

The Aftermath

The police searched my house and found that something had certainly been inside, but whatever it was, was now gone. There were three long gouges in the hallway, cutting clear through the drywall, at about the height of my shoulders. One of the officers told me it was probably a wild animal that got in through the basement and suggested I come see. I went with him to look at the basement.

The walls were covered in gouges, half an inch deep, in sets of three. Long, trailing marks, each line spaced about two inches apart.

“What kind of animal can claw up cement?” I asked, incredulous, running my finger along the grooves.

“The kind that you really don’t want in the house,” he replied.

He advised me to take a careful survey of the exterior in the morning. Maybe call someone out to inspect the house. See if I could find where it had got in—and out—of. Despite his attempt at calm professionalism, I could tell he was unnerved. He didn’t want to say what I was thinking. He and the other officers had scoured the house and found nothing that would explain how something big enough to leave these kind of claw marks behind would have gotten in.

That thing had started out in my house and was still here.

As they left, I told them to leave the lights on. I wanted to take some photos for insurance purposes, I claimed. I left the lights on all night and had never been so happy to see the dawn.

The next day I made an apology to my coworkers in the Slack General room. “Weird stuff happened,” I typed. “I got some bizarre pages and there was a wild animal in my basement making creepy noises and I got a bit freaked out.” People were understanding. I got trolled a bit. My coworkers sent me memes. They all had a good laugh. I pretended to laugh along as I did some more digging.

I found out our ops on-call person never got my escalated page. When I checked PagerDuty, all my resolved pages were gone like they had never existed. A few days later, PagerDuty support resolved my ticket with a stock “working as expected” message.

I haven’t dismissed what happened, though. Those claw marks are real. This wasn’t a prank or my imagination. Whatever it was, it could come back.

I’ve set up my own PagerDuty account. I’ve got a series of hooks so that whenever it alerts, it sends a message to my smart lightbulbs to turn on every light in the house. There are floodlights in the basement because I’m not screwing around with whatever is down there. Sure, it was expensive, but considering PagerDuty is now alerting at least once a week … I think it’s worth the investment.

I’m not sure who all they have working over there, but kudos to the PagerDuty team. It’s nice to see a company take the well-being of its client base so seriously. I’ll be honest—if my current gig doesn’t work out long term, I think I’m going to apply to work there as well. They don’t list any open positions for a secret department that handles supernatural events… but I bet they’re hiring.

I gotta say though: They really need more descriptive error messages.

The post It Came From Below appeared first on PagerDuty.

]]>