Digital Operations | Categories | PagerDuty

Three Teams That Can Use AIOps to Work Smarter, Not Harder by Hannah Culver

Hannah Culver — Mon, 28 Aug 2023 12:00:29 +0000

There isn’t a boardroom today that isn’t asking what AI and generative AI in application can help drive efficiency and accelerate their business. For organizations looking to capitalize on ML and automation to improve their efficiency during incidents, AIOps is a tangible, proven application thatproves to be an exciting opportunity for ITOps teams.

As we’ve seen across market landscape evaluations, there are a number of ways that solutions can be implemented. Despite this, the problems AIOps solutions aim to address remain fairly consistent: fewer incidents and faster resolution. But which teams can stand to benefit from this powerful technology and how will AIOps help them achieve their desired business outcomes?

Understanding how different teams can implement best practices to see a reduction in MTTR, total incidents, and time to adopt automation will help ensure that each team is taking value from your investment. Here are three teams that stand out as having much to gain from leveraging AIOps: Network Operation Center (NOC) teams, Major Incident Management (MIM) teams, and distributed service owning teams. Let’s cover each.

NOC teams

If you have a NOC, it acts as your central nervous system. You may also be in the middle of undertaking modernization efforts to reduce both cost and risk.

Many of our NOC customers tell us about challenges such as:

Eyes-on-glass operational style causes incidents to go undetected
Catch and dispatch means too many escalations to SMEs or routing incidents to the wrong team
Manual work drives up MTTR
L1/L2 teams experience high turnover and blame culture is common

To move beyond this, organizations can create L0 automation. This is automation that serves as the first responder, only bringing in humans when necessary. For well-understood, well-documented issues, L0 automation can auto-remediate incidents without a responder intervening. But for other more complex issues that require a hands-on approach, NOC teams can create L0 automation that immediately pulls in diagnostic information before the responder looks at an incident, routes incidents intelligently according to event data, and populates the incident notes with pertinent documentation and runbooks.

PagerDuty AIOps helps NOCs modernize and move away from eyes-on-glass methods. These NOCs are a center of excellence within their organizations, spearheading data-driven optimization, enabling best practices, and ensuring incident readiness.

MIM teams

When critical, customer impacting incidents happen, you don’t have time to waste. But, with complexity and noise on the rise, how do Major Incident Management teams improve to meet growing customer expectations?

We see MIM teams with common challenges such as:

Finding out about major incidents from overwhelming customers/users calling in or delayed team escalations
Lack of context as initial triage takes too long to assess severity and business impact
Long MTTR waiting for the right people, the right diagnostics, the right runbooks, etc
Disjointed tooling leading to communication barriers for responders and corresponding teams

MIM teams can overcome these challenges with a variety of automation and ML tactics. First, organizations can create automation that immediately routes high priority or severity incidents to a MIM team and tags in the appropriate teams needed via incident workflows. Additionally, ML can gather key context such as how rare an incident like this is, if it happened before and how it was resolved, and change events that might be correlated to the failure.

PagerDuty AIOps helps MIM teams detect major incidents faster, improve MTTR and customer experience, and save SMEs time. This reduces the cost of each incident and mitigates risk.

Distributed service owning teams

DevOps and distributed service owning teams are under more pressure than ever to deliver exceptional customer experiences. But with competing priorities and fewer resources, this is easier said than done.

Many of our customers share challenges they are facing such as:

Disparate monitoring tools with no central pane of glass
Too much noise leading to incorrect escalations and false incidents
Lack of context and information silos
Toil and time taken away from value-add initiatives

For service owning teams looking to overcome these challenges, an AIOps tool that can aggregate data from all the monitoring sources in the technical ecosystem can help bring clarity to incident response. Additionally, with ML, teams can reduce noise by automatically grouping together alerts based on context, time, and previous event data that the model has trained on. With this and the ML-surfaced triage information, incident response is streamlined so teams can get back to innovating faster.

PagerDuty AIOps helps service owning teams spend less time firefighting, reduce MTTR, and create exceptional customer experiences. This improves culture and team retention while increasing revenue for the entire organization.

Ready to get started?

With PagerDuty AIOps, teams like the ones we looked at see 87% fewer incidents, 14% faster MTTR, and 9x faster automation adoption. This helps organizations move faster, focus on the work that matters most to customers, and reduces risk and team burnout. Best of all, teams from dev to IT can see value from PagerDuty AIOps.

PagerDuty AIOps works in conjunction with the rest of the PagerDuty Operations Cloud to help organizations manage their operations by leveraging AI and automation to supercharge their digital transformation. With over 700 integrations, GenAI capabilities, and end-to-end event-driven automation, PagerDuty gives customers a 400% ROI and the right tools to leapfrog the competition.

To try PagerDuty AIOps out yourself, you can take an interactive product tour or try us for free for 14 days.

The post Three Teams That Can Use AIOps to Work Smarter, Not Harder appeared first on PagerDuty.

Doing More with Less: Building Greater Operational Efficiency with PagerDuty by Nancy Lee

Nancy Lee — Wed, 14 Dec 2022 14:00:13 +0000

How many of us can say with confidence that we know a tool inside and out? If you’re like most, you probably use just a small fraction of a product’s features. When it comes to feature-rich software like Microsoft Word or Excel, it’s a safe bet that most users are aware of less than half of the features, and use even less on a regular basis. And the longer we’ve been using a piece of software, the more likely we fall into this trap of feature underutilization.

I started noticing this in my own life a year and a half ago when a coworker who had recently joined the team told me she found a more efficient way to generate closed captions for our instructional videos. I asked if it was a tool in her Adobe Creative Suite.

“Nope, it’s actually YouTube!” she replied.

“What? That’s amazing!” I said. “How did we not know about this?” I was shocked. For the past 6 months, we had been paying for a separate tool for its closed captioning capabilities when, all along, we could’ve used YouTube’s free captioning feature in our Google accounts.

More recently, I had my proverbial mind blown yet again when I learned of Slack’s reminder feature. Making my to-do list for the next day, I was scheduling reminders in my Google calendar to follow up with a teammate, call my doctor, and pay the gas bill. My husband looked on in amusement as I added one event after another in my calendar.

“What are you doing?” he asked.

“Setting reminders for the things I have to do tomorrow,” I replied, mildly annoyed at this interruption to my sacred routine.

“Why don’t you use the Slack reminder feature?” he said. “That way, you’re not filling up half your calendar with reminders and making it hard for people to book meetings.”

“I had no idea you could do that!” Like the YouTube incident, I was incredulous that I was only learning of this feature now.

As I started scheduling Slack reminders for the following day, I wondered how often we hear our customers use that phrase — “I had no idea you could do that!” It’s not surprising when you think about it. We often purchase a tool for a specific use case. In our haste to implement a solution, we approach the task with blinders on, paying attention to only those features that will help us achieve our goal. “Problem solved!” we declare. Never mind that we only learned a tenth of the software’s capabilities. Years later, we’re still clicking the same buttons and following the same scripts, oblivious to the slew of new features that promise to enhance our user experience.

It’s human nature to take the path of least resistance. But at a time when many tech companies are being asked to manage costs and do more with less, perhaps a good place to start looking for efficiencies is in our existing investments.

One business area that shines a light on this is Customer Education. At PagerDuty, customer training and enablement sits with PagerDuty University. A comment we often see in our course evaluations is “I had no idea PagerDuty could do [fill in the blank with a feature that’s existed for months or even years]!” Some customers may have started using PagerDuty for on-call management and alerting, and never ventured beyond those basic capabilities. They’ve become so accustomed to using PagerDuty for a single use case that they don’t realize its product portfolio actually encompasses multiple solutions for use cases across their digital operations.

For organizations facing pressure from the current macroeconomic environment, PagerDuty’s end-to-end digital operations capabilities can help consolidate tool spend and boost productivity by reducing context switching. PagerDuty University helps customers by driving awareness of this end-to-end experience, from pre-incident creation (enriching and routing events) to post-incident mobilization (response automation) to business-wide orchestration (automated stakeholder communication) and beyond. Rather than investing in point solutions that address a single problem, our customers can leverage the solutions they need, when they need it, adopting additional capabilities and products as they continue to evolve their Digital Operations with PagerDuty.

Those of us who work in Customer Education understand that it’s our job to not only improve a customer’s time to value, but to ensure that they continue to see the return on their investment post-onboarding and beyond. For PagerDuty University, that means making sure that our customers receive proper enablement on PagerDuty’s advanced capabilities such as Event Intelligence and Incident Workflows (in Early Access!), as well as other products and use cases such as Customer Service Operations and Process Automation. Tool consolidation, cost savings, automating away toil, better customer experience — these are some of the biggest ROI our customers walk away with post-training.

Our instructor-led training courses are centered around achieving customer goals. Rather than training customers on every PagerDuty feature, we first try to understand what business challenges they’re trying to solve, and build training that guides them efficiently to reaching those goals. Often in SaaS, we talk about time to value — we like to think of our technical training team as “guides to value.”

PagerDuty University’s free, on-demand training complements our instructor-led training by digging into each product feature, situated in real-life scenarios so users always understand the larger context in which these features are used and the problems they solve. Our self-paced eLearning modules are suitable for customers who are trialing a free account, those who want to check out new features, or those who simply prefer the self-serve aspect of on-demand training.

It should come as no surprise that those of us who work in Education Services love learning. We use that love of learning to drive customer success, which sits at the heart of everything. Whether it’s driving adoption, improving onboarding, or imparting industry best practices, we strive to make sure that we never hear one of our customers say “I had no idea PagerDuty could do that!”

The post Doing More with Less: Building Greater Operational Efficiency with PagerDuty appeared first on PagerDuty.

4 New Product Announcements to Help Teams Do More with Less by Vivian Chan

Vivian Chan — Tue, 01 Nov 2022 13:00:03 +0000

Incidents are costly. It’s not just revenue that takes a hit every time you have an outage–brand reputation and client satisfaction are also on the line. To protect current and future revenue, companies have to deliver on customer expectations. Innovation alone is no longer enough: digital experiences must also be fast, flawless, and highly available. This means teams have to get more proactive with real-time, unplanned work. Only then can they account for the 1 in 3 customers who would stop doing business with a brand they loved after one bad experience (PWC).

If that wasn’t hard enough, you now need to do all of this with less. 98% of CEOs are preparing for a recession in the next 12 to 18 months (The Conference Board). Teams have to get more efficient, because costs are high, resources are low, and skills are scarce. Teams can’t afford to work harder–they have to work smarter. Instead of five tools to manage event correlation, diagnostics capture, incident workflows, incident communications, and customer status pages, you need one solution for end-to-end incident response.

Today, PagerDuty is announcing new ways to drive efficiency across your digital operations. With expanded noise reduction, workflow automation, and stakeholder communication capabilities, the PagerDuty Operations Cloud is the only solution on the market that delivers end-to-end automated incident response from ingest to resolution. If you’ve only been using the platform for on-call and alerting, it’s time to consider how you could achieve your cost-optimization goals with PagerDuty.

Earlier this year, you heard about how the PagerDuty Operations Cloud is revolutionizing operations for a world of digital everything. Over 21,000 organizations rely on the PagerDuty platform to react, respond, and resolve issues quickly in real-time by:

– Using APIs and webhooks to understand what’s happening in your environment with over 700 integrations

– Applying machine learning to help reduce the noise to signal ratio and keep teams focused on what’s important

– Automating repetitive, manual processes to deflect manual work and avoid costly escalations

– Mobilizing business-wide response for faster resolution and better customer experiences

– Learning continuously to better anticipate and prevent repeat issues

In this blog, we’ll highlight four of the key innovations coming to the PagerDuty Operations Cloud, before summarizing the full list of new capabilities. Register now for the November 3rd product launch webinar, “Evolve to Resolve: Fewer Incidents, Faster Response” to learn more and engage directly with our product team.

Design and automate Incident Workflows to reduce manual toil and save time spent per incident

The more that automation can remove toil and take care of rote tasks in incident response, the faster teams can focus on problem identification and resolution. And the faster they can resolve an incident, the sooner they can get back to building new products and services.

Today we’re thrilled to announce that Incident Workflows is entering Early Access for PagerDuty Business and Digital Operations customers soon. Incident Workflows give teams more flexibility and control over what types of responses they want to configure for different types of incidents. This takes Response Plays to a whole different level by offering full end-to-end workflow automation, tightly integrated into ChatOps and the rest of the incident response platform. This is just the first step in our journey towards offering more extensibility for our customers to configure responses for their specific use cases.

Users can rapidly design incident process automation via an easy-to-use, no-code interface that configures workflows that can automatically trigger based on changes in urgency, priority, or severity. For example, you can customize a major incident workflow that automatically opens a conference bridge, adds responders, and starts an incident-specific Slack channel to keep everyone in sync.

You can also design a separate workflow for handling smaller bugs and routing them into a backlog for a team to look at later. This takes the cognitive load off the response team and makes incident response processes smoother and more consistent.

Join the early access program now.

Proactively manage customer relationships with real-time updates via PagerDuty Status Pages

The most important stakeholder in incident response is your customer. Mishandling external stakeholder communication around operational updates erodes trust and harms customer relationships. Without timely updates via a status page, support teams are flooded, which drives up costs.

Today, we’re happy to announce the Early Access of PagerDuty Status Page. One of our top-requested features, PagerDuty Status Page will keep your customers informed about critical operational updates happening in your system. It also saves time, reduces support volumes, and avoids context switching with a single source of truth without requiring additional third-party infrastructure. Providing this level of visibility helps to maintain trust and transparency with valued accounts and external stakeholders.

Sign up for early access now.

More focused engineering time and fewer incidents with more flexible Intelligent Alert Grouping

Teams can’t focus on triage if they’re bombarded by an alert storm. PagerDuty has seen event data grow at nearly 3x the rate of users, outpacing technical teams’ ability to manually process and correlate it all across silos.

We’re happy to announce a new enhancement to our AIOps-powered noise reduction suite: Flexible time windows for Intelligent Alert Grouping. A configurable time interval provides teams with more flexibility and granularity when tuning system noise on services in their specific environment. The feature is currently in Early Access for Event Intelligence and Digital Operations customers. Customers in the Early Access program saw up to a 45% increase in the average compression rate on their noisiest services in a matter of weeks.

The machine learning-driven engine will also surface recommendations for optimal time windows based on historical analysis of time between alerts. This helps teams not only maximize noise compression and minimize manual alerting grouping, but also reduce the guesswork needed by the operator and improve efficiencies.

Try Event Intelligence today: If you’re an existing PagerDuty Incident Response customer, you can start a 30-day free trial of Event Intelligence from the PagerDuty product under Account Settings. If you’re new to PagerDuty, simply start a free 14-day trial to get started today.

Higher uptime with fewer communication snafus with best practices for streamlined communication and stakeholder management

True to our DevOps roots, PagerDuty aims to pay it forward to our customers and the industry by sharing best practices. That’s why we’ve invested in Ops Guides for anything from Service Ownership to Business Response.

We’re thrilled to announce that we have updated the Incident Response Ops Guide with new learnings that we’ve codified from our own incident response experiences. We’ve now expanded the section about communications and how to streamline stakeholder management during incident response.

Read the updated operations guide at response.pagerduty.com.

When the wrong people are tasked with an urgent, unplanned issue, escalations will drive up the resolution time and cost. When critical stakeholders and customer service are not informed, customer experience is put at risk. New updates across the Incident Response and key partner integrations help to streamline incident response to make sure all teams have the information when and where they need it to drive to next best action.

Custom Fields on Incidents: Populate custom fields on incidents to provide responders with more contextual data for faster triage and resolution. This feature is planned for Early Access in early 2023. Sign up for early access now.
Microsoft Teams: Coming soon for general availability, the integration will feature additional incident management actions, including: changing priority, reassigning, adding responders, escalating, and performing PagerDuty Automation Actions. Learn more.
Updates to Status Update Notification Templates: Teams can now use flexible templates to format the content and context they need for better standardized internal communications during incident response. The Rich Text Editing functionality is generally available (Learn more), while the templates functionality recently entered Early Access. Sign up for early access now.
PagerDuty for ServiceNow CSM: Agents using PagerDuty Customer Service Operations now have a direct line of escalation and communication with engineering on incidents in PagerDuty from the application where they work. General availability is coming soon. Learn more.
Mobile Updates: A new home screen experience in Early Access that streamlines the navigation to help responders access their most important incidents. Also, mobile service maintenance windows are now generally available. Temporarily disables a service, including its integrations, for a set period of time while it is in maintenance mode, right from the mobile app. Learn more.

But wait, there’s more:

More Automation: Automation has the added benefit of distributing expertise and scaling processes to avoid overloading contributors with unplanned work and constant disruptions. Codifying and distributing subject matter expertise in the form of runbooks reduces the number of escalations required, minimizing the total number of people needed to triage, diagnose, and resolve incidents. That’s why we’ve been introducing more integrations both across our automation products and other products in the portfolio. Learn more about PagerDuty’s Process Automation portfolio here.

Automation Actions to Runbook Automation Integration: Now generally available, Customers can now seamlessly connect to their production environments and implement diagnostic and remediation jobs in the cloud. Learn more.
Automation Actions in CSOps for Zendesk: Now generally available, agents are now empowered to validate customer-impacting issues and by running automated actions directly from the PagerDuty app in Zendesk. This will reduce resolution time, as well as the number of incidents escalated to the back-end teams. You can read more about this integration in this blog.

More Enhancements to Access Controls and Analytics:

More than 700 integration partners: With 80+ new integration partners introduced this year, PagerDuty snaps seamlessly into any stack. Learn more about our integration partners.
Webhooks v3: Managing webhook integrations is easier than ever. Generally available Webhooks v.3-based integrations will be visible and editable under the Service Directory UI under the Integrations tab. Learn more.
API Scopes: Securing access to PagerDuty resources is becoming more granular, so you can grant the right level of access for every use case. Planned for Early Access, Scoped API access provides more flexibility with application credentials that control access and available actions on PagerDuty resources, such as account level access credentials with differentiated read-only versus read and write permissions on PagerDuty Incidents. https://www.pagerduty.com/early-access/
Updates to Login experience: Planned for Early Access, this feature provides a single, streamlined login experience for PagerDuty no matter where you’re using it: web, mobile, Slack integration. Keep an eye out for this new experience on mobile and web later this year.
Updates to Service Performance Report & Incident Activity Report: Surface actionable insights and opportunities for continuous improvement with new reports featuring new visualizations, more interactivity, intuitive drill-down capabilities, and additional filtering options. These reports are currently in Early Access for Digital Operations and Analytics customers. Learn more. If you are a Professional or Business Incident Response customer, reach out to your account team to try this out for your team.

Register now for the product launch webinar “Evolve to Resolve: Fewer Incidents, Faster Response” to learn more about any of these announcements. Ready to get started? Contact sales or sign up for a 14-day free trial to get started.

The post 4 New Product Announcements to Help Teams Do More with Less appeared first on PagerDuty.

What’s New: Updates to Incident Response, PagerDuty® Process Automation Software & PagerDuty® Runbook Automation, Integrations, and More! by Vera Chan

Vera Chan — Thu, 27 Oct 2022 13:00:50 +0000

We’re excited to announce a new set of updates and enhancements to the PagerDuty Operations Cloud. Recent development and app updates from the product team include Incident Response, PagerDuty® Process Automation, as well as Community & Advocacy Events updates. We continue to help customers further automate to optimize cloud operations and reduce the amount of issues escalated to other teams. Get started now and learn about:

Incident Response Status Update Notification Templates customization, standardization, and reusability
PagerDuty® Process Automation and Rundeck Community updates in the latest 4.7.0 Release and new Automated Diagnostics for AWS
Updates to CollabOps and Customer Service Ops Integrations
- How to Show Maintenance Windows in Slack
- PagerDuty App for Salesforce v3.7 and PagerDuty App for Zendesk Migration to v3
Product Deprecations including the v1 Webhooks EOL (End-Of-Life), Event Rules EOL & Migration to Event Orchestration, and v2 Zendesk Integration EOL
Sign up for upcoming Events view our recent Podcasts and Twitch Streams, and join members of our Community Team and Product Teams as they walk you through new products and leading practices in the software industry

Incident Response

Status Update Notification Templates EA

We’re excited to announce updates to Status Update Notification Templates. You can now customize and standardize reusable communication templates based on impact, service areas, and more. Teams can also leverage this feature to fit their needs in any tool or context via our API.

(Featured above: Setting Up Status Update Notification Templates With Variables)

(Featured above: Setting Up Status Update Notification Templates Preview)

Please fill out the Early Access Program Form to be added to our early access list for the latest updates
Learn more about communicating with stakeholders in the Knowledge Base

PagerDuty® Process Automation

PagerDuty® Process Automation Software and PagerDuty® Runbook Automation Version 4.7.0

Check out the new features and enhancements for PagerDuty® Process Automation, PagerDuty® Runbook Automation and Rundeck Community in this release, including:

CloudWatch Logs Saved Query Plugin. This plugin simplifies the management of diagnostics queries. An incubating feature that helps users understand the ROI (return on investment) of jobs, and a number of security and compliance updates and bug fixes.

(Featured above: CloudWatch Logs Queries Output)

ROI Metrics Data (Incubating). Process Automation users now have a way to track time and money saved, as well as begin to see insights into the effectiveness of teams and projects.The ROI Metrics integration tracks user-defined value of each job execution and stores key value pairs against jobs to help you understand the ROI per job execution.

(Featured above: ROI Metrics Output)

Enhanced Progress Badge Plugin. The Progress Badge plugin can create graphic badges with the option to include emoticon status-symbols that render on the Log Output tab. For users implementing Automated Diagnostics, this gives domain-experts the ability to simplify diagnostics in an easier-to-consume way.

(Featured above: Process Badge in Failed State)

(Featured above: Process Badge in Successful State)

(Featured above: Process Badge State in Incident Activity Timeline)

Additional Updates. Catch up on additional bug fixes and open source product updates.

Learn more:

View our Twitch stream review of this release
Read the full release notes about the Orc Yellowgreen Gift Release (4.7.0), October 6, 2022

Automated Diagnostics for AWS

We recently launched Automated Diagnostics for AWS in PagerDuty to help our customers quickly triage problems in AWS environments. This solution consists of a seamless integration of Automation Actions and Runbook Automation connected to PagerDuty Incident Response and Event Orchestration. It provides prebuilt common diagnostics for frequently used AWS services, and an easy way to add and build your own diagnostics.

(Featured above: Automate Diagnostics Run Actions Menu)

(Featured above: Process Automation AWS CloudWatch Logs Plugin)

Learn more about automated diagnostics for AWS use cases

Integrations

Maintenance Windows in Slack

Have you ever wanted to extend the PagerDuty Slack integration to show maintenance windows directly in Slack? Mandi Walls recently wrote up a walkthrough for exactly that situation.

(Featured above: Multiple Scheduled Maintenance Windows in Slack Result)

Newest Version of the PagerDuty App for Salesforce Is Now Available!

The latest version of the PagerDuty App for Salesforce, v3.7 is now available. Some benefits of this latest versions are the following:

Webhook Extension upgrade to PagerDuty webhooks v3–This upgrade allows extensions to be added at the account level instead of the service level with webhooks v2
New Salesforce Extension page and the ability to see what Salesforce accounts are connected to PagerDuty
Default Object mapping for the salesforce incident object with the standard integration
Ability to select if ruleset actions should be limited to PagerDuty or Salesforce objects created as part of the ruleset flow

(Featured above: PagerDuty App for Salesforce V3.7 Main Page)

For more information:

Visit the PagerDuty App for Salesforce integration guide for more details
Please reach out to support@pagerduty.com with any additional questions

Migrate to v3 PagerDuty App for Zendesk

Migrate now to v3 PagerDuty App for Zendesk to continue to send Zendesk support ticket events to PagerDuty. (The integration will sunset in March 2023 and you’ll be reminded in our deprecation section below)

Some benefits of this upgrade are the following:

Bidirectional communication between PagerDuty and Zendesk using webhooks
Additional PagerDuty actions and an Incident console widget on the ticket page
Ability to view and interact with the PagerDuty status dashboard from Zendesk

(Featured above: Zendesk V3 Status Dashboard)

For more information, please:

View the PagerDuty App for Zendesk integration guide for more details
Reach out to support@pagerduty.com with any additional questions

Product Deprecations

Please take note and keep your teams informed of our upcoming product deprecations.

V1 Webhooks EOL

The End of Life date for v1 Webhooks is 10/31/2022. This means:

You will no longer be able to create new v1 Webhooks or use existing connections to v1 Webhook extensions
Apps or integrations that are using v1 Webhooks will stop working

For more information, please:

Refer to this migration guide for details and steps to migrate to v3 Webhooks
Reach out to support@pagerduty.com with additional questions

Important Dates:

V2 Zendesk Integration EOL

PagerDuty’s v2 App for Zendesk will End-Of-Life in March 2023. Migrate now to continue to send Zendesk Support Ticket events to PagerDuty. You can read about the benefits of migrating to v3 in the Integrations section above.

Learn more in the PagerDuty App for Zendesk integration guide
If you have any questions or concerns, please reach out to support@pagerduty.com

Event Rules EOL & Migration to Event Orchestration

PagerDuty Event Rules End-Of-Life is April 30, 2023. You can:

Learn more about the migration in the knowledge base
Learn more about Event Orchestration
Contact your account managerWe have plenty of migration paths to support this EOL. Additionally on the EOL date, we will auto-migrate any remaining event rules you are using to Event Orchestrations, one-to-one. From then on, you’ll be able to do everything in Event Orchestration that you can in Event Rules today. Event Orchestration has the same features as Event Rules and it uses the same backend architecture, ensuring that event processing has billions-of-events-worth of testing already baked in.

Webinars & Events

Join us for the following webinars and events to learn more about PagerDuty’s recent product updates and how they benefit customers. These are just a few of many:

Webinars

Page It to the Limit: Security Careers

October is Cybersecurity Awareness Month! Our security team at PagerDuty helps our engineers keep our platforms safe, provides our employees with security training, and much more. Join Megg Sage and Patrick Roserie as you listen to the latest episode of Page It to the Limit (the PagerDuty podcast) to learn more about how they approach security at PagerDuty and beyond.

Evolve to resolve: fewer incidents, faster response (November Product Launch!)

Join our SVP & GM of Emerging Products Jonathan Rende and senior product team members Kat Gaines, Julia Nasser, Sam Ferguson, and Hadijah Creary for a deep-dive into some of our newest capabilities, including:

Incident Workflows
PagerDuty Status Page and Status Update Notification Templates
Flexible time windows for Intelligent Alert Grouping
Updated for 2022! Incident Response Ops Guide

Live Call Routing: The Fastest Way to Reach an On-Call Staff

Join Tim Chinchen and Ben Wiegelmann from PagerDuty as they discuss:

What is Live Call Routing? Explore the Live Call routing workflow and why it works
How to set up Live Call Routing with no code required
Customer use cases from small to large companies who are using Live Call Routing to drive down response time and improve the overall customer experience

PagerDuty Community Twitch Stream

Join us on our Twitch channels, PagerDuty Twitch Stream and PagerDuty Community Twitch Stream, to catch up on one of our latest streams led by our Developer Advocates! Catch our past streams via the YouTube Twitch Streams Channel.

Subscribe and get notified when we’re live and view previous recordings
Missed one of our broadcasts? Watch any of these upcoming and recent Twitch streams (PagerDuty Garage and Terraform Time) or YouTube videos:
- HowTo Happy Hour: Intelligent Alert Grouping With Mandi Walls, as well as Max Li and Everaldo Eguiar (Data Scientists from PagerDuty) (September 30, 2022)
- Process Automation and Rundeck OSS Release Notes v4.70 with Mandi Walls and Jake Cohen from PagerDuty (Oct 12, 2022)
View our Twitch Stream schedule to join one of our weekly ongoing streams in November!

If your team could benefit from any of these enhancements, be sure to contact your account manager and sign up for a 14-day free trial.

The post What’s New: Updates to Incident Response, PagerDuty® Process Automation Software & PagerDuty® Runbook Automation, Integrations, and More! appeared first on PagerDuty.

What’s New: Updates to Incident Response, AIOps, PagerDuty® Process Automation, and More! by Vera Chan

Vera Chan — Tue, 31 May 2022 13:00:45 +0000

Summit’s right around the corner (have you registered yet?) but the shipping doesn’t stop! We’re excited to announce a new set of updates and enhancements to PagerDuty’s Digital Operations Platform. Recent updates from the product team include On-Call Management, Incident Response, Process Automation, and Integrations, to PagerDuty Community & Advocacy Events. New capabilities enable users and customers to resolve incidents faster, do the following, and more:

Quickly and easily find what you need with the new on-call scheduling UI. Keep stakeholders informed with the necessary level of content and context within status update notification templates during Incident Response
Leverage new PagerDuty® Process Automation and PagerDuty® Runbook Automation plugins to empower first-responders to run automated diagnostics on AWS infrastructure and applications reducing escalations to and engagement of others
For AIOps, you can now Copy & Paste Rules in Event Orchestration (Available to all plans)
Take action on webhook Integration requirements and try out our most recently updated PagerDuty App for Splunk from our Partner Ecosystem
Stay in the know with Webhook V1/V2 Product Deprecations and User Administration call to actions
Sign up for upcoming Events view our recent Podcasts and Twitch Streams, and join members of our Community Team and Product Teams as they walk you through new products and leading practices in the software industry

Incident Response

Schedules List UI Refresh

The new visual experience on our Schedules page is now generally available on all accounts. This redesign delivers the following to our customers:

Enhanced search functionality
Toggling between all schedules and team schedules
Collapsing or expanding information based on the level of detail needed
Easier schedule comparison with time windows displayed
Streamlined view of on-call responders with shift times

Early access rollout to accounts is ongoing.

Learn more about Schedules

Status Update Notification Templates

Our new notification templates give you standardized and reusable formatting for status updates during incidents. You can now:

Easily include common incident details via a drop-down menu that populates key information
Add logos, hosted third-party images, and other contextual information, as needed
Ensure updates consistently adhere to your company’s internal communication standards

This feature has already started ramping to customers and should reach all customers by mid June.

See it in action below!

Learn more in the knowledge base or watch it in action later

New SMS Formatting Standard for Incident Notifications (including Status Updates)

We’re standardizing the way that incident notifications are formatted for SMS notifications. As of May 23rd, all SMS incident notifications (including status updates) now start with the prefix [PagerDuty]

Learn more about notifications

PagerDuty® Process Automation Software and PagerDuty® Runbook Automation

New AWS Plugins for Automated Diagnostics

Want to reduce the amount of people pulled in to troubleshoot incidents? PagerDuty® Process Automation already helps first responders fetch diagnostic data typically only retrievable by domain-experts.. Now users can get up and running with auto-diagnostics for the AWS faster via new AWS plugins. The Amazon CloudWatch Logs Plugin retrieves diagnostic data from AWS infrastructure and applications, while the AWS Systems Manager plugin allows users to manage their global VM footprint with security best-practices.

AIOps

Event Orchestration: Copy/Paste Event Rules

Ever wanted to reuse service rule configurations to avoid re-configuring rule conditions and actions from scratch? You can now simply “copy” any Service Orchestration rule and then “paste” that rule into any Service Orchestration to reuse rule conditions and actions. The best part? You can still have the ability to tweak those conditions/actions for each specific service/orchestration.

Learn more about Event Orchestration or Copying Service Rules

User Administration

Email Domain Restriction Verification – Action Required!

PagerDuty has changed the default setting for the “Email Domain Restriction” Feature in the Account Settings to be “ON” by default (the current setting is OFF). Since April 15th, this has automatically been enabled for customers and through this update, email domains will be restricted only to domains that are owned by the account. This applies only to login email addresses, and not contact method email addresses.

The following exceptions apply to the automatic enablement of the Email Domain Restriction feature for existing customers:

Email addresses using common carrier domains (yahoo, gmail, aol etc.) will not be restricted
The restriction will not be enabled for accounts with more than 10 email domains.

ACTION: Account owners should please check that the settings and email (sub)domains are set appropriately for your needs. For further details on this topic, please refer to the knowledge base.

Integrations and Partner Ecosystem

PagerDuty App for Splunk v4.0.1

Ever feel like you’re spending too much time managing alert configurations and settings or feel overwhelmed by alert storms? Now, generally available, the latest PagerDuty App for Splunk (v4.0.1) now enables PagerDuty and Splunk customers to:

Allow PagerDuty to leverage the Splunk Alerts framework to save time and energy
Streamline and group incident data within PagerDuty in a way that makes sense to you to reduce alert noise and information overload

Learn more in the splunkbase

Webhooks IP Addresses

Do you have any webhooks configured in PagerDuty? If so, please be aware that since May 5, 2022, the official list of IP addresses that PagerDuty uses to send Webhooks calls have been provided on this page.

Please safelist those IP addresses to ensure service continuity.

Learn more about webhooks IPs

Product Deprecations

Please take note and keep your teams informed of our upcoming product deprecations:

V1/V2 Webhooks

If you are currently using V1/V2 webhook extensions in your PagerDuty environment, you need to migrate them to V3 webhook subscriptions to maintain functionality. Please follow our migration guide.

Important Dates:

V1 Webhooks – V1 webhook extensions became unsupported (no new features or bug fixes) since November 13, 2021 and will stop working in October, 2022.

V2 Webhooks – V2 webhook extensions will be unsupported in October, 2022 and will stop working in March, 2023.

Required Permissions:

Admins or Account Owners can migrate an entire account.
Team Managers can only migrate webhooks for their assigned Teams.

What are Webhooks? Webhooks allow you to receive HTTP callbacks when significant events happen in your PagerDuty account, for example, when an incident triggers, escalates, or resolves. Details about the event are sent to your specified URL, such as Slack or your own custom PagerDuty webhook processor.

If you have any questions, please reach out to your PagerDuty contact or our support team at support@pagerduty.com.

Learn more about webhooks

Webinars & Events

Join us for the following webinars and events to learn more about PagerDuty’s recent product updates and how they benefit customers.

PagerDuty Summit 2022 (In-Person and Online)

Join us to learn new ways to orchestrate, accelerate, and elevate your critical work. Enjoy inspiring keynotes, exclusive product demos, hands-on training courses, and new perspectives on developer operations, site reliability engineering artificial intelligence for IT Operations, process automation, and more! We look forward to empowering you with new ideas and skills, so you can be ready for anything in your digital operations. Join us in-person in San Francisco (June 7th), Sydney (June 15th), and London (June 21st), or online!
Register Today!

Supercharge Your AWS Cloud Platform with Self-Service Cloud Ops

Join Mandi Walls (PagerDuty), Mark Kriaf (AWS), and Cody Brown (Techstrong Group), as they share how you can:

Delegate self-service automation to run faster and eliminate toil
Automate AWS infrastructure tasks through one interface
Connect AWS operations to IT process workflows
Turn AWS requests into self-service operations for end users

Learn More

PagerDuty Community Twitch Stream

Subscribe and get notified when we’re live and view previous recordings
Missed one of our broadcasts? Watch any of these recent Twitch streams (PagerDuty Garage and Terraform Time) or YouTube videos (Women’s History Month Discussions):

If your team could benefit from any of these enhancements, be sure to contact your account manager and sign up for a 14-day free trial.

The post What’s New: Updates to Incident Response, AIOps, PagerDuty® Process Automation, and More! appeared first on PagerDuty.

What’s New: Updates to On-Call Management, Incident Response, Event Intelligence, Process Automation, and More! by Vera Chan

Vera Chan — Wed, 30 Mar 2022 13:00:16 +0000

We’re excited to announce a new set of updates and enhancements to PagerDuty’s Digital Operations Platform. Recent updates from the product team include On-Call Management, Incident Response, and Process Automation, to PagerDuty Community & Advocacy Events. New capabilities enable users and customers to resolve incidents faster, do the following, and more:

Enjoy an improved visual experience for existing On-Call Management scheduling capabilities
Join our early access program to try out our latest Automated Incident Response capabilities
Learn more about the PagerDuty® Process Automation rebrand and catch up on the GA release of PagerDuty® Runbook Automation, and release of PagerDuty® Process Automation On-Prem 4.0
Run PagerDuty® Automation Actions directly within CollabOps Integrations
Reduce unnecessary noise with Event Intelligence and machine learning
Read about our latest Partner Ecosystem announcements
Sign up for upcoming Events view our recent Podcasts and Twitch Streams, and join members of our Community Team and Product Teams as they walk you through new products and leading practices in the software industry

On-Call Management

Schedules List UI Refresh

The Schedules List page will be getting a UI refresh. This new visual experience helps you find the information needed–easily and faster while giving you more flexibility. Additional changes include:

Enhanced search functionality
Toggling between all schedules and team schedules
Collapsing or expanding information based on the level of detail needed
Easier schedule comparison with time windows displayed
Streamlined view of on-call responders with shift times

Early access rollout to accounts is ongoing.

Incident Response

Status Update Notification Templates

Status Update Notification Templates are now early access. This feature enables you to add the content and context needed for internal communications during incident response. These new flexible templates enable teams to add logos to their communications as well as format the text according to their company standards. If you would like to join us and take part in this early access program, please fill out this form.

Service Performance Report

The new Service Performance Report is now available via PagerDuty Labs Insights. This report helps your teams dive into the macro trends and specific details of how different services are impacted by incidents over time. Enjoy the stacked line graph that displays incidents over time per service, in addition to the Services List table that shows metrics based on specified date ranges. This is the first of four updated reports to be rolled out in the upcoming months that feature UI improvements such as new interactive visualizations, intuitive drill-down capabilities, and more filter options. Stay tuned for more!

Learn more here

Process Automation

PagerDuty® Process Automation helps customers automate everything in their business, drive up consistency, and drive down resolution times. Customers are empowered to automate IT and business procedures across their business systems. Automated processes can execute according to designated schedules, trigger automatically by events that occur, or stakeholders can even run them safely with a click of a button.

Before touching upon each of the three latest PagerDuty® Process Automation offerings, you can also learn more about our official rebranding of PagerDuty’s automation product line, Rundeck® to PagerDuty® Process Automation and each product offering in more depth here.

PagerDuty® Automation Actions

PagerDuty® Automation Actions is an add-on that enables PagerDuty
customers to execute automated diagnostics and remediation for services
impacted by incidents at the click of a button.

PagerDuty® Runbook Automation

Our runbook automation cloud service, previously announced as “Rundeck Cloud”, is now known as PagerDuty® Runbook Automation and is now generally available. This is our SaaS-based offering that helps businesses accelerate cloud operations by safely delegating automated IT workflows to a broader range of people in an organization.

Watch the PagerDuty® Runbook Automation Overview
Watch how a team uses PagerDuty® Runbook Automation to automate recurring tasks
Learn more about use cases and benefits of PagerDuty® Runbook Automation in the blog

PagerDuty® Process Automation On-Prem Version 4.0

Previously known as Rundeck Enterprise, our self-managed software that supports a wide range of process automation use cases, is now named “PagerDuty® Process Automation On-Prem”.

The latest 4.0 release includes:

Process Runner – The Process Runner (announced as Enterprise Runner) calls back to the cluster endpoint via HTTPS to fetch a task list eliminating the need for SSH tunnels between zones. It meets the latest zero-trust security models as it is deployed behind the firewall, securely connects to nodes, and executes automation tasks within the network zone.
Enhanced security features – Features such as User Class Management, Enhanced Webhook Security, User Manager Password Complexity and Password Reset by Email, and Failed Login Rate Limiting deliver enhanced security to your automation solution.
Plugin enhancements – New plugin enhancements include the following: AWS Systems Manager, Azure Active Directory Single SignOn, PagerDuty User Management Job Steps, and Thycotic Key Storage Plugin.
Learn more about Enterprise Updates and Core Product Updates

View the release notes for more details

Watch how a company manages single tenants for many customers using the Process Runner and PagerDuty® Process Automation
Learn more in the release notes

Automation + CollabOps

Automation Actions + PagerDuty App for Slack

Automation Actions is now integrated with PagerDuty CollabOps, allowing responders to run automated diagnostics and remediation directly from Slack. In the context of an incident, responders can view automated actions they can take, invoke them, and see execution status and output from the automated jobs. This way, responders can continue working in the same UI they typically use to work–without having to change context.

Visit the knowledge base
Read the blog

Event Intelligence

Auto-Pause Incident Notifications

Available since February 28, Auto-Pause Incident Notifications applies machine learning to detect and pause transient alerts that historically auto-resolve themselves. Now you can remove unnecessary noise with just the click of a button. Additionally, you can leverage APIs to understand basic statistics on how many transient alerts occur for a given service.

Learn more here

Partner Ecosystem

AWS Financial Services Competency Partner

PagerDuty is approved for the AWS Financial Services Competency Partner Program.This competency lists PagerDuty on the AWS website as a trusted cloud provider for financial services organizations. This competency ensures that PagerDuty together with AWS can help more financial service organizations take full advantage of cloud-scale while up-leveling their digital operations management through automation, DevOps, Service Ownership processes, and streamlined communication.

Learn more here

Product Deprecations

Please take note and keep your teams informed of our upcoming product deprecations:

Impact Metrics – The sunset of Impact Metrics as a feature is complete as of February 2022

V1 Webhooks

End-Of-Support Date: November 13, 2021 – Support for Webhooks V1 ended. We have not been accepting any feature requests or bug fixes for V1 Webhooks since this date.
End-Of-Life Date: October 2022 – The end-of-life date for Webhooks V1 has been postponed from March 2022 to October 2022 where it will be replaced by Webhooks V3 which is now generally available. After this date, customers will not be able to use or create new V1 webhooks. Furthermore, the App integrations that are using V1 Webhooks will stop working so we encourage developers to update their apps to use V3 Webhooks asap. However, Slack V1 (which is built on V1 Webhooks) is already deprecated and no longer available to customers. For more details, please refer to this support article.

V2 Webhooks

End-Of-Support Date: October 2022 – Support for V2 Webhooks will end and we will no longer accept any feature requests or bug fixes for V2 Webhooks after October 2022.
End-Of-Life Date: March 2023 – After March 2023, customers will not be able to create or use new V2 Webhooks. Furthermore, the App integrations that are using V2 Webhooks will stop working so we encourage developers to update their apps to use V3 Webhooks as soon as possible. However, Slack V2 integration (which is built on V2 Webhooks) will be available only until January 2023. This does not impact the availability of Slack V2 Next Generation (built on V3 Webhooks) to customers. For more details on steps to migrate to V3 Webhooks, please refer to this support article.

Webinars & Events

Join us for the following webinars and events to learn more about PagerDuty’s recent product updates and how they benefit customers.

New Product Releases for Automating IT Processes in the Cloud and at Higher Scale
Learn about PagerDuty’s newest runbook automation cloud offering, new features for automating IT processes across disparate environments, and the future of Rundeck by PagerDuty.View the webinar on demand
Automated Incident Response: Journey from Manual to Automated
Learn what AIR means and the problems it solves, how machines are taking the cognitive burden off humans, what the journey from manual to preventative looks like, and enjoy a 10-minute demo showcasing the power of PagerDuty AIR. View the webinar on demand
Putting Customers and Customer Service teams at the center of Incident Response with PagerDuty and Salesforce
Learn from Kat Gaines, Developer Advocate from PagerDuty, and Noopur Bakshi, Sr. Director of Product Management from Salesforce as they introduce you to how critical customer service teams communicate with customers, the benefits of using customer feedback to improve your response time, and how PagerDuty and Salesforce together break down the walls of communication/collaboration between customer feedback, Customer Service organizations, and backend Engineering teams to resolve a customer-impacting disruption. Register for the webinar today (Wednesday, March 30, 2022 11:00 AM PDT), or view it on demand

PagerDuty Community Twitch Stream

Subscribe and get notified when we’re live and view previous recordings
Missed one of our broadcasts? Watch any of these recent Twitch streams or YouTube videos:

- Feb 10,2022 with Mandi Walls and Frank Emery – The Math & Fun Behind Nesting Event Rules with Event Orchestration

- Dec 15, 2021 with Mandi Walls and Jason Flint – PagerDuty for Facilities and Crisis Response

If your team could benefit from any of these enhancements, be sure to contact your account manager and sign up for a 14-day free trial.

The post What’s New: Updates to On-Call Management, Incident Response, Event Intelligence, Process Automation, and More! appeared first on PagerDuty.

The Human Side of Being On-call: 5 Lessons for Managing Stress, Anxiety, and Life While Being On-call by Derek Ralston

Derek Ralston — Wed, 05 Jan 2022 14:00:35 +0000

Within DevOps, we talk a lot about the on-call process—but what about the human side of being on-call? For example, what are effective ways of managing stress and anxiety during a shift? How can one manage life situations that make being on-call difficult—such as being responsible for watching the kids during an on-call rotation? And how can an empathic team culture help prevent burnout and turnover?

In November and December 2021, on-call engineers from nine teams at PagerDuty met to have a discussion on the human side of being on-call. Here are the five key takeaways from those sessions:

Team empathy is critical
Don’t watch graphs all day
Postmortems can be stressful and require a lot of work
Low-urgency alerts reduce overnight noise
Week-long on-call can lead to burnout

Before diving into each key takeaway, let’s look at some metrics tied to the teams we talked with.

By the numbers

Here are key data points for the teams that joined the “human side of on-call” sessions:

What’s your on-call rotation size? The average on-call rotation size was 5 engineers.
Do you have a secondary on-call? 60% of teams said “yes.”
How often are you on-call? The average on-call frequency was every 3.5 weeks.
How long is your on-call shift? The average shift length was one week—several teams split this by weekdays/weekends.
How much time do you spend on-call per week (median)? The median time spent on-call per week was 4 hours. For two of the nine teams, their on-call engineer spent most of their business hours tackling on-call issues.

We plotted the hours spent on-call in this histogram. As you can see, 55% of the teams surveyed spent 0-5 hours per week on-call, 22% of teams spent 5-10 hours on-call, 11% spent 30-35 hours on-call, and 11% spent 40 hours on-call:

Histogram: Hours spent on-call by team

Now that we’ve shared some of the details around our focus group, let’s dive into each lesson in more detail.

Lesson 1: Team empathy is critical

Team culture is everything: it sets the foundation for creating a safe space. Putting norms in place that reinforce (with words and actions) that it’s okay to ask for overrides during your on-call week is a crucial part of setting the tone for your team’s on-call experience. Cultural change isn’t something that happens overnight, but it can be developed and molded over time. While this cultural shift is taking place on your team, it’s important to actively encourage it as part of the team culture in whatever way makes the most sense for the team. For example, after requesting an override, you can thank your colleague during a team retrospective to drive positive reinforcement. If your team has their norms documented, you can also suggest that “it’s okay to ask for overrides” is added there. Additionally, as a peer or a manager, it’s important to check in on how the on-call engineer is doing, especially after major incidents. This is especially true if it’s a person’s first major incident.

Perhaps most importantly, there needs to be empathy from the team and manager towards each on-call engineer’s unique life situation. For example, having pets or kids or elderly parents can make managing on-call trickier. Additionally, being in a stressful life event, such as the death of a loved one, can compound the stress that is felt on-call. In these situations, it’s important to be proactive about suggesting that maybe an engineer shouldn’t be on-call for a particularly rough period of time.

Lesson 2: Don’t watch graphs all day

It’s important to remember that being on-call doesn’t mean that it’s your job to watch everything all day. There needs to be trust in the system that you will get paged if something goes wrong. You need to let go of what you can’t control, and be vigilant over what you can control. Rely on a team ops review meeting to do a hand-off between rotations so you are prepared for your shift. And remember that low urgency incidents don’t need push notifications—you don’t need to increase your stress levels over those.

When time is permitted during your on-call rotation, focus your efforts on improving the on-call situation for the next on-call engineer. For example, if there’s a particular issue that keeps happening (e.g. disks that full up, logs that need rotated, noisy alerts), tackle a task that fixes it for the long term.

Lesson 3: Postmortems can be stressful and require a lot of work

Major incidents—which require a coordinated response between multiple teams—can be very stressful, and the additional workload of postmortems can cause even more stress. It’s one thing to handle the incident itself, but quite another when you have another week of stress after it. If resourcing allows, it can be helpful to create a working agreement for having the postmortem be completed by someone other than the primary responder on the incident. Additionally, providing recognition of the stress involved and allowing for decompression time after the incident is resolved can help. This might mean giving the on-call engineer a “cool down” period where they have more flexibility over their work schedule and ability to catch up in other areas of their life.

Lesson 4: Low-urgency alerts reduce overnight noise

When there is no immediate danger, an alert can be configured as low-urgency to ensure the on-call engineer doesn’t get paged while sleeping. To make this work effectively, the team needs to pair low-urgency configuration of alerts with onboarding of on-call engineers, so that their alert settings ensure they aren’t disrupted by low-urgency alerts. Effective on-call engineer onboarding should cover how to set up user notification settings, serving as a checkpoint to make sure a new hire’s settings are correct before they are put on rotation in PagerDuty.

Lesson 5: Week-long on-call can lead to burnout

Going on-call for an entire week can be a mental grind, since you are not fully off work for the entire week. This is true even if you aren’t paged during your shift, as you are still anticipating getting paged. Finding the on-call rotation length sweet spot is tricky—it’s dependent on multiple factors, including:

The preferences of the on-call engineers on the team. This can be gauged through a survey sent to the team to collect their thoughts around on-call scheduling.
How on-call engineers feel after they finish their shift. This can be tracked over time using an end of shift on-call “Yelp review rating” of 1 (worst) to 5 (best).
How noisy the team’s services are. More noise means more stress, in which case, a shorter on-call rotation would be preferable.

Instead of being on-call for a week, other options to consider include weekday/weekend rotations, business hours/after hours rotations, or shorter shifts of 2 days, 2 days, 3 days each week.

Best practices for on-call teams

Being on-call can be stressful, but having an empathic team culture and on-call rotation schedule that works best for the team’s preferences goes a long way in reducing burnout. Interested in learning more about on-call best practices, and how to develop an empathic team culture? Check out our Best Practices for On Call Teams guide.

Credits

Thanks so much to Amy Wood, Ashwin Jiwane, Charlotte Sarfati, Chelsea Vandermeer, Hunter Watson, Japa Swadia, Katherine ChengLi, KP Singh, Liam Stewart, Marcos Wright-Kuhns, Mandi Walls, Possum Nuada, Quintessence Anx, Roma Shah, Russ Smith, Todd Whitney, Tom Graft, and Vivian Chan for your contributions to these discussions and blog post.

The post The Human Side of Being On-call: 5 Lessons for Managing Stress, Anxiety, and Life While Being On-call appeared first on PagerDuty.

How service ownership can help you grow your operational maturity by Hannah Culver

Hannah Culver — Mon, 01 Nov 2021 13:00:18 +0000

Digital operations management is about harnessing the power of data to act when it matters the most. It’s also about having the right processes and procedures to support teams when every second is critical. Maturing your digital operations takes time, iteration, and commitment. The change won’t happen overnight. But, if you put in the effort, you’ll reap outsized benefits. You’ll be able to learn from incidents and proactively improve your services over time.

One way to improve your digital operations maturity is to adopt service ownership. In this blog post, we’ll share what service ownership is, how to make the transition once your organization announces the pivot, and how your teams will grow in maturity along the way.

So, what is service ownership?

Service ownership means that people take responsibility for supporting the software they deliver at every stage of the software/service lifecycle. That level of ownership brings development teams much closer to their customers, the business, and the value being delivered.

Benefits of service ownership are varied, but here are some of the most important:

Your teams will know who is on call and when. This helps them feel more confident in on-call, and builds accountability for the services they build.
Service reliability improves. When a team focuses on a particular service, trends are easier to notice. Issues with reliability bubble up faster, and improvements can be prioritized.
Customers experience less service degradation and downtime. Happier customers mean a more successful business. With service ownership, you can respond to incidents faster and can even resolve them before any significant customer impact.

Many organizations make this move to service ownership to innovate faster and gain a competitive advantage. The flexibility of service ownership allows you to pivot in new directions and adapt to change at a rapid pace. But this isn’t something that can be completed in isolation. Service ownership is part of a new cultural and operating model that must be adopted organization-wide to be successful. Let’s look at how to get started.

How can I adopt service ownership?

Like any worthwhile culture change, service ownership will not be an initiative you can complete within a single sprint. And you’ll need the whole organization to move in this direction for this initiative to succeed. For the purposes of this blog post, we’ll assume that your organization is ready to adopt service ownership, and your team is looking for the best way to make the change. To get started, there are a few things you can do.

Create a list of services. If you haven’t created a list of all the services in your system, work cross-functionally with other teams to understand all the moving pieces. While eventually you’ll want to include business services, you should take it step-by-step and focus on those owned by technology teams first. Once you have a list of services, it’s time to start on the “ownership” part.
Define the team that will own the service. Start by considering who is responsible for the service you are defining. A service should be wholly owned by the team that will be supporting it via an on-call rotation. If multiple teams share responsibility for a service, it’s better to split up that service into separate services (if possible). Some organizations call this “service mitosis”—splitting one cell into two separate cells, each looking very similar to the former whole. There are several methods for deciding how to separate services like, for example, splitting them up based on team size or volume of code they manage. You can read more about how we did that at PagerDuty.
Set up the on-call rotation for this service. Ensure that the people on the team share responsibility for ensuring availability of the service in production. Create on-call schedules that rotate individuals and back-up responders on a regular cadence, as well as policies that include escalation contacts.
Ensure the team is sized correctly. Services should be set up granularly enough so that the members of that team are able to quickly help identify the source of problems. This can apply to creating a service with a scope so large that the knowledge necessary to support it is beyond what’s contained within the team. But it also applies to scoping a team in a way that is too small. For example, if two microservices effectively behave as one, and fixing a problem on one means also fixing it on another, then it might make sense to combine them.
Start small. It’s important to roll this change out incrementally. That way, you can show success early and inspire other teams to adopt this mindset. This also gives teams time to learn from others before implementing service ownership themselves. Ideally, the change should roll out smoother with each team.

As your system grows and changes, make sure to adjust services, teams, and on-call rotations accordingly. This isn’t a set-it-and-forget-it-motion. Instead, you should expect to change as your business does. Bake time into quarterly planning to understand how your team is faring. If you’re feeling overwhelmed, bubble up the need for more support. Teams need to make sure this feedback is given to managers, and managers are responsible for escalating accordingly.

Don’t we need some documentation for this?

Each service needs documentation, no matter how small it is. Documentation helps everyone better understand what the service is and does, how it interacts with other services, and what to do when problems arise. With this in mind, these are the most important points to touch on when creating documentation.

Naming and describing: The best-named services aren’t the ones cleverly named. When naming a service, try to think of the most simple and descriptive way to say what it does. This helps eliminate confusion down the line as you grow and scale. Make sure your description is equally informative. The description should answer questions like:

What is the intent of this service, component, this slice of functionality?
How does this thing deliver value?
What does it contribute to?
If this is part of a customer-facing feature, explain how this will impact customers and how it rolls up to the larger business component.

Determining dependencies: Services don’t operate in a vacuum. Our jobs would be much easier if an issue in one service was isolated and didn’t affect any other services. Yet, this is not the case as we move more towards microservices. You need to know which services yours depends on and what services depend on yours.

At this point, it’s extremely valuable to create a service graph that shows both the technical and business services and how they map to each other. Ideally, this would be a dynamic tool that would allow you to understand how failure in one part of the system affects the rest of the system as a whole.

Beyond mapping these dependencies, you should have communication plans for them. How will you alert dependent services when you experience an incident? How will you communicate technical problems to other line-of-business stakeholders? Laying out these plans ahead of time can help you think of incidents in terms of business response.

Runbooks: Runbooks are an important tool for teams. They’re like a cheat sheet for each service. Make sure you document how to complete common tasks and resolve common incidents. As you become more familiar with your service, you can even include automation into your runbooks. This automation can range from advanced auto-remediation sequences that can eliminate the need for human involvement for some incidents, to lightweight context gathering and script running.

Whatever stage your runbooks are at, it’s key to update these regularly. If you notice something is incorrect in a runbook during response, flag it and go back to it later. Runbooks only work if they’re reflective of the current state. Create time and space to keep these assets up to date.

And remember that runbooks aren’t a cure-all. You can’t plan for and map out resolution instructions for every incident. As your system grows, you’ll encounter novel incidents. A runbook is a tool, not a silver bullet.

How do I know what success looks like?

True success comes from the entire organization adopting service ownership. You’re never done with this initiative, as services and their needs and dependencies are constantly changing. However, you can use metrics to understand how your service is performing. And you can talk to your team and understand qualitatively how they feel about this change.

To understand service performance, you can look at a variety of tools. First, you can use analytics to understand how noisy it is, how often your team is paged, and when those interruptions occur. This can give you an understanding of how healthy your service is in the eyes of the team supporting it.

If you want to know how your service is performing in the eyes of your customers, there’s a tool for that as well. SLOs, or service level objectives, are an internal metric used to measure the reliability of a service. SLOs determine the amount of failure a service can experience before a customer is unhappy, and are created from SLIs (service level indicators).

If you’re within the acceptable level of failure (also known as the error budget), your service will be perceived by customers as reliable. If you are not meeting your SLO, it’s likely your customers are unhappy with your performance.

SLOs are great tools for putting metrics to reliability and demonstrating the value of service ownership. But they’re not the only way to measure success. You also need to speak to your teams to understand their feelings.

Open discussion with teams can help bolster confidence and increase psychological safety. This is extremely important as you will encounter failure along the way. You may not size your teams correctly at the beginning of your journey, and some services might be strapped for support. You may not have the right SLOs, and need to recalibrate. Whatever the challenge you encounter, you need to stay blameless.

These hurdles mean you’re learning and improving. If you can approach them with a positive attitude and listen to the service owners, you’ll improve the reliability of your services, your system as a whole, and the happiness of your teams.

What’s my next step?

Increasing your digital operations maturity is a long road, but one worth traveling. It’s beneficial for your team, the services you run, and your customers. Adopting a service ownership mindset isn’t the only way to make these improvements, but it is a key component.

If you’re looking to learn more about service ownership, you can read our Ops Guide or watch this on-demand webinar. If you want to learn more about planning for digital operations maturity, check out our eBook. And, if you’d like to see how PagerDuty can help you move the needle on initiatives like FSO and operational maturity, try us for free for 14 days.

The post How service ownership can help you grow your operational maturity appeared first on PagerDuty.

ChatOps and Mobile Adoption: The Power of Teams Working Where They Are by Hannah Culver

Hannah Culver — Thu, 28 Oct 2021 13:00:10 +0000

The way we socialize, learn, shop, and receive care has changed drastically over the last 18 months. For many of us, perhaps one of the most drastic changes was the way we work. While work from home (WFH) was an option before the pandemic, NCCI states, “only 6% of the employed worked primarily from home and about three-quarters of workers had never worked from home.” Fast forward to 2021, and according to NorthOne, here’s how much things have changed:

Even when remote work is no longer a necessity for public health, it is here to stay and that flexibility for fully remote or hybrid work will increasingly be an option that current and future employees will look for. As Bloomberg stated, “A May survey of 1,000 U.S. adults showed that 39% would consider quitting if their employers weren’t flexible about remote work.” With the Great Resignation costing organizations top talent and heavy recruitment costs, it’s important to keep current employees happy, and the option to have options is a new standard.

But many employers worry about the productivity of remote work. Technology leaders are tasked with developer velocity and time to market. Balancing the need for innovation with maintaining high availability and reliability for the services they’re responsible for is a big task.

In the event of an incident, detecting problems and driving to resolution quickly and without customer impact is crucial. This has been a challenge for many distributed teams. While documentation, training, and knowledge sharing are all important, technology that helps teams feel closer and work where they like is an exceptional advantage.

We analyzed our own platform data and compared how teams handled urgent work in 2019 and 2020. The results showed that there were important differences between the two years, and that certain tools and practices helped teams adapt to working within distributed teams.

ChatOps tools and mobile application adoption have helped teams throughout the last 18 months work collaboratively while remote to resolve incidents faster, and these trends are only becoming more important looking forward. Here’s what the new future looks like with ChatOps and mobile applications for incident response.

Mobile adoption brings incident response to you

As we become more comfortable with working remotely, sometimes it means taking a walk to clear our heads. Or running out for coffee. Or perhaps spending the day working from the park. This flexibility allows employees to be their best selves and stay passionate about their work. Additionally, as remote work can tilt work-life balance, opportunities like this can right the scales.

But, teams still need to be ready for anything. And mobile applications help them respond faster when failure happens. For instance, with mobile adoption, an engineer can acknowledge an incident while walking their dog. In a report created from our own platform data, we took a look at how teams fared in 2020 compared to 2019. According to our data, we saw that organizations with higher mobile adoption rates had 40-50% faster MTTA (mean time to acknowledge) than those with lower mobile adoption. This benefit continues to increase as an organization or account grows in size.

An improvement in MTTA can benefit the organization in a few ways. First, when an incident is quickly acknowledged, it avoids being lost, forgotten, or overlooked. Second, when teams can acknowledge an alert faster, fewer escalations are triggered and your teams have more time to work uninterrupted. Last but not least, the faster a team can jump on an alert and trigger an incident, the faster a potentially customer-impacting issue is resolved.

Teams are downloading mobile applications to assist with incident response. PagerDuty gives teams an extra leg up when responding to alerts. With a mobile app, teams can acknowledge alerts and even kick off incident response from their phone.

And once the alert is acknowledged and incident response kicks off, teams are able to respond to incidents in the tools they know best with ChatOps.

ChatOps isn’t just a buzz word

ChatOps is all about conversation-driven development and incident response. While in a chat room, team members type commands that the chatbot is configured to execute through custom scripts and plugins. These can range from code deployments, to security event responses, to team member notifications. While this method of collaboration has been around for a while, it’s grown in popularity over the last few years.

Especially with remote work as the norm for many teams, the ability to collaborate efficiently during incidents within the tools your team already uses has become invaluable. Microsoft Teams and Slack, two of the most popular communication tools, are now host to a variety of slash commands and custom configurations. These actions can help teams automate context gathering, incident creation, and even execute runbook automation sequences to speed up response.

Our platform data showed an increase in ChatOps adoption by 22% over the last year. While this is a significant change, it makes sense. Without the ability to swivel in your chair and ask a teammate something, ChatOps became the next best option for problem solving.

And, as the number of integrations grows, more teams are able to use the tools they love to drive faster response. PagerDuty, for example, has ChatOps integrations with tools like Atlassian, atSpoke, and many more. As integrations continue to grow in both number and quality, the work you can do via collaboration tools expands.

While this growth doesn’t happen overnight, it’s clear that with time and the right processes, teams can be remote and still excell at responding to incidents. This faster response translates into happier customers and less downtime for your services. As remote work is here to stay, more teams will need to pivot to ChatOps in order to streamline incident response and limit context switching between tools.

A connected, mobile, remote future

Chat and collaboration applications sit at the center of any efficient DevOps and ITOps teams, especially if they’re distributed. Applications like Slack and Microsoft Teams enable responders to quickly and easily collaborate during incidents, reducing their time to resolution. We enable responders by offering more flexibility with chat as a PagerDuty Incident Contact method. Teams can truly work where they are by acknowledging and resolving incents all within the tools they already use.

If you want to learn more about our State of Digital Operations Report, you can download the full version here. Or, if your teams are ready for a solution that incorporates both ChatOps and mobile use into an incident response process, try PagerDuty for free for 14 days.

The post ChatOps and Mobile Adoption: The Power of Teams Working Where They Are appeared first on PagerDuty.

What’s New: Extending our Datadog Capabilities With New PagerDuty Widgets by Hadijah Creary

Hadijah Creary — Tue, 26 Oct 2021 13:00:55 +0000

In the last two years, we have seen the rise of remote and hybrid work, and with that, a proliferation of tools and apps needed to support critical communication and collaboration. Finding that app-life balance has become increasingly complex, so simplifying “how” we work is key for every organization.

This streamlining of tooling and simplifying our work process is especially true for the IT professionals and developers charged with keeping our digital systems up and running. When things break and interruptions happen, how — and how quickly — IT responds is increasingly critical. Nevertheless, jumping across screens and tabs — first your monitoring tool, then your chat app, then PagerDuty, into your homegrown application, etc. — doesn’t really help with the whole “streamline and simplify” thing.

Our goal at PagerDuty is to meet our customers where they work, whatever tool that may be, and we’re proud to have over 600 integrations with incredible technology partners. These integrations empower our customers to get more out of the PagerDuty platform and helps them define their own unique incident response processes with PagerDuty seamlessly woven into their custom technology stack.

Today we are excited to announce our latest functionality with Datadog. Our integration with Datadog serves over 3500 customers and is one of our most popular integrations. As such, the need to build upon our existing fast and effective incident response led to the development of two new widgets in Datadog. These widgets will leverage PagerDuty’s capabilities as native applications — all directly within the Datadog platform. That means you will get PagerDuty insights without the need to switch tools, screens, or tabs.

Here’s a quick overview of our new PagerDuty for Datadog widgets, just announced at Datadog Dash If you can’t wait to try it out, you’re in luck, the widgets are already available for install: just look for them in the Datadog UI extension marketplace to get started.

Status Dashboard by PagerDuty

The new Status Dashboard by PagerDuty Widget provides critical visibility to improve communications between response teams and stakeholders throughout an incident. This insight provides both technical and business responders/ leaders with a real-time view of their system’s health, improving overall awareness of operational issues. The Status Dashboard will display the current service status and provide the right level of context to determine the potential impact to the business when an incident is ongoing. This provides a single-pane-of-glass for responders by providing a side-by-side view of the business services impacted with ongoing incidents all within the Datadog dashboard. You’ll be able to:

Manually trigger an incident from the PagerDuty Datadog Marketplace to mobilize developers right away.
Display the current status of key business services along with the impacted business services while working on incidents in Datadog.
Improve communication between response teams and stakeholders during incidents.

Incidents By PagerDuty

The Incidents by PagerDuty Widget will give users the ability to take incident action response actions directly from the Datadog interface. The Incident display arms responders with knowledge of ongoing incidents within PagerDuty. It also provides the ability to acknowledge and resolve from the widget, removing the need to context switch between tools, while also offering the option to seamlessly navigation back into PagerDuty for more incident details or to take further action.

Key features include:

Display up to 20 high urgency and active incidents from PagerDuty right in the Datadog console.
Provide flexibility for acknowledging and resolving incidents based on where responders are troubleshooting.
Use one-click navigation to view PagerDuty incident lists, individual incident details, and services dependencies.

With these new widgets, the disruptive nature of switching from application to application will be greatly reduced for our joint customers, and users can work with the ease of knowing that all the information they need is all in one place.

To install the widgets, get all the details in our integration guide.

If you’re at Dash, check out our panel, Handling Incident Response and stop by our booth to meet our team, see a demo and get your questions answered,

Learn more about our Datadog integration here.

The post What’s New: Extending our Datadog Capabilities With New PagerDuty Widgets appeared first on PagerDuty.