Best Practices | Tags | PagerDuty Build It | Ship It | Own It Fri, 05 May 2023 16:53:29 +0000 en-US hourly 1 https://wordpress.org/?v=6.3.1 Creating End-To-End Event Driven Automation For Faster Response by Nisha Prajapati https://www.pagerduty.com/resources/webinar/creating-end-to-end-event-driven-automation-for-faster-response/ Mon, 10 Apr 2023 11:00:41 +0000 https://www.pagerduty.com/?post_type=resource&p=81675 The post Creating End-To-End Event Driven Automation For Faster Response appeared first on PagerDuty.

]]>
The post Creating End-To-End Event Driven Automation For Faster Response appeared first on PagerDuty.

]]>
Jobs 101: Workflow Best Practices by Nisha Prajapati https://www.pagerduty.com/resources/webinar/jobs-101-workflow-best-practices/ Thu, 30 Mar 2023 21:09:25 +0000 https://www.pagerduty.com/?post_type=resource&p=81668 The post Jobs 101: Workflow Best Practices appeared first on PagerDuty.

]]>
The post Jobs 101: Workflow Best Practices appeared first on PagerDuty.

]]>
Getting Started Workshop: Rundeck By PagerDuty by Nisha Prajapati https://www.pagerduty.com/resources/webinar/getting-started-with-rundeck-workshop/ Mon, 12 Dec 2022 20:49:04 +0000 https://www.pagerduty.com/?post_type=resource&p=80503 The post Getting Started Workshop: Rundeck By PagerDuty appeared first on PagerDuty.

]]>
The post Getting Started Workshop: Rundeck By PagerDuty appeared first on PagerDuty.

]]>
CSOPs Certification Primer by Nisha Prajapati https://www.pagerduty.com/resources/webinar/csops-certification-primer/ Tue, 23 Aug 2022 16:54:15 +0000 https://www.pagerduty.com/?post_type=resource&p=78024 The post CSOPs Certification Primer appeared first on PagerDuty.

]]>
The post CSOPs Certification Primer appeared first on PagerDuty.

]]>
Harness the power of automation-first AIOps to improve MTTR by Nisha Prajapati https://www.pagerduty.com/resources/webinar/harness-the-power-of-aiops/ Tue, 23 Aug 2022 16:48:17 +0000 https://www.pagerduty.com/?post_type=resource&p=75305 The post Harness the power of automation-first AIOps to improve MTTR appeared first on PagerDuty.

]]>
The post Harness the power of automation-first AIOps to improve MTTR appeared first on PagerDuty.

]]>
Accelerate Your Modernization Journey in the Cloud by Catherine Craglow https://www.pagerduty.com/resources/ebook/accelerate-your-modernization-journey-in-the-cloud/ Mon, 08 Aug 2022 16:20:47 +0000 https://www.pagerduty.com/?post_type=resource&p=77834 The post Accelerate Your Modernization Journey in the Cloud appeared first on PagerDuty.

]]>
The post Accelerate Your Modernization Journey in the Cloud appeared first on PagerDuty.

]]>
Automate your PagerDuty Incident Response with Slack by Nisha Prajapati https://www.pagerduty.com/resources/webinar/automate-your-pagerduty-incident-response-with-slack/ Wed, 27 Jul 2022 17:25:46 +0000 https://www.pagerduty.com/?post_type=resource&p=77420 The post Automate your PagerDuty Incident Response with Slack appeared first on PagerDuty.

]]>
The post Automate your PagerDuty Incident Response with Slack appeared first on PagerDuty.

]]>
Why Operational Maturity Helps Businesses Reduce the Great Resignation Trend by Laura Chu https://www.pagerduty.com/blog/why-operational-maturity-helps-businesses-reduce-the-great-resignation-trend/ Thu, 30 Jun 2022 13:00:15 +0000 https://www.pagerduty.com/?p=76920 Why Operational Maturity helps businesses reduce the great resignation trend The past few years have led to fundamental business and cultural shifts for both companies...

The post Why Operational Maturity Helps Businesses Reduce the Great Resignation Trend appeared first on PagerDuty.

]]>
Why Operational Maturity helps businesses reduce the great resignation trend

The past few years have led to fundamental business and cultural shifts for both companies and employees. 

Covid-19 has brought opportunities for companies who invested early in digital operations, while others struggled to maintain the status quo. The latter gave rise to record employee burnout, and what is now commonly referred to as the Great Resignation.

According to a CNBC survey, 4.5 million American workers quit their jobs in March 2022 primarily due to being burned out, unhappy with their jobs, and reevaluating their lives. And a 2021 PagerDuty user survey found that 64% of the respondents said their organization had seen a rise in turnover, while only 34% said there had been no increase in turnover in the past year.

As the Great Resignation continues, businesses are looking for new ways to keep employees productive and reduce turnover. We know from working with customers that investing in best practices makes a big difference. 

Our Digital Operations Maturity Model steps through common behaviors and codifies cohorts by their readiness to handle real-time operational challenges. Teams can uplevel their maturity and become more proactive by investing in best practices and processes that can help ease their journey with digital transformation. 

What is the Digital Operations Maturity Model?

The Digital Operations Maturity Model helps response teams assess their current maturity stage, and architect a new digital operation that prepares the teams to detect, triage, mobilize, respond, and resolve outages and system failures in less time. 

The model encompasses five levels of operations maturity:

  • Manual – Developers are slogging through issues at hand; requests require ad-hoc responses,  sometime throughout the night.
  • Reactive – System failing or customer complaints always lead to firefighting mode without a lot of coordination and planning.
  • Responsive – Issues are resolved as they occur where coordination and planning are streamlined.
  • Proactive – A seamless, coordinated issue management where issues are fixed before customers notice.
  • Preventative – Stay ahead of issues before they start and continuously learn from past and current incidents.

The Digital Operations Maturity Model’s goal is to help DevOps teams move toward the Proactive and Preventive stages to manage and maintain their IT infrastructure’s consistency, reliability, and resilience. The teams should experience fewer service failures, faster resolution of customer-facing issues, and fewer employee burnouts because they have a more predictable workflow and work-life balance.   

The benefits of improved operational maturity

In 2021, Tejere Oteri, a Product Marketing Analyst at PagerDuty, conducted a survey to understand the effect of good operations practices on business impact, operational health, and human factors. When analyzing the survey results, Tejere noticed differences in the way participants from Reactive organizations responded versus those from Preventative organizations.

When asked about the team’s workload, 33% of those from Reactive organizations felt their workload was spread evenly – but the majority preferred to see some workload improvement. When the same question was asked of the Preventive organizations, 83% overwhelmingly felt their workload was spread evenly, and surprisingly, no respondents disagreed.

Tejere also found that Reactive organizations were two times more likely to experience increased employee turnover than Preventive organizations. Specifically, the data analysis showed that Preventive organizations spend fewer personal hours and sleep hours addressing incidents for work than Responsive organizations, leading to less burnout and reducing the likelihood of high turnover.

What if we mapped the maturity model to product usage?

To introduce the operations maturity model for our customers and identify trends seen across our customer base, we mapped the model to product usage behaviors. Scott Bastek, Sr Product Analytics Manager at PagerDuty, looked at each maturity stage through PagerDuty product adoption, and made some interesting findings:  

  • Reactive teams rely on monitoring tools to identify incidents, and have not gone as far as configuring a robust response to reduce MTTR
  • Responsive teams can resolve issues as they occur by putting effort into on-call schedules, and by staging multiple levels of defense to maintain the status quo
  • Proactive teams deploy advanced incident response functionality like service dependencies or change events to understand and diagnosis issues once they occur
  • The Preventative teams leverage noise reduction, event orchestration, or analytics reports to prevent incidents from happening

Where are you on the digital operations maturity model?

Leaders concerned with employee burnout and attrition should assess their current stage of maturity, and implement new operational processes by consulting the examples shown in the operational maturity model. The best way to achieve the highest operational maturity is by investing in your DevOps team with on-call scheduling, coordinated issue management, event intelligence, and process automation tools. These investments will help DevOps teams become more proactive and preventative with all incidents, and reduce the volume of MTTA and MTTR across organizations.   

Watch “Getting from Reactive to Proactive and Beyond” to learn more about the Digital Operations Maturity Model. In the video, Scott Bastek and Tejere Oteri dive into the levels of operations maturity, and provide more statistical analysis and insights on their findings.

You can also download PagerDuty’s latest eBook and the latest State of the Digital Operations Report for examples of how organizations can achieve the highest operational maturity level.

The post Why Operational Maturity Helps Businesses Reduce the Great Resignation Trend appeared first on PagerDuty.

]]>
What is Live Call Routing? by Laura Chu https://www.pagerduty.com/blog/what-is-live-call-routing/ Tue, 28 Jun 2022 13:00:48 +0000 https://www.pagerduty.com/?p=76838 If there’s one essential thing we’ve learned from being in the business of digital operations for more than 13 years, it’s that every business has...

The post What is Live Call Routing? appeared first on PagerDuty.

]]>
If there’s one essential thing we’ve learned from being in the business of digital operations for more than 13 years, it’s that every business has a unique approach to building resilience with its bespoke tech stacks and processes. 

Many PagerDuty customers around the world are starting to provide direct access to their on-call teams with Live Call Routing (LCR). Simply put, LCR is a PagerDuty add-on feature that allows organizations to extend their customer support for incident response by dynamically routing calls or voicemails to on-call responders.  

Think of LCR as a hotline to on-call teams, making it possible for customers or staff to report incidents faster. LCR removes the operational complexity of phone-called incidents by automating this process, and ensuring the on-call teams receive and resolve incidents quickly and effectively. All inbound calls get routed based on the schedules of the on-call team, so anyone can reach the right responder immediately or leave a voicemail that becomes an incident. 

What are the business challenges with live calls today?

Whether an employee is seeking help about a system failure, or a customer is reporting a failed system, human-reported incidents are documented when a first responder creates an IT ticket. The responder, however, may not be the subject matter expert (SME) and will call multiple teams across the business to find the right help. In some cases, the true expert may not be on call, or  didn’t find a replacement while on vacation. No system could easily enable the responder to seek an alternative expert for help, so the incident remains stuck with no solution.

The real challenge for businesses is to reduce the mean-time-to-acknowledge (MTTA) and mean-time-to-resolve (MTTR). The traditional IT ticketing system is a way to record an incident but it’s not a faster way to resolve it. 

What are some LCR benefits?

LCR ensures that your customers get the best experience with your services. For example, your customer can have a real-time conversation with on-call staff by calling a direct line, bypassing the need to look up an on-call schedule and significantly reducing the MTTA and MTTR. In addition, LCR can forward customer calls via the same global on-call schedules and escalation rules to ensure a responder from the right team takes action on the issue.   

Suppose your responder is in the middle of a call and cannot answer the incoming request. In this case, the customer can leave a voicemail, and LCR will automatically trigger an incident for the next available responder. LCR can also easily provision local and international phone numbers via PagerDuty, so your business can set your on-call teams to support customers worldwide.

Some Common use cases for Live Call Routing

Based on customer interviews, we’ve identified a few common use cases for Live Call Routing:

  • An exclusive line for a key partner:  A payment service that has an important partner generates 80 percent of the revenue in a large region. The service wants to give this partner a “VIP” treatment and provide an exclusive phone number, enabled with Live Call Routing, to call the support team whenever they have an emergency. Though the partner only called the number once in the last several years, it prevented a major outage for the entire region, saved millions of dollars, and prevented several million disappointed customers.
  • Tag assets with a hotline number:  A business offers over 100 different watercrafts, from boats to jet skis, along the California coast for rental. Each type of vessel has a unique hotline number for renters to call for help. But this forces the rental business and their responders to manage multiple hotline numbers across their many rental offerings, making it difficult for responders to distinguish which phone number is for which vessel, and which incidents to address. The business enabled Live Call Routing into “one hotline number” that could service all the rented vessels, and allow callers to choose a service from a list. Incidents are now properly identified and routed to the right expert who can address the incident in less time. 
  • A direct line for the internal teams:   A technology company of over 1,000 employees has multiple teams delivering many different customer services worldwide. When a service is down, tracking the responsible team and their schedule is impossible because the information is neither shared nor accessible across teams. Live Call Routing is enabled to create one direct line that can address all incidents for multiple services and directly link the call to the right team in charge. Each team can resolve incidents faster and streamline operations to provide better customer services worldwide.

Why PagerDuty’s Live Call Routing?

PagerDuty’s Live Call Routing transforms how businesses manage human-call incidents by connecting the incident with the right experts for help. It drives down response time by eliminating all the typical administrative complexity involved in getting the right staff on-call, with flexibility capabilities such as call routing, phone trees, global number provision, and more. 

PagerDuty’s LCR ensures 24X7 coverage and confirms customer-reported incidents are immediately routed and escalated to the right individual or team(s). Most importantly, you can manage the incidents your way by getting notified and taking action on incidents via preferred communication methods. Even if your human-called incident goes into voicemail, it will automatically become an incident.    

We want to hear from you!

We are always looking to speak with customers and learn about your ideas and use cases that could leverage Live Call Routing. We have created an “Office Hour” to allow you to ask questions, exchange ideas, and validate whether Live Call Routing is a good fit for your organization. Please sign up for a free 30-minute session using Calendly, and we’ll look forward to a meeting with you and your team. 

Learn more about Live Call Routing and its capabilities by watching “Always Reach On-Call Responders Immediately with Live Call Routing,” presented by Ben Wiegelmann, Senior Product Manager of Live Call Routing. He shares some common use cases and demonstrates key capabilities that improve your MTTA and MTTR.

The post What is Live Call Routing? appeared first on PagerDuty.

]]>
How to Standardize Service Ownership at Scale for Improved Incident Response by Hannah Culver https://www.pagerduty.com/blog/standardize-service-ownership-at-scale/ Wed, 22 Jun 2022 13:00:00 +0000 https://www.pagerduty.com/?p=76833 Service ownership is a DevOps best practice where team members take responsibility for supporting the software they deliver at every stage of the development lifecycle....

The post How to Standardize Service Ownership at Scale for Improved Incident Response appeared first on PagerDuty.

]]>
Service ownership is a DevOps best practice where team members take responsibility for supporting the software they deliver at every stage of the development lifecycle. This level of ownership brings development teams much closer to their customers, the business, and the value being delivered.

Service owners are the subject matter experts (SMEs) for their services – and in a service ownership model, they are also responsible for responding to any production issues. For teams moving to this model, going on call can seem daunting. Maybe you’ve heard horror stories about weekends and evenings spent holding your laptop and responding to incidents? 

There’s no way to sugarcoat it: being on-call is tough. But best practices like service ownership can introduce more structure and predictability to an on-call shift so that, ideally, there is a net quality of life improvement for everyone.

Why is service ownership important?

Imagine this scenario: you’re called into a meeting because something is wrong somewhere in the system, but since you don’t have service owners determined, nobody knows who the SME is. Fifteen minutes turns to 20, and then 30, and so on. Meanwhile, more people are jumping on the call, yet making no progress.

This type of chaotic incident response wastes precious time – it’s the epitome of inefficiency. And the worst part is that it still happens all the time. 

It doesn’t have to be this way. But first, let’s examine why so many teams are burdened by manual incident response that drags out forever. When you look at the reasons for the slow down, it boils down to teams not being able to answer a few very important questions:

  • What services are impacted? 
  • Who owns those services?
  • What are these services’ dependencies – and who owns those services?

Meetings like the example above attempt to answer these questions, but in a reactive manner. Until teams can answer these questions, they are at a stand still and cannot make progress on resolving the incident. 

This is becoming more and more common as the technology ecosystem continues to change and grow more complex at companies of all sizes. Hundreds of services, microservices, and distributed ownership make it hard to know how to take action when something goes wrong.

Service ownership can help organizations become more proactive about incident response. Nevertheless, this is no walk in the park. Cultural change is hard, and even the most successful organizations which have managed the shift to DevOps and service ownership would agree that following best practices, and having a process for adopting service ownership, can help with stickiness and drive scale across the entire organization.

When organizations are able to adopt service ownership, everyone—from service owners, to executive stakeholders, to customers—benefits. Service owners are only called in when necessary. Stakeholders know what’s affected by an incident, and can work with the technical team to mitigate impact. And customers will encounter a shorter service disruption with clearer communication throughout. 

In a world where customer expectations have never been higher, and customer experience is key, this can put your organization above the competition – all while making life better for the people who respond to the incident.

But what actually is a service?

Defining a service can be trickier than it may seem at first glance. We’ve seen organizations split services many ways, and it’s not always as simple as matching services to what’s deployed in the cloud. For some organizations, there’s a monolith that needs to be taken into account as well. So how can you determine how to break things up into manageable pieces for which a team can be responsible?

At PagerDuty, we define a service as “a discrete piece of functionality that provides value and that is wholly owned by a team.” Another way to think of it is that a service represents an entity you monitor, and serves as a container for related incidents that associates the incidents with the right escalation policies. 

In short, it breaks down like this: if you monitor it, and you want incidents to be associated with it, and you want certain people to be on call for it, then it’s a service. This is a broader definition that allows more flexibility in how teams might define unconventional services. 

However, responders need to know more than just these boundaries to be fully prepared to deal with issues. This is where service configuration can make a big difference.

What makes a service well-configured?

At PagerDuty, we’ve established a set of standards that we feel are valuable to organizations looking to further their service ownership journey. These act as guidelines for how we create our services, and determine what “good” looks like. 

They’re flexible as well. Not every service is built the same, and some of our standards may not apply in each circumstance. Think of them as a jumping off point that our customers can use to make on-call be more efficient and less painful to their first-line responders.

It’s important to note that each organization will ramp differently, and that service ownership is a process, not a single box to be checked off a to-do list. Depending on your operational maturity, you may need to set and adopt standards at a different pace.

If you’re relatively small and new to service ownership, with only a handful of mostly cloud-based services, you may be able to set standards and configure your services accordingly in a few days. If you’re starting from scratch, it’s even easier: you can apply these standards when you create your very first services, setting you up for long-term success without needing to go back and make changes to previously configured services.

But if you’re a larger organization with hundreds or even thousands of services, this might be a tougher shift. For these organizations, here’s a few questions to ask that can help you think about how to move forward:

  1. What subsets of existing services could you set standards for today, and what are those standards? You may find that some standards are easy to apply to all your services. For example, services should have a name that accurately describes what it does. If there are standards like this that you know the majority of services should follow, then that’s a good place to start implementing. Think about how you could ask pilot teams to make these changes.
  2. What does the process for creating net new services look like? You may have your standards determined, but changing all your current services to meet these standards is a difficult undertaking. If you’re a larger organization, it’s not usually feasible to reconfigure all services at once – and reconfiguring services can be more frustrating than following a process to set them up correctly in the first place.
  3. What is your long-term goal, and what does a timeline look like for that? Some services may not need these standards, and that’s okay. Make a plan for the rest of the services with a deadline, then start onboarding additional teams to the process, making small, incremental changes over time.
  4. How do we know our dependencies? Beyond creating and applying standards, it’s also important to know how your services map to each other and affect one another. While establishing standards, think about how you can encourage codifying this information during the configuration process.

Individually, answering these questions may not seem like big differentiators – but when you think about how they scale, they make a big difference for how well you respond to incidents.

How does this help incident response?

During incident response, it’s important that you don’t waste time or energy on work that doesn’t matter. Everything must be pared down to what the team needs to focus on to resolve the incident. 

Service ownership helps you gain that clarity throughout the response process:

For instance, if you’ve configured your service well, you’ll be alerted with the correct urgency and minimal alert noise, allowing you to respond to only the most important signals and prioritize accordingly. You’ll also be able to get the right people on the scene quickly, since you’ll know who the service owners are. As you grow in maturity, you’ll also be able to create automation sequences for your services that help you reduce the work required to return service to normal.

Diagnosing what went wrong is also easier, as you’ll see what changed on the service. And with service mapping, you can understand the overall impact to the system.

During resolution, you can work faster with the integrations that your service needs, as well as keep stakeholders informed. You can streamline communication to only those people who you know will be affected by your incident, keeping the impact to a minimum even within the organization.

Lastly, you’ll learn from incidents better. As the SMEs for your service, you’ll gain historical context, and feed those learnings back into your response process, making you more resilient over time.

As you scale service ownership across the organization, these improvements make a drastic difference to both customers and teammates. If you’re looking to adopt service ownership or improve your operational maturity, and want a partner that can guide you through the process, try PagerDuty for free for 14 days. If you’d like to learn more about standardizing service ownership at scale, check out this webinar.

The post How to Standardize Service Ownership at Scale for Improved Incident Response appeared first on PagerDuty.

]]>