PagerDuty Build It | Ship It | Own It Wed, 16 Aug 2023 17:32:58 +0000 en-US hourly 1 https://wordpress.org/?v=6.3.1 Three Teams That Can Use AIOps to Work Smarter, Not Harder by Hannah Culver https://www.pagerduty.com/blog/3-use-cases-for-aiops/ Mon, 28 Aug 2023 12:00:29 +0000 https://www.pagerduty.com/?p=83615 There isn’t a boardroom today that isn’t asking what AI and generative AI in application can help drive efficiency and accelerate their business. For organizations...

The post Three Teams That Can Use AIOps to Work Smarter, Not Harder appeared first on PagerDuty.

]]>
There isn’t a boardroom today that isn’t asking what AI and generative AI in application can help drive efficiency and accelerate their business. For organizations looking to capitalize on ML and automation to improve their efficiency during incidents, AIOps is a tangible, proven application thatproves to be an exciting opportunity for ITOps teams. 

As we’ve seen across market landscape evaluations, there are a number of ways that solutions can be implemented. Despite this, the problems AIOps solutions aim to address remain fairly consistent: fewer incidents and faster resolution. But which teams can stand to benefit from this powerful technology and how will AIOps help them achieve their desired business outcomes?

Understanding how different teams can implement best practices to see a reduction in MTTR, total incidents, and time to adopt automation will help ensure that each team is taking value from your investment. Here are three teams that stand out as having much to gain from leveraging AIOps: Network Operation Center (NOC) teams, Major Incident Management (MIM) teams, and distributed service owning teams. Let’s cover each.

NOC teams

If you have a NOC, it acts as your central nervous system. You may also be in the middle of undertaking modernization efforts to reduce both cost and risk.

Many of our NOC customers tell us about challenges such as:

  • Eyes-on-glass operational style causes incidents to go undetected
  • Catch and dispatch means too many escalations to SMEs or routing incidents to the wrong team
  • Manual work drives up MTTR
  • L1/L2 teams experience high turnover and blame culture is common

To move beyond this, organizations can create L0 automation. This is automation that serves as the first responder, only bringing in humans when necessary. For well-understood, well-documented issues, L0 automation can auto-remediate incidents without a responder intervening. But for other more complex issues that require a hands-on approach, NOC teams can create L0 automation that immediately pulls in diagnostic information before the responder looks at an incident, routes incidents intelligently according to event data, and populates the incident notes with pertinent documentation and runbooks.

PagerDuty AIOps helps NOCs modernize and move away from eyes-on-glass methods. These NOCs are a center of excellence within their organizations, spearheading data-driven optimization, enabling best practices, and ensuring incident readiness.

MIM teams

When critical, customer impacting incidents happen, you don’t have time to waste. But, with complexity and noise on the rise, how do Major Incident Management teams improve to meet growing customer expectations?

We see MIM teams with common challenges such as:

  • Finding out about major incidents from overwhelming customers/users calling in or delayed team escalations
  • Lack of context as initial triage takes too long to assess severity and business impact
  • Long MTTR waiting for the right people, the right diagnostics, the right runbooks, etc
  • Disjointed tooling leading to communication barriers for responders and corresponding teams

MIM teams can overcome these challenges with a variety of automation and ML tactics. First, organizations can create automation that immediately routes high priority or severity incidents to a MIM team and tags in the appropriate teams needed via incident workflows. Additionally, ML can gather key context such as how rare an incident like this is, if it happened before and how it was resolved, and change events that might be correlated to the failure.

PagerDuty AIOps helps MIM teams detect major incidents faster, improve MTTR and customer experience, and save SMEs time. This reduces the cost of each incident and mitigates risk.

Distributed service owning teams

DevOps and distributed service owning teams are under more pressure than ever to deliver exceptional customer experiences. But with competing priorities and fewer resources, this is easier said than done.

Many of our customers share challenges they are facing such as:

  • Disparate monitoring tools with no central pane of glass
  • Too much noise leading to incorrect escalations and false incidents
  • Lack of context and information silos
  • Toil and time taken away from value-add initiatives

For service owning teams looking to overcome these challenges, an AIOps tool that can aggregate data from all the monitoring sources in the technical ecosystem can help bring clarity to incident response. Additionally, with ML, teams can reduce noise by automatically grouping together alerts based on context, time, and previous event data that the model has trained on. With this and the ML-surfaced triage information, incident response is streamlined so teams can get back to innovating faster.

PagerDuty AIOps helps service owning teams spend less time firefighting, reduce MTTR, and create exceptional customer experiences. This improves culture and team retention while increasing revenue for the entire organization. 

Ready to get started?

With PagerDuty AIOps, teams like the ones we looked at see 87% fewer incidents, 14% faster MTTR, and 9x faster automation adoption. This helps organizations move faster, focus on the work that matters most to customers, and reduces risk and team burnout. Best of all, teams from dev to IT can see value from PagerDuty AIOps.

PagerDuty AIOps works in conjunction with the rest of the PagerDuty Operations Cloud to help organizations manage their operations by leveraging AI and automation to supercharge their digital transformation. With over 700 integrations, GenAI capabilities, and end-to-end event-driven automation, PagerDuty gives customers a 400% ROI and the right tools to leapfrog the competition.

To try PagerDuty AIOps out yourself, you can take an interactive product tour or try us for free for 14 days.

The post Three Teams That Can Use AIOps to Work Smarter, Not Harder appeared first on PagerDuty.

]]>
The 4 Types of Incidents as Zombies from ‘The Last of Us’ by Hannah Culver https://www.pagerduty.com/blog/4-incident-types-as-zombies/ Thu, 18 May 2023 12:00:06 +0000 https://www.pagerduty.com/?p=82498 Seems like everyone has watched or is watching “The Last of Us.” This show is based on a video game of the same name. It...

The post The 4 Types of Incidents as Zombies from ‘The Last of Us’ appeared first on PagerDuty.

]]>
Seems like everyone has watched or is watching “The Last of Us.” This show is based on a video game of the same name. It features Pedro Pascal (from “The Mandalorian”) and his latest surrogate child, Bella Ramsey (from “Game of Thrones”). But this adventure is challenging for a plethora of reasons. Most notably, zombies. In 2003, a fungus, Cordyceps, brought on a global zombie endemic. Twenty years later, a few humans are trying to endure and survive in what’s left. Spoiler alert for anyone who hasn’t watched season one yet: it’s hard. And zombies are scary.

While incident response is rarely life or death, it can be an adrenaline spike akin to watching the show. And some of the incidents you may face have similarities to the zombies we’ve seen so far in “The Last of Us.” These incidents have a “headshot” that can help you survive against all odds.

Runners

The first zombies we see in “The Last of Us” are runners. These are fresh and may still look human compared to ones that have been infected longer. While easy to kill, there’s one factor that makes them dangerous: you never expect them. They’re novel. For those who were around for the 2003 end of the world, zombies were only a work of fiction. Nobody prepared for the end of the world (except for Bill, *sob*). For those people hanging on in 2023, runners are still jarring. They’re usually someone you know. A friend. And they change fast, as we saw at the end of episode five.

If we had to compare this one to an incident, it’d be the one that happens out of nowhere and is rare or an anomaly. The system is fine! Then it’s not and you’re thinking, “How did I miss that?” So what do you do about it? Look for the signal in the noise that tells you something is going wrong. For those infected, this could be twitching, coughing or unexpected mood swings.

There’s similar warning signs for an incident. Latency a bit high? Could be nothing. But combine it with customer support noting an increase in complaints about slowness? You may have a runner. Monitoring only gets you so far. You need to make sense of the data, both from machines and humans. Correlating that data with changes in your ecosystem can help you attack your runner before it bites you.

Stalkers

This is your garden variety of zombies. They’re not too difficult to kill. They don’t have any special abilities to speak of. And you can almost always expect them. Going down into the basement of an abandoned gas station? Of course you’ll find a stalker. Empty mall? Yep, you should have known, Ellie and Riley! Stalkers aren’t fun by any means, and can be deadly. More often than not, though, the average survivor can take care of a stalker. But, what happens when there’s a few stalkers all at once? Or you find 12 stalkers back to back, all in the same day? What if you’re fighting two to three stalkers every day for a year?

You can see where I’m going with this. Stalkers are like death by a thousand cuts. The more you have to tangle with them, the more dangerous they are. Like your most common incidents. They’re not fire drills, they’re annoying. And one isn’t so bad, but one every single day hurts. It takes time away from value-add work to fix something that you’ll need to fix again soon.

Automation isn’t something that Joel and Ellie can do in their world. But in our zombie-free existence, we can apply it to make incident response more efficient. For well-understood issues and incidents that happen frequently, crafting auto-remediation to resolve the problem without human intervention can immediately add time back to your day. And, it’s a great way to drive automation initiatives within the organization. Solving this small but frequent problem has a direct ROI associated with it. Leverage that to further automation initiatives for other types of incidents.

Clickers

Clickers are ominous, obsessive hunters that are harder to kill. As they’re blind, they use echolocation to hunt their prey. Headshots don’t work as their heads are armored with tough fungus. They’re one of the most feared and hated types of zombies in “The Last of Us,” and it’s easy to see why. Can you imagine coming up against this thing and realizing your typical solution doesn’t work the way it should? And against a more dangerous enemy?

This one may be the hardest to correlate to an incident, because clickers seem to be almost impossible to kill in the show. Everyone’s advice? Run. Before they hear you. But with incidents, you can’t do that. So, if this zombie was an incident, it would be the one that only two or three people have seen before. You’ve heard about this issue, and it’s from deep in the tech stack. But not enough people who knew about this incident shared with the class. When it happens, it feels like a bigger issue than it is.

Like a knife to the neck of a clicker, there’s a solution to this type of incident. And success comes down to the same thing: knowledge and a plan. If you know that a clicker’s head has armor, you go for the neck. It’s close combat, but effective. And since enough people have survived clickers, the knowledge spread across the surviving population.

For an incident, the best way to fix your clickers is documentation, runbooks, and historical context. Someone knows how to resolve the problem. If they share this knowledge, teams can document the process and create a runbook for the next time this scary (but repairable) problem happens. Additionally, teams can rely on AI to surface past incident data. Look-alike incidents have lots we can learn from. This past incident data helps teams understand what worked for an incident and what didn’t. If you don’t have AI to assist, you can always scan through old retrospectives as well for this historical context. Centralizing all this information is also important so that everyone can find it. That way, you may not know how to solve every problem that happens, but you know how to find that knowledge. There’s power in that, even if there’s no perfect “headshot.”

Bloaters

Bloaters look more like the demogorgons in “Stranger Things” than something that was, at one point, a human. They kill most people in the vicinity either by brute force or toxic clumps of fungus that they toss in the air like grenades. We only saw one of these in “The Last of Us” so far and it made quite the impression, annihilating most of the fighting population of Kansas City. Bloaters should be avoided at ALL costs. And any signs of them should be dealt with early before the issue compounds. Remember how the zombies were filling up the tunnels and the rebels had other initiatives to take care of? Yeah, that was technical debt and someone should have fixed it.

But that’s the way it goes. You know there’s a problem, even if you don’t know exactly how it’ll manifest. Then you’ve got a major incident on your hands–a bloater. And the best and only real way to deal with these is with a coordinated, end-to-end incident response. Make sure that you understand key components of incident response such as:

  • Escalation policies
  • Roles and responsibilities during the incident
  • Communication standards, both internal and external
  • Workflows that you can trigger automatically to take the heavy lifting off responders

With these plans in place, you will be able to resolve the incident more smoothly, faster, and with less customer impact.

What zombie are you worried most about?

What’s keeping you up at night? Fear of an impending bloater, or notifications about yet another stalker? While we may not find the cure to Zombies in ‘The Last of Us,’ we can work on technology incidents and make those easier and less catastrophic for us and our customers.

PagerDuty is here to help you improve your digital operations. Whatever challenges you’re facing right now our team can help you endure and thrive, not just survive. Check out our weekly demos to learn more.

The post The 4 Types of Incidents as Zombies from ‘The Last of Us’ appeared first on PagerDuty.

]]>
Top 3 Incident Response Problems AIOps Can Help Your Teams Solve by Hannah Culver https://www.pagerduty.com/blog/top-3-incident-response-problems-aiops-can-solve/ Thu, 20 Apr 2023 12:00:58 +0000 https://www.pagerduty.com/?p=81946 More data for data’s sake doesn’t help anyone. What organizations need is more information–actionable insight. With data coming from incoming streams of events and alerts,...

The post Top 3 Incident Response Problems AIOps Can Help Your Teams Solve appeared first on PagerDuty.

]]>
More data for data’s sake doesn’t help anyone. What organizations need is more information–actionable insight. With data coming from incoming streams of events and alerts, teams don’t have enough time to look at each one. And they struggle to parse and consolidate this data in order to figure out what they need to do next to resolve an incident. Processing this data to make it more usable and helpful during incident response often results in a rote series of manual, repetitive tasks each time an incident occurs, wasting time. It’s no wonder teams are increasingly turning to AIOps and automation for help. AIOps helps teams turn data into information and reduce that manual work. Let’s break down three ways AIOps allows teams to overcome challenges and reduce customer disruption.

Reducing noise for fewer incidents

Not every alert should become an incident. Yet for many organizations, this is what happens. Even if you’re only experiencing one problem, you may receive dozens or hundreds of pings for the same issue. This is distracting and bogs responders down. Noise should be your first thing to focus on because eliminating it:

  • Gives responders back time when they don’t need to filter out what’s important from what’s irrelevant.
  • Decreases the cognitive load that responders carry. Responders don’t need to think about 63 different alerts. They can focus on the one that matters. This reduces this on-call anxiety.
  • Reduces the distractions that get in a responder’s way during an incident. This helps responders focus on getting a fix in place faster.

To reduce noise, you can analyze the noisiest incidents you’re facing. Which ones are the same incident? Take a look at the alerts you’re receiving and see if there’s a way to group them based on event data that you gather from your monitoring tools. What’s loudest? This is an opportunity to fine tune your monitoring tools so they’re only sending you what’s most valuable. Keep in mind that this often requires routine maintenance. Monitoring tools become messy, especially when data is scattered across vendors. You’ll want to gut check this whenever you notice noise levels are increasing.

PagerDuty AIOps makes it easier to reduce alert noise within a single tool. Users can set PagerDuty to ingest and deduplicate events from those disparate signals. Then PagerDuty AIOps groups the events into an existing incident. This suppresses a new incident from being created. Teams have access to event data in the form of alerts without extra notifications. The result is that teams can better weather alert storms by bringing focus to what’s needed. 

Gaining context for better triage

Technically, all the information a responder needs to resolve an incident exists. But, it’s buried within multiple disparate streams of data. Humans alone cannot condense all this data into succinct actionable insights. This means teams spend a long time looking for answers to questions that they can leverage machine learning (ML) to find instead. ML can look at both historical event data and human interaction. Then ML translates the analyzed data into actionable insights. With ML, teams can answer key questions such as:

  • Where should my team look first?
  • Are other teams working on the same problem?
  • Is this a common incident or completely new?
  • Have we seen this before; how was it resolved?
  • Any relevant changes occur before this incident?

But developing your own ML can be a daunting task. It requires time and resources such as headcount. Many organizations choose to partner with a vendor instead.

PagerDuty AIOps ML algorithms help surface critical information such as:

  • Probable Origin: determines probable cause based on previous incidents affecting your service.
  • Related Incidents: shares if a current incident is affecting your service.
  • Outlier Incidents: whether this incident happens frequently, rarely, or is a total anomaly.
  • Past Incidents: look at the incident details and see how responders resolved it in the past.
  • Change Correlation: connects with your change integrations to show changes to your service, then leverages ML to correlate patterns between change events and incidents.

Each time this information is surfaced for your team without having to manually dig, you get to resolve the incident faster. That decreased MTTR provides you with more time to focus on value-add initiatives.

Self-healing by crafting auto-remediation

One initiative you can focus on to spend less time firefighting is automation. This is where you can orchestrate a fix and self-heal before the problem even becomes an incident. It’s resolved before it hits a responder. Now someone gets to sleep through the night instead of responding to a notification. But this initiative can seem very intimidating. The reality is that starting small and tackling low-hanging fruit can make self-healing easier than you may expect.

You can identify well-understood resolution scenarios where you can automate the response. These may be scenarios that your team would classify as frequent, or ones where the resolution is straight-forward. Teams can then create automation to resolve these without human intervention. Then, as that automation starts to take effect, your teams will start to free up time to work on new automation initiatives.

PagerDuty’s Event Orchestration  helps teams create automation that spans the entire technical ecosystem. Event Orchestration enriches and routes events, then kicks off automation to self-heal. This feature allows users to trigger remediations for well understood incidents via webhook. For more complex issues where auto-remediation might not be a possibility, teams can also leverage automation to kick off diagnostics. This builds upon the triage information responders have when they first view their incident.

Looking to get started with AIOps?

AIOps can help teams see fewer incidents and faster resolution. PagerDuty can help you achieve this, and more, with PagerDuty AIOps. See PagerDuty AIOps in action by requesting a trial or taking our product tour. In the market for AIOps? Read our buyer’s guide

The post Top 3 Incident Response Problems AIOps Can Help Your Teams Solve appeared first on PagerDuty.

]]>
Reduce MTTR and Take Automation to a New Level with PagerDuty Global Event Orchestration by Hannah Culver https://www.pagerduty.com/blog/global-event-orchestration-generally-available/ Tue, 18 Apr 2023 12:00:58 +0000 https://www.pagerduty.com/?p=81923 PagerDuty’s Global Event Orchestration is now generally available. Global Event Orchestration’s powerful decision engine enriches events, controls their routing, and triggers self-healing actions based on...

The post Reduce MTTR and Take Automation to a New Level with PagerDuty Global Event Orchestration appeared first on PagerDuty.

]]>
PagerDuty’s Global Event Orchestration is now generally available. Global Event Orchestration’s powerful decision engine enriches events, controls their routing, and triggers self-healing actions based on event data. Teams can use this functionality across any or all services within PagerDuty. This feature is a continued investment in Event Orchestration, demonstrating PagerDuty’s commitment to providing customers with best-in-class automation capabilities.

Customers in our early access program are already seeing value in Global Event Orchestration, touting reduced MTTR and better standardization of incident response at scale. As Kiril Yurovnik, Technical Lead at Riskified, said, “With a growing number of events, minimizing noise and toil is imperative, especially as organizations aim to optimize their IT processes amid the current economic environment. We’ve been using PagerDuty’s Global Event Orchestration as part of the early availability program, and the results have been strong. Riskified has been able to scale noise reduction, especially from non-production environments, saving our team valuable time to spend time innovating on what’s next.” 

What are Global Event Orchestrations?

Global Event Orchestration is like Service Event Orchestration in that it allows users to define complex rules that determine what happens to an event as it is processed. The difference is that Global Event Orchestration enriches events at ingest. Then, once the data is normalized, the event is routed to a service based on various criteria. This ensures that responders have the best event data possible to begin the response process.

Global Event Orchestration has three key components that make it successful for scaling incident response. 

Global Orchestration Rules allow users to apply actions to events across services. Teams can create rules which process event data across services and use the processed data to improve event routing. This empowers organizations to establish and improve on auto-remediation. This means that a human doesn’t need to be involved in an incident to resolve it. This also reduces the blast radius of an incident via more intelligent routing.

Enhanced integration key management reduces the workload of managing integration keys for different monitoring tools. This allows users to combine integration keys into one event orchestration. Even better, enhanced integration key management is now available for all PagerDuty plans.

Additional APIs allow for management at scale. Teams can use REST APIs for event source and Global Orchestration Rule management. Both of these APIs have Terraform support. These APIs are in addition to the REST APIs for Event Orchestration/Service Orchestration management.

“Leveraging PagerDuty’s Global Event Orchestration has been critical to ensure that our event routing processes are efficient and scalable to optimize IT operations and spend,” said Brian Long, Cloud Infrastructure Engineer at Hyland. “With Global Event Orchestration, our organization is able to detect the “resolved” condition from our notifications to execute as a resolve and reduce the number of places these conditions need to be configured by at least a factor of three. This frees up our time to focus on innovation, not configuration.”

How can Global Event Orchestration help my team?

With Global Event Orchestration, teams will see:

  • Codified incident response processes: democratize and distribute well-understood incident responses across distributed teams
  • Fewer incidents: use contextual event data from all services within your ecosystem to improve suppression accuracy
  • Faster resolution: apply automation across teams and enable automated diagnostics at scale with standardized enrichment and data normalization

How teams use Global Event Orchestration may vary based on organizational structure. Capabilities align with two different teams: ITOps, SRE, and NOC teams and developer teams.

ITOps teams will be able to capitalize on the event normalization capabilities, ensuring that all events look the same as they come in.

SRE teams can create and extend automation across any or all services within a technical ecosystem. This makes scaling and standardizing automation across an organization easier than ever.

For L1 response teams such as NOCs, Global Event Orchestration helps them handle the massive incoming wave of events. Events can be routed to the NOC if they meet certain criteria. And, as the event passes through levels of rules and nested rules, automation can deliver diagnostics to the L1 responder. If the fix for an incident is well-known, organizations can create auto-remediation.

Developer teams will see fewer incidents and faster resolution. With auto-remediation, incidents can be resolved before they even hit the services that the developer teams are on call for. And, with in-depth routing criteria, incidents don’t bounce from team to team. If automation or the NOC or L1 responders can’t resolve it, the incident will go to the subject matter expert (SME). And, by the time the SME begins to work on the incident, diagnostic information is already available, reducing resolution time.

How can I get started today?

Global Event Orchestration is generally available for all PagerDuty AIOps customers. To see it in action, join us on Twitch Friday, April 14. 

PagerDuty AIOps helps teams experience fewer incidents, faster resolution, and greater productivity without long implementations or heavy ongoing maintenance. To try PagerDuty AIOps, you can request a trial here or take our product tour. If you want to talk to sales, contact us through this form.

To learn more about Global Event Orchestration, register for this webinar. If you’re a PagerDuty AIOps customer looking to create your first Global Event Orchestration, this knowledge base article can show you how to get started.

The post Reduce MTTR and Take Automation to a New Level with PagerDuty Global Event Orchestration appeared first on PagerDuty.

]]>
Introducing PagerDuty AIOps: Harnessing the Power of AI to Transform Modern Operations for the Enterprise by Hannah Culver https://www.pagerduty.com/blog/introducing-pagerduty-aiops/ Tue, 11 Apr 2023 12:00:40 +0000 https://www.pagerduty.com/?p=81930 Today, PagerDuty launched a new AIOps solution to leverage the power of AI, provide built-in automation and build on the company’s foundation data model to...

The post Introducing PagerDuty AIOps: Harnessing the Power of AI to Transform Modern Operations for the Enterprise appeared first on PagerDuty.

]]>
Today, PagerDuty launched a new AIOps solution to leverage the power of AI, provide built-in automation and build on the company’s foundation data model to transform modern operations for the enterprise. PagerDuty has long suppressed noise to help distributed development teams focus. Now, PagerDuty AIOps addresses the large-scale event correlation, compression, and automation needs of ITOps, Command Centers, NOCs, and SRE teams with Global Event Orchestration (now generally available), and Global Alert Grouping (EA in H2 2023). If you’re interested in being a part of the early access program for Global Alert Grouping, sign up here. Going beyond event management, PagerDuty AIOps helps organizations work more efficiently, including giving them the ability to execute end-to-end, event-driven automation.

Our early access customers are already seeing results with PagerDuty AIOps, including 87% average noise reduction, deployed automated incident response 9x faster than existing solutions, and 14% faster MTTR.

As Kiril Yurovnik, Technical Lead at Riskified, said, “With a growing number of events, minimizing noise and toil is imperative, especially as organizations aim to optimize their IT processes amid the current economic environment. We’ve been using PagerDuty’s Global Event Orchestration as part of the early availability program, and the results have been strong. Riskified has been able to scale noise reduction, especially from non-production environments, saving our team valuable time to spend time innovating on what’s next.”

You can see PagerDuty AIOps in action by taking our product tour.

What is PagerDuty AIOps?

According to PagerDuty platform data, event volumes have grown by 70% YoY. As a result, businesses suffer from too much noise and too much toil while their response teams slog through chaotic, manual response processes.  

And when ITOps and SRE teams who act as first responders for incidents lack access to crucial context and visibility system-wide, they can’t take the next best action. This operational inefficiency has a compounding effect. It increases the cost of operations, reduces productivity across the technical organization, and takes away from value-add work.

In a resource-constrained environment, teams can’t wait for year-long implementations, they need help now. Organizations are looking for a solution that has fast time to value, integrates with their existing systems, and provides fast ROI. 

PagerDuty AIOps helps teams reduce noise, triage efficiently to drive the right actions towards resolution, and remove manual, repetitive work from the incident response process. PagerDuty AIOps works out of the box without requiring long implementations or heavy, ongoing  maintenance. Organizations continue to see best-in-class results. Noise reduction baked in with ML models that learn and adapt based on user behavior means teams see fewer incidents overall. And end-to-end event driven automation ensures that resolution is faster and requires less input from humans who are needed for value-add work.

“Leveraging PagerDuty’s Global Event Orchestration has been critical to ensure that our event routing processes are efficient and scalable to optimize IT operations and spend,” said Brian Long, Cloud Infrastructure Engineer at Hyland. “With Global Event Orchestration, our organization is able to detect the “resolved” condition from our notifications to execute as a resolve and reduce the number of places these conditions need to be configured by at least a factor of three. This frees up our time to focus on innovation, not configuration.”

Here’s what PagerDuty AIOps includes: 

  • Event correlation, noise compression, and triage context functionality, freeing site reliability engineers and information technology teams from managing multiple vendors and manual processes to a single powerful solution that drives to resolution quickly.
  • End-to-end automation, from event ingestion through auto-remediation, to help teams shift from reactive to proactive by capturing and actioning critical events before they become value-destroying incidents.  
  • Advanced noise reduction features (available in our early access program) that group alerts across services and allow customers to leverage both defined rules and machine learning to only surface the incidents that matter.
  • A visibility console that gives operations teams a single source of truth to monitor and quickly manage all incidents before major incidents occur with far-ranging business, IT, and financial impacts. 
  • Global Event Orchestration, a powerful decision engine to enrich and control routing or trigger self-healing actions.
  • With more than 700 integrations on the PagerDuty Operations Cloud platform, teams can trust our automation-led, people-centric AIOps solution to help save time and money.

How does PagerDuty AIOps work?

PagerDuty AIOps has sets of capabilities that help organizations standardize and scale incident best practices across all teams and services. And, it comes with new features custom-built to serve ITOps, Command Centers, NOCs, and SRE teams.

Reduce noisy incidents: reduce incident noise with the click of a button, either within a service or across services with Global Alert Grouping. Use built-in ML models, or create your own logic. And combine intelligent ML and rule-based alert grouping methods for customizable grouping capabilities. Group alerts by content, time, or other criteria for noise reduction that fits your organization’s needs.

Screen recording of PagerDuty noise reduction via alert grouping.Accelerate triage time and drive action: Leverage ML to surface the most important information for responders immediately. When an incident occurs, responders can quickly discover the probable origin of the incident, if the incident has previously occurred, and if a change was the likely cause.

Screen recording of PagerDuty triage features including past incidents and probable origin.Automate the redundant: Leverage event orchestration’s powerful decision engine to enrich and control routing or trigger self-healing actions based on event conditions across any or all services within PagerDuty with Global Event Orchestration.

Screenshot of PagerDuty Global Event Orchestration rule builder.Visualize what matters: Create a custom dashboard that provides a comprehensive view of your operations posture across services. Additionally, you’ll get full visibility into your event data so that you can prioritize what gets ingested and processed and have total transparency into your event usage.

Screen recording of PagerDuty Visibility Console where users can visualize all their event data.

How can I get started with PagerDuty AIOps today?

For current PagerDuty customers with Professional or Business plans, you can self-serve purchasing PagerDuty AIOps in your account subscriptions menu. 

For Event Intelligence customers, contact your account team about migration options to get access to new features available in PagerDuty AIOps. For more details, please see our knowledge base article.

Whether you’re a current PagerDuty customer or looking to get started, you can see PagerDuty AIOps in action by requesting a trial or taking our product tour. If you have questions and want to speak with our sales team, you can reach out here.

The post Introducing PagerDuty AIOps: Harnessing the Power of AI to Transform Modern Operations for the Enterprise appeared first on PagerDuty.

]]>
Say Goodbye to the ‘Executive Swoop and Poop’ with Status Update Notification Templates by Hannah Culver https://www.pagerduty.com/blog/status-update-notification-templates-now-generally-available/ Wed, 25 Jan 2023 14:00:52 +0000 https://www.pagerduty.com/?p=80995 Incidents are unpredictable, but how you share updates with stakeholders doesn’t have to be. Status Update Notifications Templates help teams streamline communication with internal stakeholders...

The post Say Goodbye to the ‘Executive Swoop and Poop’ with Status Update Notification Templates appeared first on PagerDuty.

]]>
Incidents are unpredictable, but how you share updates with stakeholders doesn’t have to be. Status Update Notifications Templates help teams streamline communication with internal stakeholders during a major incident. We are excited to announce that this feature has added new capabilities. Now teams can not only customize their communications; they can also create and standardize reusable templates using dynamic variable insertion representing criteria such as impact, service areas, and more.

What are Status Update Notifications?

You’re in the middle of a high-priority incident. You’re working to bring the problem to a close, but you can’t concentrate because you have a dozen (or two) stakeholders pinging you for updates. You’re copying and pasting your response across multiple internal communication channels, but that doesn’t stop the direct messages from popping up. Sound familiar? At PagerDuty, we call this the ‘executive swoop and poop.’ Basically, a responder is so inundated with update requests that they can’t do their main job: resolve the incident.

During an incident, it’s key to keep stakeholders in the loop. This helps the business respond as one to an incident, reducing resolution times and preserving customers’ trust. But stakeholders expect communications to look a certain way and contain the context that’s important to them. Formatting and writing these communications might require a responder’s full attention if the communication is done ad-hoc. Status Update Notifications allow teams to standardize communication expectations and reduce the toil of sharing updates.

This feature includes a rich text editor so teams can format the text to company communication standards, including adding company logos. With our drag-and-drop variables, responders can easily include incident details and populate key information as needed.

Status update notifications setting up template screenshot: configuring template with variables

How do the templates help my team?

During an incident, a responder has to remember so much about the system and services that they’re responsible for. It takes concentration and critical thinking. Status update notifications are a quick way to communicate with stakeholders, reducing the time and energy responders spend on sharing updates. But sometimes you need to send the same notifications frequently as similar incidents occur. Or you want your P3 and P0 communications to have different information and you don’t want to build the notification from scratch each time.

You could write all this in a playbook and store it in a wiki. But those wikis are hard to find and rarely updated. It’s not ready when and where responders need it. That’s why we built templates. Now, teams can customize and standardize reusable communication templates based on impact, business areas, and more. This functionality will also be available via API, so teams are able to customize and leverage status update notification templates to fit their needs in any context.

Status update notifications setting up template screenshot: creating new template details

With templates, your incident communications are as easy as:

  1. Ensure stakeholders are subscribed to the incident.
  2. Click “status update” and choose your template.
  3. Edit (if necessary) and preview your template.
  4. Send your status update notification.

How can I get started today?

Status Update Notification Templates are now generally available for Business and Digital Operations customers. With Status Update Notification Templates, your teams can communicate better and with fewer ‘swoop and poops.’ Your communications will match company branding and standards, and the reusable template format means any communication is ready for you at a moment’s notice.

If you want to learn more, read our knowledge base article here or watch our demo:

 

If you’re ready to see Status Update Notification Templates in action, try PagerDuty for free for 14 days.

The post Say Goodbye to the ‘Executive Swoop and Poop’ with Status Update Notification Templates appeared first on PagerDuty.

]]>
3 Ways You Might Have a NOC Process Hangover by Hannah Culver https://www.pagerduty.com/blog/3-ways-you-might-have-a-noc-process-hangover/ Mon, 24 Oct 2022 13:00:33 +0000 https://www.pagerduty.com/?p=79024 NOC, or network operation center, processes have been set in stone for decades. But it’s time for some of these processes to evolve. Digital transformation...

The post 3 Ways You Might Have a NOC Process Hangover appeared first on PagerDuty.

]]>
NOC, or network operation center, processes have been set in stone for decades. But it’s time for some of these processes to evolve. Digital transformation and the cloud era have led to the rise of DevOps, and with it, service ownership. Service ownership means that developers take responsibility for supporting the software they deliver at every stage of the life cycle. This brings development teams closer to their customers, the business, and the value they deliver.

It also requires a departure from the traditional NOC incident handling methods. Yet, as organizations transition towards service ownership, some old NOC processes remain. Here are three common NOC process hangovers and how to replace or update them.

Process hangover: L1 responders aren’t able to resolve issues

NOCs used to be the command center for technology issues. They functioned like a brain, sending out signals to relevant appendages. Issue with networking? Route to networking. Issue with security? Route to security. The NOC’s central function was to involve the correct SME to resolve an issue. This meant digging through spreadsheets (and sometimes physical contact books!) to figure out who was responsible for what.

When everything was on premise and in person, this made sense. There were fewer services, and incidents could be neatly separated by departments. If the database was having an issue, you could call up the database on-call responder. The responder (who would likely be in office or close enough to respond in person) could then go to the datacenter and take a look.

Now, in the remote work, cloud era, where organizations have hundreds or thousands of services maintained by dozens or even hundreds of teams spread across the globe, the rolodex method has outlived its purpose. It’s next to impossible to maintain accurate spreadsheets to know which teams are responsible for which services. And, as the organization changes, records grow stale quickly. Services can move between teams. Teams change as people move between them, or leave/join the company. Now, an L1 responder has to work too hard to identify the right person in an efficient and timely manner.

Organizations need a way to remove these manual steps to find the right person and route incidents directly to SMEs who can jump in to respond to any issues. This can happen in a variety of ways. For some organizations, a DevOps service ownership model is the right path forward. Those who write the code are assigned to respond and fix the service during an incident. The alert is routed directly to the on-call person on the development team that supports the service, and the SME takes it from there.

For other organizations, it might make sense to have a hybrid approach where L1 responders serve as the first line of defense before escalating to distributed, on-cal teams for their services. L1 responders shouldn’t be a routing center that connects the issue with another team. Instead, they should be empowered to resolve an incident themselves. You can set up your L1 responders to be more effective by enabling them with the ability to both troubleshoot and selectively resolve incidents. Access to automation and resources like runbooks can empower L1 responders to help accelerate the diagnosis and remediation process, oftentimes without needing to disrupt the subject matter experts that are in charge of X service via an escalation. By putting automation in the hands of L1 responders, organizations can avoid unnecessary escalations and empower L1s to resolve issues faster.

Process hangover: Major incidents aren’t called or are called too late

We’ve heard it before: time is money. And when NOCs were the primary method of ensuring incidents were responded to, they had an additional responsibility. An NOC needed to ensure that resources were well managed. This meant no unnecessary personnel responding to problems. NOCs often took the blame if they called a major incident too soon and interrupted people for a minute problem. These disruptions took SMEs away from their work innovating. So it was crucial for NOC responders to only call major incidents when it was clear there was a much bigger issue at play.

But now, time isn’t money, uptime is money. The cost of a major incident that’s flown under the radar is larger than the cost of tagging in some extra help. Imagine you’re an online retailer and your shopping cart function is down. Every minute your customers can’t add items to their cart, you’re losing hundreds of thousands of dollars. Plus, customer expectations have increased over the last few years. Customers expect that their app, tool, platform, streaming service, etc. works without interruption. And it erodes customer trust when it doesn’t. In fact, according to PWC, 1 in 3 customers would stop doing business with a brand they loved after one bad experience.

Organizations need to call major incidents sooner to mitigate customer impact. Yes, this may mean waking someone unnecessarily once in a while. But, that’s far less likely with service ownership. SMEs responsible for a service have a better understanding of when to call a major incident than an L1 responder would. So there are fewer false alarms.

Process hangover: Come-and-go war rooms

NOCs often serve as the communication hub for a major incident. This helps responders working to resolve an issue keep on task. Back when many companies had everything (and everyone) on-premise, there was a war room. People came there and the NOC coordinator kept everyone up to date. Now, with distributed teams and systems, physical war rooms are a thing of the past. Many companies instead have virtual war rooms with a video conferencing bridge or chat channel that remains open during an incident.

Other stakeholders may want to treat this war room like a physical one, dropping in as they please. But, in this virtual world, this means that these stakeholders are asking the incident responders questions. This delays the resolution. Companies with come-and-go virtual war rooms may experience more miscommunications and frustration. Responders feel frustrated by interruptions and stakeholders feel frustrated with the lack of communication.

One way to mitigate this is to close the war room to non-participants. If someone isn’t a part of the incident response team, they don’t need access to the response team’s virtual war room. Instead, what they need is an internal liaison. This is a designated communicator from the incident response team.

The internal communication liaison consolidates incident information and relays it to relevant stakeholders. To make this easier, communication liaisons can use status update notification templates. These templates dictate how to craft communications for a specific audience. They ensure that stakeholders receive any information necessary to make decisions. And no responders have to stop working on the incident at hand to share updates.

Hangovers aren’t fun, but they always end

NOCs are a tried and true way of managing incidents for many organizations. But NOC methods become out of date when moving into this era of digital transformation. Seamless communication and rapid response are key to preserving customer trust. Looking forward, teams will involve SMEs immediately and call major incidents sooner rather than later. They’ll also communicate with key stakeholders throughout an incident while setting boundaries.

And often teams need a digital operations platform to help support this transition. PagerDuty allows teams to bring major incident best practices to their organization, resolving critical incidents faster and preventing future occurrences. Try us for free for 14 days.

The post 3 Ways You Might Have a NOC Process Hangover appeared first on PagerDuty.

]]>
PagerDuty Service Standards helps organizations better configure services at scale by Hannah Culver https://www.pagerduty.com/blog/introducing-service-standards/ Tue, 23 Aug 2022 13:00:19 +0000 https://www.pagerduty.com/?p=77871 Service ownership, a DevOps best practice, is a method that many companies are pivoting towards. The benefits of service ownership are varied and include boons...

The post PagerDuty Service Standards helps organizations better configure services at scale appeared first on PagerDuty.

]]>
Service ownership, a DevOps best practice, is a method that many companies are pivoting towards. The benefits of service ownership are varied and include boons such as bringing development teams much closer to their customers, the business, and the value being delivered. The “build it, own it model” has tangible effects on customer experience, as developers are incentivized to innovate and drive customer-facing features that delight.

But the pivot to service ownership is difficult, especially for large companies with hundreds or even thousands of services. Everything from defining a service, its boundaries, and who owns it can be a behemoth undertaking. And ensuring that services are configured in a way that allows the organization to scale quickly is next to impossible across the entire technology ecosystem. However, gaining this level of visibility is crucial for better business outcomes.

Screenshot of service directory

 

To address this problem, PagerDuty is excited to announce the general availability of Service Standards for all plans. PagerDuty’s emphasis on service ownership through our service-based architecture has traditionally allowed individual teams to determine how to configure services. Now, with Service Standards, teams across an organization have both the visibility to understand what best practice looks like as well as the flexibility to standardize that knowledge across teams new to service ownership in a way that’s beneficial for both the team and organization. 

Service Standards help all teams ensure that their service configurations are adhering to service ownership best practices. This means that services are informative, integrated with the right tools, and supported by the right people. Service Standards provides both the visibility and means to institute standards across teams to not only embrace service ownership, but also to scale it across the organization.

Introducing Service Standards

When configuring services, teams throughout an organization will have different methods. Some services may have the information that all teams need to act quickly during an incident and others may not. This lack of uniformity can cause problems across the ecosystem, with information that’s lost or locked up as purely tribal knowledge. And, it’s next to impossible for managers and administrators to know whether the services they are responsible for are in good shape or not.

Service Standards can help individual engineers understand how to configure better services, while providing a guide for managers and administrators to scale these standards across an organization.

Set up better services with guidelines for success

With the shift to cloud, the number of services for any organization has grown exponentially, and a central governing and creation team isn’t often able to handle the load. To make things more complicated, service owners configure services in markedly different ways. From naming conventions, to descriptions, to whether they have the right people on call, services vary in the depth of information they provide.

Too often this results in a lot of rework. Imagine this scenario: Team spins up new services only to be blocked before they can enter production. They’re told to make a variety of changes and fixes to ship. And, since these requirements are often not codified or widely known, this is a mistake that the team might make multiple times, adding pain and toil to the service creation process. 

We hear this from customers all the time. In fact, one of the top questions we get asked is “what does ‘good’ look like?” The truth is, it often depends, but it’s always the case that ‘good’ is unique to each team’s particular way of working. 

With Service Standards, teams can standardize on what good looks like according to company policy. PagerDuty has provided nine standards that each service should fulfill to have the depth and context required for the service to be considered well-configured, all of which are able to be toggled on and off.

Screenshot of Service Standards pass or fail

 

Audit services for accountability

Service Standards also give managers and administrators the level of control they need to ensure that configuration requirements are met at scale. Administrators can determine visibility and decide whether to make these standards publicly available for the rest of the organization to view. They can also toggle on or off all nine standards depending on what the company needs. On a more granular level, administrators can apply these standards to only a subset of services for more flexibility. And, the service performance data can be exported out of PagerDuty and shared as needed to drive accountability and show progress.

Screenshot of Service Standards settings

 

Ready to try yourself?

Service Standards are here to help all organizations scale service ownership best practices. This feature gives engineers an understanding of what is ready for production and reduces the toil required to ship new services. For administrators and managers, Service Standards help drive accountability throughout the technology ecosystem and provide a way to assess progress. Over time, this improves incident response for first responders looking for quick context, and helps drive operational maturity at the organization level.

If you want to learn more, check out our recent webinar, “How to Standardize Service Ownership at Scale for Improved Incident Response” or read our knowledge base article here.

If you’re ready to see Service Standards in action, try PagerDuty for free for 14 days.

The post PagerDuty Service Standards helps organizations better configure services at scale appeared first on PagerDuty.

]]>
More Powerful than Ever: PagerDuty’s Revamped Mobile App is Primed for Even Better Incident Response by Hannah Culver https://www.pagerduty.com/blog/revamped-mobile-app-for-incident-response-2022/ Tue, 12 Jul 2022 13:00:43 +0000 https://www.pagerduty.com/?p=77078 2020 revolutionized how we work. Many went from full-time office work to 100% remote overnight. And now that in-office is once again on the horizon,...

The post More Powerful than Ever: PagerDuty’s Revamped Mobile App is Primed for Even Better Incident Response appeared first on PagerDuty.

]]>
2020 revolutionized how we work. Many went from full-time office work to 100% remote overnight. And now that in-office is once again on the horizon, companies are thinking of ways to continue to work flexibly. However, this comes with increased challenges, and a need for tools that match this working style.

The PagerDuty mobile application is well recognized, with a 4.8 stars rating on the App Store and Google Play. We understand how important it is to reach the right people immediately – that’s why we’ve made significant investments in iOS and Android to help responders resolve critical work from anywhere, anytime. 

This blog post covers some of the most exciting improvements, such as a new navigation interface to find the information you need most; improved incident intelligence through past and present incidents; and leverage automation to trigger automated diagnostics and take remediation actions. 

Easily navigate to find the information you need most

For responders, you need to know when you’re on call and what services you’re on call for. If and when you do get called, it’s crucial that you can identify how the technical services you’re responsible for are performing. And most importantly, you want to be able to see all this information at a glance from your mobile app – not navigate through multiple screens and digging for information buried deep within an app.

With the new PagerDuty mobile home screen experience, the most important information that responders need is readily available. This redesign puts the top open incidents, on-call shifts and impacted technical services front and center, reducing the number of taps needed to navigate through the app. 

The redesigned home screen is now available for early access. If you’d like to try it out, you can fill out this early access form and choose the New Mobile Home Experience selection, and we’ll send you instructions.

Part of working flexibly means having critical incident context at your disposal during the moments that matter most. When an incident begins, you need to get up to speed as quickly as possible to begin making decisions on how to best mitigate impact.

One way to do this is with the new mobile incident details screen. This screen provides you an easier visual experience and  access to all your most important features to help you address incident response faster. The most important information about an incident is available to you immediately, such as notes from other responders working on the incident, change events, past incidents, and the latest alerts.

A new carousel on the updated version of the mobile incident details screen also allows you to run a play, add a priority or note, post a status update, and more.

Gain critical incident context through the past or present

When you experience an incident, one of the biggest hurdles to jump is answering the question, “Have we seen this before?” If you have, resolution might be as straightforward as running a play or executing an automation sequence that worked before. But it often can be difficult to find that historical context, and that’s time wasted that nobody can afford.  

  • The Past Incidents feature on the PagerDuty mobile app displays incidents with similar metadata that were generated on the same service as the current, active incident. This additional context facilitates accurate triage and reduces resolution time. For example, you can see whether you, or someone on your team, has been involved in a similar previous incident, and dive into details to discover what remediation steps were taken. 

  • Change Events – Changes within the system are often the culprit behind incidents. They are often overlooked because it can be hard to pinpoint exactly what change caused the incident, especially when many organizations are shipping new code dozens or even hundreds of times per day.   However, “Gartner estimates that approximately 85% of all performance incidents can be traced.” Change Events will enable you to look at changes impacting your environment and help you establish the potential root cause. Change Events can be easily viewed in two areas of the mobile app:  the new mobile incident details screen and service details. Either tap on a desired incident and scroll to Change Events, or navigate to the Service Directory to select a service to view a maximum of two Change Events. Event details displayed include the date and time, summary, service, type, links, and source.

  • Another important piece of information during incident response is understanding the impact radius of an issue. One way to glean this information is by understanding Service Dependencies. If a large, customer-facing business service is experiencing problems due to the technical service incident, you’ll need to respond faster and with more contextual intelligence to better understand the scope of the problem.With Service Dependencies in the mobile app, you can view what services are affected to better understand scope. Service Dependencies are listed within each particular service’s profile in the Service Directory.

Leverage automation for faster response

As technology environments become more complex, it’s more important than ever to conserve people’s time and cognitive resources. This means ensuring that machines, not humans, serve as the first line of defense. 

Automating repetitive manual tasks and well-understood incidents can divert unnecessary toil away from responders so that they can focus on their day jobs, and are only called for the incidents where they’re needed most. One way to do this is with automation that you can run with the tap of your finger.

PagerDuty Automation Actions is now generally available within PagerDuty mobile, empowering you to trigger automated diagnostics and take remediation actions from anywhere, anytime. It improves productivity by automating repeated diagnostic and remediation steps, replacing the toil of manual tasks. In addition to running the scripts, you can view previously run scripts and output reports directly from the mobile app. These update in real time, meaning you never miss a thing.

These latest additions to the PagerDuty mobile application help responders work in the way you want without sacrificing time, quality, or customer experience. Flexible work is here to stay, and PagerDuty’s powerful mobile application is committed to helping you make the most of it. If you haven’t tried our mobile application in a while, it’s time to take a second look. Use the QR Code and download either. 

iOS

or Android.

Important: Ensure your mobile experience is secure

With so many new great features added to PagerDuty mobile, we are introducing the new minimum OS requirements to ensure the mobile app continues to be innovative and secure and improve the user experience. Starting June 27, 2022, the future versions of the PagerDuty mobile app will require Android 9.0 and iOS 14.0  or higher. Please ensure your device is upgraded to continue receiving mobile app updates.

The post More Powerful than Ever: PagerDuty’s Revamped Mobile App is Primed for Even Better Incident Response appeared first on PagerDuty.

]]>
How to Standardize Service Ownership at Scale for Improved Incident Response by Hannah Culver https://www.pagerduty.com/blog/standardize-service-ownership-at-scale/ Wed, 22 Jun 2022 13:00:00 +0000 https://www.pagerduty.com/?p=76833 Service ownership is a DevOps best practice where team members take responsibility for supporting the software they deliver at every stage of the development lifecycle....

The post How to Standardize Service Ownership at Scale for Improved Incident Response appeared first on PagerDuty.

]]>
Service ownership is a DevOps best practice where team members take responsibility for supporting the software they deliver at every stage of the development lifecycle. This level of ownership brings development teams much closer to their customers, the business, and the value being delivered.

Service owners are the subject matter experts (SMEs) for their services – and in a service ownership model, they are also responsible for responding to any production issues. For teams moving to this model, going on call can seem daunting. Maybe you’ve heard horror stories about weekends and evenings spent holding your laptop and responding to incidents? 

There’s no way to sugarcoat it: being on-call is tough. But best practices like service ownership can introduce more structure and predictability to an on-call shift so that, ideally, there is a net quality of life improvement for everyone.

Why is service ownership important?

Imagine this scenario: you’re called into a meeting because something is wrong somewhere in the system, but since you don’t have service owners determined, nobody knows who the SME is. Fifteen minutes turns to 20, and then 30, and so on. Meanwhile, more people are jumping on the call, yet making no progress.

This type of chaotic incident response wastes precious time – it’s the epitome of inefficiency. And the worst part is that it still happens all the time. 

It doesn’t have to be this way. But first, let’s examine why so many teams are burdened by manual incident response that drags out forever. When you look at the reasons for the slow down, it boils down to teams not being able to answer a few very important questions:

  • What services are impacted? 
  • Who owns those services?
  • What are these services’ dependencies – and who owns those services?

Meetings like the example above attempt to answer these questions, but in a reactive manner. Until teams can answer these questions, they are at a stand still and cannot make progress on resolving the incident. 

This is becoming more and more common as the technology ecosystem continues to change and grow more complex at companies of all sizes. Hundreds of services, microservices, and distributed ownership make it hard to know how to take action when something goes wrong.

Service ownership can help organizations become more proactive about incident response. Nevertheless, this is no walk in the park. Cultural change is hard, and even the most successful organizations which have managed the shift to DevOps and service ownership would agree that following best practices, and having a process for adopting service ownership, can help with stickiness and drive scale across the entire organization.

When organizations are able to adopt service ownership, everyone—from service owners, to executive stakeholders, to customers—benefits. Service owners are only called in when necessary. Stakeholders know what’s affected by an incident, and can work with the technical team to mitigate impact. And customers will encounter a shorter service disruption with clearer communication throughout. 

In a world where customer expectations have never been higher, and customer experience is key, this can put your organization above the competition – all while making life better for the people who respond to the incident.

But what actually is a service?

Defining a service can be trickier than it may seem at first glance. We’ve seen organizations split services many ways, and it’s not always as simple as matching services to what’s deployed in the cloud. For some organizations, there’s a monolith that needs to be taken into account as well. So how can you determine how to break things up into manageable pieces for which a team can be responsible?

At PagerDuty, we define a service as “a discrete piece of functionality that provides value and that is wholly owned by a team.” Another way to think of it is that a service represents an entity you monitor, and serves as a container for related incidents that associates the incidents with the right escalation policies. 

In short, it breaks down like this: if you monitor it, and you want incidents to be associated with it, and you want certain people to be on call for it, then it’s a service. This is a broader definition that allows more flexibility in how teams might define unconventional services. 

However, responders need to know more than just these boundaries to be fully prepared to deal with issues. This is where service configuration can make a big difference.

What makes a service well-configured?

At PagerDuty, we’ve established a set of standards that we feel are valuable to organizations looking to further their service ownership journey. These act as guidelines for how we create our services, and determine what “good” looks like. 

They’re flexible as well. Not every service is built the same, and some of our standards may not apply in each circumstance. Think of them as a jumping off point that our customers can use to make on-call be more efficient and less painful to their first-line responders.

It’s important to note that each organization will ramp differently, and that service ownership is a process, not a single box to be checked off a to-do list. Depending on your operational maturity, you may need to set and adopt standards at a different pace.

If you’re relatively small and new to service ownership, with only a handful of mostly cloud-based services, you may be able to set standards and configure your services accordingly in a few days. If you’re starting from scratch, it’s even easier: you can apply these standards when you create your very first services, setting you up for long-term success without needing to go back and make changes to previously configured services.

But if you’re a larger organization with hundreds or even thousands of services, this might be a tougher shift. For these organizations, here’s a few questions to ask that can help you think about how to move forward:

  1. What subsets of existing services could you set standards for today, and what are those standards? You may find that some standards are easy to apply to all your services. For example, services should have a name that accurately describes what it does. If there are standards like this that you know the majority of services should follow, then that’s a good place to start implementing. Think about how you could ask pilot teams to make these changes.
  2. What does the process for creating net new services look like? You may have your standards determined, but changing all your current services to meet these standards is a difficult undertaking. If you’re a larger organization, it’s not usually feasible to reconfigure all services at once – and reconfiguring services can be more frustrating than following a process to set them up correctly in the first place.
  3. What is your long-term goal, and what does a timeline look like for that? Some services may not need these standards, and that’s okay. Make a plan for the rest of the services with a deadline, then start onboarding additional teams to the process, making small, incremental changes over time.
  4. How do we know our dependencies? Beyond creating and applying standards, it’s also important to know how your services map to each other and affect one another. While establishing standards, think about how you can encourage codifying this information during the configuration process.

Individually, answering these questions may not seem like big differentiators – but when you think about how they scale, they make a big difference for how well you respond to incidents.

How does this help incident response?

During incident response, it’s important that you don’t waste time or energy on work that doesn’t matter. Everything must be pared down to what the team needs to focus on to resolve the incident. 

Service ownership helps you gain that clarity throughout the response process:

For instance, if you’ve configured your service well, you’ll be alerted with the correct urgency and minimal alert noise, allowing you to respond to only the most important signals and prioritize accordingly. You’ll also be able to get the right people on the scene quickly, since you’ll know who the service owners are. As you grow in maturity, you’ll also be able to create automation sequences for your services that help you reduce the work required to return service to normal.

Diagnosing what went wrong is also easier, as you’ll see what changed on the service. And with service mapping, you can understand the overall impact to the system.

During resolution, you can work faster with the integrations that your service needs, as well as keep stakeholders informed. You can streamline communication to only those people who you know will be affected by your incident, keeping the impact to a minimum even within the organization.

Lastly, you’ll learn from incidents better. As the SMEs for your service, you’ll gain historical context, and feed those learnings back into your response process, making you more resilient over time.

As you scale service ownership across the organization, these improvements make a drastic difference to both customers and teammates. If you’re looking to adopt service ownership or improve your operational maturity, and want a partner that can guide you through the process, try PagerDuty for free for 14 days. If you’d like to learn more about standardizing service ownership at scale, check out this webinar.

The post How to Standardize Service Ownership at Scale for Improved Incident Response appeared first on PagerDuty.

]]>