AIOps | Tags | PagerDuty

Three Teams That Can Use AIOps to Work Smarter, Not Harder by Hannah Culver

Hannah Culver — Mon, 28 Aug 2023 12:00:29 +0000

There isn’t a boardroom today that isn’t asking what AI and generative AI in application can help drive efficiency and accelerate their business. For organizations looking to capitalize on ML and automation to improve their efficiency during incidents, AIOps is a tangible, proven application thatproves to be an exciting opportunity for ITOps teams.

As we’ve seen across market landscape evaluations, there are a number of ways that solutions can be implemented. Despite this, the problems AIOps solutions aim to address remain fairly consistent: fewer incidents and faster resolution. But which teams can stand to benefit from this powerful technology and how will AIOps help them achieve their desired business outcomes?

Understanding how different teams can implement best practices to see a reduction in MTTR, total incidents, and time to adopt automation will help ensure that each team is taking value from your investment. Here are three teams that stand out as having much to gain from leveraging AIOps: Network Operation Center (NOC) teams, Major Incident Management (MIM) teams, and distributed service owning teams. Let’s cover each.

NOC teams

If you have a NOC, it acts as your central nervous system. You may also be in the middle of undertaking modernization efforts to reduce both cost and risk.

Many of our NOC customers tell us about challenges such as:

Eyes-on-glass operational style causes incidents to go undetected
Catch and dispatch means too many escalations to SMEs or routing incidents to the wrong team
Manual work drives up MTTR
L1/L2 teams experience high turnover and blame culture is common

To move beyond this, organizations can create L0 automation. This is automation that serves as the first responder, only bringing in humans when necessary. For well-understood, well-documented issues, L0 automation can auto-remediate incidents without a responder intervening. But for other more complex issues that require a hands-on approach, NOC teams can create L0 automation that immediately pulls in diagnostic information before the responder looks at an incident, routes incidents intelligently according to event data, and populates the incident notes with pertinent documentation and runbooks.

PagerDuty AIOps helps NOCs modernize and move away from eyes-on-glass methods. These NOCs are a center of excellence within their organizations, spearheading data-driven optimization, enabling best practices, and ensuring incident readiness.

MIM teams

When critical, customer impacting incidents happen, you don’t have time to waste. But, with complexity and noise on the rise, how do Major Incident Management teams improve to meet growing customer expectations?

We see MIM teams with common challenges such as:

Finding out about major incidents from overwhelming customers/users calling in or delayed team escalations
Lack of context as initial triage takes too long to assess severity and business impact
Long MTTR waiting for the right people, the right diagnostics, the right runbooks, etc
Disjointed tooling leading to communication barriers for responders and corresponding teams

MIM teams can overcome these challenges with a variety of automation and ML tactics. First, organizations can create automation that immediately routes high priority or severity incidents to a MIM team and tags in the appropriate teams needed via incident workflows. Additionally, ML can gather key context such as how rare an incident like this is, if it happened before and how it was resolved, and change events that might be correlated to the failure.

PagerDuty AIOps helps MIM teams detect major incidents faster, improve MTTR and customer experience, and save SMEs time. This reduces the cost of each incident and mitigates risk.

Distributed service owning teams

DevOps and distributed service owning teams are under more pressure than ever to deliver exceptional customer experiences. But with competing priorities and fewer resources, this is easier said than done.

Many of our customers share challenges they are facing such as:

Disparate monitoring tools with no central pane of glass
Too much noise leading to incorrect escalations and false incidents
Lack of context and information silos
Toil and time taken away from value-add initiatives

For service owning teams looking to overcome these challenges, an AIOps tool that can aggregate data from all the monitoring sources in the technical ecosystem can help bring clarity to incident response. Additionally, with ML, teams can reduce noise by automatically grouping together alerts based on context, time, and previous event data that the model has trained on. With this and the ML-surfaced triage information, incident response is streamlined so teams can get back to innovating faster.

PagerDuty AIOps helps service owning teams spend less time firefighting, reduce MTTR, and create exceptional customer experiences. This improves culture and team retention while increasing revenue for the entire organization.

Ready to get started?

With PagerDuty AIOps, teams like the ones we looked at see 87% fewer incidents, 14% faster MTTR, and 9x faster automation adoption. This helps organizations move faster, focus on the work that matters most to customers, and reduces risk and team burnout. Best of all, teams from dev to IT can see value from PagerDuty AIOps.

PagerDuty AIOps works in conjunction with the rest of the PagerDuty Operations Cloud to help organizations manage their operations by leveraging AI and automation to supercharge their digital transformation. With over 700 integrations, GenAI capabilities, and end-to-end event-driven automation, PagerDuty gives customers a 400% ROI and the right tools to leapfrog the competition.

To try PagerDuty AIOps out yourself, you can take an interactive product tour or try us for free for 14 days.

The post Three Teams That Can Use AIOps to Work Smarter, Not Harder appeared first on PagerDuty.

Take Data and Turn It Into Actionable Info with AI and ML by Nisha Prajapati

Nisha Prajapati — Wed, 16 Aug 2023 18:07:07 +0000

The post Take Data and Turn It Into Actionable Info with AI and ML appeared first on PagerDuty.

PagerDuty Extends Operations Cloud Leadership into AIOps and Automation by Jonathan Rende

Jonathan Rende — Tue, 11 Jul 2023 22:51:12 +0000

Forrester Names PagerDuty a Leader in first-ever Process-Centric AIOps Wave

From helping pioneer the DevOps movement to establishing best practices around service ownership to being the standard in incident response, PagerDuty has a long history of leadership. PagerDuty is honored to add to this list and now be recognized as a leader in the AIOps and Automation space by Forrester. To explain why PagerDuty was listed as a leader, it’s important to look at our current economic climate and compare it to the past.

It’s been more than a decade since the last time the Three C’s–Cost Control, Consolidation (of vendors) and Compliance–received so much oversight and scrutiny. Just like in 2008, centralized decision making and cost controls are driving organizations to consolidate entire vendor suites versus only best of breed products to do the job.

It’s no surprise that additional financial oversights are now a part of every purchase, every budget item, and every activity for IT and development. Everyone expects more–and they expect it right now, not next quarter or even next year.

AIOps, however, has always promised to do more. For us, more is about exceeding SLAs, and improving availability and reliability. More is about cost savings because fewer humans should be needed in a major outage and incident. More should not only be about being more responsive, but be about preventing issues in the first place.

What’s Changed Since the Last Financial Crisis

In 2008, important financial institutions failed given credit and lending practices. This started a market downturn where the global economy contracted. This resulted in cost controls and the need to improve business efficiencies which in turn drove more central decision making. It was a stark contrast from the strategy of top line growth at all cost which typically results in distributed decision making and vendor/tool sprawl.

So, what’s different now and why is PagerDuty’s AIOps a leading solution to look into?

Now (vs 2008) machine learning is and should be an operational part of every data centric digital business. PagerDuty’s AIOps solution has progressed fast over the last four years to help both reduce the time to resolve issues by 25% and reduce unnecessary interruptions (noise) by over 90%.
- By combining our event correlation (Intelligent event grouping) and event orchestration (event rules engine) with existing observability processes, we better target which experts are needed for which problems. We make escalation policies more effective and powerful.
- PagerDuty’s AIOps product can make those experts more productive as well when they do get called in to a major incident by automating the diagnostics process and pinpoint offending or culprit services responsible for the problem.
- Equally important, by combining event rules with automation jobs (event driven automation), an entire class of lower priority problems can be remediated without human intervention and eliminate the need for responders or experts to engage at all.
- Lastly…with Generative AI, we just potentially democratized the tools that will further broaden use even faster.
Now (vs 2008) there is no need to hire expensive professional services or bring in tons of white lab coat experts to configure systems. You can and should demand to see the value in days and weeks, not months or longer. PagerDuty’s AIOps offers 5-10x reduction in time to value over alternative solutions.
Now (vs 2008) we have proven, high value products that offer AI as an integrated part of your existing practices vs separate bespoke solutions in event management. Whether you have centralized IT Operations (e.g., network operating center with SREs) or decentralized operating models with service ownership by developers or a combination of both (hybrid model), there is no need to have or add new vendors or build new approaches. PagerDuty’s AIOps solution works in all models.

The promise of AIOps off the shelf products is a reality. There are real products from established leaders like PagerDuty to apply against your needs and requirements.

And now, we’re proud to share PagerDuty’s AIOps leadership as part of the recent Forrester Wave. Consider this your personal guide to help in your AIOps journey. Enjoy.

The post PagerDuty Extends Operations Cloud Leadership into AIOps and Automation appeared first on PagerDuty.

Report: The Forrester Wave™: Process-Centric AIOps, 2023 by Nisha Prajapati

Nisha Prajapati — Mon, 26 Jun 2023 13:00:01 +0000

The post Report: The Forrester Wave™: Process-Centric AIOps, 2023 appeared first on PagerDuty.

Report: Innovation Brief: PagerDuty AIOps is a Powerful New ‘System of Action’ by Vivian Chan

Vivian Chan — Mon, 26 Jun 2023 12:49:19 +0000

The post Report: Innovation Brief: PagerDuty AIOps is a Powerful New ‘System of Action’ appeared first on PagerDuty.

AIOps and Automation: A Conversation Featuring Guest Speaker Carlos Casanova, Forrester Principal Analyst by Heath Newburn

Heath Newburn — Fri, 09 Jun 2023 12:00:36 +0000

At the beginning of 2023, I had a great conversation with Carlos Casanova, a Forrester Principal Analyst, in a recent webinar about how AIOps can help drive successful organizational change. According to our conversation, Carlos has divided the AIOps market into two camps: technology-centric (primarily APM/Observability players) and process-centric. PagerDuty is a process-centric solution leveraging multiple technologies.

With process-centric AIOps solutions, organizations gain additional context and insights into their data. This reduces the time to act, helps improve data quality, enhances decision-making, improves routing and notification efficiency, and ultimately increases the value of services delivered by IT.

This ability to increase speed with greater context shrinks the time for critical incidents. An important thing to note is that the initial routing can be to a virtual operator. Meaning that automation could drive additional triage/debug information or potentially complete a fix before engaging a human responder.

Throughout our conversation, Carlos and I kept returning to the theme of creating better context for responders. When I asked him about what capabilities he sees as most important for solving core AIOps use cases, he said, “Quickly identifying the correlation across disparate alerts drastically reduces the noise that individuals are dealing with. Providing all impacted individuals with this clean data signal is vital to improving operations. With this data, individuals can more easily and quickly garner insight into what is truly going on in the environment. They can then quickly determine the right actions to take, decide who needs to be involved for faster remediation, and reduce the amount of effort necessary, which frees up time for other events and alerts.

But teams often struggle with getting started. We agreed that the cost of waiting and planning probably isn’t worth the cost of starting and iterating. He added “The overall initiative may look daunting, but there are achievable quick wins. Waiting is not recommended. Start with small tactical efforts that roll up to your larger and longer-term strategic goals to show progress, demonstrate value, and build momentum.”

So speed is also a continuous theme: quickly getting context, rapidly responding with automation, and starting the process immediately to see these wins. But we also know that the pressure has continued to grow.

Teams have been affected by the economic downturn and slowdown. When I asked him about how teams can increase efficiency and measure success, we spoke about automation being key to success.

Carlos responded, “Simple scenarios that occur often are great candidates for automating all or part of their remediation. Fully or even partially automating five or 10 simple scenarios instantly frees up large amounts of time for individuals to focus on the more complex scenarios that organizations might not feel comfortable automating.”

But we also have to recognize the forming, storming, and norming before we get to performing in projects. There will be changes to how we measure and think about success that we have to embrace.

“AIOps can also empower IT to alleviate workloads to help their delivery teams ‘do more with less.’ It’s important to remember that these changes invalidate existing metrics. You must establish new baselines, since individuals will no longer be performing the simple and low-level actions. For example, a technician manually resolves 300 incidents per week. Thirty are simple and have easily automated remediations. The MTTR on these might drop by 90%. Elimination of the simple incidents, however, only allows the technician to take on 10 medium-complexity incidents in their place. That means the technician will handle 20 fewer incidents per week. The average MTTR for the technician will go up, and incidents will stay in their queue longer, with a higher ratio of medium- and high-complexity incidents,” Carlos said.

One of the most common questions I run into is how to get started. Traditionally, AIOps is viewed as a potentially years-long initiative. It can be daunting to begin the journey with so much uncertainty and change. PagerDuty has greatly simplified the process by crafting a one-click process for event correlation so teams can see value immediately but this isn’t the end of the journey to AIOps.

Carlos shared his insights on getting started, as well as facing the reduction in available OpEx. “Budgets are always a challenge, but to a large extent, you can overcome that hurdle by demonstrating and clearly articulating the value of AIOps. Develop a narrative for your business case that speaks to the value of improved experiences with the organization. Demonstrate how improved routing and notifications with enhanced contextually relevant data enables the same workforce to handle more workloads with less effort. Explain how patterns and trends empower lower-level resources to execute more advanced actions because they are provided suggestive actions that are based on the more experienced and senior staff members. All of this helps organizations deal with the economic challenges they’re currently facing while also improving the quality of products and services they deliver. It’s important for organizations to demonstrate their chosen solution has a fast time to value. For example, to improve user experiences, how quickly can the solution provide complete visualizations of transactions to support personnel to resolve an outage? To provide a faster response time, how quickly can the solution analyze the environment and correlate new alerts into singular incidents that can be handled immediately or in an automated fashion? Time to value is vital in difficult economic times.”

Time to value can be even more important than ROI for many of our customers. Speed is what will delineate winners and losers in digital battlegrounds. How quickly we can deal with inevitable issues and iterate improvements is what sets teams apart from competitors and provides an excellent customer experience.

As I&O leaders work through economic uncertainty that’s forcing them to cut costs and do more with less, they require new tools and approaches that help them scale and optimize their existing resources. AIOps provides teams with a reliable way to process high volumes of data and events, manage routing and response in real-time, and help teams resolve incidents faster. If you’re interested in learning how to tackle those challenges for your business, watch this webinar to hear the rest of my conversation with Carlos.

The post AIOps and Automation: A Conversation Featuring Guest Speaker Carlos Casanova, Forrester Principal Analyst appeared first on PagerDuty.

Noise Reduction and Auto-remediation with AWS and PagerDuty AIOps by Nisha Prajapati

Nisha Prajapati — Mon, 29 May 2023 18:30:02 +0000

The post Noise Reduction and Auto-remediation with AWS and PagerDuty AIOps appeared first on PagerDuty.

Gartner® Report: Deliver Value to Succeed in Implementing AIOps Platforms by Nisha Prajapati

Nisha Prajapati — Fri, 12 May 2023 12:00:45 +0000

The post Gartner® Report: Deliver Value to Succeed in Implementing AIOps Platforms appeared first on PagerDuty.

Top 3 Incident Response Problems AIOps Can Help Your Teams Solve by Hannah Culver

Hannah Culver — Thu, 20 Apr 2023 12:00:58 +0000

More data for data’s sake doesn’t help anyone. What organizations need is more information–actionable insight. With data coming from incoming streams of events and alerts, teams don’t have enough time to look at each one. And they struggle to parse and consolidate this data in order to figure out what they need to do next to resolve an incident. Processing this data to make it more usable and helpful during incident response often results in a rote series of manual, repetitive tasks each time an incident occurs, wasting time. It’s no wonder teams are increasingly turning to AIOps and automation for help. AIOps helps teams turn data into information and reduce that manual work. Let’s break down three ways AIOps allows teams to overcome challenges and reduce customer disruption.

Reducing noise for fewer incidents

Not every alert should become an incident. Yet for many organizations, this is what happens. Even if you’re only experiencing one problem, you may receive dozens or hundreds of pings for the same issue. This is distracting and bogs responders down. Noise should be your first thing to focus on because eliminating it:

Gives responders back time when they don’t need to filter out what’s important from what’s irrelevant.
Decreases the cognitive load that responders carry. Responders don’t need to think about 63 different alerts. They can focus on the one that matters. This reduces this on-call anxiety.
Reduces the distractions that get in a responder’s way during an incident. This helps responders focus on getting a fix in place faster.

To reduce noise, you can analyze the noisiest incidents you’re facing. Which ones are the same incident? Take a look at the alerts you’re receiving and see if there’s a way to group them based on event data that you gather from your monitoring tools. What’s loudest? This is an opportunity to fine tune your monitoring tools so they’re only sending you what’s most valuable. Keep in mind that this often requires routine maintenance. Monitoring tools become messy, especially when data is scattered across vendors. You’ll want to gut check this whenever you notice noise levels are increasing.

PagerDuty AIOps makes it easier to reduce alert noise within a single tool. Users can set PagerDuty to ingest and deduplicate events from those disparate signals. Then PagerDuty AIOps groups the events into an existing incident. This suppresses a new incident from being created. Teams have access to event data in the form of alerts without extra notifications. The result is that teams can better weather alert storms by bringing focus to what’s needed.

Gaining context for better triage

Technically, all the information a responder needs to resolve an incident exists. But, it’s buried within multiple disparate streams of data. Humans alone cannot condense all this data into succinct actionable insights. This means teams spend a long time looking for answers to questions that they can leverage machine learning (ML) to find instead. ML can look at both historical event data and human interaction. Then ML translates the analyzed data into actionable insights. With ML, teams can answer key questions such as:

Where should my team look first?
Are other teams working on the same problem?
Is this a common incident or completely new?
Have we seen this before; how was it resolved?
Any relevant changes occur before this incident?

But developing your own ML can be a daunting task. It requires time and resources such as headcount. Many organizations choose to partner with a vendor instead.

PagerDuty AIOps ML algorithms help surface critical information such as:

Probable Origin: determines probable cause based on previous incidents affecting your service.
Related Incidents: shares if a current incident is affecting your service.
Outlier Incidents: whether this incident happens frequently, rarely, or is a total anomaly.
Past Incidents: look at the incident details and see how responders resolved it in the past.
Change Correlation: connects with your change integrations to show changes to your service, then leverages ML to correlate patterns between change events and incidents.

Each time this information is surfaced for your team without having to manually dig, you get to resolve the incident faster. That decreased MTTR provides you with more time to focus on value-add initiatives.

Self-healing by crafting auto-remediation

One initiative you can focus on to spend less time firefighting is automation. This is where you can orchestrate a fix and self-heal before the problem even becomes an incident. It’s resolved before it hits a responder. Now someone gets to sleep through the night instead of responding to a notification. But this initiative can seem very intimidating. The reality is that starting small and tackling low-hanging fruit can make self-healing easier than you may expect.

You can identify well-understood resolution scenarios where you can automate the response. These may be scenarios that your team would classify as frequent, or ones where the resolution is straight-forward. Teams can then create automation to resolve these without human intervention. Then, as that automation starts to take effect, your teams will start to free up time to work on new automation initiatives.

PagerDuty’s Event Orchestration helps teams create automation that spans the entire technical ecosystem. Event Orchestration enriches and routes events, then kicks off automation to self-heal. This feature allows users to trigger remediations for well understood incidents via webhook. For more complex issues where auto-remediation might not be a possibility, teams can also leverage automation to kick off diagnostics. This builds upon the triage information responders have when they first view their incident.

Looking to get started with AIOps?

AIOps can help teams see fewer incidents and faster resolution. PagerDuty can help you achieve this, and more, with PagerDuty AIOps. See PagerDuty AIOps in action by requesting a trial or taking our product tour. In the market for AIOps? Read our buyer’s guide.

The post Top 3 Incident Response Problems AIOps Can Help Your Teams Solve appeared first on PagerDuty.

Reduce MTTR and Take Automation to a New Level with PagerDuty Global Event Orchestration by Hannah Culver

Hannah Culver — Tue, 18 Apr 2023 12:00:58 +0000

PagerDuty’s Global Event Orchestration is now generally available. Global Event Orchestration’s powerful decision engine enriches events, controls their routing, and triggers self-healing actions based on event data. Teams can use this functionality across any or all services within PagerDuty. This feature is a continued investment in Event Orchestration, demonstrating PagerDuty’s commitment to providing customers with best-in-class automation capabilities.

Customers in our early access program are already seeing value in Global Event Orchestration, touting reduced MTTR and better standardization of incident response at scale. As Kiril Yurovnik, Technical Lead at Riskified, said, “With a growing number of events, minimizing noise and toil is imperative, especially as organizations aim to optimize their IT processes amid the current economic environment. We’ve been using PagerDuty’s Global Event Orchestration as part of the early availability program, and the results have been strong. Riskified has been able to scale noise reduction, especially from non-production environments, saving our team valuable time to spend time innovating on what’s next.”

What are Global Event Orchestrations?

Global Event Orchestration is like Service Event Orchestration in that it allows users to define complex rules that determine what happens to an event as it is processed. The difference is that Global Event Orchestration enriches events at ingest. Then, once the data is normalized, the event is routed to a service based on various criteria. This ensures that responders have the best event data possible to begin the response process.

Global Event Orchestration has three key components that make it successful for scaling incident response.

Global Orchestration Rules allow users to apply actions to events across services. Teams can create rules which process event data across services and use the processed data to improve event routing. This empowers organizations to establish and improve on auto-remediation. This means that a human doesn’t need to be involved in an incident to resolve it. This also reduces the blast radius of an incident via more intelligent routing.

Enhanced integration key management reduces the workload of managing integration keys for different monitoring tools. This allows users to combine integration keys into one event orchestration. Even better, enhanced integration key management is now available for all PagerDuty plans.

Additional APIs allow for management at scale. Teams can use REST APIs for event source and Global Orchestration Rule management. Both of these APIs have Terraform support. These APIs are in addition to the REST APIs for Event Orchestration/Service Orchestration management.

“Leveraging PagerDuty’s Global Event Orchestration has been critical to ensure that our event routing processes are efficient and scalable to optimize IT operations and spend,” said Brian Long, Cloud Infrastructure Engineer at Hyland. “With Global Event Orchestration, our organization is able to detect the “resolved” condition from our notifications to execute as a resolve and reduce the number of places these conditions need to be configured by at least a factor of three. This frees up our time to focus on innovation, not configuration.”

How can Global Event Orchestration help my team?

With Global Event Orchestration, teams will see:

Codified incident response processes: democratize and distribute well-understood incident responses across distributed teams
Fewer incidents: use contextual event data from all services within your ecosystem to improve suppression accuracy
Faster resolution: apply automation across teams and enable automated diagnostics at scale with standardized enrichment and data normalization

How teams use Global Event Orchestration may vary based on organizational structure. Capabilities align with two different teams: ITOps, SRE, and NOC teams and developer teams.

ITOps teams will be able to capitalize on the event normalization capabilities, ensuring that all events look the same as they come in.

SRE teams can create and extend automation across any or all services within a technical ecosystem. This makes scaling and standardizing automation across an organization easier than ever.

For L1 response teams such as NOCs, Global Event Orchestration helps them handle the massive incoming wave of events. Events can be routed to the NOC if they meet certain criteria. And, as the event passes through levels of rules and nested rules, automation can deliver diagnostics to the L1 responder. If the fix for an incident is well-known, organizations can create auto-remediation.

Developer teams will see fewer incidents and faster resolution. With auto-remediation, incidents can be resolved before they even hit the services that the developer teams are on call for. And, with in-depth routing criteria, incidents don’t bounce from team to team. If automation or the NOC or L1 responders can’t resolve it, the incident will go to the subject matter expert (SME). And, by the time the SME begins to work on the incident, diagnostic information is already available, reducing resolution time.

How can I get started today?

Global Event Orchestration is generally available for all PagerDuty AIOps customers. To see it in action, join us on Twitch Friday, April 14.

PagerDuty AIOps helps teams experience fewer incidents, faster resolution, and greater productivity without long implementations or heavy ongoing maintenance. To try PagerDuty AIOps, you can request a trial here or take our product tour. If you want to talk to sales, contact us through this form.

To learn more about Global Event Orchestration, register for this webinar. If you’re a PagerDuty AIOps customer looking to create your first Global Event Orchestration, this knowledge base article can show you how to get started.

The post Reduce MTTR and Take Automation to a New Level with PagerDuty Global Event Orchestration appeared first on PagerDuty.