artificial intelligence | Tags | PagerDuty

Generative AI for the PagerDuty Operations Cloud by Sean Scott

Sean Scott — Thu, 01 Jun 2023 17:00:21 +0000

When it comes to keeping your business’s lights on, you need to manage and orchestrate your operational activities, prioritize high-impact and urgent work, and maintain day-to-day precision. Trust is paramount during mission-critical, time-sensitive crisis response and the narrow margin for error means there is little room and low acceptance for generative AI hallucinations or false positives.

This is why our roadmap has always focused on innovation designed to make your job easier: innovation with a purpose. At PagerDuty, we have been working with AI and machine learning for years, and have become the industry leader in AIOps. And it’s through that lens that we’ve evaluated GenAI–not for its own sake, but by asking ourselves how it could unlock more value across the PagerDuty Operations Cloud.

From code co-pilots to incident response assistants, generative AI represents a tremendous opportunity. The ease and elegance of engaging with generative AI–its fundamental intuitiveness through a natural language interface–creates a step function opportunity to unlock the full potential of automation. There’s no question that automation has the potential to save time and money while increasing productivity and capacity, but automation initiatives can die under the weight of their own abstraction.

GenAI brings a consumer-style simplicity to enterprise-grade automation and makes the realization of automation’s potential much more real. The pace of software development will only accelerate, and more software means more complexity–which makes DevOps more important than ever.

Today, I’m excited to share the first three generative AI-supported capabilities PagerDuty is bringing to the PagerDuty Operations Cloud:

AI Generated Status Updates

When unplanned, interrupt work strikes, communication and coordination are essential to resolution. Industry best practices recommend regular status updates to stakeholders and leadership every 30 minutes (at least) to be sure the business responds with one voice. But crafting those updates takes time and bears its own cognitive load at a time when your teams are already at surge capacity. We have customers who tell us that during major incidents they have three people dedicated to just status updates.

This was a perfect place for us to kick off a generative AI deployment. With generative AI integrated into our Status Update feature, teams can save cycles on what to say and to whom–they can generate persona-based status update drafts with just a few clicks. The new capability leverages AI to process all data related to the current incident and auto-generate a summary, offering key insights on events, progress and challenges. This feature enhances incident management workflows and streamlines communication in addition to saving time, allowing your team to focus on the real work of resolution.

Join the waitlist

AI Generated Incident Postmortems

Postmortems are a staple of operational excellence and a best practice often driven by site reliability engineering (SRE)–it’s how you learn what went wrong, where you could improve, and most importantly, how to avoid making the same mistakes again and again.

Taking the time to document postmortems, however, can be challenging. It’s a drawn-out, manual (and occasionally emotional) process to collect all the relevant data points for review as a group.

But imagine you had a virtual set of team members shadowing the incident from start to finish, a team whose only job is to create a timely and unbiased draft of your postmortem report. That’s exactly what we can give you by applying generative AI to automate the generation of comprehensive post-incident draft reports.

As you’ll see in the video, once an incident is resolved, the user can elect to generate a postmortem review, triggering the real-time, time-consuming collection of all available data around the incident at hand (including logs, metrics, and relevant Slack or Microsoft Teams conversations). It then produces a detailed report that highlights key findings, root causes and areas of improvement. Additionally, PagerDuty generates a list of recommended action items tailored to prevent similar issues from occurring in the future.

Not only will this feature save time, but it will also provide a starting point for capturing crucial learnings, fostering a culture of continuous improvement and enabling the team to spend more time on future proofing–which brings us back to the criticality of the human-in-the-loop approach to unlocking the power of generative AI when you’re talking about mission-critical work.

Like the Status Update example above, automated incident postmortems require a person to provide expertise, judgment and oversight, validating and refining the report before releasing it for broader consumption.

Join the waitlist

AI Generated Process Automation

We’ve been using automation across the PagerDuty Operations Cloud platform since its inception, partnering with many of you to provide scripts and plugins to automate workflows that help you manage and resolve unplanned work more quickly. Our customers use us every day–whether in the cloud or on prem–for infrastructure automation as well as driving Ansible, Terraform and Power Automate. But if the scripts and tools don’t already exist, you have to do the heavy lifting yourself to actually code the script.

No longer. With generative AI, we’ve built a co-author for your automation needs. It’s like having an extra developer on your team whom you can task with researching how to do what you want to do and then create the automation for you. And best of all, it’ll do it in your favorite scripting language or easily transition from one language to another, so you ultimately have full control. We’re bringing low-code capabilities to what used to be a high-code experience, without losing any power or flexibility. For example, you can simply state, “Write an automated workflow that adds a specific user to a group within Okta. I should be able to specify the user by email and group at runtime,” hit the generate button, and watch the magic happen.

Join the waitlist

We are early in a journey where generative AI will accelerate learning, eliminate toil and increase our productivity, while freeing us up for more creativity.

With any new technology, there is risk. Managing that risk successfully is in our DNA, as our customers know from when we have introduced AI, machine learning and automation capabilities across the platform over the years. It’s why “human in the loop” is a central tenet in our AI work. And it’s why we’re moving quickly but keeping the tenets of fidelity, security and accuracy in mind as we build.

We want your feedback and input–as the possibilities are endless. What matters to you the most? Join the waitlist for these features, and sign up to get development updates. We expect to begin releasing these capabilities over the coming months.

The post Generative AI for the PagerDuty Operations Cloud appeared first on PagerDuty.

Gartner® Report: Market Guide for AIOps Platforms 2022 by Vivian Chan

Vivian Chan — Tue, 14 Jun 2022 00:34:49 +0000

The post Gartner® Report: Market Guide for AIOps Platforms 2022 appeared first on PagerDuty.

Using AI / ML to Supercharge Continuous Delivery With Harness and PagerDuty by Steve Burton

Steve Burton — Fri, 07 Sep 2018 13:00:56 +0000

At first glance, applying machine learning to Continuous Delivery might sound a bit like cracking a peanut with a sledgehammer. I mean, how hard can deployment automation actually be?

As it turns out, it’s way more complex than we think.

Pushing a new deployment into production typically has two outcomes:

The service stays up and we think everything is OK.
The service doesn’t stay up and all hell breaks loose.

The reality is, these two points above represent how 95 percent of organizations measure deployment success (up=good, down=bad). Those of you who are happy PagerDuty customers will be most familiar with outcome No. 2 (from the storm of alerts/incidents that hit your cell phone). However, Scenario No. 1 is also misleading because a Service staying up doesn’t automatically imply health, performance, or quality.

Disadvantages of Manual Deployment Health Checks

One thing we learned from our first 25 customers at Harness is that most organizations typically have 3-5 engineers who each spends at least an hour to manually verify production deployments For example, one of our customers, Build.com, used to have 5-6 team leads spending an hour each manually analyzing data from New Relic and Sumo Logic—which usually means having multiple console/browser windows open and context toggling between bash scripts, application performance monitoring, and log analytics tools.

Given that the human brain can only focus on 8-10 items in short-term memory and with all the incoming data from various systems, it’s pretty easy for humans in 2018 to miss things. Manual analysis and health checks are challenges when you have several hundred thousand time-series metrics and a few million log entries to look at post-deployment.

Let AI/Machine Learning Assist Health Checks

At Harness, we don’t just automate the deployment of software artifacts to production; we also automate health checks using AI and ML. We call this Continuous Verification.

We primarily use unsupervised machine learning algorithms like Hidden Markov Models, Symbolic Aggregate Representation, KMeans Clustering, and some Neural Nets to automate the detection of anomalies and regressions from APM and log data.

Within seconds of deploying a new software artifact, Harness can connect to any APM or Log tool and automatically generate a model of application behavior from a performance (response time/throughput) and quality perspective (error/exception/events).

Harness then compares these models with previous deployments and flags any new anomalies or regressions instantly. What takes humans hours to process and analyze takes merely seconds with machine learning algorithms.

For example, the below screenshots are from Harness verification of AppDynamics APM data:

In the above image, you can see that Harness flagged two business transaction performance regressions post-deployment. Tied to that, the below image shows that one transaction—“Request Login”—actually increased from 31ms to 165ms in response time. All of this analysis is automated with AI/ML.

Here’s another example of Harness detecting error/exception anomalies in application logs from Splunk:

Red dots signify new errors that have been introduced to the application logs from the deployment. Gray and blue dots represent baseline events or error/exceptions that are normally observed with every deployment.

Harness uses KMeans clustering with some Jacard and Cosine distance calculations to generate these visuals. Clicking on any dot also shows the stack trace and root cause of the event.

Automate Rollback with AI/ML Intelligence

Harness can also automate the rollback of deployments using the intelligence from its Continuous Verification. Think of Harness as a safety net that lets Dev/DevOps teams deploy faster but then roll back whenever new anomalies or regressions are encountered.

With upcoming Harness support for PagerDuty, organizations will be able to use PagerDuty as a notification channel as well as a verification source. For example, Harness can query PagerDuty pre-deployment to see if there are any active incidents being experienced in production. The last thing Dev/DevOps teams want to do is deploy to a hot environment.

In summary, Harness offers Continuous Delivery as-a-Service that helps organizations automate the deployment and delivery of software to end users in production. We help customers move fast without breaking things.

Steve Burton is a CI/CD and DevOps Evangelist at Harness.io. Prior to Harness, Steve did Geek stuff at AppDynamics, Moogsoft, and Glassdoor. He started his career as a Java developer back in 2004 at Sapient. When he’s not playing around with tech, he’s normally watching F1 or researching cars on the Internet.

The post Using AI / ML to Supercharge Continuous Delivery With Harness and PagerDuty appeared first on PagerDuty.