automation actions | Tags | PagerDuty

PagerDuty Debuts as a Leader in 2022 GigaOm Radar for AIOps Solutions by Heath Newburn

Heath Newburn — Tue, 09 Aug 2022 13:00:32 +0000

Every year there is a surprise in a Radar report. While it won’t be a surprise to our thousands of customers who are seeing tremendous benefits with us, PagerDuty is excited to be named a Leader in the 2022 GigaOm Radar for AIOps Solutions.

GigaOm uses extensive criteria to evaluate vendors in their Radar. From the report: “This year we’re distinguishing AIOps solutions that require displacing existing tools from those that can be added to the IT tool box without major disruption.” This was one of the keys to PagerDuty being positioned as a Leader and rated Outstanding on Tool Displacement.

Time to value and total cost of ownership have long been a hallmark of Pagerduty’s business value. Customers can trust PagerDuty to help them rapidly maximize the value of their existing systems without having to rip and replace.

Our SaaS platform delivers simplified setup, snap-on integrations, and easy-to-use event routing and enrichment that were all important to our Outstanding score for Ease of Implementation.

With more than 650 integrations, PagerDuty was rated as Exceptional on Data Consumption, System Integration, and Cloud Monitoring. This breadth of capabilities ensures that practitioners get quick time to value with virtually no customization required. The UI, Terraform, and API providers allow subject matter experts to leverage all their data sources – monitoring, CI/CD, DevOps, Security, BizOps, etc. to create the context needed for even new team members to rapidly solve problems.

Our Event Orchestration allows for simple yet powerful routing, enrichment, and automated responses to problems. By moving away from massive, complex rule bases to simplified node-based graph routing, SRE and DevOps teams can control exactly how they want to use events to create context, provide diagnostics, and automatically resolve problems where appropriate. The simple graphical interface provides for easy experimentation, while the underlying Terraform provider enables self-service capabilities, removing the burden from a centralized team. This holistic self-service capability was highlighted in our Outstanding rating for Manageability.

GigaOm recognized the advantage of an automation-first approach to AIOps. Our Rundeck and Catalytic acquisitions have enabled our platform to offer comprehensive automation integration across the platform in the form of built-in Automation Actions and flexible workflows. Balancing workloads between your humans and your machines is critical to maintaining productivity and preventing burnout. Leveraging automation as the first responder in incident resolution can remove toil and accelerate time to resolution. In cases where a responder is not required, common problem signatures can be identified and handled at machine-speed with automated remediation. But no, the machines are not coming for our jobs: while auto-remediation can handle a small percentage of well-understood fixes, more often than not automation can serve as a second pair of hands to augment responders at the center of incident response and investigation.

Although this is only our first year in the Radar, we have built on the past several years’ success with Event Intelligence and are committed to growing capabilities for our customers to deliver new business outcomes. We are on track to process 20 billion+ events for clients this year. By leveraging our many years of data as a SaaS platform to understand how clients reduce noise and resolve problems, we have been able to grow machine learning, automation, and analytics allowing teams to focus on keeping production running and delivering better solutions.

Read the report for yourself here and learn more about PagerDuty’s solution for AIOps here.

The post PagerDuty Debuts as a Leader in 2022 GigaOm Radar for AIOps Solutions appeared first on PagerDuty.

New! Common Automated Diagnostics for AWS Users by Jake Cohen

Jake Cohen — Wed, 03 Aug 2022 13:00:02 +0000

Today’s modern cloud architectures centered on AWS are typically a composite of ~250 AWS services and workflows implemented by over 25,000 SaaS services, house-developed services, and legacy systems. When incidents fire off in these environments—whether or not a company has built out a centralized cloud platform—distinct expertise is often a necessity. Because of this scaled complexity, first responders find themselves having to escalate to several different service owners or expert engineers to gather diagnostics before it’s possible to determine who an ultimate resolver of an issue should be.

When it comes to incident response, it’s critical that these new cloud environments seamlessly integrate with an organization’s existing critical applications and services—both old and new. In light of enhancing service quality and making it easier for responders to cross that bridge of expertise, we are excited to announce the immediate availability of new AWS plug-in integrations for automated diagnostics.

New AWS Plugins for Automated Diagnostics

Our new AWS plugins for Automated Diagnostics help provide deeper coverage for customers that are also users of AWS, making it easier to get up and running with automated diagnostics in their AWS environment quickly.

The new AWS plugins for Automated Diagnostics include:

CloudWatch Logs plugin. This plugin retrieves diagnostic data from AWS infrastructure and applications. Now users can more easily run automated diagnostics for AWS across multiple accounts and products.
Systems Manager plugin. This plugin allows for faster execution and accuracy for tasks such as configuration management, patching, and deploying monitoring and security tooling agents. Users are now able to apply automation to the above tasks for faster execution.
ECS Remote Command plugin. This plugin provides a mechanism to execute commands on containers. This enables developers and operators to retrieve diagnostic data from their running applications in real-time before redeploying their services.
Lambda Custom Code Workflow plugin. Create, execute, and optionally delete a new Lambda function with the custom code provided in a Job step as its input. Execute custom scripts as steps in jobs without having to install any software.

Sound complex? Don’t worry, we thought of everything :).

New Auto-Diagnostic Job Templates for AWS Users

We also released new pre-built templates for AWS, so you can start enhancing incident details for your specific environments immediately. These are purpose-built to be used with minimal configuration. Instead of starting from scratch, users now have a library of curated, ready-to-use job definitions that retrieve data for investigating, debugging, and triaging incidents during a response.

New users can start automating diagnostics for AWS faster and existing users can easily add AWS diagnostics to their existing PagerDuty Process Automation project.

Some example job templates include:

AWS – EC2	Instance Status & Associated IAM Roles	Retrieve EC2 Instance Status and Associated IAM Roles	Remote command (or SSM)
AWS – ECS	Stopped ECS Task Errors	Checks stopped ECS Tasks for errors and provides detailed information on the reason for the errors.	Stopped ECS Tasks
AWS – ELB	Retrieve ELB Targets Health Status	Retrieve the list of unhealthy Targets in the Load Balancer’s associated Target Groups.	ELB Instance Statuses
AWS – RDS	Check Database Storage Status	Checks RDS database for the instance status.	RDS Instance Status
AWS – VPC	IP addresses using UDP transfer protocol	Query CloudWatch logs to identify any hosts using the UDP transfer protocol.	CloudWatch Logs
AWS – VPC	Top 10 Hosts by Throughput on Subnet	Query CloudWatch logs to identify the top 10 hosts by throughput on a given subnet.	CloudWatch Logs
AWS – VPC	Top 10 Source IP Addresses with Highest Rejected Requests	Query CloudWatch logs to identify the top 10 source-IP addresses with the highest rejected-requests.	CloudWatch Logs
AWS – VPC	Top 10 Web-Server Requestors by Public IP	Query CloudWatch logs to identify the top 10 public-IP requestors to our web-server (e.g. Nginx).	CloudWatch Logs

And this is just the tip of the iceberg! We will continue to develop and build upon our existing plugins to ensure our customers that use AWS are well-equipped to invoke automation wherever it is needed, including providing some interactive guides.

Want to learn more about common diagnostics? Register for our webinar event, “Common Diagnostics for Common Components,” on September 14th. Request a demo to see automated diagnostics with PagerDuty Process Automation in action.

Already using PageDuty Process Automation? Check out the Automated Diagnostics solution guide to see the end-to-end process of achieving the full solution.

The post New! Common Automated Diagnostics for AWS Users appeared first on PagerDuty.

What’s New: Automation Actions in the PagerDuty Application for Zendesk by Carrie Lacina

Carrie Lacina — Mon, 01 Aug 2022 13:00:23 +0000

A Shift in Customer Expectations

The past few years have led to a significant increase in customer demands, and customer service agents are feeling the pressure. According to a recent Zendesk CX Trends report, 68% of agents report feeling overwhelmed. Here at PagerDuty, we believe that happier customer service agents lead to more positive customer interactions and stronger relationships with your brand.

To help customer service teams address this, PagerDuty continues to deepen our integration with Zendesk to help customer service teams resolve incidents as fast and efficiently as possible. With our latest release, we are excited to announce that PagerDuty Automation Actions is now available within the PagerDuty application for Zendesk.

Empowering Customer Service Agents with Automation

PagerDuty Automation Actions quickly diagnose and remediate incidents by connecting responders to corrective automation within PagerDuty. With the latest release of the PagerDuty application for Zendesk, agents can automatically validate problems and capture critical information instantly for the team to diagnose and resolve. Agents are empowered to validate customer-impacting issues and run automation actions directly from the PagerDuty app for Zendesk. This reduces resolution times and eases the load on backend teams by instantly adding critical customer information for the teams resolving the problems. It also helps reduce the number of issues escalated to engineering teams. For example, if it’s non-urgent or not impacting customers from using the services.

Make customer service agents’ lives easier with Automation Actions

Customer service agents must be empowered to address critical incidents without adding to their burgeoning workload. Automation can help lighten that load and enable teams to do more. Here are a few ways automation can help make agents’ lives easier:

Automation helps establish consistent reactions for repeat and similar issues; this helps with reporting after incidents for process improvement.
It improves Customer service agents’ efficiency when responding to cases.
It helps to reduce the time to resolution.
It allows agents to spend more time building and managing customer relationships.
Automation reduces the opportunities to make mistakes when an agent has to perform manual tasks when responding to cases under customer pressure.

To learn more about how PagerDuty Automation Actions and Zendesk can work for you, visit PagerDuty’s knowledge base. Additionally, you can contact your account manager or request a demo today.

The post What’s New: Automation Actions in the PagerDuty Application for Zendesk appeared first on PagerDuty.

Automating Common Diagnostics for Kubernetes, Linux, and other Common Components by Joseph Mandros

Joseph Mandros — Wed, 27 Jul 2022 13:00:45 +0000

Watch our Automated Diagnostics webinar on demand to learn about common diagnostics for common components and how we provide out-of-the-box job templates for you to get started right away.

This is the second piece in a series about automated diagnostics, a common use case for the PagerDuty Process Automation portfolio.

In the last piece, we talked about the basics around automated diagnostics and how teams can use the solution to reduce escalations to specialists and empower responders to take action faster. In this blog, we’re going to talk about some basic diagnostics examples for components that are most relevant to our users.

But before we jump in, let’s make clear what automated diagnostics isn’t, based on some audience feedback on Twitter from the last article:

Automated diagnostics is different from alert correlation. Alert correlation depends on a specified depth of signals, as well as an engine that can properly identify said correlated signals. Automated diagnostics is meant to help the first responder triangulate the source of the issue to either fix the issue faster themselves, or escalate more accurately.
Automated diagnostics is different from monitoring. Monitoring is purpose-built to identify undesired states in performance or activity. This means that most monitoring is not purpose-built to emulate a first-responder’s activities to validate a true positive, or identify the first actions to take. Monitoring is focused on raising the alert. Automated diagnostics is focused on determining how to fix an issue once the alert is already created.

That said, automated diagnostics can certainly make use of data collected by monitoring tools—most people don’t apply thresholds to every datapoint they collect. In fact, one of our more commonly used diagnostics integration is to query CloudWatch logs. While we might consider a log aggregator a monitoring tool, sometimes the first steps of investigation are to look at the data in the monitoring tool that exists purely for diagnosing issues.

Providing responders with on-demand or pre-run diagnostic capabilities for their own environments can help a first responder quickly determine probable cause, thereby pulling in fewer individuals to assist with the incident. By providing first-responders with “diagnostic” data that is typically only retrievable by domain experts, the need to pull in more people for troubleshooting incidents is reduced significantly. This in turn drives down the cost of incidents and reduces mean time to response (MTTR) by automating the investigative steps that are typically manual in nature.

The status quo: Automation in incident response

Operations managers are often excited about the idea of enabling self-healing or auto-remediation. It’s a natural inclination to assume that speeding up resolution through automation means “applying a cure.” But more often than not, the industry theory of “no two incidents are truly identical” rears its head. When you have a high degree of variability, this reduces the value of such potential automation since it’s less likely to be run. For example, restarting a core service may be the right way to fix today’s issue, but it could lead to a cascading failure—and an even bigger incident—tomorrow.

*The reader now switches cognitive gears to the initial stages of a response.*

But you know what tends to be highly repetitive? The same investigative steps a responder takes to begin to diagnose what went wrong and determine what happened. More repetitive action means more value to gain from applying automation. For example, let’s say an incident kicks off within your Kubernetes distribution. No matter the nature of the incident, whether it be something within your image repository, or load balancer, you’re likely still going to take the same diagnostic step of pulling your kubernetes logs.

These diagnostic steps often remain static—for the most part—depending on the component you’re working with, no matter the priority of the incident that occurs. Automated diagnostics can be applicable to heterogeneous incidents; it doesn’t have to be purpose-built for the same, recurring incident, it can be applied to and customized around all sorts of common incident types and severities—specific to your environment—for almost any common component. Think of it like going to the doctor’s office. Whether you are going to urgent care for a specific complaint or just an annual checkup, they still take your temperature, blood pressure, and weight when you walk in.

Common Examples

Every developer environment is different; but many environments are also quite similar once you really pop the hood. In the beginning stages of a response, most diagnostics will come from three main data sources:

Application data
System data
Environment data

There are several examples of common diagnostics and components that can be automatically pulled during the beginning of a response. This would not only help the responder better understand the severity of the incident, but will also help ensure the responder doesn’t pull in too many specialists and interrupt them from their normal day of work. For example, let’s look at Kubernetes (k8s) as a component for a responder during an incident. When an incident happens within a k8s environment, the infrastructure engineer who maintains the technology would typically perform actions such as:

Tail logs from k8s pod
Retrieve logs from k8s by selector label
Check image repo
Describe deployment
Execute command in pod

One thing all of these actions have in common? A typical L1 responder ACK’ing an incident doesn’t know how to orchestrate these actions—it’s just not their area of expertise. But with the out-of-the-box jobs from PagerDury’s Automated Diagnostics, the L1 responder can automatically run these diagnostics and execute these jobs, which speeds up the response and reduces the escalation to the infrastructure engineer responsible for the k8s environment.

Some common diagnostics and alert examples include:

CPU/Memory Consuming Services
- Common alert: High CPU/Memory
- Common question: Which service(s) are consuming CPU/Memory?
File size / Disk Consumption
- Common alert: High CPU/Memory
- Common question: Which files/directories are consuming the most space?
System Logs: Linux/Windows Commands
- Common alert: Server/service issues
- Common question: Is it an OS issue or app issue?
SQL Database Commands
- Common alert: Database blocks/deadlocks
- Common question: Is there a long-running query blocking other database requests?
Host Availability
- Common alert: Host down
- Common question: Is it actually down or is it a false-positive reachability issue?
Application Error: Application Logs or traces
- Common alert: 400/500 error codes
- Common question: What is the stack-trace?

A few examples of some common diagnostics for known components:

Cloudwatch Logs: Surface specific application and VPC logs.
ECS: View stopped ECS task errors.
ELB: Debug unavailable target-group instances.
Kubernetes. Retrieve logs from Pods by selector label.
Linux. Retrieve service status.
Nginx. Retrieve error logs.
Redis. Slow log entries.

And these are just some of the over 30 out-of-the-box jobs templates we have built for our users that you can find in the Automated Diagnostics solution guide. To use the Automated Diagnostics Solution, you must either have a PagerDuty Runbook Automation license or a Process Automation (previously Rundeck Enterprise) license. See the FAQ for details on how to use. If you do not have a license for either of these products, contact us to learn more.

Automating diagnostics within PagerDuty

Incidents that notify responders are filled with information provided by monitoring tools that have a “miopic” view on the alert(s). A common example is that high CPU usage triggers an alert, and this notifies a responder. But the information contained in the alert is surface-level in that it does not specify what might be the cause of the spiked CPU.

Diagnostic data is the deeper-level information that helps answer the “why” and “where” questions of incidents. Even though some monitoring and correlation tools provide some help in providing root-cause analysis for users, most fall short in their ability to emulate a responder’s investigative/troubleshooting steps of collating disparate data-sources into a unified view. By providing responders with on-demand or pre-run diagnostic capabilities, the odds of the first responder resolving the issue on their own increase, as well as the probability of pulling in fewer individuals to assist with the incident. Enter Automated Diagnostics.

Want to learn more about common diagnostics for the components you use? Register for our September 14th webinar event of the same name, hosted by Justyn Roberts, Senior Solutions Consultant, PagerDuty. New to Process Automation? Request a demo. Already using PageDuty Process Automation? Check out the automated diagnostics solution guide to see the end-to-end process of achieving the full solution. Questions? Reach out to me directly on Twitter @sordnam and let’s chat!

The post Automating Common Diagnostics for Kubernetes, Linux, and other Common Components appeared first on PagerDuty.

Summit Recap: How to Adapt to a “Digital Everything” World by Sean Scott

Sean Scott — Wed, 08 Jun 2022 13:00:52 +0000

Every interaction with our customers, partners, and employees is special – but this year’s PagerDuty Summit went far beyond my wildest dreams. Together we committed to helping you learn and grow in how you manage business critical operations – in other words, getting you ready for anything in a world of Digital Everything.

New solutions for the new world of work

During our Summit, we covered a lot of ground. We discussed how PagerDuty solves the mismatch between modern work realities and the systems we have in place to do that work. We talked about how PagerDuty is more than on call management or even incident response, but rather, the Operations Cloud for improving operations with data and machine learning, so humans can focus on more beneficial, meaningful tasks.

We also announced a slate of new tools and automation capabilities, including:

Incident Workflows: later this year, you will be able to define an orchestrated response using “if-this-then-that” logic. For the responder, this means creating a sequence of common incident actions, or customizing workflows based on your organization’s bespoke processes. And as you learn from each incident, you can feed those learnings back into your workflows, or even add custom fields to give responders more contextual data for faster triage and incident resolution. You can learn more here.
Event Orchestration: this powerful decision engine, launched earlier this year, allows teams to create custom logic to enrich, modify, and control routing based on event conditions. Now we’ve added Terraform Support to make event orchestration even more flexible for configuring, maintaining, and updating events at scale.
Automation Actions: these are now integrated with Event Orchestration, so if a CPU warning event occurs, deep diagnostics will kick in at time 0. We’ve also integrated Automation Actions into our Customer Service Ops offering, as well as into our Mobile and Slack experiences. Learn more about Automation Actions here.
Custom Status Updates: we’ve added Status Update Notification Templates – including rich text editing and visuals — to format the content and context of incident responses. After all, you want your internal communications to answer questions, not raise new ones.

Our journey continues

There’s much more in store from PagerDuty. We’re not slowing down, ensuring you can spend more time on fulfilling and interesting work. It’s time to revolutionize operations – to offload the routine tasks to machines, and direct more energy toward your customer. Data should be in service of us, not the other way around.

Being ready for anything in a world of Digital Everything doesn’t mean sacrificing human attention – it’s about putting that attention where it matters most. Technology should be intuitive and seamless, so that humans can deliver personalized support and foster peace of mind.

We’d love to hear about your journey and how you are preparing for a world of Digital Everything. If you attended Summit, what did you learn and how will you use that knowledge in your work? What else would you like to see from us in the coming months? Whether you attended this year’s event or are thinking about participating next year, your opinions and perspectives will help us do our best, so you can do yours.

Follow us on our blog, Twitter, LinkedIn, Facebook or Instagram channels for the latest PagerDuty news and insights.

The post Summit Recap: How to Adapt to a “Digital Everything” World appeared first on PagerDuty.

Extending Automation Actions Across the PagerDuty Platform by Joseph Mandros

Joseph Mandros — Tue, 07 Jun 2022 13:00:03 +0000

It’s day one of PagerDuty Summit, and we are looking forward to a full day of expert presenters, actionable content, and educational sessions to boost your PagerDuty IQ and show you new ways to improve your team’s operational excellence.

One point you will continue to hear throughout the duration of the conference echoes our greater mission: To revolutionize operations so teams spend less time on reactive, break-fix work and more time delivering new innovation. At PagerDuty, we see this future of operations extending beyond the digital teams that build and run software to all teams in the organization. While many PagerDuty products and features exist to make this mission a reality, we are going to focus on the latest and greatest with PagerDuty Automation Actions®, part of the PagerDuty Process Automation® portfolio.

New Updates and Integrations With Automation Actions

Automation Actions connects your first-line responders to corrective automation directly within PagerDuty. Instead of pushing escalations to specialists when an incident kicks off, responders can triage and resolve incidents themselves using safely delegated automation. As a result, teams can reduce MTTR, lower interruptions to specialists, and quickly diagnose and remediate incidents.

We launched Automation Actions last year to help organizations get started quickly in simple first steps towards automation. Now, Automation Actions is integrated across the entire PagerDuty platform for all users to remove manual, time consuming repetitive tasks like diagnosing issues when brought into bridge calls.

Let’s look at some of the latest and greatest automation capabilities with Automation Actions:

Automation Actions in Incident Response. Teams can now run automated diagnostics and remediate incidents directly within PagerDuty. This integration will improve productivity and remove toil by automating repetitive, manual tasks, and give time back to your engineers to focus on innovation.

Automation Actions for Customer Service Ops. This integration gives customer service agents the ability to validate customer problems and capture critical information via automation to diagnose and resolve cases faster. Agents are now empowered to validate customer-impacting issues and run automated actions directly from the PagerDuty app in Service Cloud.

Automation Actions for Event Orchestration. By combining nested event rules with machine learning and precise, targeted automation triggers, it’s now possible to action an incident before responders even get paged. This integration with Event Orchestration helps teams automate common diagnostics and enable self-healing for recurring and well-understood types of incidents, resulting in reduced MTTR and escalations to specialists.

Automation Actions in PagerDuty’s Mobile App. Everything you love about Automation Actions is now mobile! Invoke the same automation from Automation Actions and resolve common incidents directly from the PagerDuty mobile app.

Automation Actions in Slack. With this integration, incident responders can deploy scriptable diagnostics and remediation actions directly from Slack.

Automation Everywhere

To be ready for anything with increasing digital complexity and dependencies, operations must transform from being manual, rigid, and ticket queue-based, to a continuously improving system that focuses on outcomes and customer experience, delivers operational speed and resilience, and is heavily automated by machine learning and AI. Only then can teams move toward a more proactive posture, and reduce the burden of manual work to avoid burnout and preserve focus. With Automation Actions, teams not only have the ability to reach this operational milestone, but to excel and continue to mature their automation capabilities.

Be sure to check out related Automation Actions sessions at PagerDuty Summit:

Normalize Automation, Sean Noble, Principal Product Manager, PagerDuty
Is it the Cloud? App? Database? Reduce Escalations by giving first responders automated diagnostics, Jake Cohen, Senior Product Manager, PagerDuty

To learn more about the PagerDuty automation portfolio, visit our automation hub. If you want to learn more about PagerDuty Automation Actions and how it can help your team save time and money, contact your account manager or learn more today.

The post Extending Automation Actions Across the PagerDuty Platform appeared first on PagerDuty.

What is Automated Diagnostics and Why Should You Care? by Joseph Mandros

Joseph Mandros — Fri, 03 Jun 2022 13:00:35 +0000

How do you measure the cost of an incident?

A lot of people in technology talk about the cost of an incident solely from the perspective of downtime, or the number of customers and employees impacted. And from the surface, oftentimes that is a fair angle to take. It makes the headlines, and customer reputation and trust are critical to the success of any business—obviously.

But another direct cost of incidents that is infrequently acknowledged is the number of people that need to get involved during an incident; whether that’s to help investigate the root-cause, troubleshoot and resolve the incident, or absolve their team of responsibility—regardless of whether the incident is severe enough to impact your customers.

According to PagerDuty data, 50% of a responder’s time is spent determining who is best to pull in for additional support (and trying to figure out if there’s actually a problem) in x environment, or with y service. Given this statistic, this means that 50% of an incident’s lifespan is spent on the beginning stages of an incident (the diagnostic and triage phases), rather than on actual remediative actions.

The bottom line? The cost of people-hours and the number of manual actions taken per incident can get steep—fast.

Automating Your Incident Response

Applying automation to the early, recurring stages of the incident, including diagnosing the severity of the incident and understanding the genetic makeup of what went awry (and how), is critical to the success of the eventual remediation of the incident.

Automation is also important from a people perspective, ensuring your teams aren’t getting burnt out by the same, repetitive actions every time an incident kicks off. Ensuring the diagnostic data is available to first responders is paramount to the routing efficiency and overall workflow of the incident response.

Before we go any further, let’s first define diagnostic data. Diagnostic data is data retrieved by incident responders that is typically more specific than the information provided by monitoring tools. For example, whereas monitoring tools will alert you when there is a spike in CPU or Memory, the incident responders investigate by looking at the highest CPU and Memory consuming processes. Therefore, in this case, the Process Names or ID’s and their associated compute-consumption is the “diagnostic data.”

So now that we have defined Automated Diagnostics, why should you care? Because implementing an Automated Diagnostics practice can drive down the cost of incidents through both reduced incident duration and fewer responders paged.

The Problem with MTTR

Perhaps “problem” is the wrong word here, but hear me out: MTTR as a metric is too broad to return granular, actionable insights. Mean time to repair (MTTR) has been a staple maintainability metric in the IT universe for decades. And while it has many applications and does a great job of explicating the rate of general recovery, its achilles heel is just that—generality. And now that we can safely infer that 50% of a responder’s time is spent determining who is best to pull in for additional support, we’ve started looking at other metrics within the MTTR timeline, such as MTTT (mean time to triage) or MTTI (mean time to investigate).

MTTI/MTTT: The average time between the detection of an IT incident and when the organization begins to investigate its cause and solution. This denotes the time between MTTD (mean time to detect) and the start of MTTR (mean time to repair).

At PagerDuty, we measure this as the time span between when your first responder “acks” to when your resolver “acks.” This metric helps us click into what’s actually happening under the hood during an incident. After observing our own data, we’ve been able to infer that MTTI is one of the most time-consuming factors of MTTR. And in modern business, when a task requires time and attention from engineers, then that task is an expensive one for the business. Really expensive.

Using Automated Diagnostics

Now let’s bring this back around to MTTI and automated diagnostics. MTTI is not only lengthened by the technical tasks of responders manually pulling diagnostic data and having to decipher which team to escalate to based on x service and y incident. It’s also about the people and their limitations, depending on the specific expertise that is required to begin resolution. For example, in many cases, the first responder doesn’t know how to investigate the issue from the database or network ‘perspectives.’ That may be due to their lack of skills (background in databases or networks); access, or tribal knowledge (e.g. that a specific app-component depends on a complex integration with a third-party service).

By automating these investigative and debugging tasks, in addition to having the ability to delegate these actions across teams and responders, you will experience a positively cascading effect on MTTI, and eventually, MTTR.

So why should you care about automated diagnostics?

With automated diagnostics, you can:

Reduce escalations to scarce experts by designing paths to provide the first-responders with information that would typically be manually gathered
Distribute subject matter expertise across response teams
Invoke secure automation behind firewalls and VPCs
Troubleshoot and resolve faster without a human-assisted action required
Improve the speed of enablement to new engineers and ensure optimal efficiency at all levels of the incident response organization

Getting Started

You made your decision. Now it’s time to blaze the trail, but where do you start?

To use some marketing slang: don’t try to boil the ocean. Trial some actions that are both low in complexity and risk. This could be taking a deeper look at some of your noisiest services, or you could run some simple data pulls from various monitoring applications, disc usage, etc. But it’s important to have a strategy for the long-term roll out and vision of this functionality. Sure, you can write a script that pulls data from numerous sources and appends that to an incident. But that is far from scalable.

It’s important to think about the various infrastructure pieces and tools you will want to pull diagnostic data from. You will want a standardized approach for interfacing with your heterogeneous and dynamic environments.

To learn more about automated diagnostics, check out some of our how-to articles, which we will be continuing to publish throughout the year. Additionally, look out for a session on all things Automated Diagnostics from Jake Cohen during PagerDuty Summit next week!

For more resources about PagerDuty’s Process Automation portfolio, visit this page and get in touch with your account manager today.

Any questions? Feel free to ask away on Twitter @sordnam

The post What is Automated Diagnostics and Why Should You Care? appeared first on PagerDuty.

Four Use Cases for Optimizing Your Cloud Operations With PagerDuty® Runbook Automation by Madeline Stack

Madeline Stack — Wed, 06 Apr 2022 13:00:26 +0000

The cloud is easy and powerful—until it’s not. Once companies have customers, commitments, and compliance concerns, they often have to create cloud operations teams to manage the cloud on behalf of their fellow employees. Often, organizations that migrate to the cloud find themselves hampered by inefficient cloud operations if they haven’t standardized their IT procedures for operability. Less mature operations manifest as highly manual procedures, slow ticket queues, and less-than-impressive resolution times. These companies typically see longer incidents, lower productivity, and high staff costs. System availability goals are achieved at the expense of innovation velocity as too much staff time is spent on toil versus engineering.

How are you running your cloud operations? Are you operating in ticket time and losing the agility of the cloud? Or are you extending cloud automation to the operations task level? The trick to maintaining your cloud agility is to create operational APIs, and it calls for your teams to run at their speed, not languish in ticket time. The good news is that PagerDuty® Runbook Automation is here to help. PagerDuty® Runbook Automation is a SaaS that helps businesses accelerate cloud operations by safely delegating automated IT workflows. Now you can operate faster by standardizing and safely delegating automated IT workflows to stakeholders. These workflows automate procedures in runbooks for resolving incidents, and fulfilling IT requests.

PagerDuty® Runbook Automation is new to the market, but we’ve already seen some interesting use cases from our early access users. Here are some promising use cases:

Self-service Automation for Cloud Ops

PagerDuty® Runbook Automation lets DevOps engineers and SREs get started on building automation immediately. Engineers can create workflow automation and deliver reliable automated runbooks to end-users. This is especially useful for automating requests from developers to the Cloud Ops team for things like provisioning and deprovisioning instances, creating data sandboxes, copying data from production systems to development and test systems, and querying production data for things like user logs.

In addition to ensuring that system requirements are met and costs can be controlled, developers can start coding right away, instead of having to waste time on repetitive tasks or waiting for the platform team to do the same.

2. Drive Better Efficiency and Better-integrated Workflows

Many IT Ops and Tech Support Ops tools are now available as SaaS but the infrastructure and production infrastructure they need to connect to is still behind a firewall or VPC. These SaaS tools may not be able to invoke “on-premises” automation from a SaaS product because company policy doesn’t allow inbound webhooks.

Automation is most valuable when more people can use it. However, because there are a limited number of people that have access to environments behind the firewall, there are only a few who can utilize the automation that touches these environments. Delegating automation requires new approaches for connecting traditional automation to these cloud-based tools.

PagerDuty Runbook Automation’s approach for standardizing workflow automation can help you develop a congruent strategy for connecting front-office systems to back-office systems. Easily hook your SaaS tools into PagerDuty Runbook Automation to execute automation without requiring an inbound webhook through the firewall. Now, end-users can work more efficiently by invoking delegated automation where they work. This includes CRMs like Salesforce, ITSMs like ServiceNow, Github Cloud, Jira Cloud, PagerDuty, and more.

Maybe a financial planning and analysis analyst needs to grab inventory data about IT infrastructure for capacity planning. Instead of filling out a ticket for the Ops team they can pull this data themselves using a prebuilt workflow in Runbook Automation, possibly triggered from JIRA. Customer service agents can also benefit. For example, say a customer wants to change their SaaS portal subdomain, which is only possible by making API calls to “internal” API endpoints. You could empower your customer service agents to make these changes on behalf of a customer by invoking an automated workflow right from Zendesk that changes the customer’s portal name. See a demo of this use case here.

3. Self-service Cloud Operation Across Multiple Accounts or Products

PagerDuty® Runbook Automation can safely delegate access to automated tasks that work across cloud accounts, making it easier to consistently manage infrastructure in multiple cloud accounts. With PagerDuty® Runbook Automation’s AWS Systems Manager integration, operators can view and manage their global EC2 footprint in a single interface. This means faster execution and accuracy for tasks such as configuration management, patching, and deploying monitoring and security tooling agents.

This can be very useful for service providers who manage cloud infrastructure for customers. For example, customer success engineers can easily support customer environments without needing detailed knowledge or direct access to infrastructure resources. Because PagerDuty® Runbook Automation provides sharing of credentials in the context of operational processes, support reps and support teams have outcome-based jobs at their fingertips.

4. Improve Cloud Ops Compliance

PagerDuty® Process Automation is great for optimizing security and compliance since secure setup and configuration are ensured by automation, and all access and executions are logged. The same is true for Cloud Operations.

It can be time-consuming for someone to have to gather all the logs needed for audits across a sprawling number of cloud accounts and services. Runbook Automation logs every job execution inside the platform, which provides auditable history more directly aligned with business activity. This makes it easier for cloud engineering teams to meet compliance requirements.

Runbook Automation’s SaaS-based architecture makes it easy to trial! Start using powerful task automation in minutes. Contact us if you want to try Runbook Automation today. Learn more about Runbook Automation here.

The post Four Use Cases for Optimizing Your Cloud Operations With PagerDuty® Runbook Automation appeared first on PagerDuty.

Democratize Your Team’s Automation Capabilities With PagerDuty® Automation Actions by Joseph Mandros

Joseph Mandros — Mon, 04 Apr 2022 13:00:55 +0000

Let’s face it. Incidents can be expensive—really expensive. But the high cost of incidents within a production environment isn’t always due to a compromised service or negative customer experience. According to PagerDuty response data, over 50% of an incident’s lifespan is spent with first responders in the investigation and mobilization phases (what we call ‘triage’)— in other words, determining what might have gone wrong and calling the right person to fix it.

With the above statistic in mind, it’s clear that the shadow expense of your incident lifecycle is that of your people’s time—the engineer who discovered the incident, the on-call engineer who responded to the issue and determined root cause, and every other subject matter expert that gets looped into the incident lifecycle. And when you sprinkle in manual processes to the entire response timeline, things can get pricey. Very pricey.

The fact of the matter is, your developer organization’s time is just as valuable and important as the business’s bottom line. And as service and application development continues to grow in complexity, “time saved” becomes an even more important metric to track, quantify, and continuously improve. Finding a way to automate different aspects of the incident response process can help save your team’s time and bolster efficiency across the board. How can you do this, you ask? Enter PagerDuty® Automation Actions (formerly PagerDuty Rundeck Actions).

PagerDuty® Automation Actions

PagerDuty® Automation Actions add-on connects your first-line responders to corrective automation directly within PagerDuty. Instead of pushing escalations to specialists when an incident kicks off, responders can triage and resolve incidents themselves using safely delegated automation. As a result, teams reduce MTTR, lower interruptions to specialists, and quickly diagnose and remediate incidents.

PagerDuty® Automation Actions connects automated diagnostics and remediation to the incident response workflow. Automated Diagnostics are a set of actions for production services that your responders can automatically invoke when an incident occurs. Rather than having to escalate to expert specialists who manually run common tests, responders can safely and securely invoke this automation themselves from within PagerDuty and see responses delivered in real time back to your incident timeline.

Run designated actions such as service restarts, diagnostics, and more

With these diagnostic tests, responders can more efficiently escalate the incident to the right specialist for resolution, rather than involving a large group or escalating up the typical responder ladder. The specialists will be able to see the results of those common diagnostics and can get started right away.

Additionally, teams can also invoke these actions and collaborate on the incident directly from their Slack instance. This eliminates the need to access a service through a terminal and context switch between windows, creating a faster and more efficient way to resolve incidents—while also reducing escalations to specialists. As you mature your use of automated diagnostics, you can start using it for things like automated remediation and triggering using Event Intelligence.

PagerDuty® Automation Actions helps solve four main problem areas within an organization’s response process:

Siloed expertise. First-line responders don’t know the genetic makeup of every single application or service within an organization’s environment.
Consistent interruptions to specialists. Responders escalate to the engineer they think is the specialist of that application or service, taking time away from innovation and slowing time-to-resolution.
Repetitive and manual diagnostic steps. The first steps when an incident kicks off are often the same. These same, manual steps have to be actioned on before you can begin resolving the incident.
Complex and sprawling production environments. Knowing which systems to access and what actions to take can take time. Additionally, not every responder has the authority to access specific production systems, often making the escalation process difficult and time-consuming.

PagerDuty® Automation Actions solves the above issues by:

Delegating automation across teams. Deploy automated procedures to first-line responders that are typically invoked by specialists.
Resolving incidents faster with fewer escalations. By creating automation for common requests and operations, teams can spend less time figuring out who to escalate to and more time on a fix.
Triggering human-assisted/self-healing automation. Invoke diagnostic actions before responders are even paged using PagerDuty’s Event Orchestration.
Safely invoking automation with security in mind. Responders only see actions they have the authorization to invoke for impacted systems in an incident, and all actions are logged to maintain a strong security posture.

To summarize the above with some quick bullets, PagerDuty® Automation Actions helps teams:

Decrease response times up by to 30 minutes and MTTR by up to 25%
Reduce the volume of incidents that get escalated up the ladder
Distribute subject matter expertise across response teams
Trigger human-assisting and self-healing automation before responders even get paged
Invoke secure automation behind firewalls and VPCs
Deploy automated actions in place of manual procedures
Enrich incident documentation for smoother postmortems and reduced operator work

The post Democratize Your Team’s Automation Capabilities With PagerDuty® Automation Actions appeared first on PagerDuty.

PagerDuty® Runbook Automation Joins the PagerDuty Process Automation Portfolio by Madeline Stack

Madeline Stack — Tue, 22 Mar 2022 13:00:46 +0000

Spring is blooming here at PagerDuty, and so is our automation product line. We’re thrilled to share some exciting product announcements.

First, we’ve officially rebranded our automation product line, Rundeck®, as PagerDuty® Process Automation. Fundamentally, everyone who buys Rundeck becomes a PagerDuty customer, so we decided to make it less confusing.

Second, our runbook automation cloud service (announced last fall as “Rundeck Cloud”) is now generally available. This was included in the rebrand, so we chose to name it according to what it does: “PagerDuty Runbook Automation.” PagerDuty Runbook Automation is a SaaS-based offering within our PagerDuty Process Automation portfolio that’s designed to automate operations procedures so that a wider range of people in an organization can utilize them.

Finally, we are announcing version 4.0 of Rundeck Enterprise, which will now be known as “PagerDuty Process Automation On-Prem.” PagerDuty Process Automation On-Prem is self-managed software that supports a wide range of process automation use-cases.

Let’s get into the details:

Rundeck Is Now “PagerDuty Process Automation”

As mentioned above, the Rundeck product line has been rebranded as PagerDuty Process Automation. Since the acquisition of Rundeck, we’ve seen rapidly growing demand for the product from both existing PagerDuty customers and the Rundeck community. PagerDuty has also released multiple new products related to Rundeck, providing seamless integration with PagerDuty Incident Response (now called PagerDuty Automation Actions) and now a SaaS version of Rundeck focused on the runbook automation use case (now called PagerDuty Runbook Automation).

The vision behind PagerDuty Process Automation is to help our customers automate everything in their businesses, drive up consistency, and drive down resolution times. PagerDuty Process Automation empowers customers to automate IT and business procedures across their business systems. These processes can then be safely delegated to stakeholders, run on a scheduled basis, or triggered in response to events.

These automated processes can be used to diagnose and remediate issues, fulfill service requests, and conduct regular maintenance and administrative tasks. These actions can be human-assisted, or executed without any manual intervention.

The PagerDuty® Process Automation portfolio consists of the following offerings:

PagerDuty® Automation Actions. A PagerDuty add-on that curates and connects responders in PagerDuty to automated diagnostics and remediation for services involved in incidents.
PagerDuty® Runbook Automation. A SaaS service that enables engineers to standardize and automate runbook procedures, and delegate services as self-service operations.
PagerDuty® Process Automation On-Prem. A self-hosted software cluster that gives engineers the ability to standardize and automate end-to-end operational workflows, and safely delegate them as self-service operations to stakeholders.

Runbook Automation is Generally Available

PagerDuty Runbook Automation gives you powerful runbook automation without needing to install and manage the automation software yourself. Why is this useful? Let’s look at how cloud computing services themselves get provisioned and managed. Cloud computing is powerfully easy. With the press of the button, users can spin up nearly any kind of IT service. Despite the fact that the cloud provides a lot of powerful technical automation, this automation is not considered safe to make broadly available to all potential users. The reasons for limiting access include concerns about cost, speed of outcome, quality of operations, and security.

To mitigate this limited access, companies resort to having their engineers and stakeholders submit tickets to centralized cloud infrastructure teams for their needs. These infrastructure teams can keep track of all cloud usage, have the skills to properly set up and operate cloud services, and are able to enforce policies and ensure secure operations. However, a task that could take minutes to spin up in the cloud automatically ends up taking days to be manually fulfilled by the central cloud team.

What companies really need to do is build automation for fulfilling those tasks, and make them available to the humans that need them. In other words, automate their runbook procedures, and then delegate this automation to stakeholders.

Using PagerDuty Runbook Automation (previously announced as Rundeck Cloud), Users turn manual runbooks into delegated, self-service requests and automate key IT tasks across environments to optimize security and compliance. This can include automating runbooks for resolving incidents, closing tickets, and fulfilling requests, now with the agility of operating in the cloud.

As a SaaS-based offering, PagerDuty Runbook Automation lets DevOps engineers and SREs get started immediately. They can deliver reliable, automated processes that are highly secure and available without having to manage automation infrastructure.

PagerDuty Runbook Automation

Engineers standardize task automation for any public cloud. They do so by defining jobs that incorporate common infrastructure components as nodes and executing steps that utilize existing scripts and commands. PagerDuty Runbook Automation facilitates the delegation of these jobs by ensuring safety and compliance with authentication, access control, and privileged access management services—and by logging every activity.

PagerDuty Runbook Automation connects automation in customers’ production environments via an agent called a Runbook Runner that is deployed behind the firewall. It securely connects nodes within their environments back to the PagerDuty Runbook Automation endpoint. These nodes can then be used to execute steps in automated process jobs. Job definitions can incorporate nodes connected by different Runbook Runners.

The PagerDuty Runbook Runner is built to meet the latest zero-trust security models. It calls back to the PagerDuty Runbook Automation endpoint via HTTPS. There’s no need to open additional ports in firewalls.

Some of the use cases we’ve seen early users utilize PagerDuty Runbook Automation for include:

Providing self-service cloud operations for end-users like developers and customer service agents. Developers no longer need to wait for their environments, while platform teams see fewer tickets and interruptions.
Simplified management of multiple cloud accounts. It can be difficult to manage many accounts across multiple public clouds. PagerDuty Runbook Automation simplifies the sharing of access credentials in the context of operational processes.
Tracking of automation for auditing and compliance. When users are managing hybrid environments, it can be difficult to track down all the automation logs across cloud accounts and other tools. All jobs are logged and easily viewable in PagerDuty Runbook Automation, making audit tracking seamless.

Process Automation On-Prem 4.0 Now Available

We’re also excited to announce Version 4.0 of PagerDuty Process Automation On-Prem (previously Rundeck Enterprise). Version 4.0 rolls up security, stability, and usability enhancements, and includes a version of the Runner technology used by PagerDuty Runbook Automation—in this case, a Process Runner.

The Process Runner allows customers to run automated jobs that span between trusted environments. Customers can complete multi-environment changes, such as patching software instances in different data centers or availability zones, through automation in minutes instead of manual procedures that can take days.

The Process Runner deploys behind firewalls in disparate environments. It securely connects nodes within these network segments back to the central PagerDuty Process Automation On-Prem cluster. These disparate nodes can then be used to execute steps in automated process jobs that span across multiple environments.

The Process Runner is built to meet the latest zero-trust security models. It calls back to the PagerDuty Process Automation On-Prem cluster endpoint via HTTPS. Now there’s no need to open additional ports in firewalls.

PagerDuty Process Automation On-Prem’s security model allows companies to eliminate the need to support SSH between their trusted zones. This update is especially useful for businesses that need to better support automation of IT processes at subsidiaries and acquired companies, automate across global geography, and optimize the separation of customer data.

We love our Rundeck customers and look forward to continuing the journey together. Sign up for the PagerDuty Process Automation Launch Webinar to learn more!

The post PagerDuty® Runbook Automation Joins the PagerDuty Process Automation Portfolio appeared first on PagerDuty.