automated operations | Tags | PagerDuty

Bridging Silos: Automation Enhancements That Simplify IT Operations Across Secure Environments by Nisha Prajapati

Nisha Prajapati — Fri, 03 Mar 2023 16:36:08 +0000

The post Bridging Silos: Automation Enhancements That Simplify IT Operations Across Secure Environments appeared first on PagerDuty.

New! Common Automated Diagnostics for AWS Users by Jake Cohen

Jake Cohen — Wed, 03 Aug 2022 13:00:02 +0000

Today’s modern cloud architectures centered on AWS are typically a composite of ~250 AWS services and workflows implemented by over 25,000 SaaS services, house-developed services, and legacy systems. When incidents fire off in these environments—whether or not a company has built out a centralized cloud platform—distinct expertise is often a necessity. Because of this scaled complexity, first responders find themselves having to escalate to several different service owners or expert engineers to gather diagnostics before it’s possible to determine who an ultimate resolver of an issue should be.

When it comes to incident response, it’s critical that these new cloud environments seamlessly integrate with an organization’s existing critical applications and services—both old and new. In light of enhancing service quality and making it easier for responders to cross that bridge of expertise, we are excited to announce the immediate availability of new AWS plug-in integrations for automated diagnostics.

New AWS Plugins for Automated Diagnostics

Our new AWS plugins for Automated Diagnostics help provide deeper coverage for customers that are also users of AWS, making it easier to get up and running with automated diagnostics in their AWS environment quickly.

The new AWS plugins for Automated Diagnostics include:

CloudWatch Logs plugin. This plugin retrieves diagnostic data from AWS infrastructure and applications. Now users can more easily run automated diagnostics for AWS across multiple accounts and products.
Systems Manager plugin. This plugin allows for faster execution and accuracy for tasks such as configuration management, patching, and deploying monitoring and security tooling agents. Users are now able to apply automation to the above tasks for faster execution.
ECS Remote Command plugin. This plugin provides a mechanism to execute commands on containers. This enables developers and operators to retrieve diagnostic data from their running applications in real-time before redeploying their services.
Lambda Custom Code Workflow plugin. Create, execute, and optionally delete a new Lambda function with the custom code provided in a Job step as its input. Execute custom scripts as steps in jobs without having to install any software.

Sound complex? Don’t worry, we thought of everything :).

New Auto-Diagnostic Job Templates for AWS Users

We also released new pre-built templates for AWS, so you can start enhancing incident details for your specific environments immediately. These are purpose-built to be used with minimal configuration. Instead of starting from scratch, users now have a library of curated, ready-to-use job definitions that retrieve data for investigating, debugging, and triaging incidents during a response.

New users can start automating diagnostics for AWS faster and existing users can easily add AWS diagnostics to their existing PagerDuty Process Automation project.

Some example job templates include:

AWS – EC2	Instance Status & Associated IAM Roles	Retrieve EC2 Instance Status and Associated IAM Roles	Remote command (or SSM)
AWS – ECS	Stopped ECS Task Errors	Checks stopped ECS Tasks for errors and provides detailed information on the reason for the errors.	Stopped ECS Tasks
AWS – ELB	Retrieve ELB Targets Health Status	Retrieve the list of unhealthy Targets in the Load Balancer’s associated Target Groups.	ELB Instance Statuses
AWS – RDS	Check Database Storage Status	Checks RDS database for the instance status.	RDS Instance Status
AWS – VPC	IP addresses using UDP transfer protocol	Query CloudWatch logs to identify any hosts using the UDP transfer protocol.	CloudWatch Logs
AWS – VPC	Top 10 Hosts by Throughput on Subnet	Query CloudWatch logs to identify the top 10 hosts by throughput on a given subnet.	CloudWatch Logs
AWS – VPC	Top 10 Source IP Addresses with Highest Rejected Requests	Query CloudWatch logs to identify the top 10 source-IP addresses with the highest rejected-requests.	CloudWatch Logs
AWS – VPC	Top 10 Web-Server Requestors by Public IP	Query CloudWatch logs to identify the top 10 public-IP requestors to our web-server (e.g. Nginx).	CloudWatch Logs

And this is just the tip of the iceberg! We will continue to develop and build upon our existing plugins to ensure our customers that use AWS are well-equipped to invoke automation wherever it is needed, including providing some interactive guides.

Want to learn more about common diagnostics? Register for our webinar event, “Common Diagnostics for Common Components,” on September 14th. Request a demo to see automated diagnostics with PagerDuty Process Automation in action.

Already using PageDuty Process Automation? Check out the Automated Diagnostics solution guide to see the end-to-end process of achieving the full solution.

The post New! Common Automated Diagnostics for AWS Users appeared first on PagerDuty.

Automating Common Diagnostics for Kubernetes, Linux, and other Common Components by Joseph Mandros

Joseph Mandros — Wed, 27 Jul 2022 13:00:45 +0000

Watch our Automated Diagnostics webinar on demand to learn about common diagnostics for common components and how we provide out-of-the-box job templates for you to get started right away.

This is the second piece in a series about automated diagnostics, a common use case for the PagerDuty Process Automation portfolio.

In the last piece, we talked about the basics around automated diagnostics and how teams can use the solution to reduce escalations to specialists and empower responders to take action faster. In this blog, we’re going to talk about some basic diagnostics examples for components that are most relevant to our users.

But before we jump in, let’s make clear what automated diagnostics isn’t, based on some audience feedback on Twitter from the last article:

Automated diagnostics is different from alert correlation. Alert correlation depends on a specified depth of signals, as well as an engine that can properly identify said correlated signals. Automated diagnostics is meant to help the first responder triangulate the source of the issue to either fix the issue faster themselves, or escalate more accurately.
Automated diagnostics is different from monitoring. Monitoring is purpose-built to identify undesired states in performance or activity. This means that most monitoring is not purpose-built to emulate a first-responder’s activities to validate a true positive, or identify the first actions to take. Monitoring is focused on raising the alert. Automated diagnostics is focused on determining how to fix an issue once the alert is already created.

That said, automated diagnostics can certainly make use of data collected by monitoring tools—most people don’t apply thresholds to every datapoint they collect. In fact, one of our more commonly used diagnostics integration is to query CloudWatch logs. While we might consider a log aggregator a monitoring tool, sometimes the first steps of investigation are to look at the data in the monitoring tool that exists purely for diagnosing issues.

Providing responders with on-demand or pre-run diagnostic capabilities for their own environments can help a first responder quickly determine probable cause, thereby pulling in fewer individuals to assist with the incident. By providing first-responders with “diagnostic” data that is typically only retrievable by domain experts, the need to pull in more people for troubleshooting incidents is reduced significantly. This in turn drives down the cost of incidents and reduces mean time to response (MTTR) by automating the investigative steps that are typically manual in nature.

The status quo: Automation in incident response

Operations managers are often excited about the idea of enabling self-healing or auto-remediation. It’s a natural inclination to assume that speeding up resolution through automation means “applying a cure.” But more often than not, the industry theory of “no two incidents are truly identical” rears its head. When you have a high degree of variability, this reduces the value of such potential automation since it’s less likely to be run. For example, restarting a core service may be the right way to fix today’s issue, but it could lead to a cascading failure—and an even bigger incident—tomorrow.

*The reader now switches cognitive gears to the initial stages of a response.*

But you know what tends to be highly repetitive? The same investigative steps a responder takes to begin to diagnose what went wrong and determine what happened. More repetitive action means more value to gain from applying automation. For example, let’s say an incident kicks off within your Kubernetes distribution. No matter the nature of the incident, whether it be something within your image repository, or load balancer, you’re likely still going to take the same diagnostic step of pulling your kubernetes logs.

These diagnostic steps often remain static—for the most part—depending on the component you’re working with, no matter the priority of the incident that occurs. Automated diagnostics can be applicable to heterogeneous incidents; it doesn’t have to be purpose-built for the same, recurring incident, it can be applied to and customized around all sorts of common incident types and severities—specific to your environment—for almost any common component. Think of it like going to the doctor’s office. Whether you are going to urgent care for a specific complaint or just an annual checkup, they still take your temperature, blood pressure, and weight when you walk in.

Common Examples

Every developer environment is different; but many environments are also quite similar once you really pop the hood. In the beginning stages of a response, most diagnostics will come from three main data sources:

Application data
System data
Environment data

There are several examples of common diagnostics and components that can be automatically pulled during the beginning of a response. This would not only help the responder better understand the severity of the incident, but will also help ensure the responder doesn’t pull in too many specialists and interrupt them from their normal day of work. For example, let’s look at Kubernetes (k8s) as a component for a responder during an incident. When an incident happens within a k8s environment, the infrastructure engineer who maintains the technology would typically perform actions such as:

Tail logs from k8s pod
Retrieve logs from k8s by selector label
Check image repo
Describe deployment
Execute command in pod

One thing all of these actions have in common? A typical L1 responder ACK’ing an incident doesn’t know how to orchestrate these actions—it’s just not their area of expertise. But with the out-of-the-box jobs from PagerDury’s Automated Diagnostics, the L1 responder can automatically run these diagnostics and execute these jobs, which speeds up the response and reduces the escalation to the infrastructure engineer responsible for the k8s environment.

Some common diagnostics and alert examples include:

CPU/Memory Consuming Services
- Common alert: High CPU/Memory
- Common question: Which service(s) are consuming CPU/Memory?
File size / Disk Consumption
- Common alert: High CPU/Memory
- Common question: Which files/directories are consuming the most space?
System Logs: Linux/Windows Commands
- Common alert: Server/service issues
- Common question: Is it an OS issue or app issue?
SQL Database Commands
- Common alert: Database blocks/deadlocks
- Common question: Is there a long-running query blocking other database requests?
Host Availability
- Common alert: Host down
- Common question: Is it actually down or is it a false-positive reachability issue?
Application Error: Application Logs or traces
- Common alert: 400/500 error codes
- Common question: What is the stack-trace?

A few examples of some common diagnostics for known components:

Cloudwatch Logs: Surface specific application and VPC logs.
ECS: View stopped ECS task errors.
ELB: Debug unavailable target-group instances.
Kubernetes. Retrieve logs from Pods by selector label.
Linux. Retrieve service status.
Nginx. Retrieve error logs.
Redis. Slow log entries.

And these are just some of the over 30 out-of-the-box jobs templates we have built for our users that you can find in the Automated Diagnostics solution guide. To use the Automated Diagnostics Solution, you must either have a PagerDuty Runbook Automation license or a Process Automation (previously Rundeck Enterprise) license. See the FAQ for details on how to use. If you do not have a license for either of these products, contact us to learn more.

Automating diagnostics within PagerDuty

Incidents that notify responders are filled with information provided by monitoring tools that have a “miopic” view on the alert(s). A common example is that high CPU usage triggers an alert, and this notifies a responder. But the information contained in the alert is surface-level in that it does not specify what might be the cause of the spiked CPU.

Diagnostic data is the deeper-level information that helps answer the “why” and “where” questions of incidents. Even though some monitoring and correlation tools provide some help in providing root-cause analysis for users, most fall short in their ability to emulate a responder’s investigative/troubleshooting steps of collating disparate data-sources into a unified view. By providing responders with on-demand or pre-run diagnostic capabilities, the odds of the first responder resolving the issue on their own increase, as well as the probability of pulling in fewer individuals to assist with the incident. Enter Automated Diagnostics.

Want to learn more about common diagnostics for the components you use? Register for our September 14th webinar event of the same name, hosted by Justyn Roberts, Senior Solutions Consultant, PagerDuty. New to Process Automation? Request a demo. Already using PageDuty Process Automation? Check out the automated diagnostics solution guide to see the end-to-end process of achieving the full solution. Questions? Reach out to me directly on Twitter @sordnam and let’s chat!

The post Automating Common Diagnostics for Kubernetes, Linux, and other Common Components appeared first on PagerDuty.

Extending Automation Actions Across the PagerDuty Platform by Joseph Mandros

Joseph Mandros — Tue, 07 Jun 2022 13:00:03 +0000

It’s day one of PagerDuty Summit, and we are looking forward to a full day of expert presenters, actionable content, and educational sessions to boost your PagerDuty IQ and show you new ways to improve your team’s operational excellence.

One point you will continue to hear throughout the duration of the conference echoes our greater mission: To revolutionize operations so teams spend less time on reactive, break-fix work and more time delivering new innovation. At PagerDuty, we see this future of operations extending beyond the digital teams that build and run software to all teams in the organization. While many PagerDuty products and features exist to make this mission a reality, we are going to focus on the latest and greatest with PagerDuty Automation Actions®, part of the PagerDuty Process Automation® portfolio.

New Updates and Integrations With Automation Actions

Automation Actions connects your first-line responders to corrective automation directly within PagerDuty. Instead of pushing escalations to specialists when an incident kicks off, responders can triage and resolve incidents themselves using safely delegated automation. As a result, teams can reduce MTTR, lower interruptions to specialists, and quickly diagnose and remediate incidents.

We launched Automation Actions last year to help organizations get started quickly in simple first steps towards automation. Now, Automation Actions is integrated across the entire PagerDuty platform for all users to remove manual, time consuming repetitive tasks like diagnosing issues when brought into bridge calls.

Let’s look at some of the latest and greatest automation capabilities with Automation Actions:

Automation Actions in Incident Response. Teams can now run automated diagnostics and remediate incidents directly within PagerDuty. This integration will improve productivity and remove toil by automating repetitive, manual tasks, and give time back to your engineers to focus on innovation.

Automation Actions for Customer Service Ops. This integration gives customer service agents the ability to validate customer problems and capture critical information via automation to diagnose and resolve cases faster. Agents are now empowered to validate customer-impacting issues and run automated actions directly from the PagerDuty app in Service Cloud.

Automation Actions for Event Orchestration. By combining nested event rules with machine learning and precise, targeted automation triggers, it’s now possible to action an incident before responders even get paged. This integration with Event Orchestration helps teams automate common diagnostics and enable self-healing for recurring and well-understood types of incidents, resulting in reduced MTTR and escalations to specialists.

Automation Actions in PagerDuty’s Mobile App. Everything you love about Automation Actions is now mobile! Invoke the same automation from Automation Actions and resolve common incidents directly from the PagerDuty mobile app.

Automation Actions in Slack. With this integration, incident responders can deploy scriptable diagnostics and remediation actions directly from Slack.

Automation Everywhere

To be ready for anything with increasing digital complexity and dependencies, operations must transform from being manual, rigid, and ticket queue-based, to a continuously improving system that focuses on outcomes and customer experience, delivers operational speed and resilience, and is heavily automated by machine learning and AI. Only then can teams move toward a more proactive posture, and reduce the burden of manual work to avoid burnout and preserve focus. With Automation Actions, teams not only have the ability to reach this operational milestone, but to excel and continue to mature their automation capabilities.

Be sure to check out related Automation Actions sessions at PagerDuty Summit:

Normalize Automation, Sean Noble, Principal Product Manager, PagerDuty
Is it the Cloud? App? Database? Reduce Escalations by giving first responders automated diagnostics, Jake Cohen, Senior Product Manager, PagerDuty

To learn more about the PagerDuty automation portfolio, visit our automation hub. If you want to learn more about PagerDuty Automation Actions and how it can help your team save time and money, contact your account manager or learn more today.

The post Extending Automation Actions Across the PagerDuty Platform appeared first on PagerDuty.

What is Automated Diagnostics and Why Should You Care? by Joseph Mandros

Joseph Mandros — Fri, 03 Jun 2022 13:00:35 +0000

How do you measure the cost of an incident?

A lot of people in technology talk about the cost of an incident solely from the perspective of downtime, or the number of customers and employees impacted. And from the surface, oftentimes that is a fair angle to take. It makes the headlines, and customer reputation and trust are critical to the success of any business—obviously.

But another direct cost of incidents that is infrequently acknowledged is the number of people that need to get involved during an incident; whether that’s to help investigate the root-cause, troubleshoot and resolve the incident, or absolve their team of responsibility—regardless of whether the incident is severe enough to impact your customers.

According to PagerDuty data, 50% of a responder’s time is spent determining who is best to pull in for additional support (and trying to figure out if there’s actually a problem) in x environment, or with y service. Given this statistic, this means that 50% of an incident’s lifespan is spent on the beginning stages of an incident (the diagnostic and triage phases), rather than on actual remediative actions.

The bottom line? The cost of people-hours and the number of manual actions taken per incident can get steep—fast.

Automating Your Incident Response

Applying automation to the early, recurring stages of the incident, including diagnosing the severity of the incident and understanding the genetic makeup of what went awry (and how), is critical to the success of the eventual remediation of the incident.

Automation is also important from a people perspective, ensuring your teams aren’t getting burnt out by the same, repetitive actions every time an incident kicks off. Ensuring the diagnostic data is available to first responders is paramount to the routing efficiency and overall workflow of the incident response.

Before we go any further, let’s first define diagnostic data. Diagnostic data is data retrieved by incident responders that is typically more specific than the information provided by monitoring tools. For example, whereas monitoring tools will alert you when there is a spike in CPU or Memory, the incident responders investigate by looking at the highest CPU and Memory consuming processes. Therefore, in this case, the Process Names or ID’s and their associated compute-consumption is the “diagnostic data.”

So now that we have defined Automated Diagnostics, why should you care? Because implementing an Automated Diagnostics practice can drive down the cost of incidents through both reduced incident duration and fewer responders paged.

The Problem with MTTR

Perhaps “problem” is the wrong word here, but hear me out: MTTR as a metric is too broad to return granular, actionable insights. Mean time to repair (MTTR) has been a staple maintainability metric in the IT universe for decades. And while it has many applications and does a great job of explicating the rate of general recovery, its achilles heel is just that—generality. And now that we can safely infer that 50% of a responder’s time is spent determining who is best to pull in for additional support, we’ve started looking at other metrics within the MTTR timeline, such as MTTT (mean time to triage) or MTTI (mean time to investigate).

MTTI/MTTT: The average time between the detection of an IT incident and when the organization begins to investigate its cause and solution. This denotes the time between MTTD (mean time to detect) and the start of MTTR (mean time to repair).

At PagerDuty, we measure this as the time span between when your first responder “acks” to when your resolver “acks.” This metric helps us click into what’s actually happening under the hood during an incident. After observing our own data, we’ve been able to infer that MTTI is one of the most time-consuming factors of MTTR. And in modern business, when a task requires time and attention from engineers, then that task is an expensive one for the business. Really expensive.

Using Automated Diagnostics

Now let’s bring this back around to MTTI and automated diagnostics. MTTI is not only lengthened by the technical tasks of responders manually pulling diagnostic data and having to decipher which team to escalate to based on x service and y incident. It’s also about the people and their limitations, depending on the specific expertise that is required to begin resolution. For example, in many cases, the first responder doesn’t know how to investigate the issue from the database or network ‘perspectives.’ That may be due to their lack of skills (background in databases or networks); access, or tribal knowledge (e.g. that a specific app-component depends on a complex integration with a third-party service).

By automating these investigative and debugging tasks, in addition to having the ability to delegate these actions across teams and responders, you will experience a positively cascading effect on MTTI, and eventually, MTTR.

So why should you care about automated diagnostics?

With automated diagnostics, you can:

Reduce escalations to scarce experts by designing paths to provide the first-responders with information that would typically be manually gathered
Distribute subject matter expertise across response teams
Invoke secure automation behind firewalls and VPCs
Troubleshoot and resolve faster without a human-assisted action required
Improve the speed of enablement to new engineers and ensure optimal efficiency at all levels of the incident response organization

Getting Started

You made your decision. Now it’s time to blaze the trail, but where do you start?

To use some marketing slang: don’t try to boil the ocean. Trial some actions that are both low in complexity and risk. This could be taking a deeper look at some of your noisiest services, or you could run some simple data pulls from various monitoring applications, disc usage, etc. But it’s important to have a strategy for the long-term roll out and vision of this functionality. Sure, you can write a script that pulls data from numerous sources and appends that to an incident. But that is far from scalable.

It’s important to think about the various infrastructure pieces and tools you will want to pull diagnostic data from. You will want a standardized approach for interfacing with your heterogeneous and dynamic environments.

To learn more about automated diagnostics, check out some of our how-to articles, which we will be continuing to publish throughout the year. Additionally, look out for a session on all things Automated Diagnostics from Jake Cohen during PagerDuty Summit next week!

For more resources about PagerDuty’s Process Automation portfolio, visit this page and get in touch with your account manager today.

Any questions? Feel free to ask away on Twitter @sordnam

The post What is Automated Diagnostics and Why Should You Care? appeared first on PagerDuty.

Democratize Your Team’s Automation Capabilities With PagerDuty® Automation Actions by Joseph Mandros

Joseph Mandros — Mon, 04 Apr 2022 13:00:55 +0000

Let’s face it. Incidents can be expensive—really expensive. But the high cost of incidents within a production environment isn’t always due to a compromised service or negative customer experience. According to PagerDuty response data, over 50% of an incident’s lifespan is spent with first responders in the investigation and mobilization phases (what we call ‘triage’)— in other words, determining what might have gone wrong and calling the right person to fix it.

With the above statistic in mind, it’s clear that the shadow expense of your incident lifecycle is that of your people’s time—the engineer who discovered the incident, the on-call engineer who responded to the issue and determined root cause, and every other subject matter expert that gets looped into the incident lifecycle. And when you sprinkle in manual processes to the entire response timeline, things can get pricey. Very pricey.

The fact of the matter is, your developer organization’s time is just as valuable and important as the business’s bottom line. And as service and application development continues to grow in complexity, “time saved” becomes an even more important metric to track, quantify, and continuously improve. Finding a way to automate different aspects of the incident response process can help save your team’s time and bolster efficiency across the board. How can you do this, you ask? Enter PagerDuty® Automation Actions (formerly PagerDuty Rundeck Actions).

PagerDuty® Automation Actions

PagerDuty® Automation Actions add-on connects your first-line responders to corrective automation directly within PagerDuty. Instead of pushing escalations to specialists when an incident kicks off, responders can triage and resolve incidents themselves using safely delegated automation. As a result, teams reduce MTTR, lower interruptions to specialists, and quickly diagnose and remediate incidents.

PagerDuty® Automation Actions connects automated diagnostics and remediation to the incident response workflow. Automated Diagnostics are a set of actions for production services that your responders can automatically invoke when an incident occurs. Rather than having to escalate to expert specialists who manually run common tests, responders can safely and securely invoke this automation themselves from within PagerDuty and see responses delivered in real time back to your incident timeline.

Run designated actions such as service restarts, diagnostics, and more

With these diagnostic tests, responders can more efficiently escalate the incident to the right specialist for resolution, rather than involving a large group or escalating up the typical responder ladder. The specialists will be able to see the results of those common diagnostics and can get started right away.

Additionally, teams can also invoke these actions and collaborate on the incident directly from their Slack instance. This eliminates the need to access a service through a terminal and context switch between windows, creating a faster and more efficient way to resolve incidents—while also reducing escalations to specialists. As you mature your use of automated diagnostics, you can start using it for things like automated remediation and triggering using Event Intelligence.

PagerDuty® Automation Actions helps solve four main problem areas within an organization’s response process:

Siloed expertise. First-line responders don’t know the genetic makeup of every single application or service within an organization’s environment.
Consistent interruptions to specialists. Responders escalate to the engineer they think is the specialist of that application or service, taking time away from innovation and slowing time-to-resolution.
Repetitive and manual diagnostic steps. The first steps when an incident kicks off are often the same. These same, manual steps have to be actioned on before you can begin resolving the incident.
Complex and sprawling production environments. Knowing which systems to access and what actions to take can take time. Additionally, not every responder has the authority to access specific production systems, often making the escalation process difficult and time-consuming.

PagerDuty® Automation Actions solves the above issues by:

Delegating automation across teams. Deploy automated procedures to first-line responders that are typically invoked by specialists.
Resolving incidents faster with fewer escalations. By creating automation for common requests and operations, teams can spend less time figuring out who to escalate to and more time on a fix.
Triggering human-assisted/self-healing automation. Invoke diagnostic actions before responders are even paged using PagerDuty’s Event Orchestration.
Safely invoking automation with security in mind. Responders only see actions they have the authorization to invoke for impacted systems in an incident, and all actions are logged to maintain a strong security posture.

To summarize the above with some quick bullets, PagerDuty® Automation Actions helps teams:

Decrease response times up by to 30 minutes and MTTR by up to 25%
Reduce the volume of incidents that get escalated up the ladder
Distribute subject matter expertise across response teams
Trigger human-assisting and self-healing automation before responders even get paged
Invoke secure automation behind firewalls and VPCs
Deploy automated actions in place of manual procedures
Enrich incident documentation for smoother postmortems and reduced operator work

The post Democratize Your Team’s Automation Capabilities With PagerDuty® Automation Actions appeared first on PagerDuty.

Closing the Gap: Deploying Automation the Right Way by Joseph Mandros

Joseph Mandros — Thu, 17 Mar 2022 13:00:25 +0000

Automation in the enterprise is nothing new. Engineers have been working with automation tools and frameworks for decades. From configuration management tools, to continuous integration and delivery pipelines to cloud formation, you name it—automation is part of the fabric of nearly any technology use case in the business landscape. If the previous statement is true, then why does automation still seem to pair with so much manual work?

The answer? There is still a palpable lag in the widespread adoption of automation across IT and the rest of an organization. For example, the majority of companies today have some form of digital offering—whether that’s a product, service, helpdesk, or other customer-facing application. And most companies leverage some degree of automation within their IT organization to deploy or maintain that service offering. However, even though automation is utilized, the full value is often unrealized in production. Its use is typically pigeonholed in small pockets of the business, and only the employees who implemented and/or built it are able to use and apply it.

We call this the automation gap.

The Automation Gap

The automation gap is a scenario where the use of automation within an organization exists in islands—with each specialized operational unit leveraging pockets of automation in silos. Additionally, the existence of an automation gap means most employees (besides the subject matter experts) either lack the tribal knowledge, skill sets, and/or access privileges across the organization to actually use it.

Through the lens of IT, an automation gap slows down operations, and can lead to bigger business problems such as:

Limited innovation capacity because of constant escalations to specialists
Increased SLA penalties and error rates from slowed incident resolutions
Specialist burnout due to inundation of unplanned work and requests
Ultimately unhappy customers and lost revenue

Considering the pace of innovation and rise in consumer expectations in the digital world, the negative externalities of an automation gap will only worsen—or widen—as the days go by. But before you can address the issues and bridge the chasm, you need to be able to identify and understand the underlying factors that contribute to the existence of the gap at its foundation.

The characteristics of an automation gap can be categorized into three main subcomponents:

The Knowledge Gap
The Skills Gap
The Access Gap

The Knowledge Gap

Digital-first organizations of all sizes have to constantly transform to meet customer needs, keep pace with innovation, and stay ahead of the competition. In order to see success in this evolution, the digital infrastructure needs to adapt and evolve in parallel. But evolutions can’t just happen overnight. They occur through years of digital transformation, employee and cultural transitions, and new complexities and technical dependencies—all layered atop legacy infrastructure. Without proper documentation and understanding, seamless execution can be nearly impossible.

The knowledge gap is the understanding that no single individual or employee can be the subject matter expert for every dependency, system, or best practice in the IT organization. For example, a former employee with a 5+ year standing may have the tribal knowledge to tackle a legacy infrastructure incident, but the on-call responder who’s been working eight months might not have that nuanced understanding of the same underlying infrastructure.

The Skills Gap

The skills gap is the reality that different users have different technical skill sets. Similar to the knowledge gap, the skills gap stems from employee specialization of new technologies and complexities across an organization’s digital infrastructure. As new systems and processes are introduced, often the subject matter experts (SMEs) that built or implemented them are the only people who can properly triage a problem when an incident inevitably occurs. This specialization bottleneck can negatively impact the response lifecycle, burn out your specialists, and reduce the efficiency of your remediation efforts. This gap is especially evident during periods of attrition, where one or two specialists who left were the only ones with the understanding and skills needed for X system or Y service.

Not everyone knows how to administer a database, or automate a continuous integration pipeline. In fact, a company’s most highly skilled technicians are usually in such demand, anything that can offload their work helps scale the business. To constantly escalate repetitive diagnostics and procedures to SMEs creates unplanned work and acts as a distraction to otherwise high-value work experts should be prioritizing. Talent acquisition around these technical roles continues to grow, which makes this particular gap all the more important to tackle for organizations long term.

The Access Gap

Finally, an access gap is related to maintaining security posture according to today’s best practices. Following the principle of least privileged access, super-user credentials should not be widely distributed or shared among IT staff. Without the proper access to tools, information, and systems, you will see negative outcomes stemming from a lack of access, such as prolonged resolution times, inefficiencies around remediation, and less time for SMEs to focus on high-value work.

So how can PagerDuty help you bridge this automation gap, improve the overall agility of your IT operations, and enable your teams to innovate faster?

Apply Automation

PagerDuty’s automation capabilities enable your end-users to do what previously only your expert engineers were able to do. The platform is designed to bridge these gaps by safely delegating automation for use by other stakeholders, eliminating escalations interruptions and dramatically reducing wait times. These processes can incorporate existing task automation as individual steps in an operational workflow, abstracting the specific context of each step from the process user, while providing a consistent operational experience.

The PagerDuty® Process Automation portfolio consists of the following offerings:

PagerDuty® Automation Actions. A PagerDuty add-on that curates and connects responders in PagerDuty to automated diagnostics and remediation for services involved in incidents.
PagerDuty® Runbook Automation. A SaaS service that enables engineers to standardize and automate runbook procedures, and delegate services as self-service operations.
PagerDuty® Process Automation On-Prem. A self-hosted software cluster that gives engineers the ability to standardize and automate end-to-end operational workflows, and safely delegate them as self-service operations to stakeholders.

PagerDuty’s automation capabilities also help organizations safely close the access gap by providing the ability to invoke automated workflows without needing to explicitly share credentials or keys with end-users. It integrates with single sign-on (SSO) to enable role-based access control, and logs all activity at both process and step levels to meet compliance requirements.

To learn more about how PagerDuty’s automation capabilities can help you on your journey to close the automation gap, visit https://www.pagerduty.com/use-cases/automation/ today.

The post Closing the Gap: Deploying Automation the Right Way appeared first on PagerDuty.