Automation | Categories | PagerDuty

Democratize Automation with AI-Generated Runbooks by Ranjana Devaji

Ranjana Devaji — Thu, 31 Aug 2023 12:00:57 +0000

Operational efficiency is as critical within the IT and engineering teams as any other part of the business. Automating repetitive tasks and reducing escalations within and to these teams is of immense value.

While automation saves time and boosts productivity, the complexity of developing automation can be a limiting factor and bottleneck. Generative AI is a paradigm shift here, in that it brings consumer-style simplicity to assisting in the development of enterprise-grade automation.

With the new interface of generative AI, organizations can democratize automation and therefore increase the number of individuals that are contributing to authoring and harnessing automation.

To help our customers achieve their goals with automation, we are excited to announce public Early Access for AI-generated Runbooks. Starting today, Runbook Automation users can write the task they wish to automate in plain-English and let AI build a template of automation for that particular task.

AI-generated Runbooks lower the barrier to entry to new automation developers and speeds up the time to create new automation for experienced automation authors. This feature works seamlessly with the user’s preferred scripting language, offering a low-code solution for what used to be a high-code task.

Simply sign up for the PagerDuty Runbook Automation Trial if you are not an existing Runbook Automation user. For existing Runbook Automation customers, App administrators can enable this feature at any time.

Tangible Benefits from Leveraging AI-generated Runbooks

Better Development timelines

Are you a seasoned automation engineer? AI-generated runbooks will help you save time and effort.

Creating self-service automation for tasks involves sifting through documentation, identifying the right calls/commands, and then transposing them into individual job steps manually.
With AI-generated runbooks, authors can generate these on the fly and faster than ever before.

Here’s a quick look at what provisioning access to apps in Okta looks like with and without AI-generated Runbooks:

Faster Onboarding

Get good fast with example jobs for reference. Start with tasks you’re familiar with and see how these tasks operate within the Runbook Automation platform.

Democratize Technical Automation

With AI-generated Runbooks, less experienced automation-authors can quickly go from thought to development to implementation. This broadens the scope of people within an organization that can leverage a technical tool such as Runbook Automation.

Conquering Blank Slate Problems

A typical conundrum for users in the face of automation is “Where do I start?”. With AI-generated runbooks, users can now hit the ground running by creating baseline versions of Jobs for their variety of use cases.

Build with Best Practices for Optimal Results

The AI-generated Runbooks use fine-tuned “prompt engineering” that embeds the known best-practices from the engineers here on the Process Automation team at PagerDuty.
For example, all Jobs are created with a ReadMe that explains the prerequisites for invoking that Job. And all secrets used within the Job are retrieved from Key Storage – rather than requested from the user.

GenAI – Security & Data

A common concern we hear from our customers considering adopting generative AI is around the security of their data, and the potential of giving competitors advantages through model training. PagerDuty AI-generated runbooks feature is an opt-in, meaning you need to enable it to be able to use it. Furthermore, as stated in the feature documentation:

The only data sent to the generative AI model is the text entered into the prompt field. No other data about your environment, existing Jobs or the source of the prompt is sent to the AI model. Furthermore, the AI model is not trained on the text entered into the prompt.

Read our Guidelines for the Safe and Secure Use of Generative AI to learn more about how we’re working with and building our AI-powered features.

See AI-generated Runbooks in Action

AI-generated Runbooks propel automation into a new realm, where your operations are empowered like never before. Embrace the future of automation with PagerDuty, and witness the transformation it brings to your operational landscape.

The post Democratize Automation with AI-Generated Runbooks appeared first on PagerDuty.

Gartner Market Guide: Embedding Automation Into the Enterprise by Joseph Mandros

Joseph Mandros — Thu, 20 Jul 2023 13:00:17 +0000

Read the Gartner Market Guide today!

“Existing workload automation strategies are unable to cope with the expansion in complexity of workload types, volumes and locations driven by evolving business demand, as per Gartner. Digital business is slowed without collaboration and automation inside and outside of IT, leading to siloes of capabilities across business and IT teams.Cost optimization is an evolving challenge, driven by technical debt and requirements to demonstrate business value of investments.”

The lack of collaboration and automation both within and beyond IT departments creates isolated capabilities across business and IT teams, hindering the pace of digital business. Additionally, as technical debt accumulates and the need to showcase the value of investments grows, cost optimization becomes an ongoing challenge.

By embracing these key findings in a recent Market Guide published by Gartner®, we believe businesses can streamline their operations and enhance efficiency in the face of expanding complexities. Let’s take a brief look at our understanding of the key findings from Gartner Market Guide for Service Orchestration and Automation Platform.

Workload Automation Challenges

Businesses face a myriad of challenges when it comes to managing workloads effectively. “According to Gartner, existing workload strategies are unable to cope with the expansion in complexity of workload types, volumes, and locations driven by evolving business demands.” Automation strategies that once sufficed are no longer equipped to handle the expanding complexities driven by evolving business demands.

To address these challenges, organizations need to adopt intelligent automation solutions that can adapt to changing requirements. Intelligent workload automation leverages technologies such as artificial intelligence (AI) and machine learning (ML) to dynamically allocate resources, optimize scheduling, and automate repetitive tasks.

Collaboration and Automation: Breaking Silos

According to Gartner, digital business is slowed without collaboration and automation inside and outside of IT, leading to siloes of capabilities across business and IT teams. The lack of collaboration and automation between these teams can significantly hinder digital business initiatives. Silos create barriers, slowing down decision-making processes and impeding the flow of information and ideas.

To overcome these challenges, organizations must foster a culture of collaboration and implement automation solutions that span across business and IT functions. By integrating workflows and sharing information seamlessly, teams can work together in harmony, driving innovation and accelerating digital initiatives. Collaborative automation tools, such as workflow management platforms and project management software, can facilitate effective communication, collaboration, and information sharing, leading to faster time-to-market and improved customer satisfaction.

The Evolving Challenge of Cost Optimization

“According to Gartner, cost optimization is an evolving challenge, driven by technical debt and requirements to demonstrate business value of investments.” As digital infrastructures expand, organizations accumulate technical debt—a burden caused by outdated technologies, inefficient processes, and legacy systems. This technical debt not only impedes agility and innovation but also increases operational costs.

To address this challenge, businesses must prioritize cost optimization through strategic investments and ongoing evaluation of their technology portfolios. Embracing cloud-based solutions, leveraging automation, and adopting agile practices can help organizations reduce technical debt and achieve greater cost efficiency.

Additionally, organizations should establish clear metrics and processes to measure and demonstrate the business value of technology investments, enabling informed decision-making and resource allocation.

Conclusion

We believe the key findings from the report underscore the need for organizations to stay agile, adaptive, and innovative in their approach to workload management, ensuring they can effectively meet evolving business demands and drive sustainable growth in the ever-changing digital landscape.

“According to Gartner, it is recommended to drive collaboration across business and IT teams by democratizing access to automated capabilities through feedback-driven self-service automation solutions.

Unlock the business value of orchestrated I&O automation by implementing expanded service orchestration and event-driven workflows to drive agility, innovation and cost optimization efforts.

Meet today and tomorrow’s business demands by service orchestration and automation platforms (SOAPs) delivering the needed agility and efficiency. Embed agility and efficiency into orchestrated IT processes to meet business demands by using SOAPs.”

To learn how PagerDuty can help you on your automation journey, click here.

___________________________________

Gartner Market Guide for Service Orchestration and Automation Platforms, Chris Sanderson, Daniel Betts, Hassan Ennaciri, 23 January 2023.

The post Gartner Market Guide: Embedding Automation Into the Enterprise appeared first on PagerDuty.

PagerDuty Extends Operations Cloud Leadership into AIOps and Automation by Jonathan Rende

Jonathan Rende — Tue, 11 Jul 2023 22:51:12 +0000

Forrester Names PagerDuty a Leader in first-ever Process-Centric AIOps Wave

From helping pioneer the DevOps movement to establishing best practices around service ownership to being the standard in incident response, PagerDuty has a long history of leadership. PagerDuty is honored to add to this list and now be recognized as a leader in the AIOps and Automation space by Forrester. To explain why PagerDuty was listed as a leader, it’s important to look at our current economic climate and compare it to the past.

It’s been more than a decade since the last time the Three C’s–Cost Control, Consolidation (of vendors) and Compliance–received so much oversight and scrutiny. Just like in 2008, centralized decision making and cost controls are driving organizations to consolidate entire vendor suites versus only best of breed products to do the job.

It’s no surprise that additional financial oversights are now a part of every purchase, every budget item, and every activity for IT and development. Everyone expects more–and they expect it right now, not next quarter or even next year.

AIOps, however, has always promised to do more. For us, more is about exceeding SLAs, and improving availability and reliability. More is about cost savings because fewer humans should be needed in a major outage and incident. More should not only be about being more responsive, but be about preventing issues in the first place.

What’s Changed Since the Last Financial Crisis

In 2008, important financial institutions failed given credit and lending practices. This started a market downturn where the global economy contracted. This resulted in cost controls and the need to improve business efficiencies which in turn drove more central decision making. It was a stark contrast from the strategy of top line growth at all cost which typically results in distributed decision making and vendor/tool sprawl.

So, what’s different now and why is PagerDuty’s AIOps a leading solution to look into?

Now (vs 2008) machine learning is and should be an operational part of every data centric digital business. PagerDuty’s AIOps solution has progressed fast over the last four years to help both reduce the time to resolve issues by 25% and reduce unnecessary interruptions (noise) by over 90%.
- By combining our event correlation (Intelligent event grouping) and event orchestration (event rules engine) with existing observability processes, we better target which experts are needed for which problems. We make escalation policies more effective and powerful.
- PagerDuty’s AIOps product can make those experts more productive as well when they do get called in to a major incident by automating the diagnostics process and pinpoint offending or culprit services responsible for the problem.
- Equally important, by combining event rules with automation jobs (event driven automation), an entire class of lower priority problems can be remediated without human intervention and eliminate the need for responders or experts to engage at all.
- Lastly…with Generative AI, we just potentially democratized the tools that will further broaden use even faster.
Now (vs 2008) there is no need to hire expensive professional services or bring in tons of white lab coat experts to configure systems. You can and should demand to see the value in days and weeks, not months or longer. PagerDuty’s AIOps offers 5-10x reduction in time to value over alternative solutions.
Now (vs 2008) we have proven, high value products that offer AI as an integrated part of your existing practices vs separate bespoke solutions in event management. Whether you have centralized IT Operations (e.g., network operating center with SREs) or decentralized operating models with service ownership by developers or a combination of both (hybrid model), there is no need to have or add new vendors or build new approaches. PagerDuty’s AIOps solution works in all models.

The promise of AIOps off the shelf products is a reality. There are real products from established leaders like PagerDuty to apply against your needs and requirements.

And now, we’re proud to share PagerDuty’s AIOps leadership as part of the recent Forrester Wave. Consider this your personal guide to help in your AIOps journey. Enjoy.

The post PagerDuty Extends Operations Cloud Leadership into AIOps and Automation appeared first on PagerDuty.

What is Zero Trust Security and Why Should You Care? by Joseph Mandros

Joseph Mandros — Tue, 13 Jun 2023 13:00:45 +0000

Automation has become a game changer for businesses seeking efficiency and scalability in a rather unclear and volatile macroeconomic landscape. Streamlining processes, improving productivity, and reducing incidence for human error are just a few benefits that automation brings.

However, as organizations embrace automation, it’s crucial to ensure modern security measures are in place to protect these new and evolving assets. While other security models control the majority of the narrative across the business landscape, zero trust is quickly emerging as a necessary security implementation concept.

With our recent release of the next-generation architecture for PagerDuty Runbook Automation and PagerDuty Process Automation, we are positioned as the ideal partner to help organizations implement and grow within a zero trust security architecture for the modern enterprise.

To learn more, keep reading and/or register for our webinar about Zero Trust security happening this Thursday, June 15th at 6 A.M. PT and 11 A.M PT respectively.

What is zero trust security?

Zero trust security is a model that challenges the traditional perimeter-based security approach by assuming that no user or device can be inherently trusted—regardless of their location. It emphasizes continuous verification and validation of identities, devices, and network traffic before granting access to resources. It achieves this through multi-factor authentication, granular access controls, encryption, and monitoring, enabling organizations to minimize the risk of data breaches and unauthorized access.

By shifting the traditional perimeter-based security paradigm and adopting a “trust no one” approach, zero trust security offers a holistic framework that aligns seamlessly with modern automation initiatives. Additionally, it can positively impact the process evolution of a business’ inner workings as the world becomes increasingly more complex—and prone to bank-breaking threats.

Source: https://www.microsoft.com/en-us/security/business/zero-trust

What’s the big deal?

Zero trust security often stands out as a superior approach compared to traditional security models, largely due to its fundamental shift to a modern technological mindset and comprehensive implementation.

Unlike perimeter-based security models that rely on the assumption that internal networks are inherently trustworthy, zero trust security adopts a “trust no one” philosophy. It implements strict access controls, continuous authentication, and rigorous monitoring at every level, ensuring that every user, device, and network component is treated as potentially untrusted. This approach significantly reduces the attack surface and prevents lateral movement within the network, making it highly effective against both external threats and insider risks.

Additionally, zero trust security provides adaptive access controls that dynamically adjust privileges based on context, bolstering security without impeding productivity. By combining strong authentication, encryption, and segmentation, zero trust security offers a holistic and proactive defense strategy that fortifies organizations against sophisticated threats, making it a superior choice for today’s deep field of dynamic and interconnected digital landscapes.

Business of all sizes can positively benefit from implementing a security model like zero trust, with contributing factors such as:

Protecting Sensitive Data: Zero trust security ensures that access to this valuable data is strictly controlled and authenticated, reducing the risk of unauthorized access, data breaches, and potential financial and reputational damages.
Mitigating Insider Threats: Zero trust security addresses the risk of insider threats by assuming that no user or device should be implicitly trusted. This helps organizations identify and address potential risks before they cause harm.
Adapting to Evolving Cyber Threats: Traditional security models often rely on perimeter-based defenses, assuming that internal network traffic is safe. However, modern cyber threats—such as advanced persistent threats and zero-day exploits—can bypass traditional defenses. Zero trust security takes a more granular approach, implementing continuous auditing, multi-factor authentication, and strict access controls to protect against these evolving threats.
Supporting Remote and Mobile Workforces: With the rise of remote work and the increasing use of mobile devices, businesses face new challenges in securing their networks and data. Zero trust security allows organizations to implement secure access controls, regardless of the user’s location or device. This flexibility ensures that employees can work remotely while maintaining a strong security posture.
Meeting Compliance and Regulatory Requirements: Implementing zero-trust security can help organizations meet these requirements by enforcing access controls, monitoring data usage, and demonstrating a proactive approach to cybersecurity.
Building Customer Trust: In today’s data-driven world, customers value the security and privacy of their personal information. By implementing robust zero-trust security measures, businesses can build trust with their customers, demonstrating their commitment to protecting sensitive data and mitigating cyber risks.

PagerDuty Process Automation + Zero Trust

Digital Transformation initiatives rely on cloud technologies to rapidly scale the business, but there are new challenges around security with automating operations and cloud infrastructure. The main challenge being that engineers need the most secure protocols to run automation in restricted application environments that mandate a zero trust architecture—where direct SSH zone access is deprecated.

Additionally, significant engineering effort is required to deploy and manage automation that performs well across hundreds of remote environments and geographical regions. Lastly, creating resilient automation runbooks is time consuming and prone to error when coordinating within a variety of complex environments.

With PagerDuty Runbook Automation, engineers can now run automation from a central system that triggers the execution through enhanced Runners or AWS SSM within the remote environments—without needing to rely on SSH firewall rules.

PagerDuty Runbook Automation dispatching tasks to remote environments using zero-trust principles.

The new Runners can leverage common plugins like Ansible and Kubernetes and customers can create new types of runbooks where engineers target many remote secure environments and explicitly state where and how tasks will be independently routed and executed within each environment. This enables better performance, scale, and fault tolerance.

For customers with high security requirements, PagerDuty Runbook Automation and Process Automation can now enable connectivity without the need to open ports in their firewalls, such as SSH, enabling remote operations. This new functionality simplifies secure connectivity to automation by reducing the need for customers to deploy their own bastion or jump host and public endpoints.

To learn more about zero trust security and PagerDuty Process Automation, be sure to register for the webinar happening this Thursday, June 15th, at 6 A.M. P.T and 11 A.M. PT respectively.

The post What is Zero Trust Security and Why Should You Care? appeared first on PagerDuty.

Take Advantage of the New Product Trial of Runbook Automation for Incident Resolution by Jorge Villamariona

Jorge Villamariona — Mon, 12 Jun 2023 12:00:48 +0000

The PagerDuty Operations Cloud is the platform that enables our customers to manage the full lifecycle of urgent incidents. Many of our customers are leveraging Process Automation to augment their incident response teams and as a key driver to grow and scale their capabilities.

The work resulting from urgent incidents cannot be postponed because it impacts a company’s revenue or ability to service customers. Often, this work is repetitive and could be delegated to first responders. However, the deeper context needed to make accurate diagnosis and remediation of these incidents is locked away in production environments and requires knowledge, skills and access privileges from specialists. Responders frequently have to escalate the incident to already overworked specialists—a time-consuming process that can be disruptive, frustrating and repetitive.

By automating repetitive and time-consuming tasks from your incident resolution process, you can free up your engineers to focus on higher-value activities that require creativity and critical thinking. This, in turn, can lead to shorter MTTR, better customer experiences, faster innovation, revenue protection and improved profitability.

PagerDuty Automated Incident Resolution provides pre-built and customizable diagnostic and remediation capabilities that help first-responders determine the cause and initiate remediation within their production environment, which saves time and requires fewer individuals to assist with the response. Automating this repetition can speed up MTTR by 25% and reduce costs and interruptions by at least 50%.

In order to make it easier for our customers to realize how Automated Incident Resolution can speed up their MTTR, we rolled out the In-Product Trial of Runbook Automation for Incident Resolution last month. This trial is exclusively available to our Business and Digital Operations customers. PagerDuty users can request a trial from the automation tab using the Web UI:

The account owner will see an approval request via email. Upon approval, the account owner can set up a trial for Automation Actions and get a fully functional Runbook Automation instance in just a few minutes. Users will see a visual guide to help them get started authoring automation, including: Creating a Runbook Automation (RBA) Instance, Adding a Runner (a program that allows you to execute automation jobs in your environment), Adding Automation Actions (which allows you to invoke automation jobs and workflows from PagerDuty), Running Actions (from the PagerDuty incident details page), and viewing the output from the automation.

We encourage you to fully take advantage of this trial to further automate and optimize your incident response process. We are looking forward to hearing your Incident Resolution Automation Success story.

The post Take Advantage of the New Product Trial of Runbook Automation for Incident Resolution appeared first on PagerDuty.

AIOps and Automation: A Conversation Featuring Guest Speaker Carlos Casanova, Forrester Principal Analyst by Heath Newburn

Heath Newburn — Fri, 09 Jun 2023 12:00:36 +0000

At the beginning of 2023, I had a great conversation with Carlos Casanova, a Forrester Principal Analyst, in a recent webinar about how AIOps can help drive successful organizational change. According to our conversation, Carlos has divided the AIOps market into two camps: technology-centric (primarily APM/Observability players) and process-centric. PagerDuty is a process-centric solution leveraging multiple technologies.

With process-centric AIOps solutions, organizations gain additional context and insights into their data. This reduces the time to act, helps improve data quality, enhances decision-making, improves routing and notification efficiency, and ultimately increases the value of services delivered by IT.

This ability to increase speed with greater context shrinks the time for critical incidents. An important thing to note is that the initial routing can be to a virtual operator. Meaning that automation could drive additional triage/debug information or potentially complete a fix before engaging a human responder.

Throughout our conversation, Carlos and I kept returning to the theme of creating better context for responders. When I asked him about what capabilities he sees as most important for solving core AIOps use cases, he said, “Quickly identifying the correlation across disparate alerts drastically reduces the noise that individuals are dealing with. Providing all impacted individuals with this clean data signal is vital to improving operations. With this data, individuals can more easily and quickly garner insight into what is truly going on in the environment. They can then quickly determine the right actions to take, decide who needs to be involved for faster remediation, and reduce the amount of effort necessary, which frees up time for other events and alerts.

But teams often struggle with getting started. We agreed that the cost of waiting and planning probably isn’t worth the cost of starting and iterating. He added “The overall initiative may look daunting, but there are achievable quick wins. Waiting is not recommended. Start with small tactical efforts that roll up to your larger and longer-term strategic goals to show progress, demonstrate value, and build momentum.”

So speed is also a continuous theme: quickly getting context, rapidly responding with automation, and starting the process immediately to see these wins. But we also know that the pressure has continued to grow.

Teams have been affected by the economic downturn and slowdown. When I asked him about how teams can increase efficiency and measure success, we spoke about automation being key to success.

Carlos responded, “Simple scenarios that occur often are great candidates for automating all or part of their remediation. Fully or even partially automating five or 10 simple scenarios instantly frees up large amounts of time for individuals to focus on the more complex scenarios that organizations might not feel comfortable automating.”

But we also have to recognize the forming, storming, and norming before we get to performing in projects. There will be changes to how we measure and think about success that we have to embrace.

“AIOps can also empower IT to alleviate workloads to help their delivery teams ‘do more with less.’ It’s important to remember that these changes invalidate existing metrics. You must establish new baselines, since individuals will no longer be performing the simple and low-level actions. For example, a technician manually resolves 300 incidents per week. Thirty are simple and have easily automated remediations. The MTTR on these might drop by 90%. Elimination of the simple incidents, however, only allows the technician to take on 10 medium-complexity incidents in their place. That means the technician will handle 20 fewer incidents per week. The average MTTR for the technician will go up, and incidents will stay in their queue longer, with a higher ratio of medium- and high-complexity incidents,” Carlos said.

One of the most common questions I run into is how to get started. Traditionally, AIOps is viewed as a potentially years-long initiative. It can be daunting to begin the journey with so much uncertainty and change. PagerDuty has greatly simplified the process by crafting a one-click process for event correlation so teams can see value immediately but this isn’t the end of the journey to AIOps.

Carlos shared his insights on getting started, as well as facing the reduction in available OpEx. “Budgets are always a challenge, but to a large extent, you can overcome that hurdle by demonstrating and clearly articulating the value of AIOps. Develop a narrative for your business case that speaks to the value of improved experiences with the organization. Demonstrate how improved routing and notifications with enhanced contextually relevant data enables the same workforce to handle more workloads with less effort. Explain how patterns and trends empower lower-level resources to execute more advanced actions because they are provided suggestive actions that are based on the more experienced and senior staff members. All of this helps organizations deal with the economic challenges they’re currently facing while also improving the quality of products and services they deliver. It’s important for organizations to demonstrate their chosen solution has a fast time to value. For example, to improve user experiences, how quickly can the solution provide complete visualizations of transactions to support personnel to resolve an outage? To provide a faster response time, how quickly can the solution analyze the environment and correlate new alerts into singular incidents that can be handled immediately or in an automated fashion? Time to value is vital in difficult economic times.”

Time to value can be even more important than ROI for many of our customers. Speed is what will delineate winners and losers in digital battlegrounds. How quickly we can deal with inevitable issues and iterate improvements is what sets teams apart from competitors and provides an excellent customer experience.

As I&O leaders work through economic uncertainty that’s forcing them to cut costs and do more with less, they require new tools and approaches that help them scale and optimize their existing resources. AIOps provides teams with a reliable way to process high volumes of data and events, manage routing and response in real-time, and help teams resolve incidents faster. If you’re interested in learning how to tackle those challenges for your business, watch this webinar to hear the rest of my conversation with Carlos.

The post AIOps and Automation: A Conversation Featuring Guest Speaker Carlos Casanova, Forrester Principal Analyst appeared first on PagerDuty.

Debug State Capture for Traditional Infrastructure & Apps by Justyn Roberts

Justyn Roberts — Thu, 25 May 2023 13:00:59 +0000

In our previous blogs on Capturing Application State and using Ephemeral Containers for Debugging Kubernetes, we discussed the value of being able to deploy specific tools to gather diagnostics for later analysis, while also providing the responder to the incident the means to resolve infrastructure or application issues.

This drives a balance between the need to restore a service as quickly as possible, in addition to ensuring enough debug data is available for a later permanent resolution—all while allowing a development team to keep a container running lean and in a performant way.

By capturing both application and environment state when the incident occurs, any responder or service owner spends less time context switching between tools, credentials, and environments—enabling more accurate and faster responses and problem resolution.

The techniques discussed in the prior blogs in this series focussed around modern, cloud-native platforms like Kubernetes, and the unique approaches needed for containers—especially containers that do not natively ship with debugging tools.

Not everyone is able or willing to move every application to cloud native, and many of us still work within a hybrid scenario of both containerized and traditional applications.

Even without the ephemeral nature of containers and the strict policies of container images, there is still a need to capture in-the-moment evidence to help with root-cause analysis in order to avoid future occurrences of incidents.

Let’s look at use cases describing the ability to capture state automatically in the event of a failure or decreased performance, and pick some interesting scenarios to dive into for a deeper look.

This is a non-exhaustive list, but here are some examples of how debug state capture is used in traditional application environments:

Infrastructure & Network

Top resource-consuming processes on one or more infrastructure components
TCP dump; thread/memory/core dump

Database

Top resource consuming queries
Current query state
Execution of application specific queries

Application specific

Java – Run thread/heap dump with tools like jstack
Windows – Proc Dump
Python – Running thread dump
All – Application specific log files

Additional Log Files

Debug state capture can grab whole or partial logs from any file that may not be captured by a log aggregator.

PagerDuty Process Automation provides many pre-built template workflows for capturing application and environment state as part of the automated diagnostics project. These workflows are flexible and extendable so that they can be customized to work for your particular use-cases.

Taking a Deeper Dive

Let us take a closer look at some specific examples of capturing environment state that could prove useful at identifying the long-term fix for an incident.

Use case 1 – Gather database debug

We can use the SQL RUN Step in Process Automation to add either an inline statement, or execute an existing script. As my application is MariaDB (A fork of MySQL), I can use the following parameters to run the MySQL query:

SHOW FULL PROCESSLIST;

(Note: credentials are derived from my existing external store and passed securely as I execute the step as part of a workflow, so I can safely delegate without exposing info)

I pass the output to my Incident platform (In my case, PagerDuty, of course), and set the job to collect automatically if an incident occurs within the database service.

This info is now automatically available to both my responder in their app, chatops tool, or within any post mortem. In this case, I can see someone is running a benchmark test at the point of incident! As with the previous blog posts, it would also be easy to post more complex versions of this to a storage environment like an AWS S3 Bucket for later analysis.

Use case 2 – Gather application debug

My observability tool is very quick to let me know WHEN an application has failed, but not always the information on WHY it failed. This second use case will run an ad hoc command for my python application to use py-spy, a sampling profiler for my application, in conjunction with one of our automation plugins to move files securely to S3 for later retrieval.

Outputs data direct to my S3 storage :

This example highlights worker states for my python app at a thread level, straight into the hands of my developer, and stored for as long as they might need to reference.
Of course, these commands are not exclusive, and I could easily chain multiple checks to provide a broader view.

Use case 3 – Traditional Infrastructure debug state capture

For the third use case, I need to deploy a set of bash commands to a remote machine and run again at the trigger event. This primarily surfaces diagnostics such as open files and network connections, but it also runs bpftrace, a tool that can be used for tracing specific calls:

Process Automation allows me to define and deploy a whole script and store the output for gathering a snapshot of my environment state:

Conclusion

Signals from monitoring tools, even in traditional environments, benefit from broader visibility to allow any responder, DevOps engineer or SRE to make quick and safe decisions. Developers also often need additional information and the ability to capture state when problems arise, as they might not be on hand immediately.

Debug State Capture enables this, providing additional context for a responder, reducing time spent digging around in different tools and the capability to collect deeper datasets for subsequent analysis.

Curious to learn more? Get started today with a trial of Runbook Automation.

The post Debug State Capture for Traditional Infrastructure & Apps appeared first on PagerDuty.

Debugging Kubernetes with Automated Runbooks & Ephemeral Containers by Jake Cohen

Jake Cohen — Tue, 02 May 2023 13:00:01 +0000

In our previous blog, we discussed the difficulty in capturing all relevant diagnostics during an incident before a “band-aid” fix is applied. The most common, concrete example of this is an application running in a container and the container is redeployed—perhaps to a prior version or the same version—simply to solve the immediate issue. For companies where every millisecond of performance and every second of uptime has consequential impacts on the customer experience, these types of short-term fixes are a necessity. The costs to the business become significant, though, when engineers are tasked with developing the long-term solution to these incidents. For both major and (recurring) minor incidents, engineers have to spend inordinate amounts of time gathering evidence of the state of the application and environment when the incident occurred.

While a good portion of this diagnostic data resides in monitoring tools and therefore persists, there are times when it is necessary to get a shell in a container to retrieve information that is only available for the lifetime of the container. In Kubernetes, this is done using the kubectl exec command. With the right parameters, users can get a live shell in their running container and start executing commands to retrieve diagnostics. For example, once a user has a shell in a Java container, they can invoke jstack to get a thread dump of their application.

But many operations teams do not let anyone exec into production pods (which is where critical incidents happen), or the number of people that can is very slim—for both security reasons and due to the limited number of people that are familiar with operating in Kubernetes. Consequently, in order to retrieve diagnostic data during an incident, individuals with Kubernetes access and expertise regularly need to be pulled in for help. This process drives up the cost of incidents by increasing MTTR, as well as the number of people that need to get involved.

For these reasons, it is best to use automation that removes the need for users to exec into running pods. With this automation architecture, when an issue occurs, an automated runbook is invoked, and that runbook retrieves the debug data, sends it to a persistent storage location (S3, Blob Storage, SFTP server, etc), and then informs the engineers where they can locate and use the debug data.

PagerDuty Process Automation provides a pre-built, templatized runbook for exactly this use case: when an alert creates an incident inside PagerDuty, this can automatically (or by the click of a button) trigger the runbook to execute commands in the pod, send the output to a persistent storage, and provide details on the location of that data in the incident.

Link to debug data is provided to engineers during and after the incident

Users of both our commercial automation products (Process Automation and Runbook Automation) and open source Rundeck can follow the instructions here to download and get started with the automated-runbook.

This automated runbook is great when the container image already has the command-line utilities (binaries) needed for debugging. For example, many containerized Java apps ship with the jstack utility in the container image; however, what happens when the debugging utilities are not shipped as part of the container image? Or, as is increasingly commonplace, what happens when the container is “distro-less,” and therefore will not even provide a shell?

This is where Kubernetes Ephemeral Containers come into play—providing users a mechanism to attach a container (of any image) to a running pod without the need to modify the pod definition or redeploy the pod.

By sharing the process namespace, the ephemeral container can use its debugging utilities for another container in the pod—even if the original container is in a crashed state. Here is a blog by Ivan Velichko that goes into great detail about process-namespace sharing with ephemeral containers:

Source: https://iximiuz.com/en/posts/kubernetes-ephemeral-containers/

Similar to using kubectl exec, leveraging ephemeral containers properly still requires access to executing kubectl commands on the Kubernetes cluster—which is rarely available to those outside operations. And just as before, knowing how to properly construct the command takes a superior level of familiarity with Kubernetes:

kubectl debug -it -n ${namespace} -c debugger --image=busybox --share-processes ${pod_name}
(Sample command for using Kubernetes Ephemeral Containers)

To accommodate users that have containers without debugging utilities or distro-less containers, we have built a new Kubernetes plugin that harnesses the ephemeral containers functionality:

We have used this plugin in a template for an automated runbook that also captures diagnostic output and sends the output to a persistent location. Process Automation and Runbook Automation users can get started with this template job by downloading it as part of the Automated Diagnostics Project here.

If you do not yet have a Process Automation or Runbook Automation account, click here to get started with PagerDuty’s automation products.

The post Debugging Kubernetes with Automated Runbooks & Ephemeral Containers appeared first on PagerDuty.

PagerDuty Announces New Automation Enhancements That Simplify Operations Across Distributed and Zero Trust Environments by Joseph Mandros

Joseph Mandros — Tue, 28 Mar 2023 13:00:59 +0000

Be sure to register for the launch webinar on Thursday, March 30th to learn more about the latest release from the PagerDuty Operations Cloud.

Rundeck by PagerDuty has long helped organizations bridge operational silos and automate away IT tasks so teams can focus more time on building and less time putting out fires. And while this mission still rings true today, our vision is to extend this reality and revolutionize all operations while continuing to build trust.

To resolve high-impact work faster and more efficiently, the PagerDuty Operations Cloud delivers value across every IT environment; whether it be pre-production or production, isolated or secure, multi-cloud or on premise—you name it. We want to meet our customers where they are and deliver the value they need.

Starting today, that vision is now a reality.

We are thrilled to introduce a next-generation architecture for PagerDuty Runbook Automation and PagerDuty Process Automation that simplifies how our customers manage automation across cloud, remote, and hybrid environments.

This latest functionality, among others, is why Runbook Automation is an integral part of the PagerDuty Operations Cloud. Now PagerDuty helps automate across any infrastructure, multi-zoned hybrid environment, network, and more to resolve that unplanned, time-sensitive, and high-impact work we know about all too well.

Standardizing automation across secure infrastructure

It’s clear that automation has become a necessity in order for businesses to keep pace with the rapid transformations happening across the technical landscape. These businesses also have to sustain growth and transformation while also doing more with the same—or even fewer–resources. Additionally, segregated environments and disparate services add complexity via hybrid cloud realities and increasing security and regulatory requirements. This sprawl of IT environments has led to a new dimension of organizational silos, along with departmental and technical silos.

One thing is for sure: When built, conventional automation tooling didn’t anticipate the complexity of security requirements in modern distributed environments. As a result, engineers have to manually execute tasks for operations within each environment, causing long wait times, more personnel time consumed, and higher levels of engineering toil. To solve this problem of fragmented automation, something more is needed. Teams need full visibility across their entire infrastructure and the ability to seamlessly execute distributed automation jobs—without having to manually build new automated operations into each project and environment.

With this new functionality, instead of having to manually invoke an automation step in each environment, engineers can now manage and run automated tasks and distribute that automation across their many segregated environments from a single administration.

As a result, teams will be able to:

Operate faster by enabling automated operations across cloud and data center environments
Simplify security when operating in high-compliance and zero-trust architectures
Eliminate toil by speeding up task resolution and reducing personnel time across all zones, environments, and networks

In order to better understand how this is made possible by the new functionality, let’s touch on some of the challenges we are looking to solve for our current and future customers.

Enabling scale and efficiency with security in mind

While it is true that automation can unlock new levels of scale and potential for innovation, it also brings with it critical challenges around added complexity, connectivity, and security. For technology teams, this means additional dependencies inside isolated environments that need to be maintained, distributed network endpoints to keep in check, and islands of fragmented automation spread across remote and local environments that need to be securely managed and run.

One of the bigger challenges that we hear from our customers is around managing and running automation across environments with high security and compliance requirements. In many cases, engineers have to manually manage each of their several isolated environments due to the many security nuances and process dependencies within each zone.

Now, PagerDuty Runbook Automation can be that connectivity conduit across our customer’s distributed operations that wield strict requirements for:

Disparate environments? No problem: Runbook Automation and Process Automation can now authorize orchestration of automation steps in remote environments as if they were local, and allows incorporation of many environments in the same job definition. This eliminates network silos that typically compromise automation and thus requires manual log-ins to properly run in those environments.
Compliance audits? No problem: Runbook Automation and Process Automation now simplify compliance by embedding access control and logging into automation, now extending these capabilities into remote environments—all from a centralized control plane.
Zero trust security? No problem: For customers with high security requirements, Runbook Automation and Process Automation can now enable connectivity without the need to open ports in their firewalls, such as SSH, enabling remote operations. This new functionality simplifies secure connectivity to automation by reducing the need for customers to deploy their own bastion or jump host and public endpoints.

Example diagram of PagerDuty Runbook Automation running an automated diagnostic process in remote environments to capture environmental state.

New Runner functionality

The Runner is a remote execution point purpose built for node steps to run on specified endpoints, rather than from the automation server itself. The Runner, available for both Process Automation and Runbook Automation, securely opens up network/communication between data centers, remote environments, and the automation cluster.

The new release offers a next-generation Runner that is now integrated with common infrastructure such as Ansible, Docker, and Kubernetes that execute locally within the private network. The new architecture now allows job authors to develop automated jobs that incorporate multiple environments.

New feature highlights

Run automation anywhere with next-generation Runners that provide secure and resilient connectivity from within remote environments.
Support complex architectures and jobs with distributed automation steps that enable the orchestration of standardized automation to work across any environment.
Simplify management with an enhanced Runner UI and APIs that simplify administration of Runners from the central automation environment, including configuration, status, and managing credentials.
Integrate your existing stack with plugins available on remote Runners for common technologies like Ansible, WinRm, Kubernetes, and Docker that can execute in local and remote environments.

Process Automation and Runbook Automation can now provide the same breadth of automation workflows with execution steps for Ansible or Kubernetes in remote environments that will only continue to strengthen as we blaze this trail of new distributed automation capabilities for our customers.

Looking ahead

These new automation features from Runbook Automation and Process Automation are just the beginning, and strengthen the value of the PagerDuty Operations Cloud by providing more flexibility for our customers to create triggered workflows across a wider variety of secure environments.

Register for our webinar on Thursday, March 30th to hear more about the latest release from the PagerDuty Process Automation portfolio. If you have any questions or are interested in learning more, make sure to contact your account manager and visit our Process Automation page.

The post PagerDuty Announces New Automation Enhancements That Simplify Operations Across Distributed and Zero Trust Environments appeared first on PagerDuty.

Calculating Business Value of Automation in PagerDuty Process Automation by Greg Chase

Greg Chase — Wed, 01 Mar 2023 14:00:40 +0000

Budgets in IT departments are tight these days, so proving a return on investment is essential for justifying or expanding a project. The good news is that automation saves money by reducing the amount of human effort required. It is similar to investing in a robot vacuum cleaner. Despite the upfront cost, you save time (and money) by not having humans do the vacuuming.

Reporting the value delivered by an automation program can be challenging since the value depends heavily on what is being automated. Your project proposal may forecast time and cost savings by automating certain manual tasks. Tracking and reporting those savings is how you show the business impact of your projects. So how can you simplify tracking and reporting?

We have a feature in PagerDuty Process Automation that can help: the ROI Metric Data plugin. The ROI Metric Data plugin follows the simple principle that every time an automation runs, it delivers value. The automation developer specifies value metrics by defining key values such as hours-saved:10 for their automation.

Whenever the job executes, these metric values are added to the log entry of the execution. The plugin also provides an end point to extract the JSON records of these runs, along with other metadata about the executions—making it possible to compile, calculate, and analyze these metrics over time.

Here are some patterns you can follow to track the business value delivered by your automation projects.

Reporting savings from reduced labor costs

The most direct benefit of automating a task is the cost savings of the labor it replaces. Take this use case shared by Robert Powers from Brinks at PagerDuty Summit 2022. Their as-is process was a recurring data transfer job that took a staff member 5 to 10 hours to complete manually.

By automating the process with PagerDuty Process Automation, they turned this process from being ¼ of one person’s job every week into an automated task that takes zero human time.

Cost, opportunity and benefit criteria of a data transfer automation project

To use the ROI Metric Data Plugin to track the value generated in this scenario, you would simply define a metric hours_saved with a value of 10 to include this metric in the execution records of this process. This will give you an easy metric to be able to export to show total hours saved per execution of this process. We chose this arbitrary key-value approach since these values can change over time as you add capabilities to your automated job. This way, you can compare the value of newer versions of your automation to older versions when charting data—provided you don’t change the key names.

For your own scenario, you will want to determine how much time is spent by workers manually completing tasks that you will be automating. This can be as accurate as you want the end result to be. Estimates are OK, or you can develop an average time spent through observations. The average or estimate will be the value you pair with a key such as hours_saved. You can break these out by employee job type if you want to track costs savings, or changes in workload distribution. Simply define more key-value pairs: DBA_hours_saved, senior_engineer_hours_saved. If you want to calculate a return on investment, you’ll also want to keep track of the hours needed to create the automations. You can also define values in monetary value, or convert hours to monetary value during your analysis.

Here we have created two key-value pairs to be logged per job run: Hours_Saved : 1.25, and Dollars_Saved : 250.

Upload the job execution data to your favorite reporting tool, such as Tableau. You can chart the compilation of your different metrics by user, and job over time. For example, you can show hours saved from manual executions by users vs. scheduled job runs. You can calculate money saved either directly from metrics you have defined, or by converting different hours metrics to costs.

Here is an example of charting the logged data showing increasing money and time savings from scheduled job runs and user-invoked job runs.

Converting these metrics to return on investment requires adding the costs associated with the implementation of automation. In the customer scenario we shared above, 20 FTE hours (assuming equivalent labor costs) was the cost to create the automated process. If this includes maintenance over a year, this looks like: 520 FTE Hours Saved – 20 FTE Hours to automate = 500 hours saved in just the first year of operation.

Adjusting metrics by automation outcome

Following the principle that automation delivers value whenever that automation runs, we may wish to calculate value according to the outcome of these runs. This would mean filtering out automation runs that are unsuccessful.

There are different reasons why an automation execution may be unsuccessful. There may be problems with the job definition itself, or errors reported by nodes and workflow steps that don’t otherwise terminate the job. In the case of one of these unsuccessful executions, you may wish to filter them out of your value calculation.

Example job run with a failed step

When running analytics we can choose to filter out unsuccessful runs due to external failures from integrated systems.

Example chart showing hours and money saved and job status

The ROI Metric Data Plugin is available in PagerDuty Process Automation as of Version 4.7, and is also available as part of PagerDuty Runbook Automation. To learn more about working with the ROI Metric Data Plugin, check out the Process Automation Documentation.

If you are not already a user of PagerDuty Process Automation or PagerDuty Runbook Automation, schedule a demonstration or trial today!

The post Calculating Business Value of Automation in PagerDuty Process Automation appeared first on PagerDuty.