runbook | Tags | PagerDuty

Democratize Automation with AI-Generated Runbooks by Ranjana Devaji

Ranjana Devaji — Thu, 31 Aug 2023 12:00:57 +0000

Operational efficiency is as critical within the IT and engineering teams as any other part of the business. Automating repetitive tasks and reducing escalations within and to these teams is of immense value.

While automation saves time and boosts productivity, the complexity of developing automation can be a limiting factor and bottleneck. Generative AI is a paradigm shift here, in that it brings consumer-style simplicity to assisting in the development of enterprise-grade automation.

With the new interface of generative AI, organizations can democratize automation and therefore increase the number of individuals that are contributing to authoring and harnessing automation.

To help our customers achieve their goals with automation, we are excited to announce public Early Access for AI-generated Runbooks. Starting today, Runbook Automation users can write the task they wish to automate in plain-English and let AI build a template of automation for that particular task.

AI-generated Runbooks lower the barrier to entry to new automation developers and speeds up the time to create new automation for experienced automation authors. This feature works seamlessly with the user’s preferred scripting language, offering a low-code solution for what used to be a high-code task.

Simply sign up for the PagerDuty Runbook Automation Trial if you are not an existing Runbook Automation user. For existing Runbook Automation customers, App administrators can enable this feature at any time.

Tangible Benefits from Leveraging AI-generated Runbooks

Better Development timelines

Are you a seasoned automation engineer? AI-generated runbooks will help you save time and effort.

Creating self-service automation for tasks involves sifting through documentation, identifying the right calls/commands, and then transposing them into individual job steps manually.
With AI-generated runbooks, authors can generate these on the fly and faster than ever before.

Here’s a quick look at what provisioning access to apps in Okta looks like with and without AI-generated Runbooks:

Faster Onboarding

Get good fast with example jobs for reference. Start with tasks you’re familiar with and see how these tasks operate within the Runbook Automation platform.

Democratize Technical Automation

With AI-generated Runbooks, less experienced automation-authors can quickly go from thought to development to implementation. This broadens the scope of people within an organization that can leverage a technical tool such as Runbook Automation.

Conquering Blank Slate Problems

A typical conundrum for users in the face of automation is “Where do I start?”. With AI-generated runbooks, users can now hit the ground running by creating baseline versions of Jobs for their variety of use cases.

Build with Best Practices for Optimal Results

The AI-generated Runbooks use fine-tuned “prompt engineering” that embeds the known best-practices from the engineers here on the Process Automation team at PagerDuty.
For example, all Jobs are created with a ReadMe that explains the prerequisites for invoking that Job. And all secrets used within the Job are retrieved from Key Storage – rather than requested from the user.

GenAI – Security & Data

A common concern we hear from our customers considering adopting generative AI is around the security of their data, and the potential of giving competitors advantages through model training. PagerDuty AI-generated runbooks feature is an opt-in, meaning you need to enable it to be able to use it. Furthermore, as stated in the feature documentation:

The only data sent to the generative AI model is the text entered into the prompt field. No other data about your environment, existing Jobs or the source of the prompt is sent to the AI model. Furthermore, the AI model is not trained on the text entered into the prompt.

Read our Guidelines for the Safe and Secure Use of Generative AI to learn more about how we’re working with and building our AI-powered features.

See AI-generated Runbooks in Action

AI-generated Runbooks propel automation into a new realm, where your operations are empowered like never before. Embrace the future of automation with PagerDuty, and witness the transformation it brings to your operational landscape.

The post Democratize Automation with AI-Generated Runbooks appeared first on PagerDuty.

Debugging Kubernetes with Automated Runbooks & Ephemeral Containers by Jake Cohen

Jake Cohen — Tue, 02 May 2023 13:00:01 +0000

In our previous blog, we discussed the difficulty in capturing all relevant diagnostics during an incident before a “band-aid” fix is applied. The most common, concrete example of this is an application running in a container and the container is redeployed—perhaps to a prior version or the same version—simply to solve the immediate issue. For companies where every millisecond of performance and every second of uptime has consequential impacts on the customer experience, these types of short-term fixes are a necessity. The costs to the business become significant, though, when engineers are tasked with developing the long-term solution to these incidents. For both major and (recurring) minor incidents, engineers have to spend inordinate amounts of time gathering evidence of the state of the application and environment when the incident occurred.

While a good portion of this diagnostic data resides in monitoring tools and therefore persists, there are times when it is necessary to get a shell in a container to retrieve information that is only available for the lifetime of the container. In Kubernetes, this is done using the kubectl exec command. With the right parameters, users can get a live shell in their running container and start executing commands to retrieve diagnostics. For example, once a user has a shell in a Java container, they can invoke jstack to get a thread dump of their application.

But many operations teams do not let anyone exec into production pods (which is where critical incidents happen), or the number of people that can is very slim—for both security reasons and due to the limited number of people that are familiar with operating in Kubernetes. Consequently, in order to retrieve diagnostic data during an incident, individuals with Kubernetes access and expertise regularly need to be pulled in for help. This process drives up the cost of incidents by increasing MTTR, as well as the number of people that need to get involved.

For these reasons, it is best to use automation that removes the need for users to exec into running pods. With this automation architecture, when an issue occurs, an automated runbook is invoked, and that runbook retrieves the debug data, sends it to a persistent storage location (S3, Blob Storage, SFTP server, etc), and then informs the engineers where they can locate and use the debug data.

PagerDuty Process Automation provides a pre-built, templatized runbook for exactly this use case: when an alert creates an incident inside PagerDuty, this can automatically (or by the click of a button) trigger the runbook to execute commands in the pod, send the output to a persistent storage, and provide details on the location of that data in the incident.

Link to debug data is provided to engineers during and after the incident

Users of both our commercial automation products (Process Automation and Runbook Automation) and open source Rundeck can follow the instructions here to download and get started with the automated-runbook.

This automated runbook is great when the container image already has the command-line utilities (binaries) needed for debugging. For example, many containerized Java apps ship with the jstack utility in the container image; however, what happens when the debugging utilities are not shipped as part of the container image? Or, as is increasingly commonplace, what happens when the container is “distro-less,” and therefore will not even provide a shell?

This is where Kubernetes Ephemeral Containers come into play—providing users a mechanism to attach a container (of any image) to a running pod without the need to modify the pod definition or redeploy the pod.

By sharing the process namespace, the ephemeral container can use its debugging utilities for another container in the pod—even if the original container is in a crashed state. Here is a blog by Ivan Velichko that goes into great detail about process-namespace sharing with ephemeral containers:

Source: https://iximiuz.com/en/posts/kubernetes-ephemeral-containers/

Similar to using kubectl exec, leveraging ephemeral containers properly still requires access to executing kubectl commands on the Kubernetes cluster—which is rarely available to those outside operations. And just as before, knowing how to properly construct the command takes a superior level of familiarity with Kubernetes:

kubectl debug -it -n ${namespace} -c debugger --image=busybox --share-processes ${pod_name}
(Sample command for using Kubernetes Ephemeral Containers)

To accommodate users that have containers without debugging utilities or distro-less containers, we have built a new Kubernetes plugin that harnesses the ephemeral containers functionality:

We have used this plugin in a template for an automated runbook that also captures diagnostic output and sends the output to a persistent location. Process Automation and Runbook Automation users can get started with this template job by downloading it as part of the Automated Diagnostics Project here.

If you do not yet have a Process Automation or Runbook Automation account, click here to get started with PagerDuty’s automation products.

The post Debugging Kubernetes with Automated Runbooks & Ephemeral Containers appeared first on PagerDuty.

Getting Started Workshop: Rundeck By PagerDuty by Nisha Prajapati

Nisha Prajapati — Mon, 12 Dec 2022 20:49:04 +0000

The post Getting Started Workshop: Rundeck By PagerDuty appeared first on PagerDuty.

Now Available on AWS Marketplace: PagerDuty® Runbook Automation and PagerDuty® Process Automation On Prem by Inga Weizman

Inga Weizman — Thu, 19 May 2022 19:13:09 +0000

We are excited to announce that PagerDuty® Runbook Automation and PagerDuty® Process Automation On Prem are now available on the AWS Marketplace, the leading global cloud provider. With more than 200 different cloud services, AWS makes it simple and attractive to build and grow your cloud-native business and/or migrate your existing infrastructure to the cloud, so you can begin to take advantage of the unlimited scale, agility, and flexibility the cloud offers.

As organizations flock to the cloud and begin to transition and transform from a centralized, monolithic architecture to a hybrid environment, this newfound freedom can lead to more incidents due to the dynamic nature of the cloud. Organizations will look to adopt and align new technologies, operating processes, and people as they look to scale and grow in the cloud.

Own To Innovate

Adopting Service Ownership is paramount for a successful transformation. It’s the “you build it, you own it” framework that puts your developers closer to your customers in order to speed up innovation and deliver high-quality code. With this newly gained freedom to move faster and innovate, organizations are able to deliver more value to their customers. However, this newfound freedom can also cause downtime that will impact your customers, put your brand at risk, pull your developers away from planned work, and ultimately slow down the pace of innovation.

In the cloud, the majority of incidents occur at the application level, so it’s critical to have complete visibility across all your services, quickly orchestrate a streamlined response, and automate as much as possible to fix issues without human intervention. In order for businesses to scale, grow, and move fast, automation plays a key role in operational maturity, by freeing up developers’ time so they can innovate more without sacrificing the customer experience and dealing with tickets.

The PagerDuty Operations Cloud helps digital businesses manage all aspects of urgent and mission-critical work. It integrates into and across enterprises, people, and technology to identify, escalate, automate, and resolve urgent and time-sensitive work for these businesses before customers, employees, or the business’s reputation are impacted.

Automate to Deliver More for Customers

PagerDuty® Process Automation On Prem and PagerDuty® Runbook Automation provide an automation platform that allows cloud ops teams to standardize and automate operational procedures, and then safely delegate them as self-service requests to other stakeholders. AWS Console, administrative functions, instances, and software can all be incorporated as nodes and steps in automated sequences. Integrations with SSO, secrets management, and job-level audit logging ensure proper access control and compliance.

With PagerDuty Process Automation and PagerDuty Runbook Automation, teams can:

Resolve requests in minutes without human intervention or by giving first responder automated runbook actions
Optimize security and compliance with authentication, access control, logging every activity, and providing context checking to ensure users only invoke actions at the right times
Spend more time innovating for our customers rather than being pulled away to close tickets and manage incidents

Not sure which version to choose? Check out our blog “Five Considerations for Choosing Self-Managed Automation vs. SaaS Automation” for some suggestions.

Check out our Automation offerings here and learn more in our upcoming webinar on May 26th.

The post Now Available on AWS Marketplace: PagerDuty® Runbook Automation and PagerDuty® Process Automation On Prem appeared first on PagerDuty.

4 Automation Talks To Watch from Summit 2021 by Mandi Walls

Mandi Walls — Tue, 13 Jul 2021 13:00:46 +0000

PagerDuty Summit 2021 is a wrap, but that doesn’t mean you have to miss out on all of the great content we presented! All of the sessions will be available on the Summit Platform for a few more weeks.

Automation is a big topic this year. Teams work in environments that just keep getting more complex, so deploying automation into your response processes can help manage all the bits and pieces that need to get done—and get done fast. Rundeck joined PagerDuty last year, and this year we got to see some great content about automation at Summit. Plug Rundeck into your PagerDuty account to not just identify urgent work, but to get it done—automatically, safely, and only interrupting responders when it’s absolutely necessary.

Here are four on-demand sessions on automation that I’d recommend you check out.

Customer Use Cases

David Morse, a Principle Site Reliability Engineer at Parsons, joins Arturo Suarez Martin of Rundeck for “Adventures in Operational Automation.” Parsons has been around for 75 years and is pivoting to more digitally centered projects. David and his team provide tools to other developers to make them more efficient.

Automation is a big part of their projects, but they started by focusing on automating small pieces that could save time and linking those pieces together into bigger processes. Their successes with automation created interest across Parsons, and more teams got excited to adopt automation.

Ali Soheili and Andrea Valenti from Trimble present “How PagerDuty & Rundeck Drives Operational Maturity.” Ali is a Senior DevOps Engineering Lead, and Andrea is the Director of Cloud Engineering and Infrastructure. They shared Trimble’s story of a multi-layer organizational transformation to consolidate requirements across many teams into a single cloud team. They found benefits such as lengthening the duration of their on-call rotations, avoiding burnout, and reusing work across their projects.

Trimble uses Rundeck to help automate business processes like provisioning infrastructure, creating data sandboxes, and copying data from development to production for developers. They reduced the time it takes for these tasks to be completed from days to minutes, freeing up time for the SREs and helping developers to work more efficiently.

Rundeck also helps Trimble manage their incidents automatically instead of alerting human responders. Check out Ali and Andrea’s talk for all of the details, as well as a demo of one of their self-healing jobs using Rundeck.

PagerDuty Talk You Missed

Are you part of a “you build it, you run it” organization? Check out this session featuring Jake Cohen, “Adopting and Maturing to Service Ownership with PagerDuty and Rundeck” for a look at how teams can operate at a fast pace and at scale—while still maintaining valid and safe service ownership.

If you’re already using Rundeck, don’t miss out on the new features available in Rundeck 3.4, including enhancements to the security and compliance features and a slick new UI. Forrest Evans takes you through all of the details in “What’s New In Rundeck 3.4.” Additionally, don’t miss your opportunity to take a look at the new Rundeck content!

Also, we announced PagerDuty Runbook Actions. This integrated runbook automation for your PagerDuty account will give your teams the ability to incorporate more powerful automation into response workflows. Are you already a PagerDuty user and want to be in on the early access program for Runbook Actions? Runbook Actions will be generally available later this year. You can sign up for early access at https://www.pagerduty.com/whats-new/.

Catch Summit 2021 On Demand Until July 25th

Visit https://www.pagerduty.com/events/summit-2021/ for free access to the Summit recordings on demand until July 25th.

Note that you will need to create an account on this portal since it is not tied to your PagerDuty account. If this is your first time visiting the PDU portal, please click on the ‘Sign Up’ tab and create an account, using the access code pdusummit21.

If you have questions or just want to chat with us, join us in the PagerDuty Community at https://community.pagerduty.com. There you’ll have access to challenges to earn points for swag, and you’ll be the first to know about new features and events from PagerDuty and Rundeck.

The post 4 Automation Talks To Watch from Summit 2021 appeared first on PagerDuty.

From Ticket-Time to Real-Time: Changing the Status Quo of Operations Work by PagerDuty

PagerDuty — Tue, 15 Jun 2021 13:00:22 +0000

This blog was previously published on May 27th, 2021.

2020 Was…Rough

Keeping a Digital Business running has never been an easy job, especially over the last year. 2020 forced many businesses to accelerate their digital transformation initiatives faster than anyone imagined! Customers are demanding more capacity and reliability, the business is releasing more new services – faster than ever before, and companies are learning to use new remote working models, straining systems and people.

Complexity is the New Normal

In operations, there has always been a mix of legacy and new applications. But the level of system complexity has increased with the rise of the public cloud, containers and microservices. Even for mid-sized SaaS companies.

Visual representation of services for a mid-size SaaS company

Operations teams are used to dealing with failures. However, with the rising scale and complexity of today’s services, problems and failures happen more often and can be much more difficult to solve. On top of all of that, there’s also the pressure to open things up so the organization can move faster, but also to lock things down and remain compliant.

Needless to say, staying ahead is no easy feat. How does a business go faster, but at the same time avoid risk? Enter the concept of real-time operations.

Why Real-Time Operations?

Everyone agrees that speed is a competitive advantage. So how does a business move faster? It’s almost impossible if Ops is in a reactive state. Unfortunately that’s where a lot of businesses are today. We call this reactive state ticket-time operations.

Life in Operations has always been a mix of planned and unplanned work. Ops teams are frequently interrupted by someone who needs them to do something or they are interrupting someone with a request.

It’s an endless stream of requests in the form of tickets – often asking to do the same task over and over again. For example, the development teams may need the network team to make a change to a firewall rule every time there’s a new release. The network team has to drop what they are doing to make the change… but that change also needs to be approved by the security team before it goes live. Now the network team is interrupting the security team and waiting for them to help. Meanwhile everyone is juggling their own work.

The industry has gotten used to this way of working and the results aren’t great. Engineers feel frustrated, overworked and under-utilized and business owners feel like everything takes too long, costs too much, and breaks too often.

So, here we are today. The demands of IT Operations are pushing things to the breaking point. It’s no longer sustainable to operate under the slow, high-friction, and high-cost burden of the ticket-time operating model. Instead Operations needs to shift to what we call real-time Operations.

What do we mean by “real time”? Real-time is the ability to make decisions and take action at the speed of the business. It means instant communication and decision making. Instead of having the information and control be inside silos – it’s distributing control out to the organization and letting people work at their own pace and have end-to-end control.

Three Ways to Enable Real-Time Operations

1. Monitoring, Observability, and AIOps

Monitoring is an age old practice that has traditionally been the domain of the Operations side of the house. Monitoring is about looking for patterns or events that are similar to those seen previously and alerting the appropriate folks when those conditions are triggered.

The ‘new’ kid on the block is observability, which measures how well you can understand a system’s internal states from its external outputs. Observability tools and methods help us interrogate our services to figure out what is really going on.

It’s built on:

Events: Is this discrete event something that has happened before?
Metrics: Looking at those events and asking – are things getting better or worse?
Distributed tracing: Looking within the new distributed infrastructures and understand how these events cross through each component.

Although monitoring is traditionally owned by the operations side – we are seeing Observability also driven by developers. Monitoring + Observability help achieve real time operations by creating deeper visibility between teams and help us learn how systems work day-to-day.

Last but not least, there’s AIOps. AIOps is about combining tool capabilities to understand what’s happening in real time. AIOps provides solutions similar to existing event management solutions, but includes added capabilities required for today’s complex, modern environments such as machine learning, automation, flexible data collection and ingestion, powerful visualizations, and more. It’s about taking all the information and signals from all the infrastructure, aggregating metrics, reducing noise, improving correlation and understanding, and spotting patterns. Learn how to use AIOps for better Incident Management.

2. Service Ownership

In an increasing complex digital world the notion of service ownership becomes more and more important.

Organizations needs to know:

What happens when something goes wrong?
What are the dependencies?
And who is the person responsible?

The service ownership practice helps build a map that answers these questions and helps businesses understand the interaction across the teams and technical systems that they’re interacting with.

Services will fail; it’s a fact of operating. How a company responds when there’s a failure can make all the difference between keeping or losing customers.

Full-service ownership helps streamline the incident response lifecycle by empowering engineers to own their services in production, which reduces the number of handoffs and can significantly reduce MTTR when incidents occur. Placing subject matter experts, with direct knowledge of the systems they support, in the role of first responders helps decrease the inevitable chaos and panic that arise from uncertainty.

3. Self-Service Operations

For organizations trying to move from a reactive ticket-driven approach to a proactive approach, the self-service operations model is a key real-time operations enabler.

What does “real time” mean when it comes to self-service? Rather than having information and control be inside functional silos, self-service delegates control to the right people in the organization.

Part of self-service is communicating intelligence, like sharing system context, visibility, service ownership, the right runbooks, and decision support. The other part is freeing up subject matter experts to do work that adds value to the business – rather than continually getting interrupted by requests.

In an incident management scenario, this means first responders have the information and the control they need to be able to take action or to have AI take action on their behalf. This results in faster resolution and fewer disruptive escalations!

Self-Service With Runbook Automation

You can create self-service with runbook automation. Runbook automation allows the subject matter experts to define workflows that span different tools, scripts, APIs, permissions, credentials, and command line procedures and delegate that process to the people who need it.

Runbook automation enables the right people to safely complete tasks that previously only subject matter experts could do. It also allows your subject matter experts to take their best practices and turn them into common practices used by everyone.

Runbook automation can be used across the full life cycle. For incident response, responders have the ability to diagnose an issue and have the automated actions at their fingertips that normally they would have to escalate to the experts to do. This works for normal day to day service requests too. For provisioning, change, and maintenance tasks, instead of constantly waiting for someone to do something for you, runbook automation enables folks to complete the task for themselves. Learn more about self-service operations.

Our opportunity to transform how operations work gets done spans the entire operations lifecycle. Applying real-time operations focus to these other Ops work tasks can make a big difference to improving business velocity! To learn how PagerDuty can help, sign up for a free 14-day trial today.

The post From Ticket-Time to Real-Time: Changing the Status Quo of Operations Work appeared first on PagerDuty.

Let’s Re-Run the Runbook! by Traci Myers

Traci Myers — Wed, 05 May 2021 23:49:13 +0000

The post Let’s Re-Run the Runbook! appeared first on PagerDuty.

ROI of Runbook Automation for Incident Management by Traci Myers

Traci Myers — Wed, 05 May 2021 23:18:45 +0000

The post ROI of Runbook Automation for Incident Management appeared first on PagerDuty.

You’ve Built It and Run It, Now Delegate It by Catherine Craglow

Catherine Craglow — Fri, 30 Apr 2021 22:57:03 +0000

The post You’ve Built It and Run It, Now Delegate It appeared first on PagerDuty.

Modernizing Incident Response: Quicker Resolution and Fewer Escalations by Catherine Craglow

Catherine Craglow — Wed, 28 Apr 2021 06:16:23 +0000

The post Modernizing Incident Response: Quicker Resolution and Fewer Escalations appeared first on PagerDuty.