devops | Tags | PagerDuty

How to Maximize Time Savings and Reduce Toil During Incident Response by Laura Chu

Laura Chu — Mon, 31 Jul 2023 12:00:44 +0000

Incidents are a costly burden on businesses. Despite assembling the right people and teams, the manual work, tool setup and prolonged tasks can negatively impact customer experience. The need for adaptable processes to address diverse incident types further complicates the situation.

This is where the PagerDuty Operations Cloud steps in. It streamlines and automates all the various manual steps in the incident response process. The result is a cohesive and end-to-end incident management experience that frees up responders to focus on the critical thinking requirements to resolve the incident.

At the heart of the PagerDuty Operations Cloud lies Incident Response–the backbone for effectively managing an orchestrated response to address customer-impacting incidents. To help our customers build a resilient approach to digital operations, we aim to deliver a solution that is:

Automated to eliminate inefficiencies
Flexible to accommodate each team’s specific processes
Proactive to learn from failure and repeat incidents

This year, PagerDuty has introduced Incident Workflows, Custom Fields on Incidents and Status Update Notification Templates. These latest additions work in concert to further streamline incident management processes, enabling you and your team to focus on resolving incidents and delivering exceptional digital experiences to your customers. With every minute mattering in incident response, saving time during every step of the process becomes crucial, leading to a positive and impactful transformation in your business operations.

Here are Three Ways to Cut Down Incident Time

Experience significant time savings with Incident Workflows

Incident Workflows, a powerful capability within PagerDuty, empowers you to easily customize workflows for different incidents and automate manual steps by integrating them into a unified process. With Incident Workflows, actions can be orchestrated based on the incident type via a customizable, user-friendly no-code/low-code builder.

For example, let’s say your incident process requires five manual steps. With Incident Workflows, you can automate the entire process.

Responders no longer need to worry about manual steps once the Incident Workflow is configured. Instead, they can initiate the appropriate incident workflow (Eg., P1, P2), allowing the PagerDuty Operations Cloud to coordinate the right teams to promptly address and resolve incidents. This gives teams more time back to focus on the task at hand: resolving the incident.

Take advantage of our latest generally available Incident Workflow Templates, which enable you to quickly operationalize best practices for managing major incidents, standardizing collaboration tools and ensuring the right stakeholders are informed with the latest updates. These templates are designed to empower responders, who have not previously used Incident Workflows, to quickly adopt and implement this functionality, leading to faster incident resolution.

Better context for faster incident resolution

Context is key for responders during incidents. Having the right information is essential for sharing with other responders and helps guide their actions, such as sending status updates or writing a postmortem. Details such as “data regions” or “customer impact” help teams prioritize efforts effectively. To assist with this, PagerDuty introduced Custom Fields on Incidents.

This new feature allows teams to easily extract important incident data from any system of record and place it where responders can access it, whether on the incident details page or in a status update. PagerDuty empowers responders to save valuable time during triage and make more informed decisions by including relevant critical data.

Simplify stakeholder updates with notification templates

Effective communication with key stakeholders during incidents is crucial. However, crafting these notifications can be time-consuming and resource-intensive. By using Status Update Notification Templates, you can leverage customizable templates that alleviate the strain of writing communications, streamline the process and reduce the time and effort required to share critical updates.

These templates eliminate the guesswork in formatting updates by providing pre-designed templates tailored to your organization’s needs. With Status Update Notification Templates, you can streamline the process of sharing incident updates, ensuring clear and consistent communication with stakeholders.

Get 1+1=3 with the PagerDuty Operations Cloud

These features work great alone, but together they provide a better end-to-end incident management experience. With Incident Workflows, sending templated status updates becomes effortless, and soon, you’ll be able to include Custom Fields directly in those updates. For instance, imagine using a custom field to add an object like “data region” and seamlessly launch an Incident Workflow that includes a status update with the same custom field. In the near future, a responder will be able to automatically populate the same information to a Jira ticket or reassign the incident to the right regional responder.

This powerful orchestration across a unified platform allows you to streamline work across the entire incident lifecycle for maximum time savings, resulting in faster resolution and better customer experiences without impacting revenue.

Watch a demonstration of how these features work together.

Dynamic Digital Ecosystem

PagerDuty brings all of these capabilities to a desktop web interface, mobile application, chat experience and API so you can work in a way that suits you best. Therefore, we are making these capabilities available in all four of these services to enable you to do so.

Don’t Wait, Try it Out

PagerDuty empowers you to streamline your incident response process by leveraging the PagerDuty Operations Cloud with Incident Workflows and integrating various tools and templates. This integration optimizes your incident management, ensuring fast and effective response. As a result, your organization can experience reduced operating costs while freeing up resources to prioritize innovation and growth.

Curious to see these features in action? Embark on our Product Tour or try our free 14-day trial to witness the power of the PagerDuty Operations Cloud firsthand.

The post How to Maximize Time Savings and Reduce Toil During Incident Response appeared first on PagerDuty.

Strengthen Your DORA Metrics with PagerDuty by Mandi Walls

Mandi Walls — Fri, 30 Jun 2023 12:00:56 +0000

For technical teams, the findings from DORA provide a model for measuring and improving performance. With almost a decade of data gathered from more than 33,000 professionals worldwide, the capabilities and frameworks detailed by the research help teams pinpoint areas for improvement and areas to celebrate.

The team at DORA categorizes capabilities into three sections: Technical Capabilities, Process Capabilities and Cultural Capabilities. These are all important considerations for teams hoping to use the DORA findings to improve their own performance.

The DORA research is tool agnostic, and not prescriptive on how teams should go about improving their performance or prioritizing their goals, so it can be challenging to put together a tactical plan. With the myriad tools available to teams today, putting together the best combination for your team seems daunting.

If your team is utilizing the DORA metrics as part of your improvement goals, the PagerDuty Operations Cloud can help. As a central component of your environment, the visibility PagerDuty brings will help with a number of key metrics and compliment the tools that impact others. Several PagerDuty features fit well into the capabilities described by the DORA team.

Reducing Unplanned Work

A key benefit of using PagerDuty for incident response is reducing the overall time spent on unplanned work across the organization. This is important in several ways unrelated to the DORA capabilities, primarily around making better use of resources and focusing on work that provides the most value to the organization. Less time spent in incidents = more time spent delighting customers.

One of the challenges of incidents and unplanned work from an organizational culture perspective, is that they can be invisible–the time isn’t tracked in work plans, or documented as part of the time required to build a new feature. So making unplanned work visible helps teams manage the burden and work towards improving outcomes.

Work in process limits and visual management capabilities can both be improved by deploying PagerDuty to capture the impact of unplanned work and incidents on your team. Including data from PagerDuty related to how many engineers are impacted by out-of-hours incidents, overnight or “sleep time” incidents, and even work-day incidents gives managers additional insight into how teams are performing.

Integrations and Extensions

Integrations and extensions are powerful features that place PagerDuty at the center of your operational capabilities.

Integrations allow PagerDuty to receive information and alerts from other services, interrogate them, assign them to services, and initiate incidents. PagerDuty integrates with many third party services that provide monitoring, observability and tracing functionality for various types of events in your environment.

Extensions help you streamline your PagerDuty workflows with third-party tools like Slack, Microsoft Teams, Jira Cloud and Zoom that enhance your incident response experience.

Integrations and extensions mean that your teams can bring any number of tools with them to PagerDuty. The flexibility provided by more than 700 integrations gives all of your teams the tools they need, whether they are running machine learning applications, web platforms or databases.

Selecting the right tool for the job saves teams time and confusion, but it shouldn’t sacrifice the ability to respond to incidents and preserve reliability. Make PagerDuty part of your cross-team baseline, as described in the capability Empowering Teams to Choose Tools.

Change Events

Change events are non-alerting events that can be sent to PagerDuty from your build and deploy tools. They give your teams insight into what has changed in a service, and are an invaluable first-stop when investigating incidents.

Having a good continuous delivery practice is a key capability for DORA-aligned teams, and the research shows that teams using continuous delivery practices spend less time on unplanned work. Your team can speed up what time they do spend on unplanned work by using change events.

Change events can also provide wider visibility for deployment changes in your environment, improving your deployment automation, and helping streamline change approval. Even when teams are using different build and deployment tools, their change events can be captured by PagerDuty to help manage service reliability.

Automation

PagerDuty’s various Automation solutions play an important role in not just incident response, but in the completion of general technical tasks. Process Automation, Runbook Automation, and Automation Actions all contribute to teams spending less time on simple well-understood tasks and more time on work that adds value.

Many DORA capabilities emphasize automation, so using a general-purpose tool provides teams with a single interface for automating many tasks. One of the key capabilities strengthened by automation is cloud infrastructure, in which on-demand self-service is a requirement. Teams using PagerDuty’s automation solutions can create jobs and delegate work to whomever in the organization requires the tasks to be completed, creating true self-service workflows.

Terraform Provider

Related to automation is PagerDuty’s support for Infrastructure as Code (IaC) via the Terraform provider. IaC and similar solutions help teams track their changes to infrastructure and other components via Version Control, another of DORA’s technical capabilities.

Managing a large PagerDuty environment can be complicated using only the UI, but making use of Terraform to create objects and provide teams with templates helps everyone on the team improve the reproducibility and traceability of their changes.

Service Graph and Business Services

Finally, PagerDuty’s Service Graph and Business Services features enable teams to create relationships among services, illuminating the impact of incidents when they happen in large environments. Status Pages give the entire organization a place to look for service impacts during an incident, and how they relate to the customer.

Business services in PagerDuty are representations of user-facing features; users might see odd behaviors in a shopping cart experience, but will have no idea which backend service is causing the behavior. Building the relationships in the service graph provides data to your organization about the health of user-facing features and capabilities while also helping responders troubleshoot issues with dependencies.

These features will help with strengthening a team’s monitoring and observability capability, as well as the Monitoring systems to perform business decisions capability.

Do more DORA with PagerDuty

The tools your team uses should help your organization reach its goals. Implementing PagerDuty’s features will help your organization improve not only in responding to incidents, but also in creating reliable services your users love. To learn more, visit our website and join our community forum.

The post Strengthen Your DORA Metrics with PagerDuty appeared first on PagerDuty.

What is Runbook Automation? by Catherine Craglow

Catherine Craglow — Wed, 03 May 2023 15:00:35 +0000

The post What is Runbook Automation? appeared first on PagerDuty.

Humans, Robots, and Incident Response by Nisha Prajapati

Nisha Prajapati — Fri, 03 Mar 2023 21:40:04 +0000

The post Humans, Robots, and Incident Response appeared first on PagerDuty.

A Deep-Dive Into PagerDuty’s New Incident Workflows by Ariel Russo

Ariel Russo — Mon, 13 Feb 2023 14:00:57 +0000

The more that automation can remove toil and take care of rote tasks in incident response, the faster teams can focus on problem identification and resolution. And the faster they can resolve an incident, the sooner they can get back to building new products and services.

One of the most highly anticipated announcements from our November launch was Incident Workflows to automate manual steps of incident response. Today, we are excited to share that this feature is now generally available. If you’re a PagerDuty customer, you can start using them today!

What are incident workflows?

Think about all of the manual steps in your incident response process–paging subject matter experts, setting up a conference bridge, establishing a Slack channel, sending out status updates… the list goes on. Remembering to take each of these steps every time an incident strikes adds unnecessary cognitive load to an already stressful process. These repeatable, manual tasks are perfect candidates for automation.

Incident Workflows automatically orchestrate the actions you already know you’ll need to take based on the type of incident at hand. Serving as an upgrade from Response Plays, now you can use a no-code/low-code builder to create customizable incident workflows that will reduce the manual work required to escalate, mobilize, and coordinate the right incident response for any use case. You can automatically trigger common incident response actions using if-this-then-that logic, such as adding a responder, subscribing stakeholders, or spinning up a conference bridge.

Here’s a short demo that showcases how it works:

Let’s take a closer look at the functionality that makes Incident Workflows so powerful.

Conditional Triggers

Different incident types require different remediation steps. Conditional triggers allow you to create logic to kick off a workflow when criteria for certain incident fields are met–like changes to priority, urgency, or status. For example, you can now define a major incident workflow that is automatically triggered for all P1 and P2 incidents. And for the users who would like to double down on automation, you could even set priority using Event Orchestration and have your Incident Workflow pick up that priority change as a trigger for the major incident workflow.

You can also utilize manual triggers that allow a responder to start a workflow directly from the incident details page. When a manual trigger is added, Incident Workflows can be triggered from the PagerDuty web app, mobile app, Slack, Microsoft Teams, or directly through the API.

Enhanced CollabOps Actions

Save time setting up collaboration channels by configuring your workflow to do it for you. Use workflow actions to spin up a Zoom conference bridge or create a per-incident Slack channel that includes all incident responders and incident updates. P.S. If you’re a Microsoft Teams user, we’re working on comparable functionality–stay tuned.

Easy Communications & Coordination

Communicating to internal and external stakeholders in a timely and transparent way is essential to effective incident response. As my colleague, Hannah Culver, explained, having a prepared business response plan in place can help internal cross-functional teams work better together, while also helping customer-facing teams preserve customer trust through proactive communications. Incident Workflows make this piece of the puzzle easier by allowing users to automate adding responders to the incident, subscribing stakeholders, and posting status updates, so you can keep everyone who needs to know about the incident informed in real time.

Resolve On-the-Go

Incident Workflows are also available for PagerDuty Mobile on iOS and Android devices. Whether walking your dog or away from your desk, PagerDuty Mobile alerts you about the top open incidents you need to take action on. You can manually trigger a preconfigured incident workflow of custom response steps, from your incident detail screen, and start working with the right teams to resolve the incident from your preferred mobile devices. Learn more about triggering Incident Workflows from your mobile device in the Knowledge Base.

End-to-End Automation on a Unified Platform

Incident Workflows were built to pair with other automation features on the PagerDuty platform to help teams resolve faster and minimize impact to revenue. For example, Event Orchestration can simultaneously change priority to automatically trigger an Incident Workflow, while also kicking off automated diagnostics via Automation Actions so that the script is already run by the time the right responders are called and get to the incident-specific Slack channel that the Incident Workflow has spun up.

Conclusion

During Incident Response, responders should be free to focus on their core responsibility – resolving the incident. Incident Workflows give them the gift of being able to automate those other essential, yet tedious tasks in a standardized, repeatable way. By standardizing your incident response processes, you can ensure the right actions are taken across teams to reduce your time to incident resolution. Given that Incident Workflows are already included in Business and Digital Operations plans, it makes sense to utilize this great new functionality at no additional cost. Why piece together your tech, when you can achieve end-to-end incident response all in one place? Your management team asking about tool consolidation will certainly appreciate it.

This is just the first step in PagerDuty’s journey towards offering more extensibility for our customers to configure responses for their unique use cases. To learn more about Incident Workflows, check out our KB article and try it out today. For more in-depth learning, you can also sign up for the Incident Workflows PDU course on-demand.

The post A Deep-Dive Into PagerDuty’s New Incident Workflows appeared first on PagerDuty.

Automation Seasons Freezings Wrap Up and New Year’s Resolutions by Madeline Stack

Madeline Stack — Thu, 22 Dec 2022 14:00:38 +0000

It’s that time of year where you may feel pressured to pick your New Year’s resolutions. Well, we went ahead and tried to give you a head start. 2023 is the year we tame toil so we can focus on the fun stuff like engineering and innovation.

Hopefully you have had the chance to follow along with us for the month of December for Seasons Freezings, the time of year you are locked out of production, so you have time to explore new ideas like automation . If you missed it, this blog links to Seasons Freezings content you can consider even when you’re back in prod.

Resolution 1: More engineering, less toil.

Customers still tell us that toil is plaguing them. A survey run by PagerDuty found that 43% of respondents spend between 11% and 30% of their time doing routine, manual tasks. That is roughly 8 days a year total spent on interruptions and repetitive work. In the new year, we encourage you to take back some of this time by implementing automation into your existing processes.

You can get started simply by developing a list of repetitive tasks and “below the value line” work you can automate away. This could include server restarts, fetching logs, open/close/update ticket, infrastructure provisioning, updating user accounts, the list goes on.

Next, organize subject matter experts and stakeholders, agree on how to standardize completing these repetitive tasks, and automate them using PagerDuty Process Automation products or Rundeck open source. Find out more about toil by checking out this blog by Damon Edwards.

Once you’ve standardized the way to complete these operational tasks, maximize the value of this automation by delegating it out as self-service to end users. Start exploring self-service automation in your organization today. Watch this video to learn:

High-impact use cases for delegating IT processes as self-service automation
Design principles for creating self-service automation
How to fit delegated self-service to the way your end users work

You’re well on your way to “automate. delegate. celebrate!”

Make sure to watch this Twitch episode on how PagerDuty manages change freezes, thawing from a freeze, and how you can utilize a freeze to help tackle toil.

Resolution 2: Reduce interruptions by automating diagnostics

Tired of pesky pages for the same incident? While we can’t suggest you ignore the noise, PagerDuty’s Automated Diagnostics solution is a fantastic way to reduce interruptions for common, recurring incidents. The key is automating your most common troubleshooting procedures and giving your responders the ability to invoke automation inside the PagerDuty interface. By automating diagnostics, you can limit the disruptions of day-to-day work for expert engineers and reduce resolution time across the board.

Here are some useful Automated Diagnostics resources:

This blog details the issues around the initial stages of a response process and why your teams should care about automated diagnostics.
We even have some out-of-the-box job templates for automating common diagnostics for Kubernetes, Linux, and other common components. Read this blog to learn more.
Read the automated diagnostics solutions guide for more information on setting up the solution, use cases, and diagnostic examples.

Resolution 3: Get more out of existing automation with integrations

We know many companies have a lot of automation that they use today. The challenge can be making that automation secure and widely available for others to use. Many organizations use Pagerduty Process Automation or Rundeck to centralize and standardize automation.

To make this easy, PagerDuty Process Automation and Rundeck come bundled with many out-of-the-box plugins. Plugins provide a wrapper around scripts, interfaces, and utilities. Take a look at the full list.

One very popular pairing is integrating Ansible into PagerDuty Process Automation or Rundeck. Many users integrate Ansible playbooks into PagerDuty Process Automation or Rundeck to orchestrate and schedule workflows across multiple tools. Watch this video to learn the benefits of using PagerDuty Process Automation or Rundeck and Ansible, and some helpful tips to get started.

We would love to hear your new year’s resolutions for 2023. Tag Rundeck on Twitter to share!

The post Automation Seasons Freezings Wrap Up and New Year’s Resolutions appeared first on PagerDuty.

Tickets Make Operations Unnecessarily Miserable by Damon Edwards

Damon Edwards — Fri, 16 Dec 2022 14:00:40 +0000

IT Operations has always been difficult. There is always too much work to do—and not enough time to do it. The frequent interruptions and high levels of toil certainly don’t help. Moreover, there is relentless pressure from executives that question why everything takes too long, breaks too often, and costs too much.

In search of improvement, we have repeatedly bet on new tools to improve our work. We’ve cycled through new platforms (from virtualization, to cloud, to containers, to Kubernetes) and new automation (e.g., Ansible, Terraform, Pulumi). While each has provided benefits, would an average engineer say that the stress and overload on operations has fundamentally changed?

Enterprises have also spent the past two decades liberally applying management frameworks like ITIL and COBIT. Again, would an average engineer say things have gotten better or worse?

In the midst of all of this, there is conventional wisdom that rarely gets questioned—the extensive use of tickets to manage operations work.

Tickets have become the go-to work management tool in operations organizations. Need something done? Open a ticket. Someone needs something from you? A ticket will appear in your queue.

Ticket-driven ways of working have become so ubiquitous that few think twice about adding more ticket queues across an organization.

However, what if we were wrong about tickets? What if ticket queues were a significant source of operational strife hiding in plain sight? Let’s examine how ticket queues are often causing more harm than good.

What is wrong with tickets?

It’s the queues

Tickets on their own are relatively innocuous as they are just records. The issue is where you put those tickets. Tickets get put into ticket queues, and that is where the problems start.

Queues add delay, increase risks, add variability, add overhead, lower quality, and decrease motivation.

These aren’t my opinions, the cost of queues comes from extensive research in other fields as diverse as physical manufacturing and product development. Queuing Theory is its respected area of academic study.

In the rest of this post, I’ll use “tickets” and “queues” somewhat interchangeably. Just know that it is the queue part that causes the problems.

Tickets increase communication problems

Whenever parties are forced to communicate via ticket queues, there are going to be disconnects and miscommunication. Think about all of the problems you’ve probably had with tickets: Requests being misunderstood. The person reading a request lacks the context or is working in a different context. The requester makes the wrong request or doesn’t understand the ramifications of what they are requesting. As a request sits in the queue, the request parameters have (unknown to either party) changed or are no longer valid.

Tickets delay feedback and learning

Almost every modern management philosophy (from Lean to DevOps and the Lean Startup) hinges on the concept of the organization learning quicker. All strive for increasingly tighter feedback loops so that analysis and decisions happen faster.

However, as Scott Prugh likes to remind us all, “Queues don’t learn.” Queues work against feedback loops by both injecting delays and stripping each request of its context. It is tough to become a learning organization when there is rampant use of ticket-driven request queues.

Tickets encourage bottlenecks

The nature of how teams work through ticket queues encourages bottlenecks. First, ticket queues are often used where there are capacity mismatches. For example, ticket queues are commonly used to buffer requests made of specialist teams (e.g., network, database, security) who are largely outnumbered by people who need them to do something. These requesting teams “stuff” the queue with requests, which causes the queue length and response times to grow. Because queues delay feedback, the requestors are often unaware of (or don’t care about) the capacity mismatch and continue to stuff the queue. This behavior increases both the queue length and response time, worsening the bottleneck.

Additionally, as queue lengths grow, the teams responsible for working the queue will instinctively look inward to protect their capacity. This natural reaction leads to optimizations for the team behind the queue—often at the expense of the broader organization. For example, if a firewall team only makes non-emergency changes on Mondays and Thursdays, it creates a delay for the rest of the organization even if it helps the firewall team optimize its workload.

Tickets reinforce siloed ways of working

Ticket queues act as buffers that allow teams to continue working in a siloed, disconnected manner. The more disconnected teams become, the more they behave like siloed pools of specialist labor.

Requests are made of these specialists via ticket queues, and requests are fulfilled as one-offs through semi-manual or manual efforts. Variability is high. Priorities and context are difficult to gauge.

As with the reaction to bottlenecks, the primary management focus is on protecting team capacity (not the needs of the broader organization). The more that silos’ effects are reinforced, the more disconnects, mistakes, and delays there will be. Ticket queues are the enablers of this downward spiral.

Tickets create snowflakes

The default FIFO (First In, First Out) nature of ticket queues encourages one-offs. When a ticket comes up to the top of the queue, the people working the ticket queue will spring into action, attempt to garner as much context as possible in the limited time they have, do what they think is correct from their perspective, and then move on to the next ticket.

This way of working—jumping from one request to another, each with seemingly disconnected context—is a leading cause of snowflakes. “Snowflakes” is a term that describes something that may be technically correct (even perfect) but a one-off that is not reproducible. A manually updated server is the classic example of a snowflake. You may be able to get it into a working state, but in all likelihood, it is going to be slightly different from other servers in your fleet (and often in undetectable ways).

The cost of snowflakes might seem minimal at first, but as environments grow, the costs quickly compound and create an expensive and seemingly intractable condition. As Keith Chambers has famously pointed out, “How many enterprises have had ‘snow days’ where some small, unexpected variation results in an incident that kills a team’s capacity for hours or days?”

Tickets obscure the value stream

So much of the work of the Lean, Agile, and DevOps movements have been about breaking down barriers to build a systemic view of the work that needs to be done to deliver value (this end-to-end systemic view is often referred to as the “value stream”). Because context matters in all knowledge work, understanding where each piece of work fits into the broader system is critical.

After all the work that has been done to provide transparency and build-up context, breaking it down into a series of individual tickets obscures the value stream and scatters the context. In fact, breaking work down into smaller and smaller units is often seen as a ticket system best practice.

Tickets add management overhead

Ticket queues don’t just appear and take care of themselves. Someone has to set up the queue and define the rules (rules that also often add the overhead of other people needing to learn how to work within or around those rules). Someone has to maintain the ticket system itself. Priorities, conflicts, and politics also need to be continuously managed (often through an expensive project management overlay). All of this work has a cost and requires time and effort that could be spent on other value-adding work.

What are tickets good for?

Don’t get me wrong; tickets aren’t all bad. Tickets are just overused and/or used for the wrong reasons.

In my opinion, ticket systems are useful for raising true exceptions (e.g., logging bugs or enhancement requests). Also, there is some merit to using ticket systems to document human-to-human communication when approvals are unavoidable.

Ticket queues are also useful when each request is atomic and isolated (e.g., traditional customer helpdesk or selling tickets at a box office). However, most of these requests are prime candidates for self-service automation.

When it comes to the complex knowledge work required to deliver and operate enterprise software-based services, the evidence seems clear that ticket queues are costly at best and destructive at worst.

How do we get rid of as many ticket queues as possible?

As more organizations become aware of the toxic side effects of ticket-driven request queues, I see the same basic pattern emerging to remove as much work from queues as possible:

Redesign the work to avoid handoffs wherever possible

Forward-thinking organizations have been focusing on creating “service ownership” or “product-aligned teams” that can handle as much of the lifecycle as possible (without needing to hand off work to other teams). Eliminating handoffs, naturally, cuts down on the need for ticket queues.

Apply self-service automation to eliminate remaining ticket queues

Wherever ticket queues can’t be eliminated, replacing the queues with self-service is an excellent alternative. Self-service automation enables both the definition and execution of operations activity to be delegated across traditional organizational boundaries. By replacing ticket queues with pull-based self-service interfaces, wait time is eliminated, feedback loops are shortened, breaks in context are avoided, and costly toil is eliminated. The few ticket queues that remain are the ones for true expectations and one-offs.

It is time to take action

It is time, as an industry, to question the conventional wisdom of ticket queues. Tickets have their place, but have been overused to everyone’s detriment. Here at PagerDuty, we’re working with our users to find and enable better ways of working that make it easier to get things done. We hope you can join us on this mission.

The post Tickets Make Operations Unnecessarily Miserable appeared first on PagerDuty.

Doing More with Less: Building Greater Operational Efficiency with PagerDuty by Nancy Lee

Nancy Lee — Wed, 14 Dec 2022 14:00:13 +0000

How many of us can say with confidence that we know a tool inside and out? If you’re like most, you probably use just a small fraction of a product’s features. When it comes to feature-rich software like Microsoft Word or Excel, it’s a safe bet that most users are aware of less than half of the features, and use even less on a regular basis. And the longer we’ve been using a piece of software, the more likely we fall into this trap of feature underutilization.

I started noticing this in my own life a year and a half ago when a coworker who had recently joined the team told me she found a more efficient way to generate closed captions for our instructional videos. I asked if it was a tool in her Adobe Creative Suite.

“Nope, it’s actually YouTube!” she replied.

“What? That’s amazing!” I said. “How did we not know about this?” I was shocked. For the past 6 months, we had been paying for a separate tool for its closed captioning capabilities when, all along, we could’ve used YouTube’s free captioning feature in our Google accounts.

More recently, I had my proverbial mind blown yet again when I learned of Slack’s reminder feature. Making my to-do list for the next day, I was scheduling reminders in my Google calendar to follow up with a teammate, call my doctor, and pay the gas bill. My husband looked on in amusement as I added one event after another in my calendar.

“What are you doing?” he asked.

“Setting reminders for the things I have to do tomorrow,” I replied, mildly annoyed at this interruption to my sacred routine.

“Why don’t you use the Slack reminder feature?” he said. “That way, you’re not filling up half your calendar with reminders and making it hard for people to book meetings.”

“I had no idea you could do that!” Like the YouTube incident, I was incredulous that I was only learning of this feature now.

As I started scheduling Slack reminders for the following day, I wondered how often we hear our customers use that phrase — “I had no idea you could do that!” It’s not surprising when you think about it. We often purchase a tool for a specific use case. In our haste to implement a solution, we approach the task with blinders on, paying attention to only those features that will help us achieve our goal. “Problem solved!” we declare. Never mind that we only learned a tenth of the software’s capabilities. Years later, we’re still clicking the same buttons and following the same scripts, oblivious to the slew of new features that promise to enhance our user experience.

It’s human nature to take the path of least resistance. But at a time when many tech companies are being asked to manage costs and do more with less, perhaps a good place to start looking for efficiencies is in our existing investments.

One business area that shines a light on this is Customer Education. At PagerDuty, customer training and enablement sits with PagerDuty University. A comment we often see in our course evaluations is “I had no idea PagerDuty could do [fill in the blank with a feature that’s existed for months or even years]!” Some customers may have started using PagerDuty for on-call management and alerting, and never ventured beyond those basic capabilities. They’ve become so accustomed to using PagerDuty for a single use case that they don’t realize its product portfolio actually encompasses multiple solutions for use cases across their digital operations.

For organizations facing pressure from the current macroeconomic environment, PagerDuty’s end-to-end digital operations capabilities can help consolidate tool spend and boost productivity by reducing context switching. PagerDuty University helps customers by driving awareness of this end-to-end experience, from pre-incident creation (enriching and routing events) to post-incident mobilization (response automation) to business-wide orchestration (automated stakeholder communication) and beyond. Rather than investing in point solutions that address a single problem, our customers can leverage the solutions they need, when they need it, adopting additional capabilities and products as they continue to evolve their Digital Operations with PagerDuty.

Those of us who work in Customer Education understand that it’s our job to not only improve a customer’s time to value, but to ensure that they continue to see the return on their investment post-onboarding and beyond. For PagerDuty University, that means making sure that our customers receive proper enablement on PagerDuty’s advanced capabilities such as Event Intelligence and Incident Workflows (in Early Access!), as well as other products and use cases such as Customer Service Operations and Process Automation. Tool consolidation, cost savings, automating away toil, better customer experience — these are some of the biggest ROI our customers walk away with post-training.

Our instructor-led training courses are centered around achieving customer goals. Rather than training customers on every PagerDuty feature, we first try to understand what business challenges they’re trying to solve, and build training that guides them efficiently to reaching those goals. Often in SaaS, we talk about time to value — we like to think of our technical training team as “guides to value.”

PagerDuty University’s free, on-demand training complements our instructor-led training by digging into each product feature, situated in real-life scenarios so users always understand the larger context in which these features are used and the problems they solve. Our self-paced eLearning modules are suitable for customers who are trialing a free account, those who want to check out new features, or those who simply prefer the self-serve aspect of on-demand training.

It should come as no surprise that those of us who work in Education Services love learning. We use that love of learning to drive customer success, which sits at the heart of everything. Whether it’s driving adoption, improving onboarding, or imparting industry best practices, we strive to make sure that we never hear one of our customers say “I had no idea PagerDuty could do that!”

The post Doing More with Less: Building Greater Operational Efficiency with PagerDuty appeared first on PagerDuty.

Getting Started Workshop: Rundeck By PagerDuty by Nisha Prajapati

Nisha Prajapati — Mon, 12 Dec 2022 20:49:04 +0000

The post Getting Started Workshop: Rundeck By PagerDuty appeared first on PagerDuty.

Toil: Still Plaguing Engineering Teams by Damon Edwards

Damon Edwards — Tue, 06 Dec 2022 14:00:00 +0000

This blog is an update from a popular blog authored by Damon Edwards.

Our industry has always had localized expressions for work that was necessary but didn’t move the company forward. The SRE movement calls this type of work “toil.”

The concept of toil is a unifying force because it provides an impartial framework for identifying — then containing — the work that takes up our time, blocks people from fulfilling their engineering potential, and doesn’t move the company forward.

Why Toil Matters

Unfortunately, “not enough time and too much to do” describes the default working conditions inside operations orgs. There is an unlimited supply of planned and unplanned work — new things to roll out, incidents to respond to, support requests to answer, technical debt to pay down, and the list goes on.

With only so many hours in the day, how do you make sure what you’re working on actually makes a difference?

How do you make sure your team and your broader organization maximize the kinds of work that add value and find ways to eliminate work that doesn’t? After all, organization and team decisions dictate the majority of your work.

To maximize both the value of your engineering organization and the human potential of your colleagues, you need an objective framework to identify and contain the “wrong” kind of work and maximize the “right” kind of work. Understanding what toil is — and keeping the amount of toil contained — provides economic benefits to your company and improves the work lives of fellow engineers.

What is the Definition of Toil?

Google first popularized the term “toil,” and the SRE movement, and it has since been pushed to IT operations.

In a nutshell, SRE is about injecting software engineering practices — and a new mindset — into IT operations to create highly reliable and highly scalable systems. Interest in the topic of SRE has skyrocketed since Google published their Site Reliability Engineering book.

In the book, Vivek Rau articulates an excellent definition, “Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.”

The more of these attributes a task has, the more confidence you can have in classifying the work as “toil.” However, just because work is classified as toil doesn’t mean that a task is frivolous or unnecessary. On the contrary, most organizations would grind to a halt if the toil didn’t get done.

A goal of “no toil” sounds nice in theory. However, in reality, a “no toil” goal is not attainable in a business. Technology organizations are always in flux, and new developments (expected or unexpected) will almost always cause toil. Just because a task is necessary to deliver value to a customer doesn’t mean that it is always value-adding work. Toil may be necessary at times, but it doesn’t add enduring value (i.e., a change in the perception of value by customers). Long-term, we should want to eliminate the need for the toil.

The best we can hope for is to be effective at reducing toil and keeping toil at a manageable level across the organization. Toil will come from sources you already know about but just haven’t had the time or budget to automate (e.g., semi-manual deployments, schema updates/rollbacks, changing storage quotas, network changes, user adds, adding capacity, DNS changes, service failover). Toil will also come from any number of unforeseen conditions that can cause incidents requiring manual intervention (e.g., restarts, diagnostics, performance checks, changing config settings).

What Should People Be Doing Instead of Toil?

Instead of engineers spending time on non-value-adding toil, you want them spending as much of their time as possible on value-adding engineering work.

Also pulling from Vivek Rau’s helpful definitions, engineering work can be defined as the creative and innovative work that requires human judgment, has enduring value, and can be leveraged by others.

Working in an organization with a high ratio of engineering work to toil feels like everyone is swimming towards a goal. Working in an organization with a low ratio of engineering work to toil feels more like you are treading water, at best, or sinking, at worst.

High Levels of Toil Are Toxic

Toil may seem innocuous in small amounts. However, when left unchecked, toil can quickly accumulate to levels that are toxic to both the individual and the organization.

For the individual, high-levels of toil lead to:

Discontent and a lack of feeling of accomplishment
Burnout
More errors, leading to time-consuming rework to fix
No time to learn new skills
Career stagnation (hurt by a lack of opportunity to deliver value-adding projects)

For the organization, high-levels of toil lead to:

Shortages of team capacity
Excessive operational support costs
Inability to make progress on strategic initiatives (the “everybody is busy, but nothing is getting done” syndrome)
Inability to retain top talent (and acquire top talent once word gets out about how the organization functions)

One of the most dangerous aspects of toil is that it requires engineering work to eliminate it.

Reducing toil requires engineering time to either build supporting automation to automate away the need for manual intervention or enhance the system to alleviate the need for the intervention in the first place.

Engineering work needed to reduce toil will typically be a choice of creating external automation (i.e., scripts and automation tools outside of the service), creating internal automation (i.e., automation delivered as part of the service), or enhancing the service to not require maintenance intervention.

Toil eats up the time needed to do the engineering work that will prevent future toil. If you aren’t careful, the level of toil in an organization can increase to a point where the organization won’t have the capacity needed to stop it. If we use the Technical Debt metaphor, this would be “engineering bankruptcy.”

The SRE model of working — and all of the benefits that come with it — depends on teams having ample capacity for engineering work. This capacity requirement is why toil is such a central concept for SRE. If toil eats up the capacity to do engineering work, the SRE model doesn’t work. An SRE perpetually buried under toil isn’t an SRE, he is just a traditional long-suffering sysadmin with a new title.

Why PagerDuty Cares About Toil

One of our main goals is to improve the work-lives of operations professionals. Reducing toil and maximizing engineering time does just that.

Our users have often shown us how they use PagerDuty Process Automation and Rundeck in their efforts to reduce toil.

Benefits include:

Reduction in variation and errors to reduce toil by standardizing procedures.
Making it easier to do engineering work that reduces toil by automating tasks that previously required a lot of toil.
Stop one team from creating toil for another team by enabling self-service and allowing others to do operations tasks themselves.

The post Toil: Still Plaguing Engineering Teams appeared first on PagerDuty.