DevOps | Categories | PagerDuty https://www.pagerduty.com/blog/category/devops/ Build It | Ship It | Own It Thu, 29 Jun 2023 21:46:14 +0000 en-US hourly 1 https://wordpress.org/?v=6.3.1 Strengthen Your DORA Metrics with PagerDuty by Mandi Walls https://www.pagerduty.com/blog/strengthen-your-dora-metrics-with-pagerduty/ Fri, 30 Jun 2023 12:00:56 +0000 https://www.pagerduty.com/?p=83229 For technical teams, the findings from DORA provide a model for measuring and improving performance. With almost a decade of data gathered from more than...

The post Strengthen Your DORA Metrics with PagerDuty appeared first on PagerDuty.

]]>
For technical teams, the findings from DORA provide a model for measuring and improving performance. With almost a decade of data gathered from more than 33,000 professionals worldwide, the capabilities and frameworks detailed by the research help teams pinpoint areas for improvement and areas to celebrate. 

The team at DORA categorizes capabilities into three sections: Technical Capabilities, Process Capabilities and Cultural Capabilities. These are all important considerations for teams hoping to use the DORA findings to improve their own performance. 

The DORA research is tool agnostic, and not prescriptive on how teams should go about improving their performance or prioritizing their goals, so it can be challenging to put together a tactical plan. With the myriad tools available to teams today, putting together the best combination for your team seems daunting.

If your team is utilizing the DORA metrics as part of your improvement goals, the PagerDuty Operations Cloud can help. As a central component of your environment, the visibility PagerDuty brings will help with a number of key metrics and compliment the tools that impact others. Several PagerDuty features fit well into the capabilities described by the DORA team.

Reducing Unplanned Work

A key benefit of using PagerDuty for incident response is reducing the overall time spent on unplanned work across the organization. This is important in several ways unrelated to the DORA capabilities, primarily around making better use of resources and focusing on work that provides the most value to the organization. Less time spent in incidents = more time spent delighting customers.

One of the challenges of incidents and unplanned work from an organizational culture perspective, is that they can be invisible–the time isn’t tracked in work plans, or documented as part of the time required to build a new feature. So making unplanned work visible helps teams manage the burden and work towards improving outcomes. 

Work in process limits and visual management capabilities can both be improved by deploying PagerDuty to capture the impact of unplanned work and incidents on your team. Including data from PagerDuty related to how many engineers are impacted by out-of-hours incidents, overnight or “sleep time” incidents, and even work-day incidents gives managers additional insight into how teams are performing. 

Integrations and Extensions

Integrations and extensions are powerful features that place PagerDuty at the center of your operational capabilities. 

Integrations allow PagerDuty to receive information and alerts from other services, interrogate them, assign them to services, and initiate incidents. PagerDuty integrates with many third party services that provide monitoring, observability and tracing functionality for various types of events in your environment.

Extensions help you streamline your PagerDuty workflows with third-party tools like Slack, Microsoft Teams, Jira Cloud and Zoom that enhance your incident response experience.

Integrations and extensions mean that your teams can bring any number of tools with them to PagerDuty. The flexibility provided by more than 700 integrations gives all of your teams the tools they need, whether they are running machine learning applications, web platforms or databases. 

Selecting the right tool for the job saves teams time and confusion, but it shouldn’t sacrifice the ability to respond to incidents and preserve reliability. Make PagerDuty part of your cross-team baseline, as described in the capability Empowering Teams to Choose Tools

Change Events

Change events are non-alerting events that can be sent to PagerDuty from your build and deploy tools. They give your teams insight into what has changed in a service, and are an invaluable first-stop when investigating incidents. 

Having a good continuous delivery practice is a key capability for DORA-aligned teams, and the research shows that teams using continuous delivery practices spend less time on unplanned work. Your team can speed up what time they do spend on unplanned work by using change events.

Change events can also provide wider visibility for deployment changes in your environment, improving your deployment automation, and helping streamline change approval. Even when teams are using different build and deployment tools, their change events can be captured by PagerDuty to help manage service reliability.

Automation

PagerDuty’s various Automation solutions play an important role in not just incident response, but in the completion of general technical tasks. Process Automation, Runbook Automation, and Automation Actions all contribute to teams spending less time on simple well-understood tasks and more time on work that adds value.

Many DORA capabilities emphasize automation, so using a general-purpose tool provides teams with a single interface for automating many tasks. One of the key capabilities strengthened by automation is cloud infrastructure, in which on-demand self-service is a requirement. Teams using PagerDuty’s automation solutions can create jobs and delegate work to whomever in the organization requires the tasks to be completed, creating true self-service workflows.

Terraform Provider

Related to automation is PagerDuty’s support for Infrastructure as Code (IaC) via the Terraform provider. IaC and similar solutions help teams track their changes to infrastructure and other components via Version Control, another of DORA’s technical capabilities.

Managing a large PagerDuty environment can be complicated using only the UI, but making use of Terraform to create objects and provide teams with templates helps everyone on the team improve the reproducibility and traceability of their changes.

Service Graph and Business Services

Finally, PagerDuty’s Service Graph and Business Services features enable teams to create relationships among services, illuminating the impact of incidents when they happen in large environments. Status Pages give the entire organization a place to look for service impacts during an incident, and how they relate to the customer.

Business services in PagerDuty are representations of user-facing features; users might see odd behaviors in a shopping cart experience, but will have no idea which backend service is causing the behavior. Building the relationships in the service graph provides data to your organization about the health of user-facing features and capabilities while also helping responders troubleshoot issues with dependencies.

These features will help with strengthening a team’s monitoring and observability capability, as well as the Monitoring systems to perform business decisions capability.

Do more DORA with PagerDuty

The tools your team uses should help your organization reach its goals. Implementing PagerDuty’s features will help your organization improve not only in responding to incidents, but also in creating reliable services your users love. To learn more, visit our website and join our community forum.

The post Strengthen Your DORA Metrics with PagerDuty appeared first on PagerDuty.

]]>
Automation Seasons Freezings Wrap Up and New Year’s Resolutions by Madeline Stack https://www.pagerduty.com/blog/seasons-freezings-wrap-up-and-new-years-resolutions/ Thu, 22 Dec 2022 14:00:38 +0000 https://www.pagerduty.com/?p=80699 It’s that time of year where you may feel pressured to pick your New Year’s resolutions. Well, we went ahead and tried to give you...

The post Automation Seasons Freezings Wrap Up and New Year’s Resolutions appeared first on PagerDuty.

]]>
It’s that time of year where you may feel pressured to pick your New Year’s resolutions. Well, we went ahead and tried to give you a head start. 2023 is the year we tame toil so we can focus on the fun stuff like engineering and innovation. 

Hopefully you have had the chance to follow along with us for the month of December for Seasons Freezings, the time of year you are locked out of production, so you have time to explore new ideas like automation 🙂. If you missed it, this blog links to Seasons Freezings content you can consider even when you’re back in prod.

Image of snowflake with "Happy seasons freezings! It's time to #automate" text.

Resolution 1: More engineering, less toil.

Customers still tell us that toil is plaguing them. A survey run by PagerDuty found that 43% of respondents spend between 11% and 30% of their time doing routine, manual tasks. That is roughly 8 days a year total spent on interruptions and repetitive work. In the new year, we encourage you to take back some of this time by implementing automation into your existing processes. 

You can get started simply by developing a list of repetitive tasks and “below the value line” work you can automate away. This could include server restarts, fetching logs, open/close/update ticket, infrastructure provisioning, updating user accounts, the list goes on. 

Next, organize subject matter experts and stakeholders, agree on how to standardize completing these repetitive tasks, and automate them using PagerDuty Process Automation products or Rundeck open source. Find out more about toil by checking out this blog by Damon Edwards.

Animated image of speedometer going from "toil" to "engineering time"

Once you’ve standardized the way to complete these operational tasks, maximize the value of this automation by delegating it out as self-service to end users. Start exploring self-service automation in your organization today. Watch this video to learn: 

  • High-impact use cases for delegating IT processes as self-service automation
  • Design principles for creating self-service automation
  • How to fit delegated self-service to the way your end users work 

You’re well on your way to “automate. delegate. celebrate!”

Make sure to watch this Twitch episode on how PagerDuty manages change freezes, thawing from a freeze, and how you can utilize a freeze to help tackle toil. 

Resolution 2: Reduce interruptions by automating diagnostics

Tired of pesky pages for the same incident? While we can’t suggest you ignore the noise, PagerDuty’s Automated Diagnostics solution is a fantastic way to reduce interruptions for common, recurring incidents. The key is automating your most common troubleshooting procedures and giving your responders the ability to invoke automation inside the PagerDuty interface. By automating diagnostics, you can limit the disruptions of day-to-day work for expert engineers and reduce resolution time across the board. 

Here are some useful Automated Diagnostics resources: 

  • This blog details the issues around the initial stages of a response process and why your teams should care about automated diagnostics.
  • We even have some out-of-the-box job templates for automating common diagnostics for Kubernetes, Linux, and other common components. Read this blog to learn more. 
  • Read the automated diagnostics solutions guide for more information on setting up the solution, use cases, and diagnostic examples.  

Resolution 3: Get more out of existing automation with integrations

We know many companies have a lot of automation that they use today. The challenge can be making that automation secure and widely available for others to use. Many organizations use Pagerduty Process Automation or Rundeck to centralize and standardize automation. 

To make this easy, PagerDuty Process Automation and Rundeck come bundled with many out-of-the-box plugins. Plugins provide a wrapper around scripts, interfaces, and utilities. Take a look at the full list

One very popular pairing is integrating Ansible into PagerDuty Process Automation or Rundeck. Many users integrate Ansible playbooks into PagerDuty Process Automation or Rundeck to orchestrate and schedule workflows across multiple tools. Watch this video to learn the benefits of using PagerDuty Process Automation or Rundeck and Ansible, and some helpful tips to get started. 

We would love to hear your new year’s resolutions for 2023. Tag Rundeck on Twitter to share! 

Contact us to learn more about PagerDuty Process Automation.

The post Automation Seasons Freezings Wrap Up and New Year’s Resolutions appeared first on PagerDuty.

]]>
Tickets Make Operations Unnecessarily Miserable by Damon Edwards https://www.pagerduty.com/blog/tickets-make-operations-unnecessarily-miserable/ Fri, 16 Dec 2022 14:00:40 +0000 https://www.pagerduty.com/?p=80637 IT Operations has always been difficult. There is always too much work to do—and not enough time to do it. The frequent interruptions and high...

The post Tickets Make Operations Unnecessarily Miserable appeared first on PagerDuty.

]]>
IT Operations has always been difficult. There is always too much work to do—and not enough time to do it. The frequent interruptions and high levels of toil certainly don’t help. Moreover, there is relentless pressure from executives that question why everything takes too long, breaks too often, and costs too much.

In search of improvement, we have repeatedly bet on new tools to improve our work. We’ve cycled through new platforms (from virtualization, to cloud, to containers, to Kubernetes) and new automation (e.g., Ansible, Terraform, Pulumi). While each has provided benefits, would an average engineer say that the stress and overload on operations has fundamentally changed?

Enterprises have also spent the past two decades liberally applying management frameworks like ITIL and COBIT. Again, would an average engineer say things have gotten better or worse?

In the midst of all of this, there is conventional wisdom that rarely gets questioned—the extensive use of tickets to manage operations work. 

Tickets have become the go-to work management tool in operations organizations. Need something done? Open a ticket. Someone needs something from you? A ticket will appear in your queue.

Ticket-driven ways of working have become so ubiquitous that few think twice about adding more ticket queues across an organization. 

However, what if we were wrong about tickets? What if ticket queues were a significant source of operational strife hiding in plain sight? Let’s examine how ticket queues are often causing more harm than good.

Graphic of person moving through ticket cycle

What is wrong with tickets?

It’s the queues

Tickets on their own are relatively innocuous as they are just records. The issue is where you put those tickets. Tickets get put into ticket queues, and that is where the problems start.

Queues add delay, increase risks, add variability, add overhead, lower quality, and decrease motivation.

These aren’t my opinions, the cost of queues comes from extensive research in other fields as diverse as physical manufacturing and product development. Queuing Theory is its respected area of academic study.

In the rest of this post, I’ll use “tickets” and “queues” somewhat interchangeably. Just know that it is the queue part that causes the problems.

Image illustrating queue creation leading to money and time spent

Tickets increase communication problems

Whenever parties are forced to communicate via ticket queues, there are going to be disconnects and miscommunication. Think about all of the problems you’ve probably had with tickets: Requests being misunderstood. The person reading a request lacks the context or is working in a different context. The requester makes the wrong request or doesn’t understand the ramifications of what they are requesting. As a request sits in the queue, the request parameters have (unknown to either party) changed or are no longer valid.

Illustration demonstrating ticket queue being a barrier  to communication

Tickets delay feedback and learning

Almost every modern management philosophy (from Lean to DevOps and the Lean Startup) hinges on the concept of the organization learning quicker. All strive for increasingly tighter feedback loops so that analysis and decisions happen faster.

However, as Scott Prugh likes to remind us all, “Queues don’t learn.” Queues work against feedback loops by both injecting delays and stripping each request of its context. It is tough to become a learning organization when there is rampant use of ticket-driven request queues.

Illustration demonstrating feedback loop and what happens when ticket queues are introduced

Tickets encourage bottlenecks

The nature of how teams work through ticket queues encourages bottlenecks. First, ticket queues are often used where there are capacity mismatches. For example, ticket queues are commonly used to buffer requests made of specialist teams (e.g., network, database, security) who are largely outnumbered by people who need them to do something. These requesting teams “stuff” the queue with requests, which causes the queue length and response times to grow. Because queues delay feedback, the requestors are often unaware of (or don’t care about) the capacity mismatch and continue to stuff the queue. This behavior increases both the queue length and response time, worsening the bottleneck.

Additionally, as queue lengths grow, the teams responsible for working the queue will instinctively look inward to protect their capacity. This natural reaction leads to optimizations for the team behind the queue—often at the expense of the broader organization. For example, if a firewall team only makes non-emergency changes on Mondays and Thursdays, it creates a delay for the rest of the organization even if it helps the firewall team optimize its workload.

Tickets reinforce siloed ways of working

Ticket queues act as buffers that allow teams to continue working in a siloed, disconnected manner. The more disconnected teams become, the more they behave like siloed pools of specialist labor.

Requests are made of these specialists via ticket queues, and requests are fulfilled as one-offs through semi-manual or manual efforts. Variability is high. Priorities and context are difficult to gauge.

As with the reaction to bottlenecks, the primary management focus is on protecting team capacity (not the needs of the broader organization). The more that silos’ effects are reinforced, the more disconnects, mistakes, and delays there will be. Ticket queues are the enablers of this downward spiral.

Illustration of siloed specialist labor pools and the breakdown of communication

Tickets create snowflakes

The default FIFO (First In, First Out) nature of ticket queues encourages one-offs. When a ticket comes up to the top of the queue, the people working the ticket queue will spring into action, attempt to garner as much context as possible in the limited time they have, do what they think is correct from their perspective, and then move on to the next ticket.

This way of working—jumping from one request to another, each with seemingly disconnected context—is a leading cause of snowflakes. “Snowflakes” is a term that describes something that may be technically correct (even perfect) but a one-off that is not reproducible. A manually updated server is the classic example of a snowflake. You may be able to get it into a working state, but in all likelihood, it is going to be slightly different from other servers in your fleet (and often in undetectable ways).

The cost of snowflakes might seem minimal at first, but as environments grow, the costs quickly compound and create an expensive and seemingly intractable condition. As Keith Chambers has famously pointed out, “How many enterprises have had ‘snow days’ where some small, unexpected variation results in an incident that kills a team’s capacity for hours or days?”

Image of a ticket queue being compared to a "snowflake maker"

Tickets obscure the value stream

So much of the work of the Lean, Agile, and DevOps movements have been about breaking down barriers to build a systemic view of the work that needs to be done to deliver value (this end-to-end systemic view is often referred to as the “value stream”). Because context matters in all knowledge work, understanding where each piece of work fits into the broader system is critical.

After all the work that has been done to provide transparency and build-up context, breaking it down into a series of individual tickets obscures the value stream and scatters the context. In fact, breaking work down into smaller and smaller units is often seen as a ticket system best practice.

Illustration demonstrating how queues break down shared/common goals across teams

Tickets add management overhead

Ticket queues don’t just appear and take care of themselves. Someone has to set up the queue and define the rules (rules that also often add the overhead of other people needing to learn how to work within or around those rules). Someone has to maintain the ticket system itself. Priorities, conflicts, and politics also need to be continuously managed (often through an expensive project management overlay). All of this work has a cost and requires time and effort that could be spent on other value-adding work.

What are tickets good for?

Don’t get me wrong; tickets aren’t all bad. Tickets are just overused and/or used for the wrong reasons.

In my opinion, ticket systems are useful for raising true exceptions (e.g., logging bugs or enhancement requests). Also, there is some merit to using ticket systems to document human-to-human communication when approvals are unavoidable.

Ticket queues are also useful when each request is atomic and isolated (e.g., traditional customer helpdesk or selling tickets at a box office). However, most of these requests are prime candidates for self-service automation.

When it comes to the complex knowledge work required to deliver and operate enterprise software-based services, the evidence seems clear that ticket queues are costly at best and destructive at worst.

How do we get rid of as many ticket queues as possible?

As more organizations become aware of the toxic side effects of ticket-driven request queues, I see the same basic pattern emerging to remove as much work from queues as possible:

  1. Redesign the work to avoid handoffs wherever possible

Forward-thinking organizations have been focusing on creating “service ownership” or “product-aligned teams” that can handle as much of the lifecycle as possible (without needing to hand off work to other teams). Eliminating handoffs, naturally, cuts down on the need for ticket queues.

  1. Apply self-service automation to eliminate remaining ticket queues

Wherever ticket queues can’t be eliminated, replacing the queues with self-service is an excellent alternative. Self-service automation enables both the definition and execution of operations activity to be delegated across traditional organizational boundaries. By replacing ticket queues with pull-based self-service interfaces, wait time is eliminated, feedback loops are shortened, breaks in context are avoided, and costly toil is eliminated. The few ticket queues that remain are the ones for true expectations and one-offs. 

Illustration demonstrating the value of self-service automation

It is time to take action

It is time, as an industry, to question the conventional wisdom of ticket queues. Tickets have their place, but have been overused to everyone’s detriment. Here at PagerDuty, we’re working with our users to find and enable better ways of working that make it easier to get things done. We hope you can join us on this mission.

 Contact us to learn more about PagerDuty Process Automation.

The post Tickets Make Operations Unnecessarily Miserable appeared first on PagerDuty.

]]>
Toil: Still Plaguing Engineering Teams by Damon Edwards https://www.pagerduty.com/blog/toil-still-plaguing-engineering-teams/ Tue, 06 Dec 2022 14:00:00 +0000 https://www.pagerduty.com/?p=80291 This blog is an update from a popular blog authored by Damon Edwards.  Our industry has always had localized expressions for work that was necessary...

The post Toil: Still Plaguing Engineering Teams appeared first on PagerDuty.

]]>
This blog is an update from a popular blog authored by Damon Edwards. 

Our industry has always had localized expressions for work that was necessary but didn’t move the company forward. The SRE movement calls this type of work “toil.”

The concept of toil is a unifying force because it provides an impartial framework for identifying — then containing — the work that takes up our time, blocks people from fulfilling their engineering potential, and doesn’t move the company forward.

Why Toil Matters

Unfortunately, “not enough time and too much to do” describes the default working conditions inside operations orgs. There is an unlimited supply of planned and unplanned work — new things to roll out, incidents to respond to, support requests to answer, technical debt to pay down, and the list goes on.

With only so many hours in the day, how do you make sure what you’re working on actually makes a difference?

How do you make sure your team and your broader organization maximize the kinds of work that add value and find ways to eliminate work that doesn’t? After all, organization and team decisions dictate the majority of your work.

To maximize both the value of your engineering organization and the human potential of your colleagues, you need an objective framework to identify and contain the “wrong” kind of work and maximize the “right” kind of work. Understanding what toil is — and keeping the amount of toil contained — provides economic benefits to your company and improves the work lives of fellow engineers. 

What is the Definition of Toil?

Google first popularized the term “toil,” and the SRE movement, and it has since been pushed to IT operations.

In a nutshell, SRE is about injecting software engineering practices — and a new mindset — into IT operations to create highly reliable and highly scalable systems. Interest in the topic of SRE has skyrocketed since Google published their Site Reliability Engineering book.

In the book, Vivek Rau articulates an excellent definition, “Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.” 

The more of these attributes a task has, the more confidence you can have in classifying the work as “toil.” However, just because work is classified as toil doesn’t mean that a task is frivolous or unnecessary. On the contrary, most organizations would grind to a halt if the toil didn’t get done.

A goal of “no toil” sounds nice in theory. However, in reality, a “no toil” goal is not attainable in a business. Technology organizations are always in flux, and new developments (expected or unexpected) will almost always cause toil. Just because a task is necessary to deliver value to a customer doesn’t mean that it is always value-adding work. Toil may be necessary at times, but it doesn’t add enduring value (i.e., a change in the perception of value by customers). Long-term, we should want to eliminate the need for the toil.

The best we can hope for is to be effective at reducing toil and keeping toil at a manageable level across the organization. Toil will come from sources you already know about but just haven’t had the time or budget to automate (e.g., semi-manual deployments, schema updates/rollbacks, changing storage quotas, network changes, user adds, adding capacity, DNS changes, service failover). Toil will also come from any number of unforeseen conditions that can cause incidents requiring manual intervention (e.g., restarts, diagnostics, performance checks, changing config settings).

What Should People Be Doing Instead of Toil?

Instead of engineers spending time on non-value-adding toil, you want them spending as much of their time as possible on value-adding engineering work.

Also pulling from Vivek Rau’s helpful definitions, engineering work can be defined as the creative and innovative work that requires human judgment, has enduring value, and can be leveraged by others.

Table of what constitutes "toil" and "engineering work"

Working in an organization with a high ratio of engineering work to toil feels like everyone is swimming towards a goal. Working in an organization with a low ratio of engineering work to toil feels more like you are treading water, at best, or sinking, at worst.

High Levels of Toil Are Toxic

Toil may seem innocuous in small amounts. However, when left unchecked, toil can quickly accumulate to levels that are toxic to both the individual and the organization. 

Image of skull and cross bones with the word "toil"

For the individual, high-levels of toil lead to:

  • Discontent and a lack of feeling of accomplishment
  • Burnout
  • More errors, leading to time-consuming rework to fix
  • No time to learn new skills
  • Career stagnation (hurt by a lack of opportunity to deliver value-adding projects)

For the organization, high-levels of toil lead to:

  • Shortages of team capacity
  • Excessive operational support costs
  • Inability to make progress on strategic initiatives (the “everybody is busy, but nothing is getting done” syndrome)
  • Inability to retain top talent (and acquire top talent once word gets out about how the organization functions)

One of the most dangerous aspects of toil is that it requires engineering work to eliminate it. 

Reducing toil requires engineering time to either build supporting automation to automate away the need for manual intervention or enhance the system to alleviate the need for the intervention in the first place.

Engineering work needed to reduce toil will typically be a choice of creating external automation (i.e., scripts and automation tools outside of the service), creating internal automation (i.e., automation delivered as part of the service), or enhancing the service to not require maintenance intervention.

Toil eats up the time needed to do the engineering work that will prevent future toil. If you aren’t careful, the level of toil in an organization can increase to a point where the organization won’t have the capacity needed to stop it. If we use the Technical Debt metaphor, this would be “engineering bankruptcy.”

Visual illustration of toil at manageable percentage of capacity vs. unmanageable percentage of capacity

The SRE model of working — and all of the benefits that come with it — depends on teams having ample capacity for engineering work. This capacity requirement is why toil is such a central concept for SRE. If toil eats up the capacity to do engineering work, the SRE model doesn’t work. An SRE perpetually buried under toil isn’t an SRE, he is just a traditional long-suffering sysadmin with a new title.

Why PagerDuty Cares About Toil

One of our main goals is to improve the work-lives of operations professionals. Reducing toil and maximizing engineering time does just that.

Our users have often shown us how they use PagerDuty Process Automation and Rundeck in their efforts to reduce toil. 

Benefits include: 

  • Reduction in variation and errors to reduce toil by standardizing procedures. 
  • Making it easier to do engineering work that reduces toil by automating tasks that previously required a lot of toil. 
  • Stop one team from creating toil for another team by enabling self-service and allowing others to do operations tasks themselves.

Contact us to learn more about PagerDuty Runbook Automation.

The post Toil: Still Plaguing Engineering Teams appeared first on PagerDuty.

]]>
PagerDuty and DataOps: Enabling Organizations to Improve Decision Making with Better Data by Jorge Villamariona https://www.pagerduty.com/blog/pagerduty-and-dataops/ Thu, 27 Oct 2022 01:00:01 +0000 https://www.pagerduty.com/?p=79281 This blog was co-authored by Jorge Villamariona from Product Marketing and May Tong from Technology Ecosystem Introduction Many organizations have been digitally transforming their operations...

The post PagerDuty and DataOps: Enabling Organizations to Improve Decision Making with Better Data appeared first on PagerDuty.

]]>
This blog was co-authored by Jorge Villamariona from Product Marketing and May Tong from Technology Ecosystem

Introduction

Many organizations have been digitally transforming their operations and the majority of them are moving to the cloud.  With this transformation, data teams have to analyze ever larger and more complex data sets to allow downstream teams to make faster and more accurate decisions on a daily basis. Consequently, most organizations need to work with: customer data, product data, usage data, advertising data, and financial data. Some of the datasets are structured, some are semi-structured, and some unstructured. In short, there are endless amounts of data of various types arriving from multiple sources at increasing rates.

With these larger volume, velocity, and variety (commonly known as the 3Vs) of big data, the traditional approaches to managing the data lifecycle started to fall short. Concurrently, and towards the end of the first decade of the 2000s, software development teams started adopting agile methodologies for the software development lifecycle. These methodologies became known as DevOps (portmanteau of Development and Operations).  The following diagram illustrates the DevOps process at a high level.

 

DevOps Process

DevOps Process

Meanwhile, data professionals took a page from their next door software development colleagues and started applying DevOps methodologies and concepts to their own complex data environments.  This is what brought about the DataOps approach.

So, what is DataOps?

DataOps is the practice of leveraging software and data engineering, quality assurance, and infrastructure operations into a single nimble organization. DataOps optimizes how organizations develop and deploy data applications. It leverages process evolution, organizational alignment, and multiple technologies to enable relationships among everyone who participates in producing, moving, transforming, and consuming data: developers, data engineers, data scientists, analysts, and business users. It fosters collaboration, removes silos, and gives teams the ability to use data across the organization to make better business decisions. Overall, DataOps helps teams to collect and prepare data, analyze and make faster and more accurate decisions from a complete data set. DataOps also reduces data downtime or failures by monitoring data for quality.

What Problems Does DataOps Solve?

DataOps addresses a number of common challenges in your organization’s data environments, among them:

  1. Removing silos and promoting collaboration between teams:  Data engineers, scientists, and analysts must collaborate.  There has to be a massive cultural shift. Companies need to allow their employees to iterate rapidly with data-driven ideas.
  2. Improving efficiency and agility – Responding to bugs and defects can be dramatically minimized with greater levels of communication and collaboration between teams and the use of automation.
  3. Improving data quality:  DataOps gives data professionals the ability to automatically format data and uses multiple data sources to help teams to analyze the data and make better decisions.
  4. Eliminating data downtime and failures since the data is monitored for data quality by the data teams.
What is Data Observability ?

“Data observability” provides the tools and methodologies to monitor and manage the health of an organization’s data across multiple tools and across the complete data lifecycle. Data observability allows organizations to proactively correct problems in real-time before the problems impact business users.

What is the relationship between Data Observability and DataOps?

Data observability is a framework that enables DataOps.  DataOps teams use agile approaches to extract business value from enterprise data. But any problems with incorrect or inaccurate data could create serious challenges, especially if issues (aka data downtime) are not detected before they impact the business. Fortunately, with AI-powered data observability, organizations can detect, resolve and prevent data downtime.

Data Observability tools are concerned with data: Freshness, Statistical distribution, Volume, Schema, and Lineage.  The correct use of data observability tools results in better quality data, enhanced trust, and a more operationally mature environment.

Who are the stakeholders in DataOps?

Surely, building a strong centralized data team that builds relationships between all of the departments within an organization is a key factor in achieving data operational maturity. The data team usually publishes the most relevant datasets, thus ensuring that decisions, analyses, and data models are done from a single source of truth. At the other end of the spectrum are the data analysts and line-of-business users who consume these datasets by asking questions and extracting answers from the data. Carefully and intentionally defining roles and responsibilities helps organizations avoid conflicts, redundancies, and inefficiencies.

DataOps Personas

Here are the most common profiles (aka personas) that take part in the data lifecycle:

  • Data Engineers: These data professionals are in charge of capturing the data and building the pipelines that bring it from the source systems into data stores so that analysts and data scientists can access it. They publish core datasets after cleansing and transforming the data. They are in charge of providing timely data that is clean, curated, and accessible to those who need it. In the most traditional data environments the ETL (Extraction, Transformation, and Loading) acronym appears in their title.
  • Data Scientists: Apply their knowledge of statistics to build predictive and prescriptive models. Their most common environments are Scala, Python, and R. Aside from statistics, they are generally experts in data mining, machine learning, and deep learning. The financial industry, for example, has traditionally referred to them as quants, because of their solid background in mathematics.
  • Data Analysts/Business Analysts: Are data professionals who are generally part of line-of-business or functional groups (sales, marketing, etc.).  They are familiar with how the organization operates, the strategic objectives, and where, and how data is needed.  They transform business questions into data queries.   They have a deep understanding of the information and key metrics executives need to measure and achieve their goals. They are experts at utilizing front-end BI (Business Intelligence) tools.
  • Data Platform Administrators: Manage the infrastructure so that it works well, has ample capacity, and provides high quality of service to every department relying on it. They are responsible for transactional databases, data warehouses, data lakes, BI tools and so on. Additionally, they establish the access policies, control the infrastructure, and licensing costs.
  • Line of Business Data Consumers: Are the final users of the data, and generally use the data to make decisions. They rely on BI tools and are responsible for taking action based on what the data says. For example, sales leaders may decide to invest more in a particular geography based on sales activity. Perhaps marketing managers may decide to allocate campaign funds to certain types of campaigns based on ROI metrics.
  • Chief Data Officer: This person oversees the whole data team operation. Typically they report to the CEO, CTO, and sometimes the CIO.

 

DataOps Stakeholders and their tools

Stakeholders in the DataOps process at PagerDuty

The diagram above places the stakeholders in their traditional area of responsibility within the DataOps process at PagerDuty.  Undoubtedly, there will be varying degrees of overlap in different organizations.

DataOps at PagerDuty

At PagerDuty we have implemented a DataOps practice that leverages PagerDuty and a handful of our technology partners. By applying PagerDuty and DataOps principles we have been able to:

  • Move away from several data warehouses to a single data warehouse where datasets from MuleSoft, Segment, Fivetran, Kafka, and Spark pipelines get consolidated into a single source of truth.
  • Meet data SLAs from multiple data workloads by taking advantage of automation and data technology partnerships.
  • Leverage Observability for Detection, Resolution, and Prevention of Incidents with our data – before users learn about it.
  • Shift the focus of the data team from administrative tasks to data driven insights and data science.
  • Future-proof our data environment to meet the demands of proliferating data use cases.  These range from BI to new Artificial Intelligence (AI) applications from over 400 internal users in multiple departments and thousands of customers.

 

DevOps Process at PagerDuty

DataOps Environment at PagerDuty

The diagram above depicts several of the key components that make up our DataOps environment.  While every organization’s data needs and data environment are unique, you can glean into the fact that our problems and architecture are not all that unique (multiple data warehouses, multiple ETL tools, strict SLAs, sprawling demand for datasets). More than likely, you are already spotting several shared high level problems as well as architectural similarities with your own data environment.

You can also leverage PagerDuty in your DataOps environment

The PagerDuty digital operations platform alerts data teams and downstream data users and consumers as soon as data issues arise to prevent data downtime. We are excited to announce our six currently published DataOps or data-related integrations within our ecosystem. These technology partners solve data pipeline and data quality problems across the organization.  They improve collaboration, reduce friction, and reduce data failures by improving alignment:

  • Monte Carlo: Provides end-to-end data observability, solving data downtime before it happens.
  • Lightup: Helps enterprises achieve great data quality at cloud scale.
  • Arize: A Machine Learning (ML) observability platform to monitor, troubleshoot, and resolve ML model issues.
  • WhyLabs: Prevents costly AI failures by providing data and model monitoring
  • Prefect: Build and monitor data pipelines with real-time alerting
  • Astronomer: Reduces data downtime with real-time data monitoring on pipelines

 

Image of PagerDuty integrations

PagerDuty DataOps Ecosystem

Most importantly, these new DataOps integrations with PagerDuty cover key areas such as: data pipeline orchestration, testing and production quality, deployment automation, and data science/ML model management.  We encourage you to try PagerDuty along with some of these PagerDuty ecosystem technology partners to help you drive tighter collaboration amongst cross-functional teams and achieve better and faster decisions with less data downtime.  Similarly, if you are thinking about building a PagerDuty Integration, please sign up for a developer account to get started.

The post PagerDuty and DataOps: Enabling Organizations to Improve Decision Making with Better Data appeared first on PagerDuty.

]]>
Create and Manage Maintenance Windows Through PagerDuty Mobile App by Laura Chu https://www.pagerduty.com/blog/new-mobile-maintenance-windows/ Wed, 28 Sep 2022 13:00:07 +0000 https://www.pagerduty.com/?p=78801 In order to respond in real-time to urgent, critical digital incidents, on-call responders must be able to take action from anywhere.  But when on-call responders...

The post Create and Manage Maintenance Windows Through PagerDuty Mobile App appeared first on PagerDuty.

]]>
In order to respond in real-time to urgent, critical digital incidents, on-call responders must be able to take action from anywhere. 

But when on-call responders become overwhelmed with alerts, they often just “ignore them” because they cannot tell the difference between a real alert and a false one. For example, when there is a down service due to maintenance or upgrades, this event could trigger multiple incidents, meaning the responder could receive false alerts that do not pertain to a real incident. Other times, however, a service triggers critical incidents and requires the responder to dive into the problem and solve the matter fast. 

On-call teams need a better solution that is more intuitive and flexible–one that allows them to disable a service as well as pause incident alerts on their mobile device, so they can focus on what matters: solving an incident without interruptions. 

We believe effective incident management empowers teams to do their jobs more efficiently while minimizing interruptions. That’s why we are excited to announce the general availability of Maintenance Widows through the PagerDuty Mobile App. 

Maintenance Windows help responders temporarily disable a service, including all of its integrations, while it is in maintenance mode. When a service is in the maintenance window, all of the service’s integrations are effectively “switched off” so that no new incidents will trigger. 

Easy to create, update and delete maintenance windows from anywhere:

Creating Maintenance Windows within the mobile app takes just a few simple steps: 

  1. Choose the Service Directory from the hamburger menu and select your preferred service.
  2. Tap on settings and tap “create maintenance menu.”
  3. Enter a description to explain why this maintenance is happening.
  4. Schedule the beginning and end date (and time) for the maintenance. 
  5. Once the maintenance window expires, the service exits the maintenance mode, and new incidents can be triggered again.

Creating a maintenance window

You can delete an existing maintenance window by going to settings and tapping on “end maintenance window.”

List of active maintenance windows

A maintenance window for multiple services:

  • PagerDuty’s mobile experience allows for the creation of a maintenance window on one service at a time. Users who want to create a maintenance window covering multiple services can be done through the web application.
  • Updating and deleting options for a maintenance window covering multiple services can be done through the mobile app. 

This latest addition to PagerDuty Mobile empowers on-call teams to manage and respond to incidents without sacrificing time and work-life balance. We’re continuing to improve the PagerDuty mobile experience by giving teams the trusted information to continue serving their customers better. 

You can learn more about PagerDuty Mobile and the Maintenance Windows through our Knowledge Base Articles. Or try it out using the following QR code to download:

QR code for downloading PagerDuty Mobile on iOS

iOS

QR code for downloading PagerDuty Mobile on Android

Android

Want to learn more about PagerDuty Incident Response and how it works with our mobile app? Participate in the free 14-day trial and experience how PagerDuty can empower your teams with faster time and efficiency, and drive innovation across your Operations Cloud.

The post Create and Manage Maintenance Windows Through PagerDuty Mobile App appeared first on PagerDuty.

]]>
IHS Markit: Centralizing Incident Management With PagerDuty & ServiceNow by Lisa Duckrow https://www.pagerduty.com/blog/ihsmarkit/ Wed, 17 Aug 2022 13:00:55 +0000 https://www.pagerduty.com/?p=77957 In today’s digital world, organizations are constantly undergoing change. They’re moving to the cloud and rolling out DevOps at scale—all in the name of driving...

The post IHS Markit: Centralizing Incident Management With PagerDuty & ServiceNow appeared first on PagerDuty.

]]>
In today’s digital world, organizations are constantly undergoing change. They’re moving to the cloud and rolling out DevOps at scale—all in the name of driving innovation. But moving from a monolith to microservices can lead to applications becoming increasingly distributed. When problems arise, customers don’t care how many teams and services you have, or how complex your architecture is. They only care that your services work when they need them to.

To this end, bringing everything—teams, services, data—under centralized management is key. Urgent work cannot be held up by centralized ticketing tools.

This is where combining IT service management tools with a digital operations platform can bridge the gap between central IT and decentralized teams. Enter PagerDuty and ServiceNow—by combining the two, responders gain access to automation to drive action without delay, enabling a real-time response in seconds while maintaining a complete history of all activities. This combination also streamlines the business response to incidents, keeping stakeholders updated.

This better together approach is representative of incident response processes leveraged today in the modern enterprise stack. One such PagerDuty customer benefitting is IHS Markit.

The Culture Clash

IHS Markit provides analytics and intelligence to financial service providers, governments, and other major industries. Headquartered in London, UK, it employs 16,000 people globally.

IHS Markit needed to bring together a rapidly growing number of hybrid operations to gain full visibility across the business and manage incidents from a centralized command center. The company had grown through acquisition and now offered around 700 customer-facing services and 300 internal services. Tracking for incidents at this scale was incredibly challenging, and was made harder by the conflicting requirements held by different areas of the business.

  • The DevOps team wanted to remain “agile, autonomous, and awesome,” with full control over all its monitoring needs. A core requirement from DevOps was that the team did not want to raise tickets or have to log on to ServiceNow.
  • The operations command center (OCC) team was rooted in a more traditional IT infrastructure library (ITIL) structure and based its system on ServiceNow. The team wanted better scheduling and escalation policies, but with “zero impact to the existing, mature incident management processes.”
  • Compliance wanted to track controls and records in a common system of record in ServiceNow, particularly as IHS Markit has many products under various regulatory regimes.
  • Management requested global oversight across all teams, whether the team was more aligned with DevOps or sat within the more traditional ITIL side. Management wanted ServiceNow to provide this visibility.

IHS Markit already had PagerDuty in place, but wanted to expand its use. John Kennedy, Director of Observability at IHS Markit, explained, “We wanted to bring incident management together into one enterprise offering that was horizontal across the company and properly managed.”

A Solution for Everyone

To achieve this, IHS Markit integrated ServiceNow incidents with PagerDuty. IHS Markit worked with PagerDuty’s customer success team to customize the PagerDuty platform to accommodate all requirements and improve operations. 

This enabled the DevOps team to maintain ownership of their services within PagerDuty. For these teams, the ServiceNow integration was introduced “by stealth”—everything was tracked and recorded in ServiceNow, without them ever having to log into the platform. 

For the OCC team, PagerDuty’s integration with ServiceNow ensured the existing incident management process remained intact. Everyone could monitor major incidents via PagerDuty dashboards, even if they were not yet onboarded in PagerDuty. With one click, incident managers could quickly bring in specialized teams with diverse skills, including senior executives or product experts.

This also fulfilled compliance and management’s visibility requirements, as PagerDuty gave them a single pane of glass through which they could view the entirety of the system.

“All of our major incident management was now being done in PagerDuty, and if the incident occurred outside of it, then our major incident managers would sync it up with PagerDuty,” John explained. “On top of this, they’re using response plays to bring in executives to help us make quick decisions. As a result, we’re getting major benefits, especially on MTTR.”

This better together approach means that central IT has visibility and access across distributed teams. This will be essential as IHS Markit continues its growth journey.

What’s next for IHS Markit?

Looking ahead, IHS Markit will continue to centralize visibility. “There is a huge expansion of agile and DevOps methodology across the company, which means we need to think about the next evolution of our converged model for incident management,” John said.

Maintaining DevOps’ ability to be “agile and autonomous” will also be a major focus. “We need them to be able to create their own technical services, so that means thinking about the technical services in ServiceNow and whether they need to be hooked into our hierarchy there,” John explained. “Governance is important too—how we maintain the quality of the system and how that’s governed centrally.”

As digital transformation continues and teams are more distributed than ever, it’s key that business processes for managing urgent work can operate in real-time. To find out more about how PagerDuty can enhance ServiceNow and other ITSM tools for faster resolution times and enhanced coordination, check out these resources:

How Your ITSM Tool & PagerDuty Make a Dynamic Duo for Real-Time Work

Enhance your ITSM

Solutions brief: Extend ITSM Workflows with PagerDuty

And, if you’re ready to see PagerDuty in action, try us out for free for 14 days.

The post IHS Markit: Centralizing Incident Management With PagerDuty & ServiceNow appeared first on PagerDuty.

]]>
Get to the Root (Cause Analysis) in 5 Easy Steps by PagerDuty University https://www.pagerduty.com/blog/get-to-the-root-cause-analysis-in-5-easy-steps/ Wed, 10 Aug 2022 13:00:45 +0000 https://www.pagerduty.com/?p=77861 What is one of the first things you should do when you are assigned an incident via PagerDuty? If you immediately thought “Acknowledge!” you are...

The post Get to the Root (Cause Analysis) in 5 Easy Steps appeared first on PagerDuty.

]]>
What is one of the first things you should do when you are assigned an incident via PagerDuty? If you immediately thought “Acknowledge!” you are not wrong, but after that, it’s all about resolving the issue as quickly and painlessly as possible. The first step to resolution is to investigate what caused the incident in the first place so you can easily get a fix in place.

In the PagerDuty platform, Root Cause Analysis* refers to a set of features that aims to provide you, the responder, with as much context and actionable intelligence as possible. By surfacing past and related incidents, as well as insights into incident frequency, responders will have tools to quickly gain the situational awareness they need to determine probable root cause and speed up triage, and ultimately resolve faster. Likely origin points based on historical data will also be highlighted to help add context. 

Here are the five places on the incident details page to help you investigate the potential root causes:

  1. Outlier Incident
    When first opening an incident, look for the Outlier Incident classification label. This label is located directly under the incident name and will have a classification label of “Frequent,” “Rare,” or “Anomaly.” Based on this classification label, you can quickly gauge whether this incident has occurred before and how you might respond to it based on past experiences. Hover over the label to read their definitions.Outlier Incident classification label of "Frequent," "Rare," or "Anomaly."
  2. Past Incidents
    Once you have determined the frequency at which the incident has occurred on the service, navigate to the Past Incidents tab further down the page. A heat map is displayed to show when previous incidents like this open incident have occurred over the last six months. Look for patterns in the colors – darker colors equal higher concentration of incidents – or hover over the heatmap colors to see further details about the relevant incidents. Below that are details about the Top 5 past incidents like the open incident (if there are any!) along with information about when they occurred and who last changed the incident. Note: That person would be a great resource if you want to ask them about what they did/see their notes on the incident! To open up the incident details page for any past incident, click on the hyperlinked title.Past Incidents heat map
  3. Related Incidents
    Another quick source of information is the Related Incidents tab. Here you see if there are currently any ongoing incidents that might be related to your issue from across all services, unlike Past Incidents, which only shows similar incidents on the same service. Understanding the scope of an incident across the business (is this isolated or part of a larger problem?) can help you understand the impact and to quickly identify who you need to collaborate with to fix the problem.View of Related Incidents tab
  4. Probable Origins
    Jump start your triaging efforts with the Probable Origins widget located on the incident details page. This widget will calculate the likely origin percentage based on historical data, like whether the incident occurred directly before or after a similar event to the current open incident. Screenshot of Probable Origins widget
  5. Change Correlation
    Lastly, it can greatly accelerate resolution when you are aware of any changes to your infrastructure or code that might have caused the incident. Change Correlation, displayed under Recent Changes on the incident detail page, shows the three recent change events that are most relevant to an incident based on time, related services, or PagerDuty’s machine learning. The recent change events will indicate why the platform surfaced the event, helping you to easily narrow down potential causes. Screenshot of Change Correlation display

Knowledge check! True or false: The Past Incidents tab displays Resolved Incidents from the same service, while Related Incidents will display only Open Incidents on other services. (see answer at the bottom of the page)

How’d you do? Remember, these are five places you can look, to quickly gain context and jumpstart your triaging efforts. 

To solve incidents faster and help reduce downtime further, combine this set of Root Cause Analysis features with Noise Reduction and Event Orchestration capabilities. If you need a refresher, take PagerDuty University’s Event Intelligence courses and then show off your ability to work smarter, not harder, by completing the Event Intelligence Certification!

Resources for Next Steps:

Event Intelligence Courses can be found on the PagerDuty University eLearning Portal.

  • Noise Reduction
  • Event Orchestration
  • Root Cause Analysis

Event Intelligence Certification Exam information can be found on this page under “Specialty Product Certification.” As a celebration of this new series launching, we are offering complimentary registration for the exam for 30 days, so register now!

*Footnote: While we refer to this category of features as Root Cause Analysis, PagerDuty is not predicting or identifying root cause. Rather, our features help to create context around incidents to drive faster resolution. It’s also worth noting that there has been an industry shift to adopt the term probable or proximate cause rather than suggesting that there is any one true “root cause.”

Knowledge Check Answer: False. While the statement is correct that Past Incidents only displays resolved incidents from the past that were on the same service, Related Incidents will look at other active incidents – open and recently resolved – across ALL services (including the service your current incident is on) to find if any incidents are related to your current incident.

The post Get to the Root (Cause Analysis) in 5 Easy Steps appeared first on PagerDuty.

]]>
More Powerful than Ever: PagerDuty’s Revamped Mobile App is Primed for Even Better Incident Response by Hannah Culver https://www.pagerduty.com/blog/revamped-mobile-app-for-incident-response-2022/ Tue, 12 Jul 2022 13:00:43 +0000 https://www.pagerduty.com/?p=77078 2020 revolutionized how we work. Many went from full-time office work to 100% remote overnight. And now that in-office is once again on the horizon,...

The post More Powerful than Ever: PagerDuty’s Revamped Mobile App is Primed for Even Better Incident Response appeared first on PagerDuty.

]]>
2020 revolutionized how we work. Many went from full-time office work to 100% remote overnight. And now that in-office is once again on the horizon, companies are thinking of ways to continue to work flexibly. However, this comes with increased challenges, and a need for tools that match this working style.

The PagerDuty mobile application is well recognized, with a 4.8 stars rating on the App Store and Google Play. We understand how important it is to reach the right people immediately – that’s why we’ve made significant investments in iOS and Android to help responders resolve critical work from anywhere, anytime. 

This blog post covers some of the most exciting improvements, such as a new navigation interface to find the information you need most; improved incident intelligence through past and present incidents; and leverage automation to trigger automated diagnostics and take remediation actions. 

Easily navigate to find the information you need most

For responders, you need to know when you’re on call and what services you’re on call for. If and when you do get called, it’s crucial that you can identify how the technical services you’re responsible for are performing. And most importantly, you want to be able to see all this information at a glance from your mobile app – not navigate through multiple screens and digging for information buried deep within an app.

With the new PagerDuty mobile home screen experience, the most important information that responders need is readily available. This redesign puts the top open incidents, on-call shifts and impacted technical services front and center, reducing the number of taps needed to navigate through the app. 

The redesigned home screen is now available for early access. If you’d like to try it out, you can fill out this early access form and choose the New Mobile Home Experience selection, and we’ll send you instructions.

Part of working flexibly means having critical incident context at your disposal during the moments that matter most. When an incident begins, you need to get up to speed as quickly as possible to begin making decisions on how to best mitigate impact.

One way to do this is with the new mobile incident details screen. This screen provides you an easier visual experience and  access to all your most important features to help you address incident response faster. The most important information about an incident is available to you immediately, such as notes from other responders working on the incident, change events, past incidents, and the latest alerts.

A new carousel on the updated version of the mobile incident details screen also allows you to run a play, add a priority or note, post a status update, and more.

Gain critical incident context through the past or present

When you experience an incident, one of the biggest hurdles to jump is answering the question, “Have we seen this before?” If you have, resolution might be as straightforward as running a play or executing an automation sequence that worked before. But it often can be difficult to find that historical context, and that’s time wasted that nobody can afford.  

  • The Past Incidents feature on the PagerDuty mobile app displays incidents with similar metadata that were generated on the same service as the current, active incident. This additional context facilitates accurate triage and reduces resolution time. For example, you can see whether you, or someone on your team, has been involved in a similar previous incident, and dive into details to discover what remediation steps were taken. 

  • Change Events – Changes within the system are often the culprit behind incidents. They are often overlooked because it can be hard to pinpoint exactly what change caused the incident, especially when many organizations are shipping new code dozens or even hundreds of times per day.   However, “Gartner estimates that approximately 85% of all performance incidents can be traced.” Change Events will enable you to look at changes impacting your environment and help you establish the potential root cause. Change Events can be easily viewed in two areas of the mobile app:  the new mobile incident details screen and service details. Either tap on a desired incident and scroll to Change Events, or navigate to the Service Directory to select a service to view a maximum of two Change Events. Event details displayed include the date and time, summary, service, type, links, and source.

  • Another important piece of information during incident response is understanding the impact radius of an issue. One way to glean this information is by understanding Service Dependencies. If a large, customer-facing business service is experiencing problems due to the technical service incident, you’ll need to respond faster and with more contextual intelligence to better understand the scope of the problem.With Service Dependencies in the mobile app, you can view what services are affected to better understand scope. Service Dependencies are listed within each particular service’s profile in the Service Directory.

Leverage automation for faster response

As technology environments become more complex, it’s more important than ever to conserve people’s time and cognitive resources. This means ensuring that machines, not humans, serve as the first line of defense. 

Automating repetitive manual tasks and well-understood incidents can divert unnecessary toil away from responders so that they can focus on their day jobs, and are only called for the incidents where they’re needed most. One way to do this is with automation that you can run with the tap of your finger.

PagerDuty Automation Actions is now generally available within PagerDuty mobile, empowering you to trigger automated diagnostics and take remediation actions from anywhere, anytime. It improves productivity by automating repeated diagnostic and remediation steps, replacing the toil of manual tasks. In addition to running the scripts, you can view previously run scripts and output reports directly from the mobile app. These update in real time, meaning you never miss a thing.

These latest additions to the PagerDuty mobile application help responders work in the way you want without sacrificing time, quality, or customer experience. Flexible work is here to stay, and PagerDuty’s powerful mobile application is committed to helping you make the most of it. If you haven’t tried our mobile application in a while, it’s time to take a second look. Use the QR Code and download either. 

iOS

or Android.

Important: Ensure your mobile experience is secure

With so many new great features added to PagerDuty mobile, we are introducing the new minimum OS requirements to ensure the mobile app continues to be innovative and secure and improve the user experience. Starting June 27, 2022, the future versions of the PagerDuty mobile app will require Android 9.0 and iOS 14.0  or higher. Please ensure your device is upgraded to continue receiving mobile app updates.

The post More Powerful than Ever: PagerDuty’s Revamped Mobile App is Primed for Even Better Incident Response appeared first on PagerDuty.

]]>
Why Operational Maturity Helps Businesses Reduce the Great Resignation Trend by Laura Chu https://www.pagerduty.com/blog/why-operational-maturity-helps-businesses-reduce-the-great-resignation-trend/ Thu, 30 Jun 2022 13:00:15 +0000 https://www.pagerduty.com/?p=76920 Why Operational Maturity helps businesses reduce the great resignation trend The past few years have led to fundamental business and cultural shifts for both companies...

The post Why Operational Maturity Helps Businesses Reduce the Great Resignation Trend appeared first on PagerDuty.

]]>
Why Operational Maturity helps businesses reduce the great resignation trend

The past few years have led to fundamental business and cultural shifts for both companies and employees. 

Covid-19 has brought opportunities for companies who invested early in digital operations, while others struggled to maintain the status quo. The latter gave rise to record employee burnout, and what is now commonly referred to as the Great Resignation.

According to a CNBC survey, 4.5 million American workers quit their jobs in March 2022 primarily due to being burned out, unhappy with their jobs, and reevaluating their lives. And a 2021 PagerDuty user survey found that 64% of the respondents said their organization had seen a rise in turnover, while only 34% said there had been no increase in turnover in the past year.

As the Great Resignation continues, businesses are looking for new ways to keep employees productive and reduce turnover. We know from working with customers that investing in best practices makes a big difference. 

Our Digital Operations Maturity Model steps through common behaviors and codifies cohorts by their readiness to handle real-time operational challenges. Teams can uplevel their maturity and become more proactive by investing in best practices and processes that can help ease their journey with digital transformation. 

What is the Digital Operations Maturity Model?

The Digital Operations Maturity Model helps response teams assess their current maturity stage, and architect a new digital operation that prepares the teams to detect, triage, mobilize, respond, and resolve outages and system failures in less time. 

The model encompasses five levels of operations maturity:

  • Manual – Developers are slogging through issues at hand; requests require ad-hoc responses,  sometime throughout the night.
  • Reactive – System failing or customer complaints always lead to firefighting mode without a lot of coordination and planning.
  • Responsive – Issues are resolved as they occur where coordination and planning are streamlined.
  • Proactive – A seamless, coordinated issue management where issues are fixed before customers notice.
  • Preventative – Stay ahead of issues before they start and continuously learn from past and current incidents.

The Digital Operations Maturity Model’s goal is to help DevOps teams move toward the Proactive and Preventive stages to manage and maintain their IT infrastructure’s consistency, reliability, and resilience. The teams should experience fewer service failures, faster resolution of customer-facing issues, and fewer employee burnouts because they have a more predictable workflow and work-life balance.   

The benefits of improved operational maturity

In 2021, Tejere Oteri, a Product Marketing Analyst at PagerDuty, conducted a survey to understand the effect of good operations practices on business impact, operational health, and human factors. When analyzing the survey results, Tejere noticed differences in the way participants from Reactive organizations responded versus those from Preventative organizations.

When asked about the team’s workload, 33% of those from Reactive organizations felt their workload was spread evenly – but the majority preferred to see some workload improvement. When the same question was asked of the Preventive organizations, 83% overwhelmingly felt their workload was spread evenly, and surprisingly, no respondents disagreed.

Tejere also found that Reactive organizations were two times more likely to experience increased employee turnover than Preventive organizations. Specifically, the data analysis showed that Preventive organizations spend fewer personal hours and sleep hours addressing incidents for work than Responsive organizations, leading to less burnout and reducing the likelihood of high turnover.

What if we mapped the maturity model to product usage?

To introduce the operations maturity model for our customers and identify trends seen across our customer base, we mapped the model to product usage behaviors. Scott Bastek, Sr Product Analytics Manager at PagerDuty, looked at each maturity stage through PagerDuty product adoption, and made some interesting findings:  

  • Reactive teams rely on monitoring tools to identify incidents, and have not gone as far as configuring a robust response to reduce MTTR
  • Responsive teams can resolve issues as they occur by putting effort into on-call schedules, and by staging multiple levels of defense to maintain the status quo
  • Proactive teams deploy advanced incident response functionality like service dependencies or change events to understand and diagnosis issues once they occur
  • The Preventative teams leverage noise reduction, event orchestration, or analytics reports to prevent incidents from happening

Where are you on the digital operations maturity model?

Leaders concerned with employee burnout and attrition should assess their current stage of maturity, and implement new operational processes by consulting the examples shown in the operational maturity model. The best way to achieve the highest operational maturity is by investing in your DevOps team with on-call scheduling, coordinated issue management, event intelligence, and process automation tools. These investments will help DevOps teams become more proactive and preventative with all incidents, and reduce the volume of MTTA and MTTR across organizations.   

Watch “Getting from Reactive to Proactive and Beyond” to learn more about the Digital Operations Maturity Model. In the video, Scott Bastek and Tejere Oteri dive into the levels of operations maturity, and provide more statistical analysis and insights on their findings.

You can also download PagerDuty’s latest eBook and the latest State of the Digital Operations Report for examples of how organizations can achieve the highest operational maturity level.

The post Why Operational Maturity Helps Businesses Reduce the Great Resignation Trend appeared first on PagerDuty.

]]>