Best Practices & Insights | Categories | PagerDuty https://www.pagerduty.com/blog/category/insights/ Build It | Ship It | Own It Thu, 07 Sep 2023 21:13:22 +0000 en-US hourly 1 https://wordpress.org/?v=6.3.1 How to Ace Your Services with PagerDuty by Débora Cambé https://www.pagerduty.com/blog/how-to-ace-your-services-with-pagerduty/ Wed, 06 Sep 2023 12:00:58 +0000 https://www.pagerduty.com/?p=83923 It’s finals week for the US Open, one of the most celebrated sports events in the world. Tennis is my favorite sport to watch as...

The post How to Ace Your Services with PagerDuty appeared first on PagerDuty.

]]>
It’s finals week for the US Open, one of the most celebrated sports events in the world. Tennis is my favorite sport to watch as I’m fascinated by the strength, composure and endurance each player displays while standing by themselves on the court, sometimes during incredibly long matches – the current record is 11h05.

Tennis players are fully accountable for the outcome of their matches at every single stage. Their performance directly impacts whether they win or lose. If this sounds familiar, that’s because it is. Service Ownership follows the same approach: “you build it, you own it”. In the context of DevOps, you’re not working alone. But there are definitely lessons to learn from tennis when it comes to building healthy, resilient services. 

The parallel started drawing itself when interviewing Leeor Engel, Director of Engineering for the Incident Response product line. Keep reading and find out his take on how to ace services and how the PagerDuty team used PagerDuty’s own Service Standards functionality to improve the overall maturity of their services.  

What is Service Standards?

When pivoting to a Service Ownership model, organizations struggle with having a clear visibility of their multiple services and how to uniformize their configurations. Launched a year ago for all PagerDuty plans, Service Standards can guide teams to better configure their services, while helping managers and administrators to scale these standards across the organization.

With Service Standards, PagerDuty provides nine standards that each service should fulfill to have the depth and context required for that service to be considered well-configured, all of which are able to be toggled on and off.

PagerDuty’s Customer Zero: PagerDuty

After the launch of Service Standards, PagerDuty was its own customer zero. Leeor walks us through the motivation behind this effort: “You wanna get adoption and figure out what the gaps are, get feedback, figure out ways to improve [the product]. Then there was an organizational goal. We talk a lot about what makes a service well configured and what does good look like. So we did a big push to get PagerDuty to be customer zero for that feature. We basically got every team to review all their services. And we actually found that many services did not meet the standards.”

Services varied considerably in their standard compliance, but “under 50%” were fully compliant. Approximately four months later, the goal to reach 100% compliance was achieved. But it’s a constant work in progress to keep it that way: “It can be very difficult, depending on the type of service, to get 10 out of 10 [standards]. So our goal was to get 100% of services to be at least 80% compliant. We got there. But then there’s an ongoing effort to maintain that because new services are created all the time, and it’s easy to forget this. And so our continuous process is what catches those stragglers and gets them compliant.

If you also want to ace your services, here are four lessons you can draw from tennis dynamics to get there:

Warm-Up

You might have identified the need to standardize your services to play in the best practices court. But maybe your organization has dozens, even hundreds, of services and that feels overwhelming. Where and how should you start to avoid feeling overwhelmed?

Lesson #1: Start with the baseline

In tennis, the baseline is where each game begins. It’s where players serve and it’s the foundation for their positioning and strategy. Without a well developed baseline play, there’s no chance of winning. But it needs to be built gradually.  

Similarly, standards work as a service’s baseline level of quality, consistency, and functionality. It’s not about achieving perfection from the outset but rather about having a structured foundation to build upon. Take it from Leeor: “You want to focus on systemic things and define any standard as a starting point. Don’t worry about it being perfect. Just get it in place and have a continuous monitoring regime. And that’s gonna move the needle the most, because that’s going to expose all these other problems you might have in your processes that you need to improve, whatever it might be. It’ll be sort of the gateway to exposing those things and then addressing them, continuously improving.

Lesson #2: Adapt to the surface

Every tennis player has their own style of play, but they must adapt to the surface they’re playing on, each enabling different dynamics. On grass, for example, rallies are usually shorter, as the ball bounces low and players need to get to it faster – playing the net successfully and mastering the volley is key to success.

In the context of services, recognizing each team’s unique circumstances is a crucial first step when determining which standards that team’s service should follow. As Leeor explains, “teams can have pretty different needs in terms of their services. Sometimes their integration set up is a little bit different. Sometimes they’re not monitoring things that are directly based on code deployments. For example, one of our Service Standards is having at least one change integration – we may have services that don’t. They may be triage services that have email integrations or things like that. Those services still provide value and they need a standard, but they need a slightly different one. There isn’t a one-size-fits-all that works for everyone.

Win the game

The foundations are set: you have defined your service’s boundaries and standards according to the needs of the team that owns it. Now you need to ensure those standards are complied with. How?

Lesson #3: Avoid unforced errors

An unforced error happens when a player loses a point even though their ability to execute it was completely in their control, i.e. not forced by the opponent.

Teams are responsible for keeping their service standards in check, but in the fast-paced DevOps world that can be tough; services change or new ones might be created depending on business needs. Leeor highlights three essential steps to successfully maintain the balance of your service standards and avoid the unforced error trap:

  • Monitor: With the new PagerDuty Service Standards API you can pull your service standards on a regular basis. This allows you to confirm if the standards are in line with the service needs, if they might need to change or if it makes sense to create exemptions.
  • Report: Create a reporting regime where you define a regular cadence to assess the state of all the services. With PagerDuty Service Standards it’s easy to do so, as the service performance data can be exported out of PagerDuty by the admins and shared as needed to drive accountability and show progress. Admins also have the option to make standards publicly available for the rest of the organization to view. 
  • Educate and be educated: Leeor explains how talking directly and frequently with team owners can raise awareness and educate on the importance of complying with service standards: “For example, business services were not uniformly used across all teams and it’s actually pretty useful. Even just to have a parent business service for your area. Then you can leverage capabilities like the Service Graph or Business Impact features. A system where you can see all your services at a bird’s eye view.” It can also help surface different use cases: “Over time, we developed this process where we could have some exemptions. An example would be testing a service that isn’t in production yet, and it doesn’t yet have the escalation policy. So we set up an exemption process – which ideally was temporary – and we set up some exclusions around specific standards.” 

Win the match

Lesson #4: Continuously improve

The beauty of tennis is the course of a match can change instantly. There is no time limit to a game or even a set and players aren’t only depending on variables they can control: there’s the opponent’s focus and physical condition, the weather, and even the audience. Are they cheering you on?

Tennis is a game of continuous improvement and the same happens with services. Well configured services help scale Service Ownership best practices which, in turn, drive the organization’s operational maturity level.

Here’s Leeor’s number one advice to get there: “The key thing is reporting. Of course you need to establish what your standard is and that may look a little different depending on the business. But really the critical thing is the continuous monitoring and reporting. Mistakes happen, things get missed, humans are humans, right? So you need some process that catches the things that fall through the cracks. Define a standard and continuously monitor it, like you would do with any other process. You’re trying to continuously improve. You need to monitor it.

Start Acing Your Services

Put all these lessons in practice with the PagerDuty Operations Cloud, the essential platform to get your services in shape and manage all unplanned, time-sensitive, critical work across the enterprise. Learn more here and try our free 14-day trial

The post How to Ace Your Services with PagerDuty appeared first on PagerDuty.

]]>
6 Best Practices for Seamless Notifications with International SMS by Cristina Dias https://www.pagerduty.com/blog/6-best-practices-for-seamless-notifications-with-international-sms/ Tue, 05 Sep 2023 12:00:43 +0000 https://www.pagerduty.com/?p=83907 There’s no denying it: in today’s interconnected world, Application-to-Person (A2P) SMS notifications have become an integral part of our daily lives. Whether it’s receiving crucial...

The post 6 Best Practices for Seamless Notifications with International SMS appeared first on PagerDuty.

]]>
There’s no denying it: in today’s interconnected world, Application-to-Person (A2P) SMS notifications have become an integral part of our daily lives. Whether it’s receiving crucial banking alerts, getting updates from our favorite retailers, or even surfacing a notification from PagerDuty when your service is down–SMS keeps us informed and connected. But have you ever wondered about the intricacies behind this seemingly straightforward technology? It’s more complex than you might think.

Here at PagerDuty, we are dedicated to making sure that notifications reach their intended recipients and we’ve learned a lot about international SMS best practices along the way. In this blog, we delve into international SMS best practices and the critical factors that can make or break your SMS notifications game so that you know how to optimally configure your notifications to never miss a page from PagerDuty.

Navigating the Challenges of International SMS

Who better to share learnings than the woman behind the curtain, responsible for our notifications experience? We interviewed Abby Allen, Senior Product Manager of the Notifications Experience Team, to help shed light on the challenges of international A2P SMS and offer some insights for improving SMS deliverability. 

The Illusion of Reliability

Contrary to popular belief, Abby warns that “SMS is far less reliable than people think!” A2P SMS often faces disruptions due to network outages or planned maintenance, affecting message delivery. This unreliability is hidden behind the seamless communication we experience in person-to-person SMS, which is very opaque by design and helps prevent recipients from noticing issues with the underlying carrier networks. Imagine your business-critical alert going undelivered due to an SMS outage–the consequences could be detrimental. Therefore, PagerDuty “monitors our international deliverability and encourages everyone to have a backup channel to reach their users. If your system also offers SMS, make sure there’s a backup communication option for your users, too.”

Global Audience, Unique Needs

Expanding your SMS strategy beyond local borders is an opportunity to tap into a vast international market. Nonetheless, each region has its own regulations, carrier restrictions and user preferences.

Abby highlights some particularly challenging regulations. For instance, nearly every country requires opt-in to confirm a recipient actually wants SMS from you. An easy way to do that is to ask the recipient to “Reply Yes” or click a link to confirm. However…

  • France, Vietnam, and many others require all A2P SMS to be sent from an alphanumeric sender ID. This kind of SMS sender ID is required in a growing number of countries but doesn’t allow replies at all. Your “Reply Yes” won’t work for millions of international users.
  • Norway and other countries will actively block any URL from standard link shorteners like bit.ly. 
  • China and Romania block all SMS with any URL. This makes a workflow with a link to click to confirm opt-in challenging.
  • Singapore: to send SMS to Singapore, you must register your content templates with the Singapore government and pay substantial fees. Any content that doesn’t comply with your registered templates is subject to blocking and filtering. Even if you have a great workflow that doesn’t require replies or links, you still need to get it approved by a government entity. 

Neglecting the needs of your international audience can lead to missed opportunities and a less effective communication strategy.

Legal Compliance Matters

International SMS law is a complex web of regulations that can have financial repercussions for non-compliance. Sending unauthorized messages, violating content restrictions, or spamming recipients can result in penalties. Familiarize yourself with the regulations in your target regions to ensure your SMS campaigns are both effective and legally sound.

6 Best Practices for Optimizing Internal SMS and Notifications Settings

Understanding the challenges is only half the battle. Let’s explore actionable strategies to ensure your SMS notifications hit the mark–regardless of borders.

1. Default to Push, Keep SMS as Backup

Staying proactive is key. Regularly monitor the deliverability of your international SMS to identify potential issues. However, relying solely on SMS is risky. Consider SMS as a supplementary channel rather than your primary one. At the time of writing, push notifications are subject to far less international regulation and provide significantly more opportunities for engaging, interactive messaging. Abby recommends push notifications as one of the most reliable alternatives to ensure your message reaches its destination, along with email and phone calls.

If you’re a PagerDuty customer, why don’t you give the PagerDuty mobile app a go? This critical component of the PagerDuty Operations Cloud empowers our users with unmatched convenience, agility, and adaptability, ensuring a flawless orchestration of incident management and collaboration across borders. Better yet, push notifications within the app typically deliver 4 to 6 times faster than SMS.

2. Give Other Messaging Platforms a Go

Instead of sticking to traditional SMS, consider exploring other messaging platforms, like Slack, Microsoft Teams or WhatsApp. These platforms provide an international experience with fewer regulatory hurdles, making it easier to reach your global audience seamlessly.

PagerDuty is opening Early Access for Slack as an incident contact method to customers the week of September 20. If you’re interested in participating in the program, you can sign up here. Starting incident response in chat enables customers to immediately reap the benefits of improved MTTA and MTTR. The use of chat will directly tie collaboration to incident management, minimize context switching, and automate manual tasks thus enabling faster incident remediation.

3. Tailor Your SMS Content to Industry Regulations

Before hitting send, ensure your SMS content adheres to the regulations of the target region. Certain industries, such as banking, finance and adult content, face stringent content restrictions to protect citizens against spam. If you manage an application that supports any of these industries, involve your legal team to avoid violations that could lead to hefty fines and damage your brand’s reputation.

4. Plan for Change

SMS regulations are a moving target. What works today may not work tomorrow due to shifting requirements. Prepare for change by having backup communication channels in place and being adaptable to evolving regulations.

5. Optimize for User Experience

Put yourself in your users’ shoes. Craft SMS messages that are concise, relevant, and valuable. This user-centric approach enhances engagement and minimizes the chances of your messages being marked as spam.

6. Monitor Success at the Country Level

When conducting international SMS deliveries, it’s advisable not to depend solely on a single global SMS deliverability metric to validate the effectiveness of your campaigns. Instead, consider delving into insights at the country-specific level. This approach ensures that the achievements of larger and established markets with successful SMS campaigns don’t overshadow potential deliverability hurdles in your emerging ones.

As technology bridges global divides, international SMS remains a potent tool for businesses to connect with audiences worldwide. However, the path to SMS excellence is paved with challenges that demand proactive measures. By embracing best practices such as diverse messaging platforms, regulatory compliance and robust monitoring, you can unlock the true potential of international SMS and deliver unparalleled user experiences.

Ready to Dive Deeper?

Abby, along with PagerDuty colleagues Girish Shankarraman and Vivek Raj Saxena, joined Mandi Walls for a “How To Happy Hour” Twitch stream to share these learnings and more. Catch the recording to learn from their expertise and enhance your SMS strategies. Join in the discussion in the comments section.

If you’re interested in trying the PagerDuty mobile application, you can simply scan or click the QR Codes below and download it today.

 

The post 6 Best Practices for Seamless Notifications with International SMS appeared first on PagerDuty.

]]>
10 Years of Failure Friday at PagerDuty: Fostering Resilience, Learning and Reliability by Cristina Dias https://www.pagerduty.com/blog/10-years-of-failure-friday-at-pagerduty-fostering-resilience-learning-and-reliability/ Tue, 25 Jul 2023 12:00:28 +0000 https://www.pagerduty.com/?p=83351 In today’s fast-paced and ever-evolving world of technology, failure is inevitable. Organizations should embrace failure as a learning opportunity for how to build and deliver...

The post 10 Years of Failure Friday at PagerDuty: Fostering Resilience, Learning and Reliability appeared first on PagerDuty.

]]>
In today’s fast-paced and ever-evolving world of technology, failure is inevitable. Organizations should embrace failure as a learning opportunity for how to build and deliver more resilient services. At PagerDuty, we’ve practiced Failure Friday for 10 years now. Failure Friday–a practice inspired by the chaos engineering space–involves intentionally injecting failures into our systems to improve reliability and foster a proactive engineering culture.

We’ve interviewed Stevenson Jean-Pierre (SJP), a Senior Engineering Manager, and Mandi Walls, a DevOps Advocate, to help us understand the Failure Friday practice at PagerDuty and how the practice is adopted across the industry.

The origins and evolution of Failure Friday

On June 28, we marked ten years of Failure Friday at PagerDuty. What started out as a weekly practice, quickly evolved beyond that. As the involved teams continued to learn and grow, and in the spirit of continuous learning, Failure Friday has evolved over the decade as well.

SJP manages database reliability and infrastructure teams. For Failure Friday process leaders like him, this initiative serves multiple purposes. It deepens engineers’ understanding of systems, allows teams to get creative while testing failure scenarios, and enables controlled production changes with key stakeholders involved. As SJP explains, “Instead of waiting for things to fail in natural ways in our environments, we produce failure scenarios in our own infrastructure to better understand how they work.”

While the core concept of Failure Friday remains consistent, SJP and his teams have evolved their approach over time. Initially, automated failure testing–randomly disrupting parts of the infrastructure or stopping services–was common. Now, the focus has shifted towards intentional failure testing, targeting areas like load and performance. This deliberate shift allows teams to gain actionable insights into specific failure modes, bottlenecks and optimize their systems more effectively.

Additionally, Failure Friday is no longer limited to a specific day of the week and occurs based on the team’s needs. SJP emphasizes that failure can happen any day, and embraces this concept beyond Fridays, calling it “Failure Any Day.”

So how does it work?

The Failure Friday process at PagerDuty involves planning and executing failure scenarios, with specific objectives and hypotheses. SJP and his team identify the system to be tested, define the failure scenario, and select the stakeholders involved. An Incident Commander (IC) leads the process, ensuring it mirrors the response during a real incident. The process is documented and analyzed, allowing stakeholders to provide their perspectives. A postmortem is conducted to identify areas for improvement, with its scope depending on the scenario being tested. If there are enough surprises or lessons learned, the team gives the postmortem a more formal treatment or invites a wider audience, as the level of detail and rigor needed for the review so justifies. Otherwise, it is conducted by the team that owned the Failure Friday.

For SJP, one key takeaway from Failure Friday is the realization of how complex digital infrastructure has become with all of its interconnected data and systems. Emergent behaviors can surprise even experienced engineers. By inducing failures, his teams gain a more holistic understanding of the system and uncover hidden dependencies. This knowledge equips SJP’s teams to better handle real-world incidents and prevent customer impact. It empowers them to design more robust software and ultimately enhance system reliability. 

SJP mentioned that Failure Friday also builds trust within the organization. The engineering teams have successfully built a culture of reliability, where a proactive approach to failure is embraced and valued.

Fostering a culture of innovation and learning in DevOps with Failure Friday

Failure Friday is a practice that has gained popularity in the DevOps community. It’s used as a means to foster innovation, enhance system reliability, and improve the customer experience. Mandi shed light on the significance of Failure Friday in the DevOps sphere and its impact on organizational culture.

For her, the clear focus of Failure Friday is the customer experience. Mandi emphasized the need for graceful error handling, clear communication to users, and minimizing disruptive errors that result in poor customer experiences. For her, simple actions can significantly improve customer experience: “Do you have more graceful handling of certain errors? Do you pop up a nice message to the user? Or do you just report a 503 error? That’s not a very good customer experience.”

According to Mandi, implementing Failure Friday or similar practices may come with benefits and challenges for organizations. These practices foster collaboration among various stakeholders, including engineers, product managers and business owners. Integrating Failure Friday into DevOps processes promotes better alignment and understanding of failure impacts. Additionally, Failure Friday contributes to developing a positive organizational culture by creating a low-stakes environment for open discussions and learning. This encourages psychological safety and a blame-free atmosphere, facilitating honest conversations and a proactive approach to system resilience.

However, challenges may arise in introducing intentional errors into production systems due to stability concerns and limited testing capabilities. Tools and services can help mitigate these concerns, making testing in production more secure and accessible. Fostering collaboration between stakeholders requires effective communication and coordination, often necessitating adjustments to existing workflows and structures. Moreover, embracing Failure Friday and cultivating a blame-free environment may require a shift in organizational culture, which can be challenging but essential for the success of such initiatives.

Ultimately, for Mandi, Failure Friday positively impacts team collaboration and communication within an organization. The practice encourages teams to engage in honest discussions, enhances trust, and fosters a proactive approach to system resilience and customer satisfaction. At the end of the day, investing in building resilience will pay off in better digital experiences for your customers.

How to start Failure Friday in your organization

For those interested in experimenting with Failure Friday, SJP suggests starting small and gradually scaling up. Mandi’s top suggestion is for organizations to prioritize building psychological safety and creating blame-free environments. As she says, “Failure Friday is not just a practice; it’s an opportunity to foster a culture of collaboration and resilience.”

SJP strongly believes that “other teams and organizations can benefit from adopting a similar approach.” Recognizing that failures are inevitable, he emphasizes the value of understanding systems comprehensively and adopting a failure-mode mindset. Whether running full Failure Friday exercises or starting with tabletop exercises, organizations can enhance their engineering practices and cultivate a culture of resilience.

If you’re eager to learn more and join the discussion…

Don’t miss out on a chance to learn from industry experts and discover how Failure Friday can revolutionize your organization and DevOps practices. Watch the recorded Twitch stream and be part of the dialogue in the comments. Get ready to embrace failure as a pathway to success in the world of technology!

The post 10 Years of Failure Friday at PagerDuty: Fostering Resilience, Learning and Reliability appeared first on PagerDuty.

]]>
Strengthen Your DORA Metrics with PagerDuty by Mandi Walls https://www.pagerduty.com/blog/strengthen-your-dora-metrics-with-pagerduty/ Fri, 30 Jun 2023 12:00:56 +0000 https://www.pagerduty.com/?p=83229 For technical teams, the findings from DORA provide a model for measuring and improving performance. With almost a decade of data gathered from more than...

The post Strengthen Your DORA Metrics with PagerDuty appeared first on PagerDuty.

]]>
For technical teams, the findings from DORA provide a model for measuring and improving performance. With almost a decade of data gathered from more than 33,000 professionals worldwide, the capabilities and frameworks detailed by the research help teams pinpoint areas for improvement and areas to celebrate. 

The team at DORA categorizes capabilities into three sections: Technical Capabilities, Process Capabilities and Cultural Capabilities. These are all important considerations for teams hoping to use the DORA findings to improve their own performance. 

The DORA research is tool agnostic, and not prescriptive on how teams should go about improving their performance or prioritizing their goals, so it can be challenging to put together a tactical plan. With the myriad tools available to teams today, putting together the best combination for your team seems daunting.

If your team is utilizing the DORA metrics as part of your improvement goals, the PagerDuty Operations Cloud can help. As a central component of your environment, the visibility PagerDuty brings will help with a number of key metrics and compliment the tools that impact others. Several PagerDuty features fit well into the capabilities described by the DORA team.

Reducing Unplanned Work

A key benefit of using PagerDuty for incident response is reducing the overall time spent on unplanned work across the organization. This is important in several ways unrelated to the DORA capabilities, primarily around making better use of resources and focusing on work that provides the most value to the organization. Less time spent in incidents = more time spent delighting customers.

One of the challenges of incidents and unplanned work from an organizational culture perspective, is that they can be invisible–the time isn’t tracked in work plans, or documented as part of the time required to build a new feature. So making unplanned work visible helps teams manage the burden and work towards improving outcomes. 

Work in process limits and visual management capabilities can both be improved by deploying PagerDuty to capture the impact of unplanned work and incidents on your team. Including data from PagerDuty related to how many engineers are impacted by out-of-hours incidents, overnight or “sleep time” incidents, and even work-day incidents gives managers additional insight into how teams are performing. 

Integrations and Extensions

Integrations and extensions are powerful features that place PagerDuty at the center of your operational capabilities. 

Integrations allow PagerDuty to receive information and alerts from other services, interrogate them, assign them to services, and initiate incidents. PagerDuty integrates with many third party services that provide monitoring, observability and tracing functionality for various types of events in your environment.

Extensions help you streamline your PagerDuty workflows with third-party tools like Slack, Microsoft Teams, Jira Cloud and Zoom that enhance your incident response experience.

Integrations and extensions mean that your teams can bring any number of tools with them to PagerDuty. The flexibility provided by more than 700 integrations gives all of your teams the tools they need, whether they are running machine learning applications, web platforms or databases. 

Selecting the right tool for the job saves teams time and confusion, but it shouldn’t sacrifice the ability to respond to incidents and preserve reliability. Make PagerDuty part of your cross-team baseline, as described in the capability Empowering Teams to Choose Tools

Change Events

Change events are non-alerting events that can be sent to PagerDuty from your build and deploy tools. They give your teams insight into what has changed in a service, and are an invaluable first-stop when investigating incidents. 

Having a good continuous delivery practice is a key capability for DORA-aligned teams, and the research shows that teams using continuous delivery practices spend less time on unplanned work. Your team can speed up what time they do spend on unplanned work by using change events.

Change events can also provide wider visibility for deployment changes in your environment, improving your deployment automation, and helping streamline change approval. Even when teams are using different build and deployment tools, their change events can be captured by PagerDuty to help manage service reliability.

Automation

PagerDuty’s various Automation solutions play an important role in not just incident response, but in the completion of general technical tasks. Process Automation, Runbook Automation, and Automation Actions all contribute to teams spending less time on simple well-understood tasks and more time on work that adds value.

Many DORA capabilities emphasize automation, so using a general-purpose tool provides teams with a single interface for automating many tasks. One of the key capabilities strengthened by automation is cloud infrastructure, in which on-demand self-service is a requirement. Teams using PagerDuty’s automation solutions can create jobs and delegate work to whomever in the organization requires the tasks to be completed, creating true self-service workflows.

Terraform Provider

Related to automation is PagerDuty’s support for Infrastructure as Code (IaC) via the Terraform provider. IaC and similar solutions help teams track their changes to infrastructure and other components via Version Control, another of DORA’s technical capabilities.

Managing a large PagerDuty environment can be complicated using only the UI, but making use of Terraform to create objects and provide teams with templates helps everyone on the team improve the reproducibility and traceability of their changes.

Service Graph and Business Services

Finally, PagerDuty’s Service Graph and Business Services features enable teams to create relationships among services, illuminating the impact of incidents when they happen in large environments. Status Pages give the entire organization a place to look for service impacts during an incident, and how they relate to the customer.

Business services in PagerDuty are representations of user-facing features; users might see odd behaviors in a shopping cart experience, but will have no idea which backend service is causing the behavior. Building the relationships in the service graph provides data to your organization about the health of user-facing features and capabilities while also helping responders troubleshoot issues with dependencies.

These features will help with strengthening a team’s monitoring and observability capability, as well as the Monitoring systems to perform business decisions capability.

Do more DORA with PagerDuty

The tools your team uses should help your organization reach its goals. Implementing PagerDuty’s features will help your organization improve not only in responding to incidents, but also in creating reliable services your users love. To learn more, visit our website and join our community forum.

The post Strengthen Your DORA Metrics with PagerDuty appeared first on PagerDuty.

]]>
Updating Your Tools for API Scopes by Mandi Walls https://www.pagerduty.com/blog/updating-your-tools-for-api-scopes/ Wed, 24 May 2023 12:00:53 +0000 https://www.pagerduty.com/?p=82646 The PagerDuty REST API provides  200+ endpoints for users to programmatically access objects and workflows in the PagerDuty platform. Teams leverage these APIs to streamline...

The post Updating Your Tools for API Scopes appeared first on PagerDuty.

]]>
The PagerDuty REST API provides  200+ endpoints for users to programmatically access objects and workflows in the PagerDuty platform. Teams leverage these APIs to streamline creating and managing users, teams, services and other components for their environment. 

Up until now, access to the REST API has been authorized and authenticated via API Keys. These keys are managed in the web UI, and provide an all-or-nothing access to the objects in the account, making them too permissive for many teams, so PagerDuty Engineering has been working on creating a comprehensive set of API scopes for use with OAuth2.0 Tokens. 

Each object in the PagerDuty REST API will have at least one scope, read, and many objects will also have write. You’ll be able to tune your apps to have only the access they are required to have to work correctly, without worrying that they’ll also have access to everything else in your account.

For folks currently using API keys, they’ll continue to work for the time being (stay tuned for potential EOL in the future), but migrating to Scoped OAuth will help teams manage access and adhere to principles of least privilege.

For a video introduction to API Scopes, check out this video on our YouTube channel.

Set up an App

The first change you’ll notice when setting up API access with scopes is that the access is managed via an app. These won’t be managed via the “API Access Keys” section under the “Integrations” menu; you’ll want to head to “App Registration” (formerly known as “Developer Mode”) instead, and proceed through the app configuration process. Setting these up is limited to the administrators and owners of your account. 

When the app is created, you’ll have the option to add Scoped OAuth to it:

Screen Capture of the PagerDuty platform web UI. There are two description boxes, one for “Scoped OAuth” and one for “Classic User OAuth”. They both have buttons marked “Add”.

For apps that support Scoped OAuth, the next dialog will allow you to select which objects should be available to Tokens derived from this app access. You’ll be able to select as many or as few as are appropriate for your use case:

A screen capture of the PagerDuty platform web UI, featuring the top portion of a table as part of a form. The top title line of the table includes three columns: “Resource”, “Read Access” and “Write Access”. Read Access and Write Access both have unchecked checkboxes beside them as options for the form to submit the same selection for all following options. Subsequent lines on the table list individual resources. Each resource also includes “Read Access” with a checkbox, and some resources also include “Write Access” with a checkbox. All boxes are unchecked.

After clicking Save, you’ll be presented with a popover window containing two important pieces of information: the Client ID and Client Secret that will be used to provision tokens for this app access.

A screen capture of the PagerDuty platform web UI, showing a dialog box. The title of the box is “New client secret”. The instructions read “Make sure to copy your client secret now as you will not be able to access it again once you leave this page. If you lose access, you will have to delete the OAuth application and create a new one.” There are two text boxes. The first one is labeled “Client ID” and contains an alphanumeric string. There is a button beside the text box to copy the data to the clipboard. The second box is labeled “Client Secret” and also contains part of a string with most of the characters blurred out. It also has a copy-to-clipboard button. The bottom of the dialog box contains a red button labeled “Hide Client Secret Forever”.

You’ll need these to provision tokens, so store them in a vault or other location or app you use for certificates, passwords, or secrets.

Finding Scopes

As you can see from the above screen capture, there are a lot of potential scopes your tokens might need, depending on which objects will be accessed via the API. 

Fortunately, the API documentation has been updated to include the necessary scopes for all of the object endpoints. Each type of request will have a note that includes the scope and whether the request will require read or write access. Broadly, though, requests using GET methods – used for listing or getting information – will only require read access, while PUT, POST, and DELETE requests will require write access.

An example from the PagerDuty API documentation. The text reads “Scoped OAuth requires: users:contact_methods.read”.

Requesting Tokens

Tokens, associated with the Scoped Client within your app, will be requested from https://identity.pagerduty.com/oauth/token using the credentials you received when the app was created. You’ll find the request format in the API docs here. The other required data are your region (US or EU) and subdomain (youraccount.pagerduty.com). Each token you request must specify which scopes it will be valid for.

curl -i --request POST \

  https://identity.pagerduty.com/oauth/token \

  --header "Content-Type: application/x-www-form-urlencoded" \

  --data-urlencode "grant_type=client_credentials" \

  --data-urlencode "client_id={CLIENT_ID}" \

  --data-urlencode "client_secret={CLIENT_SECRET}" \

  --data-urlencode "scope=as_account-us.companysubdomain incidents.read services.read"

The scopes included in a token can be the full set of scopes included in the app, or a subset of those scopes, depending on how your organization prefers to manage the tokens. The tokens are returned in a JSON document:

{

  "access_token": "pdus+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",

  "scope": "as_account-us.companysubdomain incidents.read services.read",

  "token_type": "bearer",

  "expires_in": 2592000

}

These tokens have an expiration! This is a big change from the never-expiring API keys. Tokens will have to be rotated every 24 hours. 

Additionally, to help folks with repository scanners ensure that tokens are not checked in, each PagerDuty token will start with “pd<region>+”, so either pdus+ or pdeu+ so that tokens are easily identified.

Once a token is created, how it is distributed and managed is up to you. The tokens are not listed in the web UI or otherwise available on the platform. From within the app page you’ll have the ability to revoke all the tokens associated with the app, but not individual ones.

Using Tokens

If you’ve been accessing the API via custom scripts, you’ll need to do some updates to use the new tokens.  

For shell scripts that use command line tools like curl or wget to make their https requests, you’ll need to update the Authorization header:

--header "Authorization: Bearer $TOKEN" 

Similarly, in tools like Postman, you’ll want to set Authorization to a Bearer token. For more on how to do this in Postman, see the Postman documentation.

If you are using any of the various client libraries for PagerDuty’s API, check the documentation for those projects to determine if a code change is required. For example, in pdpyras, a session constructor is available specifically for OAuth2.0 tokens:

# REST API v2 with an OAuth2 access token:

session_oauth = pdpyras.APISession(OAUTH_TOKEN, auth_type='oauth2')

Depending on the programming languages you are using, there may be more sophisticated solutions or support for OAuth2.0 tokens available. Sample code for several languages is also included in the documentation on the developer site.

Managing tokens and apps

How you design your app and token granularity is up to you and the security requirements of your organization. There are a few suggested methods to help you get started. 

Who Provisions Tokens

Because the Scoped OAuth tokens will have a 30-day expiration, organizations with a lot of teams programmatically accessing the PagerDuty API will likely find it easier to share the Client ID and Client Secret with individual teams and allow them to provision their own tokens. Administrators may choose to create apps for each team or type of application to control who will have access to the API and which objects.

Smaller teams, or those who need more control over access to resources, may want to retain the Client ID and Client Secret to the account administrators, who will then create and share tokens out to teams via secure storage.

Fully Scoped Apps, Limited Scope Tokens

For use cases that will require access to a lot of different objects, you can provision the app to include all scopes. Since tokens can be requested with a subset of all allowed scopes, individual tokens can be constrained to just what the client application will use in the API.

This method can cut down on the number of apps that need to be managed in the PagerDuty account. Administrators can provision apps for teams or departments, and provide tokens with more limited scopes.

Try it out and let us know

Scoped OAuth is available now as an Early Access feature, and will be generally available on all accounts at the end of May 2023. Try it out and let us know what you think! Sign up for early access at https://www.pagerduty.com/early-access/ or contact your account team to learn more. You can also join our community forum if you have questions.

The post Updating Your Tools for API Scopes appeared first on PagerDuty.

]]>
The 4 Types of Incidents as Zombies from ‘The Last of Us’ by Hannah Culver https://www.pagerduty.com/blog/4-incident-types-as-zombies/ Thu, 18 May 2023 12:00:06 +0000 https://www.pagerduty.com/?p=82498 Seems like everyone has watched or is watching “The Last of Us.” This show is based on a video game of the same name. It...

The post The 4 Types of Incidents as Zombies from ‘The Last of Us’ appeared first on PagerDuty.

]]>
Seems like everyone has watched or is watching “The Last of Us.” This show is based on a video game of the same name. It features Pedro Pascal (from “The Mandalorian”) and his latest surrogate child, Bella Ramsey (from “Game of Thrones”). But this adventure is challenging for a plethora of reasons. Most notably, zombies. In 2003, a fungus, Cordyceps, brought on a global zombie endemic. Twenty years later, a few humans are trying to endure and survive in what’s left. Spoiler alert for anyone who hasn’t watched season one yet: it’s hard. And zombies are scary.

While incident response is rarely life or death, it can be an adrenaline spike akin to watching the show. And some of the incidents you may face have similarities to the zombies we’ve seen so far in “The Last of Us.” These incidents have a “headshot” that can help you survive against all odds.

Runners

The first zombies we see in “The Last of Us” are runners. These are fresh and may still look human compared to ones that have been infected longer. While easy to kill, there’s one factor that makes them dangerous: you never expect them. They’re novel. For those who were around for the 2003 end of the world, zombies were only a work of fiction. Nobody prepared for the end of the world (except for Bill, *sob*). For those people hanging on in 2023, runners are still jarring. They’re usually someone you know. A friend. And they change fast, as we saw at the end of episode five.

If we had to compare this one to an incident, it’d be the one that happens out of nowhere and is rare or an anomaly. The system is fine! Then it’s not and you’re thinking, “How did I miss that?” So what do you do about it? Look for the signal in the noise that tells you something is going wrong. For those infected, this could be twitching, coughing or unexpected mood swings.

There’s similar warning signs for an incident. Latency a bit high? Could be nothing. But combine it with customer support noting an increase in complaints about slowness? You may have a runner. Monitoring only gets you so far. You need to make sense of the data, both from machines and humans. Correlating that data with changes in your ecosystem can help you attack your runner before it bites you.

Stalkers

This is your garden variety of zombies. They’re not too difficult to kill. They don’t have any special abilities to speak of. And you can almost always expect them. Going down into the basement of an abandoned gas station? Of course you’ll find a stalker. Empty mall? Yep, you should have known, Ellie and Riley! Stalkers aren’t fun by any means, and can be deadly. More often than not, though, the average survivor can take care of a stalker. But, what happens when there’s a few stalkers all at once? Or you find 12 stalkers back to back, all in the same day? What if you’re fighting two to three stalkers every day for a year?

You can see where I’m going with this. Stalkers are like death by a thousand cuts. The more you have to tangle with them, the more dangerous they are. Like your most common incidents. They’re not fire drills, they’re annoying. And one isn’t so bad, but one every single day hurts. It takes time away from value-add work to fix something that you’ll need to fix again soon.

Automation isn’t something that Joel and Ellie can do in their world. But in our zombie-free existence, we can apply it to make incident response more efficient. For well-understood issues and incidents that happen frequently, crafting auto-remediation to resolve the problem without human intervention can immediately add time back to your day. And, it’s a great way to drive automation initiatives within the organization. Solving this small but frequent problem has a direct ROI associated with it. Leverage that to further automation initiatives for other types of incidents.

Clickers

Clickers are ominous, obsessive hunters that are harder to kill. As they’re blind, they use echolocation to hunt their prey. Headshots don’t work as their heads are armored with tough fungus. They’re one of the most feared and hated types of zombies in “The Last of Us,” and it’s easy to see why. Can you imagine coming up against this thing and realizing your typical solution doesn’t work the way it should? And against a more dangerous enemy?

This one may be the hardest to correlate to an incident, because clickers seem to be almost impossible to kill in the show. Everyone’s advice? Run. Before they hear you. But with incidents, you can’t do that. So, if this zombie was an incident, it would be the one that only two or three people have seen before. You’ve heard about this issue, and it’s from deep in the tech stack. But not enough people who knew about this incident shared with the class. When it happens, it feels like a bigger issue than it is.

Like a knife to the neck of a clicker, there’s a solution to this type of incident. And success comes down to the same thing: knowledge and a plan. If you know that a clicker’s head has armor, you go for the neck. It’s close combat, but effective. And since enough people have survived clickers, the knowledge spread across the surviving population.

For an incident, the best way to fix your clickers is documentation, runbooks, and historical context. Someone knows how to resolve the problem. If they share this knowledge, teams can document the process and create a runbook for the next time this scary (but repairable) problem happens. Additionally, teams can rely on AI to surface past incident data. Look-alike incidents have lots we can learn from. This past incident data helps teams understand what worked for an incident and what didn’t. If you don’t have AI to assist, you can always scan through old retrospectives as well for this historical context. Centralizing all this information is also important so that everyone can find it. That way, you may not know how to solve every problem that happens, but you know how to find that knowledge. There’s power in that, even if there’s no perfect “headshot.”

Bloaters

Bloaters look more like the demogorgons in “Stranger Things” than something that was, at one point, a human. They kill most people in the vicinity either by brute force or toxic clumps of fungus that they toss in the air like grenades. We only saw one of these in “The Last of Us” so far and it made quite the impression, annihilating most of the fighting population of Kansas City. Bloaters should be avoided at ALL costs. And any signs of them should be dealt with early before the issue compounds. Remember how the zombies were filling up the tunnels and the rebels had other initiatives to take care of? Yeah, that was technical debt and someone should have fixed it.

But that’s the way it goes. You know there’s a problem, even if you don’t know exactly how it’ll manifest. Then you’ve got a major incident on your hands–a bloater. And the best and only real way to deal with these is with a coordinated, end-to-end incident response. Make sure that you understand key components of incident response such as:

  • Escalation policies
  • Roles and responsibilities during the incident
  • Communication standards, both internal and external
  • Workflows that you can trigger automatically to take the heavy lifting off responders

With these plans in place, you will be able to resolve the incident more smoothly, faster, and with less customer impact.

What zombie are you worried most about?

What’s keeping you up at night? Fear of an impending bloater, or notifications about yet another stalker? While we may not find the cure to Zombies in ‘The Last of Us,’ we can work on technology incidents and make those easier and less catastrophic for us and our customers.

PagerDuty is here to help you improve your digital operations. Whatever challenges you’re facing right now our team can help you endure and thrive, not just survive. Check out our weekly demos to learn more.

The post The 4 Types of Incidents as Zombies from ‘The Last of Us’ appeared first on PagerDuty.

]]>
Top 3 Incident Response Problems AIOps Can Help Your Teams Solve by Hannah Culver https://www.pagerduty.com/blog/top-3-incident-response-problems-aiops-can-solve/ Thu, 20 Apr 2023 12:00:58 +0000 https://www.pagerduty.com/?p=81946 More data for data’s sake doesn’t help anyone. What organizations need is more information–actionable insight. With data coming from incoming streams of events and alerts,...

The post Top 3 Incident Response Problems AIOps Can Help Your Teams Solve appeared first on PagerDuty.

]]>
More data for data’s sake doesn’t help anyone. What organizations need is more information–actionable insight. With data coming from incoming streams of events and alerts, teams don’t have enough time to look at each one. And they struggle to parse and consolidate this data in order to figure out what they need to do next to resolve an incident. Processing this data to make it more usable and helpful during incident response often results in a rote series of manual, repetitive tasks each time an incident occurs, wasting time. It’s no wonder teams are increasingly turning to AIOps and automation for help. AIOps helps teams turn data into information and reduce that manual work. Let’s break down three ways AIOps allows teams to overcome challenges and reduce customer disruption.

Reducing noise for fewer incidents

Not every alert should become an incident. Yet for many organizations, this is what happens. Even if you’re only experiencing one problem, you may receive dozens or hundreds of pings for the same issue. This is distracting and bogs responders down. Noise should be your first thing to focus on because eliminating it:

  • Gives responders back time when they don’t need to filter out what’s important from what’s irrelevant.
  • Decreases the cognitive load that responders carry. Responders don’t need to think about 63 different alerts. They can focus on the one that matters. This reduces this on-call anxiety.
  • Reduces the distractions that get in a responder’s way during an incident. This helps responders focus on getting a fix in place faster.

To reduce noise, you can analyze the noisiest incidents you’re facing. Which ones are the same incident? Take a look at the alerts you’re receiving and see if there’s a way to group them based on event data that you gather from your monitoring tools. What’s loudest? This is an opportunity to fine tune your monitoring tools so they’re only sending you what’s most valuable. Keep in mind that this often requires routine maintenance. Monitoring tools become messy, especially when data is scattered across vendors. You’ll want to gut check this whenever you notice noise levels are increasing.

PagerDuty AIOps makes it easier to reduce alert noise within a single tool. Users can set PagerDuty to ingest and deduplicate events from those disparate signals. Then PagerDuty AIOps groups the events into an existing incident. This suppresses a new incident from being created. Teams have access to event data in the form of alerts without extra notifications. The result is that teams can better weather alert storms by bringing focus to what’s needed. 

Gaining context for better triage

Technically, all the information a responder needs to resolve an incident exists. But, it’s buried within multiple disparate streams of data. Humans alone cannot condense all this data into succinct actionable insights. This means teams spend a long time looking for answers to questions that they can leverage machine learning (ML) to find instead. ML can look at both historical event data and human interaction. Then ML translates the analyzed data into actionable insights. With ML, teams can answer key questions such as:

  • Where should my team look first?
  • Are other teams working on the same problem?
  • Is this a common incident or completely new?
  • Have we seen this before; how was it resolved?
  • Any relevant changes occur before this incident?

But developing your own ML can be a daunting task. It requires time and resources such as headcount. Many organizations choose to partner with a vendor instead.

PagerDuty AIOps ML algorithms help surface critical information such as:

  • Probable Origin: determines probable cause based on previous incidents affecting your service.
  • Related Incidents: shares if a current incident is affecting your service.
  • Outlier Incidents: whether this incident happens frequently, rarely, or is a total anomaly.
  • Past Incidents: look at the incident details and see how responders resolved it in the past.
  • Change Correlation: connects with your change integrations to show changes to your service, then leverages ML to correlate patterns between change events and incidents.

Each time this information is surfaced for your team without having to manually dig, you get to resolve the incident faster. That decreased MTTR provides you with more time to focus on value-add initiatives.

Self-healing by crafting auto-remediation

One initiative you can focus on to spend less time firefighting is automation. This is where you can orchestrate a fix and self-heal before the problem even becomes an incident. It’s resolved before it hits a responder. Now someone gets to sleep through the night instead of responding to a notification. But this initiative can seem very intimidating. The reality is that starting small and tackling low-hanging fruit can make self-healing easier than you may expect.

You can identify well-understood resolution scenarios where you can automate the response. These may be scenarios that your team would classify as frequent, or ones where the resolution is straight-forward. Teams can then create automation to resolve these without human intervention. Then, as that automation starts to take effect, your teams will start to free up time to work on new automation initiatives.

PagerDuty’s Event Orchestration  helps teams create automation that spans the entire technical ecosystem. Event Orchestration enriches and routes events, then kicks off automation to self-heal. This feature allows users to trigger remediations for well understood incidents via webhook. For more complex issues where auto-remediation might not be a possibility, teams can also leverage automation to kick off diagnostics. This builds upon the triage information responders have when they first view their incident.

Looking to get started with AIOps?

AIOps can help teams see fewer incidents and faster resolution. PagerDuty can help you achieve this, and more, with PagerDuty AIOps. See PagerDuty AIOps in action by requesting a trial or taking our product tour. In the market for AIOps? Read our buyer’s guide

The post Top 3 Incident Response Problems AIOps Can Help Your Teams Solve appeared first on PagerDuty.

]]>
Calculating Business Value of Automation in PagerDuty Process Automation by Greg Chase https://www.pagerduty.com/blog/calculating-business-value-of-automation-in-pagerduty-process-automation/ Wed, 01 Mar 2023 14:00:40 +0000 https://www.pagerduty.com/?p=81366 Budgets in IT departments are tight these days, so proving a return on investment is essential for justifying or expanding a project. The good news...

The post Calculating Business Value of Automation in PagerDuty Process Automation appeared first on PagerDuty.

]]>
Budgets in IT departments are tight these days, so proving a return on investment is essential for justifying or expanding a project. The good news is that automation saves money by reducing the amount of human effort required. It is similar to investing in a robot vacuum cleaner. Despite the upfront cost, you save time (and money) by not having humans do the vacuuming. 

Reporting the value delivered by an automation program can be challenging since the value depends heavily on what is being automated. Your project proposal may forecast time and cost savings by automating certain manual tasks. Tracking and reporting those savings is how you show the business impact of your projects. So how can you simplify tracking and reporting?

We have a feature in PagerDuty Process Automation that can help: the ROI Metric Data plugin. The ROI Metric Data plugin follows the simple principle that every time an automation runs, it delivers value. The automation developer specifies value metrics by defining key values such as hours-saved:10 for their automation. 

Whenever the job executes, these metric values are added to the log entry of the execution. The plugin also provides an end point to extract the JSON records of these runs, along with other metadata about the executions—making it possible to compile, calculate, and analyze these metrics over time.

Here are some patterns you can follow to track the business value delivered by your automation projects.

Reporting savings from reduced labor costs

The most direct benefit of automating a task is the cost savings of the labor it replaces. Take this use case shared by Robert Powers from Brinks at PagerDuty Summit 2022. Their as-is process was a recurring data transfer job that took a staff member 5 to 10 hours to complete manually.

By automating the process with PagerDuty Process Automation, they turned this process from being ¼ of one person’s job every week into an automated task that takes zero human time.

Chart showing manual cost of running a job at 10 human hours per week vs. cost to automate 20 human hours total.

Cost, opportunity and benefit criteria of a data transfer automation project

To use the ROI Metric Data Plugin to track the value generated in this scenario, you would simply define a metric hours_saved with a value of 10 to include this metric in the execution records of this process. This will give you an easy metric to be able to export to show total hours saved per execution of this process. We chose this arbitrary key-value approach since these values can change over time as you add capabilities to your automated job. This way, you can compare the value of newer versions of your automation to older versions when charting data—provided you don’t change the key names.

For your own scenario, you will want to determine how much time is spent by workers manually completing tasks that you will be automating. This can be as accurate as you want the end result to be. Estimates are OK, or you can develop an average time spent through observations. The average or estimate will be the value you pair with a key such as hours_saved. You can break these out by employee job type if you want to track costs savings, or changes in workload distribution. Simply define more key-value pairs: DBA_hours_saved, senior_engineer_hours_saved. If you want to calculate a return on investment, you’ll also want to keep track of the hours needed to create the automations. You can also define values in monetary value, or convert hours to monetary value during your analysis.

Screenshot from PagerDuty Process Automation showing how to enter key-value pairs for ROI Metrics Data.

Here we have created two key-value pairs to be logged per job run: Hours_Saved : 1.25, and Dollars_Saved : 250.

Upload the job execution data to your favorite reporting tool, such as Tableau. You can chart the compilation of your different metrics by user, and job over time. For example, you can show hours saved from manual executions by users vs. scheduled job runs. You can calculate money saved either directly from metrics you have defined, or by converting different hours metrics to costs.

Graph in Tableau visualizing increasing hours and money saved per user.

Here is an example of charting the logged data showing increasing money and time savings from scheduled job runs and user-invoked job runs.

Converting these metrics to return on investment requires adding the costs associated with the implementation of automation. In the customer scenario we shared above, 20 FTE hours (assuming equivalent labor costs) was the cost to create the automated process. If this includes maintenance over a year, this looks like: 520 FTE Hours Saved – 20 FTE Hours to automate = 500 hours saved in just the first year of operation

Adjusting metrics by automation outcome

Following the principle that automation delivers value whenever that automation runs, we may wish to calculate value according to the outcome of these runs. This would mean filtering out automation runs that are unsuccessful.

There are different reasons why an automation execution may be unsuccessful. There may be problems with the job definition itself, or errors reported by nodes and workflow steps that don’t otherwise terminate the job. In the case of one of these unsuccessful executions, you may wish to filter them out of your value calculation.

Screenshot of PagerDuty Process Automation showing detailed status information of an automation run, including 2 completed steps, 1 failed step, and overall failed status.

Example job run with a failed step

When running analytics we can choose to filter out unsuccessful runs due to external failures from integrated systems.

Graph in Tableau visualizing increasing hours and money for failed and successful automation runs.

Example chart showing hours and money saved and job status

The ROI Metric Data Plugin is available in PagerDuty Process Automation as of Version 4.7, and is also available as part of PagerDuty Runbook Automation. To learn more about working with the ROI Metric Data Plugin, check out the Process Automation Documentation.  

If you are not already a user of PagerDuty Process Automation or PagerDuty Runbook Automation, schedule a demonstration or trial today!

The post Calculating Business Value of Automation in PagerDuty Process Automation appeared first on PagerDuty.

]]>
Doing More with Less: Building Greater Operational Efficiency with PagerDuty by Nancy Lee https://www.pagerduty.com/blog/doing-more-with-less-building-greater-operational-efficiency-with-pagerduty/ Wed, 14 Dec 2022 14:00:13 +0000 https://www.pagerduty.com/?p=80617 How many of us can say with confidence that we know a tool inside and out? If you’re like most, you probably use just a...

The post Doing More with Less: Building Greater Operational Efficiency with PagerDuty appeared first on PagerDuty.

]]>
How many of us can say with confidence that we know a tool inside and out? If you’re like most, you probably use just a small fraction of a product’s features. When it comes to feature-rich software like Microsoft Word or Excel, it’s a safe bet that most users are aware of less than half of the features, and use even less on a regular basis. And the longer we’ve been using a piece of software, the more likely we fall into this trap of feature underutilization. 

I started noticing this in my own life a year and a half ago when a coworker who had recently joined the team told me she found a more efficient way to generate closed captions for our instructional videos. I asked if it was a tool in her Adobe Creative Suite. 

Nope, it’s actually YouTube!” she replied.

“What? That’s amazing!” I said. “How did we not know about this?” I was shocked. For the past 6 months, we had been paying for a separate tool for its closed captioning capabilities when, all along, we could’ve used YouTube’s free captioning feature in our Google accounts. 

More recently, I had my proverbial mind blown yet again when I learned of Slack’s reminder feature. Making my to-do list for the next day, I was scheduling reminders in my Google calendar to follow up with a teammate, call my doctor, and pay the gas bill. My husband looked on in amusement as I added one event after another in my calendar.

“What are you doing?” he asked.  

“Setting reminders for the things I have to do tomorrow,” I replied, mildly annoyed at this interruption to my sacred routine.

“Why don’t you use the Slack reminder feature?” he said. “That way, you’re not filling up half your calendar with reminders and making it hard for people to book meetings.” 

“I had no idea you could do that!” Like the YouTube incident, I was incredulous that I was only learning of this feature now.

As I started scheduling Slack reminders for the following day, I wondered how often we hear our customers use that phrase — “I had no idea you could do that!” It’s not surprising when you think about it. We often purchase a tool for a specific use case. In our haste to implement a solution, we approach the task with blinders on, paying attention to only those features that will help us achieve our goal. “Problem solved!” we declare. Never mind that we only learned a tenth of the software’s capabilities. Years later, we’re still clicking the same buttons and following the same scripts, oblivious to the slew of new features that promise to enhance our user experience.  

It’s human nature to take the path of least resistance. But at a time when many tech companies are being asked to manage costs and do more with less, perhaps a good place to start looking for efficiencies is in our existing investments.

One business area that shines a light on this is Customer Education. At PagerDuty, customer training and enablement sits with PagerDuty University. A comment we often see in our course evaluations is “I had no idea PagerDuty could do [fill in the blank with a feature that’s existed for months or even years]!” Some customers may have started using PagerDuty for on-call management and alerting, and never ventured beyond those basic capabilities. They’ve become so accustomed to using PagerDuty for a single use case that they don’t realize its product portfolio actually encompasses multiple solutions for use cases across their digital operations. 

For organizations facing pressure from the current macroeconomic environment, PagerDuty’s end-to-end digital operations capabilities can help consolidate tool spend and boost productivity by reducing context switching. PagerDuty University helps customers by driving awareness of this end-to-end experience, from pre-incident creation (enriching and routing events) to post-incident mobilization (response automation) to business-wide orchestration (automated stakeholder communication) and beyond. Rather than investing in point solutions that address a single problem, our customers can leverage the solutions they need, when they need it, adopting additional capabilities and products as they continue to evolve their Digital Operations with PagerDuty.  

Those of us who work in Customer Education understand that it’s our job to not only improve a customer’s time to value, but to ensure that they continue to see the return on their investment post-onboarding and beyond. For PagerDuty University, that means making sure that our customers receive proper enablement on PagerDuty’s advanced capabilities such as Event Intelligence and Incident Workflows (in Early Access!), as well as other products and use cases such as Customer Service Operations and Process Automation. Tool consolidation, cost savings, automating away toil, better customer experience — these are some of the biggest ROI our customers walk away with post-training. 

Our instructor-led training courses are centered around achieving customer goals. Rather than training customers on every PagerDuty feature, we first try to understand what business challenges they’re trying to solve, and build training that guides them efficiently to reaching those goals. Often in SaaS, we talk about time to value — we like to think of our technical training team as “guides to value.” 

PagerDuty University’s free, on-demand training complements our instructor-led training by digging into each product feature, situated in real-life scenarios so users always understand the larger context in which these features are used and the problems they solve. Our self-paced eLearning modules are suitable for customers who are trialing a free account, those who want to check out new features, or those who simply prefer the self-serve aspect of on-demand training. 

It should come as no surprise that those of us who work in Education Services love learning. We use that love of learning to drive customer success, which sits at the heart of everything. Whether it’s driving adoption, improving onboarding, or imparting industry best practices, we strive to make sure that we never hear one of our customers say “I had no idea PagerDuty could do that!”

The post Doing More with Less: Building Greater Operational Efficiency with PagerDuty appeared first on PagerDuty.

]]>
4 Challenges Facing CXOs in A World of Digital Everything by Dormain Drewitz https://www.pagerduty.com/blog/4-challenges-facing-cxos-in-a-world-of-digital-everything/ Wed, 19 Oct 2022 13:00:52 +0000 https://www.pagerduty.com/?p=79148 As a busy executive, taking time to attend an event and listen to sessions is a luxury. And yet, I know that many of my...

The post 4 Challenges Facing CXOs in A World of Digital Everything appeared first on PagerDuty.

]]>
As a busy executive, taking time to attend an event and listen to sessions is a luxury. And yet, I know that many of my best breakthrough ideas on how to lead my teams have come from taking those moments to tune into new ideas. The challenge is figuring out where the hidden nuggets of wisdom are buried in a mountain of content.

PagerDuty Summit 2022 generated about 15 hours of content, much of it for site reliability engineers, platform teams, and engineering managers. But there were also some great takeaways for more senior executives. As executives, we’re driving growth through innovation, operational efficiency, and risk mitigation. And many of our biggest challenges lie in how we lead the people on our teams through the changes to drive those outcomes.

I’ve parsed through much of the content from the PagerDuty Summit to find the most useful insights for executives. Many of them feature hearing from other senior leaders, who face similar challenges. Dig in to the full replays linked from each section.

Shaping a culture of accountability

Accountability always seems to roll uphill. While executives are and should be accountable, a lack of localized accountability leads to “not my problem” syndrome. But distributing accountability is hard as organizations grow and systems become complex.

The “you build it, you run it” mantra of DevOps is about embracing a culture of accountability. As I’ve written before, the litmus test of any culture is what happens when things go wrong. What are the artifacts? Are folks happy to be accountable when the dashboards are green, but reluctant when systems are down?  

When team members don’t feel like they are equipped to deal with issues, the accountability flows uphill. For some great insights on how tech leaders across industries are empowering their teams when things go wrong, catch the “Delivering Value: A two-way Street” keynote panel from the Sydney Summit. You’ll hear from Andre Lachmann, Director of Technology at Nine; Sourav Lala, Head of Cloud & Platform Services at Carsales; and Gurnam Madan, Head of Engineering at Wesdigital.

  • Madan talks about getting trust right with the right foundations. What does that mean? Hear him describe adhering to SLAs and SLOs with PagerDuty. He also recommends embracing chaos engineering to break things in a controlled environment. “Don’t wait for things to break.”
  • Lachmann highlights the importance of being able to get the right person involved when pressure is high. After all, “Every day is like the Super Bowl in media.”
  • Lala sets the bar high, aiming to resolve incidents before teams notice, let alone customers. He brings cloud, automation, and PagerDuty to make this possible.

Doing more with the same: Operational Efficiency

Even without the backdrop of economic uncertainty, find me an organization that isn’t trying to do more with the same? According to the Gartner® report “The Chief Technology Officer’s First 100 Days,” operational efficiency is one of the common metrics used to measure CTO success.¹ But getting more out of the resources you already have does have a cost: change.

Leading your organization to change their behaviors will make or break an efficiency initiative. Jamie Vernon, SVP of IT at ResultsCX shared some great insights on leading change in his session, “Cultural Adoption of Automation.” After all, automation holds great promise for driving operational efficiency, if you can manage the change to adopt it.

Vernon stresses understanding your authentic reasons to automate. “Your motivations, your passion, your enthusiasm, your being genuine and authentic… is going to be part of how you sell it to your stakeholders.” Humans are wired to find patterns and set into routines. People need a compelling reason to manage the discomfort of change.

For teams that are asking for more staff to help, Vernon shares a great idea. He describes how they anthropomorphized their automation efforts into RITA. The team trained RITA like they would onboard a junior member of the team.

Another session that offered a great blueprint for enabling teams to change was from Schneider. Jared Vils and Dana Dickrell covered a lot in “The AIOps Outcome That Smashed It,” including how they onboarded 480 responders across 74 teams. About halfway through, they share the internal video that they produced to get the organization engaged. If you have initiatives that should be driving operational efficiency that are stalling, you’ll find some great ideas here.

Connecting the dots on Customer Experience

According to Forrester Research, 82% of customers say they are likely to spend more with a brand that makes them feel appreciated and respected.² Anyone in leadership positions knows that cultivating positive feelings takes deliberate effort. One negative experience can wipe out five positive ones.

Delivering a positive experience is an “all hands on deck” situation, but not every team experiences that the same way. In the fireside chat with Fiona Gill, Vice President of Customer Success for Americas at Anaplan, she highlights the need for internal collaboration. After all, customer-facing teams get frequent and direct feedback from customers. Developers and engineers, however, get less exposure to customers.

It’s easy to work in silos, but that puts the customer experience at risk. Gill highlights how folks in the field are experiencing first hand if customers are not having a good experience. This is a critical signal for engineering teams. How are your engineering teams getting that input?

The hard stuff: Change, legacy, and trust

Like accountability, “hard problems” also tend to roll uphill to the executive suite. But successfully tackling hard problems requires bringing the organization along. On hard problems, I recommend the “Real Talk: PagerDuty’s Product Impact” keynote panel from the Sydney Summit. It features Fiona Muller, Group Principal Products and Services at Telstra, and Iain Phillips, General Manager of Site Reliability Engineering at Xero.

Changing behaviors is hard. We covered this earlier with respect to accountability and operational efficiency. But the same goes for anything, including resiliency engineering. “We treat SREs as an enablement team,” says Iain Phillips. Making it someone’s job to enable change goes a long way. But forcing change doesn’t work. As an example, hear how Fiona Muller explains how tools at Telstra get promoted organically, rather than mandate.

Dealing with legacy (or “heritage”) is also hard. And it’s everywhere —yes, even startups and so-called cloud-natives. A strictly “bi-modal” approach can leave teams working on heritage apps demoralized. How are you bringing your heritage apps and infrastructure along as your teams adopt DevOps and SRE practices? Fiona Muller shares insights from Telstra’s journey.

Finally, trust is hard earned. As Jenn Tejada, PagerDuty’s CEO, likes to say, “Trust is earned in drops and lost in buckets.” Fiona Muller echoed this in her discussion of why her team spends so much time on security and reliability: “People have a really long memory for the bad things and a short memory for the good things.”

Hopefully, as a leader, you’re taking time every day to invest in yourself and learn. Let me know if these insights from your peers have been useful!

Footnotes

¹Gartner, The Chief Technology Officer’s First 100 Days,  Samantha Searle, Nick Jones, Arun Chandrasekaran, September 7, 2022. GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.

²Forrester Research, Customer Service Unplugged: How To Scale Empathetic Customer Service, Max Ball, July 26, 2022

The post 4 Challenges Facing CXOs in A World of Digital Everything appeared first on PagerDuty.

]]>