PagerDuty Build It | Ship It | Own It Thu, 20 Jul 2023 16:19:29 +0000 en-US hourly 1 https://wordpress.org/?v=6.3.1 Gartner Market Guide: Embedding Automation Into the Enterprise by Joseph Mandros https://www.pagerduty.com/blog/gartner-market-guide-2023/ Thu, 20 Jul 2023 13:00:17 +0000 https://www.pagerduty.com/?p=83358 Read the Gartner Market Guide today! __ “Existing workload automation strategies are unable to cope with the expansion in complexity of workload types, volumes and...

The post Gartner Market Guide: Embedding Automation Into the Enterprise appeared first on PagerDuty.

]]>
Read the Gartner Market Guide today!

__

“Existing workload automation strategies are unable to cope with the expansion in complexity of workload types, volumes and locations driven by evolving business demand, as per Gartner. Digital business is slowed without collaboration and automation inside and outside of IT, leading to siloes of capabilities across business and IT teams.Cost optimization is an evolving challenge, driven by technical debt and requirements to demonstrate business value of investments.”

The lack of collaboration and automation both within and beyond IT departments creates isolated capabilities across business and IT teams, hindering the pace of digital business. Additionally, as technical debt accumulates and the need to showcase the value of investments grows, cost optimization becomes an ongoing challenge.

By embracing these key findings in a recent Market Guide published by Gartner®, we believe businesses can streamline their operations and enhance efficiency in the face of expanding complexities. Let’s take a brief look at our understanding of the key findings from Gartner Market Guide for Service Orchestration and Automation Platform.

Workload Automation Challenges

Businesses face a myriad of challenges when it comes to managing workloads effectively. “According to Gartner, existing workload strategies are unable to cope with the expansion in complexity of workload types, volumes, and locations driven by evolving business demands.” Automation strategies that once sufficed are no longer equipped to handle the expanding complexities driven by evolving business demands.

To address these challenges, organizations need to adopt intelligent automation solutions that can adapt to changing requirements. Intelligent workload automation leverages technologies such as artificial intelligence (AI) and machine learning (ML) to dynamically allocate resources, optimize scheduling, and automate repetitive tasks.

Collaboration and Automation: Breaking Silos

According to Gartner, digital business is slowed without collaboration and automation inside and outside of IT, leading to siloes of capabilities across business and IT teams. The lack of collaboration and automation between these teams can significantly hinder digital business initiatives. Silos create barriers, slowing down decision-making processes and impeding the flow of information and ideas.

To overcome these challenges, organizations must foster a culture of collaboration and implement automation solutions that span across business and IT functions. By integrating workflows and sharing information seamlessly, teams can work together in harmony, driving innovation and accelerating digital initiatives. Collaborative automation tools, such as workflow management platforms and project management software, can facilitate effective communication, collaboration, and information sharing, leading to faster time-to-market and improved customer satisfaction.

The Evolving Challenge of Cost Optimization

“According to Gartner, cost optimization is an evolving challenge, driven by technical debt and requirements to demonstrate business value of investments.” As digital infrastructures expand, organizations accumulate technical debt—a burden caused by outdated technologies, inefficient processes, and legacy systems. This technical debt not only impedes agility and innovation but also increases operational costs.

To address this challenge, businesses must prioritize cost optimization through strategic investments and ongoing evaluation of their technology portfolios. Embracing cloud-based solutions, leveraging automation, and adopting agile practices can help organizations reduce technical debt and achieve greater cost efficiency.

Additionally, organizations should establish clear metrics and processes to measure and demonstrate the business value of technology investments, enabling informed decision-making and resource allocation.

Conclusion

We believe the key findings from the report underscore the need for organizations to stay agile, adaptive, and innovative in their approach to workload management, ensuring they can effectively meet evolving business demands and drive sustainable growth in the ever-changing digital landscape.

“According to Gartner, it is recommended to drive collaboration across business and IT teams by democratizing access to automated capabilities through feedback-driven self-service automation solutions.

Unlock the business value of orchestrated I&O automation by implementing expanded service orchestration and event-driven workflows to drive agility, innovation and cost optimization efforts.

Meet today and tomorrow’s business demands by service orchestration and automation platforms (SOAPs) delivering the needed agility and efficiency. Embed agility and efficiency into orchestrated IT processes to meet business demands by using SOAPs.”

To learn how PagerDuty can help you on your automation journey, click here.

___________________________________

Gartner Market Guide for Service Orchestration and Automation Platforms, Chris Sanderson, Daniel Betts, Hassan Ennaciri, 23 January 2023.

GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.

The post Gartner Market Guide: Embedding Automation Into the Enterprise appeared first on PagerDuty.

]]>
What is Zero Trust Security and Why Should You Care? by Joseph Mandros https://www.pagerduty.com/blog/what-is-zero-trust-security-and-why-should-you-care/ Tue, 13 Jun 2023 13:00:45 +0000 https://www.pagerduty.com/?p=82871 Automation has become a game changer for businesses seeking efficiency and scalability in a rather unclear and volatile macroeconomic landscape. Streamlining processes, improving productivity, and...

The post What is Zero Trust Security and Why Should You Care? appeared first on PagerDuty.

]]>
Automation has become a game changer for businesses seeking efficiency and scalability in a rather unclear and volatile macroeconomic landscape. Streamlining processes, improving productivity, and reducing incidence for human error are just a few benefits that automation brings.

However, as organizations embrace automation, it’s crucial to ensure modern security measures are in place to protect these new and evolving assets. While other security models control the majority of the narrative across the business landscape, zero trust is quickly emerging as a necessary security implementation concept.

With our recent release of the next-generation architecture for PagerDuty Runbook Automation and PagerDuty Process Automation, we are positioned as the ideal partner to help organizations implement and grow within a zero trust security architecture for the modern enterprise.

To learn more, keep reading and/or register for our webinar about Zero Trust security happening this Thursday, June 15th at 6 A.M. PT and 11 A.M PT respectively.

What is zero trust security?

Zero trust security is a model that challenges the traditional perimeter-based security approach by assuming that no user or device can be inherently trusted—regardless of their location. It emphasizes continuous verification and validation of identities, devices, and network traffic before granting access to resources. It achieves this through multi-factor authentication, granular access controls, encryption, and monitoring, enabling organizations to minimize the risk of data breaches and unauthorized access.

By shifting the traditional perimeter-based security paradigm and adopting a “trust no one” approach, zero trust security offers a holistic framework that aligns seamlessly with modern automation initiatives. Additionally, it can positively impact the process evolution of a business’ inner workings as the world becomes increasingly more complex—and prone to bank-breaking threats.

Source: https://www.microsoft.com/en-us/security/business/zero-trust

What’s the big deal?

Zero trust security often stands out as a superior approach compared to traditional security models, largely due to its fundamental shift to a modern technological mindset and comprehensive implementation.

Unlike perimeter-based security models that rely on the assumption that internal networks are inherently trustworthy, zero trust security adopts a “trust no one” philosophy. It implements strict access controls, continuous authentication, and rigorous monitoring at every level, ensuring that every user, device, and network component is treated as potentially untrusted. This approach significantly reduces the attack surface and prevents lateral movement within the network, making it highly effective against both external threats and insider risks.

Additionally, zero trust security provides adaptive access controls that dynamically adjust privileges based on context, bolstering security without impeding productivity. By combining strong authentication, encryption, and segmentation, zero trust security offers a holistic and proactive defense strategy that fortifies organizations against sophisticated threats, making it a superior choice for today’s deep field of dynamic and interconnected digital landscapes.

Business of all sizes can positively benefit from implementing a security model like zero trust, with contributing factors such as:

  • Protecting Sensitive Data: Zero trust security ensures that access to this valuable data is strictly controlled and authenticated, reducing the risk of unauthorized access, data breaches, and potential financial and reputational damages.
  • Mitigating Insider Threats: Zero trust security addresses the risk of insider threats by assuming that no user or device should be implicitly trusted. This helps organizations identify and address potential risks before they cause harm.
  • Adapting to Evolving Cyber Threats: Traditional security models often rely on perimeter-based defenses, assuming that internal network traffic is safe. However, modern cyber threats—such as advanced persistent threats and zero-day exploits—can bypass traditional defenses. Zero trust security takes a more granular approach, implementing continuous auditing, multi-factor authentication, and strict access controls to protect against these evolving threats.
  • Supporting Remote and Mobile Workforces: With the rise of remote work and the increasing use of mobile devices, businesses face new challenges in securing their networks and data. Zero trust security allows organizations to implement secure access controls, regardless of the user’s location or device. This flexibility ensures that employees can work remotely while maintaining a strong security posture.
  • Meeting Compliance and Regulatory Requirements: Implementing zero-trust security can help organizations meet these requirements by enforcing access controls, monitoring data usage, and demonstrating a proactive approach to cybersecurity.
  • Building Customer Trust: In today’s data-driven world, customers value the security and privacy of their personal information. By implementing robust zero-trust security measures, businesses can build trust with their customers, demonstrating their commitment to protecting sensitive data and mitigating cyber risks.

PagerDuty Process Automation + Zero Trust

Digital Transformation initiatives rely on cloud technologies to rapidly scale the business, but there are new challenges around security with automating operations and cloud infrastructure. The main challenge being that engineers need the most secure protocols to run automation in restricted application environments that mandate a zero trust architecture—where direct SSH zone access is deprecated.

Additionally, significant engineering effort is required to deploy and manage automation that performs well across hundreds of remote environments and geographical regions. Lastly, creating resilient automation runbooks is time consuming and prone to error when coordinating within a variety of complex environments.

With PagerDuty Runbook Automation, engineers can now run automation from a central system that triggers the execution through enhanced Runners or AWS SSM within the remote environments—without needing to rely on SSH firewall rules.

PagerDuty Runbook Automation dispatching tasks to remote environments using zero-trust principles.

The new Runners can leverage common plugins like Ansible and Kubernetes and customers can create new types of runbooks where engineers target many remote secure environments and explicitly state where and how tasks will be independently routed and executed within each environment. This enables better performance, scale, and fault tolerance.

For customers with high security requirements, PagerDuty Runbook Automation and Process Automation can now enable connectivity without the need to open ports in their firewalls, such as SSH, enabling remote operations. This new functionality simplifies secure connectivity to automation by reducing the need for customers to deploy their own bastion or jump host and public endpoints.

To learn more about zero trust security and PagerDuty Process Automation, be sure to register for the webinar happening this Thursday, June 15th, at 6 A.M. P.T and 11 A.M. PT respectively.

The post What is Zero Trust Security and Why Should You Care? appeared first on PagerDuty.

]]>
PagerDuty Announces New Automation Enhancements That Simplify Operations Across Distributed and Zero Trust Environments by Joseph Mandros https://www.pagerduty.com/blog/new-enhancements-runbook-automation/ Tue, 28 Mar 2023 13:00:59 +0000 https://www.pagerduty.com/?p=81742 Be sure to register for the launch webinar on Thursday, March 30th to learn more about the latest release from the PagerDuty Operations Cloud. Rundeck...

The post PagerDuty Announces New Automation Enhancements That Simplify Operations Across Distributed and Zero Trust Environments appeared first on PagerDuty.

]]>
Be sure to register for the launch webinar on Thursday, March 30th to learn more about the latest release from the PagerDuty Operations Cloud.


Rundeck by PagerDuty has long helped organizations bridge operational silos and automate away IT tasks so teams can focus more time on building and less time putting out fires. And while this mission still rings true today, our vision is to extend this reality and revolutionize all operations while continuing to build trust.

To resolve high-impact work faster and more efficiently, the PagerDuty Operations Cloud delivers value across every IT environment; whether it be pre-production or production, isolated or secure, multi-cloud or on premise—you name it. We want to meet our customers where they are and deliver the value they need.

Starting today, that vision is now a reality. 

We are thrilled to introduce a next-generation architecture for PagerDuty Runbook Automation and PagerDuty Process Automation that simplifies how our customers manage automation across cloud, remote, and hybrid environments.

This latest functionality, among others, is why Runbook Automation is an integral part of the PagerDuty Operations Cloud. Now PagerDuty helps automate across any infrastructure, multi-zoned hybrid environment, network, and more to resolve that unplanned, time-sensitive, and high-impact work we know about all too well.

Standardizing automation across secure infrastructure

It’s clear that automation has become a necessity in order for businesses to keep pace with the rapid transformations happening across the technical landscape. These businesses also have to sustain growth and transformation  while also doing more with the same—or even fewer–resources. Additionally, segregated environments and disparate services add complexity via hybrid cloud realities and increasing security and regulatory requirements. This sprawl of IT environments has led to a new dimension of organizational silos, along with departmental and technical silos. 

One thing is for sure: When built, conventional automation tooling didn’t anticipate the complexity of security requirements in modern distributed environments. As a result, engineers have to manually execute tasks for operations within each environment, causing long wait times, more personnel time consumed, and higher levels of engineering toil. To solve this problem of fragmented automation, something more is needed. Teams need full visibility across their entire infrastructure and the ability to seamlessly execute distributed automation jobs—without having to manually build new automated operations into each project and environment.

With this new functionality, instead of having to manually invoke an automation step in each environment, engineers can now manage and run automated tasks and distribute that automation across their many segregated environments from a single administration.

As a result, teams will be able to: 

  • Operate faster by enabling automated operations across cloud and data center environments
  • Simplify security when operating in high-compliance and zero-trust architectures
  • Eliminate toil by speeding up task resolution and reducing personnel time across all zones, environments, and networks

In order to better understand how this is made possible by the new functionality, let’s touch on some of the challenges we are looking to solve for our current and future customers.

Enabling scale and efficiency with security in mind

While it is true that automation can unlock new levels of scale and potential for innovation, it also brings with it critical challenges around added complexity, connectivity, and security. For technology teams, this means additional dependencies inside isolated environments that need to be maintained, distributed network endpoints to keep in check, and islands of fragmented automation spread across remote and local environments that need to be securely managed and run.

One of the bigger challenges that we hear from our customers is around managing and running automation across environments with high security and compliance requirements. In many cases, engineers have to manually manage each of their several isolated environments due to the many security nuances and process dependencies within each zone.

Now, PagerDuty Runbook Automation can be that connectivity conduit across our customer’s distributed operations that wield strict requirements for:

  • Disparate environments? No problem: Runbook Automation and Process Automation can now authorize orchestration of automation steps in remote environments as if they were local, and allows incorporation of many environments in the same job definition. This eliminates network silos that typically compromise automation and thus requires manual log-ins to properly run in those environments.
  • Compliance audits? No problem: Runbook Automation and Process Automation now simplify compliance by embedding access control and logging into automation, now extending these capabilities into remote environments—all from a centralized control plane. 
  • Zero trust security? No problem: For customers with high security requirements, Runbook Automation and Process Automation can now enable connectivity without the need to open ports in their firewalls, such as SSH, enabling remote operations. This new functionality simplifies secure connectivity to automation by reducing the need for customers to deploy their own bastion or jump host and public endpoints. 

distribtued-automation-capture-environment-state

Example diagram of PagerDuty Runbook Automation running an automated diagnostic process in remote environments to capture environmental state.

New Runner functionality

The Runner is a remote execution point purpose built for node steps to run on specified endpoints, rather than from the automation server itself. The Runner, available for both Process Automation and Runbook Automation, securely opens up network/communication between data centers, remote environments, and the automation cluster.  

The new release offers a next-generation Runner that is now integrated with common infrastructure such as Ansible, Docker, and Kubernetes that execute locally within the private network. The new architecture now allows job authors to develop automated jobs that incorporate multiple environments.

New feature highlights

  • Run automation anywhere with next-generation Runners that provide secure and resilient connectivity from within remote environments.
  • Support complex architectures and jobs with distributed automation steps that enable the orchestration of standardized automation to work across any environment.
  • Simplify management with an enhanced Runner UI and APIs that simplify administration of Runners from the central automation environment, including configuration, status, and managing credentials.
  • Integrate your existing stack with plugins available on remote Runners for common technologies like Ansible, WinRm, Kubernetes, and Docker that can execute in local and remote environments.

Process Automation and Runbook Automation can now provide the same breadth of automation workflows with execution steps for Ansible or Kubernetes in remote environments that will only continue to strengthen as we blaze this trail of new distributed automation capabilities for our customers. 

Looking ahead

These new automation features from Runbook Automation and Process Automation are just the beginning, and strengthen the value of the PagerDuty Operations Cloud by providing more flexibility for our customers to create triggered workflows across a wider variety of secure environments.

Register for our webinar on Thursday, March 30th to hear more about the latest release from the PagerDuty Process Automation portfolio. If you have any questions or are interested in learning more, make sure to contact your account manager and visit our Process Automation page.

The post PagerDuty Announces New Automation Enhancements That Simplify Operations Across Distributed and Zero Trust Environments appeared first on PagerDuty.

]]>
Automating Common Diagnostics for Kubernetes, Linux, and other Common Components by Joseph Mandros https://www.pagerduty.com/blog/common-diagnostics-common-components/ Wed, 27 Jul 2022 13:00:45 +0000 https://www.pagerduty.com/?p=77320 Watch our Automated Diagnostics webinar on demand to learn about common diagnostics for common components and how we provide out-of-the-box job templates for you to...

The post Automating Common Diagnostics for Kubernetes, Linux, and other Common Components appeared first on PagerDuty.

]]>
Watch our Automated Diagnostics webinar on demand to learn about common diagnostics for common components and how we provide out-of-the-box job templates for you to get started right away.


This is the second piece in a series about automated diagnostics, a common use case for the PagerDuty Process Automation portfolio.

In the last piece, we talked about the basics around automated diagnostics and how teams can use the solution to reduce escalations to specialists and empower responders to take action faster. In this blog, we’re going to talk about some basic diagnostics examples for components that are most relevant to our users.

But before we jump in, let’s make clear what automated diagnostics isn’t, based on some audience feedback on Twitter from the last article:

  1. Automated diagnostics is different from alert correlation. Alert correlation depends on a specified depth of signals, as well as an engine that can properly identify said correlated signals. Automated diagnostics is meant to help the first responder triangulate the source of the issue to either fix the issue faster themselves, or escalate more accurately.
  2. Automated diagnostics is different from monitoring. Monitoring is purpose-built to identify undesired states in performance or activity. This means that most monitoring is not purpose-built to emulate a first-responder’s activities to validate a true positive, or identify the first actions to take. Monitoring is focused on raising the alert. Automated diagnostics is focused on determining how to fix an issue once the alert is already created. 

That said, automated diagnostics can certainly make use of data collected by monitoring tools—most people don’t apply thresholds to every datapoint they collect.  In fact, one of our more commonly used diagnostics integration is to query CloudWatch logs. While we might consider a log aggregator a monitoring tool, sometimes the first steps of investigation are to look at the data in the monitoring tool that exists purely for diagnosing issues.

Providing responders with on-demand or pre-run diagnostic capabilities for their own environments can help a first responder quickly determine probable cause, thereby pulling in fewer individuals to assist with the incident. By providing first-responders with “diagnostic” data that is typically only retrievable by domain experts, the need to pull in more people for troubleshooting incidents is reduced significantly. This in turn drives down the cost of incidents and reduces mean time to response (MTTR) by automating the investigative steps that are typically manual in nature. 

The status quo: Automation in incident response

Operations managers are often excited about the idea of enabling self-healing or auto-remediation. It’s a natural inclination to assume that speeding up resolution through automation means “applying a cure.” But more often than not, the industry theory of “no two incidents are truly identical” rears its head. When you have a high degree of variability, this reduces the value of such potential automation since it’s less likely to be run. For example, restarting a core service may be the right way to fix today’s issue, but it could lead to a cascading failure—and an even bigger incident—tomorrow.

*The reader now switches cognitive gears to the initial stages of a response.*

But you know what tends to be highly repetitive? The same investigative steps a responder takes to begin to diagnose what went wrong and determine what happened.  More repetitive action means more value to gain from applying automation. For example, let’s say an incident kicks off within your Kubernetes distribution. No matter the nature of the incident, whether it be something within your image repository, or load balancer, you’re likely still going to take the same diagnostic step of pulling your kubernetes logs. 

These diagnostic steps often remain static—for the most part—depending on the component you’re working with, no matter the priority of the incident that occurs. Automated diagnostics can be applicable to heterogeneous incidents; it doesn’t have to be purpose-built for the same, recurring incident, it can be applied to and customized around all sorts of common incident types and severities—specific to your environment—for almost any common component. Think of it like going to the doctor’s office. Whether you are going to urgent care for a specific complaint or just an annual checkup, they still take your temperature, blood pressure, and weight when you walk in.

Common Examples

Every developer environment is different; but many environments are also quite similar once you really pop the hood. In the beginning stages of a response, most diagnostics will come from three main data sources:

  • Application data
  • System data
  • Environment data

There are several examples of common diagnostics and components that can be automatically pulled during the beginning of a response. This would not only help the responder better understand the severity of the incident, but will also help ensure the responder doesn’t pull in too many specialists and interrupt them from their normal day of work. For example, let’s look at Kubernetes (k8s) as a component for a responder during an incident. When an incident happens within a k8s environment, the infrastructure engineer who maintains the technology would typically perform actions such as:

  • Tail logs from k8s pod
  • Retrieve logs from k8s by selector label
  • Check image repo
  • Describe deployment
  • Execute command in pod

One thing all of these actions have in common? A typical L1 responder ACK’ing an incident doesn’t know how to orchestrate these actions—it’s just not their area of expertise. But with the out-of-the-box jobs from PagerDury’s Automated Diagnostics, the L1 responder can automatically run these diagnostics and execute these jobs, which speeds up the response and reduces the escalation to the infrastructure engineer responsible for the k8s environment.

Some common diagnostics and alert examples include: 

  • CPU/Memory Consuming Services
    • Common alert: High CPU/Memory
    • Common question: Which service(s) are consuming CPU/Memory?
  • File size / Disk Consumption
    • Common alert: High CPU/Memory
    • Common question: Which files/directories are consuming the most space?
  • System Logs: Linux/Windows Commands
    • Common alert: Server/service issues
    • Common question: Is it an OS issue or app issue?
  • SQL Database Commands
    • Common alert: Database blocks/deadlocks
    • Common question: Is there a long-running query blocking other database requests?
  • Host Availability
    • Common alert: Host down
    • Common question: Is it actually down or is it a false-positive reachability issue?
  • Application Error: Application Logs or traces
    • Common alert: 400/500 error codes
    • Common question: What is the stack-trace?

A few examples of some common diagnostics for known components:

  • Cloudwatch Logs: Surface specific application and VPC logs.
  • ECS: View stopped ECS task errors.
  • ELB: Debug unavailable target-group instances.
  • Kubernetes. Retrieve logs from Pods by selector label.
  • Linux. Retrieve service status.
  • Nginx. Retrieve error logs. 
  • Redis. Slow log entries.

And these are just some of the over 30 out-of-the-box jobs templates we have built for our users that you can find in the Automated Diagnostics solution guide. To use the Automated Diagnostics Solution, you must either have a PagerDuty Runbook Automation license or a Process Automation (previously Rundeck Enterprise) license. See the FAQ for details on how to use. If you do not have a license for either of these products, contact us to learn more.

Automating diagnostics within PagerDuty

Incidents that notify responders are filled with information provided by monitoring tools that have a “miopic” view on the alert(s). A common example is that high CPU usage triggers an alert, and this notifies a responder. But the information contained in the alert is surface-level in that it does not specify what might be the cause of the spiked CPU.  

Diagnostic data is the deeper-level information that helps answer the “why” and “where” questions of incidents. Even though some monitoring and correlation tools provide some help in providing root-cause analysis for users, most fall short in their ability to emulate a responder’s investigative/troubleshooting steps of collating disparate data-sources into a unified view. By providing responders with on-demand or pre-run diagnostic capabilities, the odds of the first responder resolving the issue on their own increase, as well as the probability of pulling in fewer individuals to assist with the incident. Enter Automated Diagnostics. 

Want to learn more about common diagnostics for the components you use? Register for our September 14th webinar event of the same name, hosted by Justyn Roberts, Senior Solutions Consultant, PagerDuty. New to Process Automation? Request a demo. Already using PageDuty Process Automation? Check out the automated diagnostics solution guide to see the end-to-end process of achieving the full solution. Questions? Reach out to me directly on Twitter @sordnam and let’s chat! 

The post Automating Common Diagnostics for Kubernetes, Linux, and other Common Components appeared first on PagerDuty.

]]>
Extending Automation Actions Across the PagerDuty Platform by Joseph Mandros https://www.pagerduty.com/blog/summit-automation-actions-updates/ Tue, 07 Jun 2022 13:00:03 +0000 https://www.pagerduty.com/?p=76695 It’s day one of PagerDuty Summit, and we are looking forward to a full day of expert presenters, actionable content, and educational sessions to boost...

The post Extending Automation Actions Across the PagerDuty Platform appeared first on PagerDuty.

]]>
It’s day one of PagerDuty Summit, and we are looking forward to a full day of expert presenters, actionable content, and educational sessions to boost your PagerDuty IQ and show you new ways to improve your team’s operational excellence.

One point you will continue to hear throughout the duration of the conference echoes our greater mission: To revolutionize operations so teams spend less time on reactive, break-fix work and more time delivering new innovation. At PagerDuty, we see this future of operations extending beyond the digital teams that build and run software to all teams in the organization. While many PagerDuty products and features exist to make this mission a reality, we are going to focus on the latest and greatest with PagerDuty Automation Actions®, part of the PagerDuty Process Automation® portfolio.

New Updates and Integrations With Automation Actions 

Automation Actions connects your first-line responders to corrective automation directly within PagerDuty. Instead of pushing escalations to specialists when an incident kicks off, responders can triage and resolve incidents themselves using safely delegated automation. As a result, teams can reduce MTTR, lower interruptions to specialists, and quickly diagnose and remediate incidents. 

We launched Automation Actions last year to help organizations get started quickly in simple first steps towards automation. Now, Automation Actions is integrated across the entire PagerDuty platform for all users to remove manual, time consuming repetitive tasks like diagnosing issues when brought into bridge calls. 

Let’s look at some of the latest and greatest automation capabilities with Automation Actions:

  • Automation Actions in Incident Response. Teams can now run automated diagnostics and remediate incidents directly within PagerDuty. This integration will improve productivity and remove toil by automating repetitive, manual tasks, and give time back to your engineers to focus on innovation.

  • Automation Actions for Customer Service Ops. This integration gives customer service agents the ability to validate customer problems and capture critical information via automation to diagnose and resolve cases faster. Agents are now empowered to validate customer-impacting issues and run automated actions directly from the PagerDuty app in Service Cloud.

  • Automation Actions for Event Orchestration. By combining nested event rules with machine learning and precise, targeted automation triggers, it’s now possible to action an incident before responders even get paged. This integration with Event Orchestration helps teams automate common diagnostics and enable self-healing for recurring and well-understood types of incidents, resulting in reduced MTTR and escalations to specialists.

  • Automation Actions in PagerDuty’s Mobile App. Everything you love about Automation Actions is now mobile! Invoke the same automation from Automation Actions and resolve common incidents directly from the PagerDuty mobile app.

  • Automation Actions in Slack.  With this integration, incident responders can deploy scriptable diagnostics and remediation actions directly from Slack.

Automation Everywhere

To be ready for anything with increasing digital complexity and dependencies, operations must transform from being manual, rigid, and ticket queue-based, to a continuously improving system that focuses on outcomes and customer experience, delivers operational speed and resilience, and is heavily automated by machine learning and AI. Only then can teams move toward a more proactive posture, and reduce the burden of manual work to avoid burnout and preserve focus. With Automation Actions, teams not only have the ability to reach this operational milestone, but to excel and continue to mature their automation capabilities.

Be sure to check out related Automation Actions sessions at PagerDuty Summit:

  • Normalize Automation, Sean Noble, Principal Product Manager, PagerDuty
  • Is it the Cloud? App? Database? Reduce Escalations by giving first responders automated diagnostics, Jake Cohen, Senior Product Manager, PagerDuty

To learn more about the PagerDuty automation portfolio, visit our automation hub. If you want to learn more about PagerDuty Automation Actions and how it can help your team save time and money, contact your account manager or learn more today.

 

The post Extending Automation Actions Across the PagerDuty Platform appeared first on PagerDuty.

]]>
What is Automated Diagnostics and Why Should You Care? by Joseph Mandros https://www.pagerduty.com/blog/what-is-automated-diagnostics-why-should-you-care/ Fri, 03 Jun 2022 13:00:35 +0000 https://www.pagerduty.com/?p=76267 How do you measure the cost of an incident? A lot of people in technology talk about the cost of an incident solely from the...

The post What is Automated Diagnostics and Why Should You Care? appeared first on PagerDuty.

]]>
How do you measure the cost of an incident?

A lot of people in technology talk about the cost of an incident solely from the perspective of downtime, or the number of customers and employees impacted. And from the surface, oftentimes that is a fair angle to take. It makes the headlines, and customer reputation and trust are critical to the success of any business—obviously.

But another direct cost of incidents that is infrequently acknowledged is the number of people that need to get involved during an incident; whether that’s to help investigate the root-cause, troubleshoot and resolve the incident, or absolve their team of responsibility—regardless of whether the incident is severe enough to impact your customers.

According to PagerDuty data, 50% of a responder’s time is spent determining who is best to pull in for additional support (and trying to figure out if there’s actually a problem) in x environment, or with y service. Given this statistic, this means that 50% of an incident’s lifespan is spent on the beginning stages of an incident (the diagnostic and triage phases), rather than on actual remediative actions.

The bottom line? The cost of people-hours and the number of manual actions taken per incident can get steep—fast.

Automating Your Incident Response

Applying automation to the early, recurring stages of the incident, including diagnosing the severity of the incident and understanding the genetic makeup of what went awry (and how), is critical to the success of the eventual remediation of the incident.

Automation is also important from a people perspective, ensuring your teams aren’t getting burnt out by the same, repetitive actions every time an incident kicks off. Ensuring the diagnostic data is available to first responders is paramount to the routing efficiency and overall workflow of the incident response.

Before we go any further, let’s first define diagnostic data. Diagnostic data is data retrieved by incident responders that is typically more specific than the information provided by monitoring tools. For example, whereas monitoring tools will alert you when there is a spike in CPU or Memory, the incident responders investigate by looking at the highest CPU and Memory consuming processes. Therefore, in this case, the Process Names or ID’s and their associated compute-consumption is the “diagnostic data.”

So now that we have defined Automated Diagnostics, why should you care? Because implementing an Automated Diagnostics practice can drive down the cost of incidents through both reduced incident duration and fewer responders paged.

The Problem with MTTR

Perhaps “problem” is the wrong word here, but hear me out: MTTR as a metric is too broad to return granular, actionable insights. Mean time to repair (MTTR) has been a staple maintainability metric in the IT universe for decades. And while it has many applications and does a great job of explicating the rate of general recovery, its achilles heel is just that—generality. And now that we can safely infer that 50% of a responder’s time is spent determining who is best to pull in for additional support, we’ve started looking at other metrics within the MTTR timeline, such as MTTT (mean time to triage) or MTTI (mean time to investigate).

MTTI/MTTT: The average time between the detection of an IT incident and when the organization begins to investigate its cause and solution. This denotes the time between MTTD (mean time to detect) and the start of MTTR (mean time to repair). 

At PagerDuty, we measure this as the time span between when your first responder “acks” to when your resolver “acks.” This metric helps us click into what’s actually happening under the hood during an incident. After observing our own data, we’ve been able to infer that MTTI is one of the most time-consuming factors of MTTR. And in modern business, when a task requires time and attention from engineers, then that task is an expensive one for the business. Really expensive.

Using Automated Diagnostics

Now let’s bring this back around to MTTI and automated diagnostics. MTTI is not only lengthened by the technical tasks of responders manually pulling diagnostic data and having to decipher which team to escalate to based on x service and y incident. It’s also about the people and their limitations, depending on the specific expertise that is required to begin resolution. For example, in many cases, the first responder doesn’t know how to investigate the issue from the database or network ‘perspectives.’ That may be due to their lack of skills (background in databases or networks); access, or tribal knowledge (e.g. that a specific app-component depends on a complex integration with a third-party service). 

By automating these investigative and debugging tasks, in addition to having the ability to delegate these actions across teams and responders, you will experience a positively cascading effect on MTTI, and eventually, MTTR.

So why should you care about automated diagnostics?

With automated diagnostics, you can:

  • Reduce escalations to scarce experts by designing paths to provide the first-responders with information that would typically be manually gathered
  • Distribute subject matter expertise across response teams
  • Invoke secure automation behind firewalls and VPCs
  • Troubleshoot and resolve faster without a human-assisted action required
  • Improve the speed of enablement to new engineers and ensure optimal efficiency at all levels of the incident response organization

Getting Started

You made your decision. Now it’s time to blaze the trail, but where do you start?

To use some marketing slang: don’t try to boil the ocean. Trial some actions that are both low in complexity and risk. This could be taking a deeper look at some of your noisiest services, or you could run some simple data pulls from various monitoring applications, disc usage, etc. But it’s important to have a strategy for the long-term roll out and vision of this functionality. Sure, you can write a script that pulls data from numerous sources and appends that to an incident. But that is far from scalable.

It’s important to think about the various infrastructure pieces and tools you will want to pull diagnostic data from. You will want a standardized approach for interfacing with your heterogeneous and dynamic environments.

To learn more about automated diagnostics, check out some of our how-to articles, which we will be continuing to publish throughout the year. Additionally, look out for a session on all things Automated Diagnostics from Jake Cohen during PagerDuty Summit next week!

For more resources about PagerDuty’s Process Automation portfolio, visit this page and get in touch with your account manager today.

Any questions? Feel free to ask away on Twitter @sordnam

The post What is Automated Diagnostics and Why Should You Care? appeared first on PagerDuty.

]]>
Democratize Your Team’s Automation Capabilities With PagerDuty® Automation Actions by Joseph Mandros https://www.pagerduty.com/blog/democratize-capabilities-automation-actions/ Mon, 04 Apr 2022 13:00:55 +0000 https://www.pagerduty.com/?p=75355 Let’s face it. Incidents can be expensive—really expensive. But the high cost of incidents within a production environment isn’t always due to a compromised service...

The post Democratize Your Team’s Automation Capabilities With PagerDuty® Automation Actions appeared first on PagerDuty.

]]>
Let’s face it. Incidents can be expensive—really expensive. But the high cost of incidents within a production environment isn’t always due to a compromised service or negative customer experience. According to PagerDuty response data, over 50% of an incident’s lifespan is spent with first responders in the investigation and mobilization phases (what we call ‘triage’)— in other words, determining what might have gone wrong and calling the right person to fix it.

With the above statistic in mind, it’s clear that the shadow expense of your incident lifecycle is that of your people’s time—the engineer who discovered the incident, the on-call engineer who responded to the issue and determined root cause, and every other subject matter expert that gets looped into the incident lifecycle. And when you sprinkle in manual processes to the entire response timeline, things can get pricey. Very pricey.

The fact of the matter is, your developer organization’s time is just as valuable and important as the business’s bottom line. And as service and application development continues to grow in complexity, “time saved” becomes an even more important metric to track, quantify, and continuously improve. Finding a way to automate different aspects of the incident response process can help save your team’s time and bolster efficiency across the board. How can you do this, you ask? Enter PagerDuty® Automation Actions (formerly PagerDuty Rundeck Actions).

PagerDuty® Automation Actions

PagerDuty® Automation Actions add-on connects your first-line responders to corrective automation directly within PagerDuty. Instead of pushing escalations to specialists when an incident kicks off, responders can triage and resolve incidents themselves using safely delegated automation. As a result, teams reduce MTTR, lower interruptions to specialists, and quickly diagnose and remediate incidents.

PagerDuty® Automation Actions connects automated diagnostics and remediation to the incident response workflow. Automated Diagnostics are a set of actions for production services that your responders can automatically invoke when an incident occurs. Rather than having to escalate to expert specialists who manually run common tests, responders can safely and securely invoke this automation themselves from within PagerDuty and see responses delivered in real time back to your incident timeline.

Run designated actions such as service restarts, diagnostics, and more

With these diagnostic tests, responders can more efficiently escalate the incident to the right specialist for resolution, rather than involving a large group or escalating up the typical responder ladder. The specialists will be able to see the results of those common diagnostics and can get started right away. 

Additionally, teams can also invoke these actions and collaborate on the incident directly from their Slack instance. This eliminates the need to access a service through a terminal and context switch between windows, creating a faster and more efficient way to resolve incidents—while also reducing escalations to specialists. As you mature your use of automated diagnostics, you can start using it for things like automated remediation and triggering using Event Intelligence.

PagerDuty® Automation Actions helps solve four main problem areas within an organization’s response process:

  • Siloed expertise. First-line responders don’t know the genetic makeup of every single application or service within an organization’s environment.
  • Consistent interruptions to specialists. Responders escalate to the engineer they think is the specialist of that application or service, taking time away from innovation and slowing time-to-resolution.
  • Repetitive and manual diagnostic steps. The first steps when an incident kicks off are often the same. These same, manual steps have to be actioned on before you can begin resolving the incident. 
  • Complex and sprawling production environments. Knowing which systems to access and what actions to take can take time. Additionally, not every responder has the authority to access specific production systems, often making the escalation process difficult and time-consuming.

PagerDuty® Automation Actions solves the above issues by:

  • Delegating automation across teams. Deploy automated procedures to first-line responders that are typically invoked by specialists.
  • Resolving incidents faster with fewer escalations. By creating automation for common requests and operations, teams can spend less time figuring out who to escalate to and more time on a fix.
  • Triggering human-assisted/self-healing automation. Invoke diagnostic actions before responders are even paged using PagerDuty’s Event Orchestration.
  • Safely invoking automation with security in mind. Responders only see actions they have the authorization to invoke for impacted systems in an incident, and all actions are logged to maintain a strong security posture.

 

To summarize the above with some quick bullets, PagerDuty® Automation Actions helps teams:

  • Decrease response times up by to 30 minutes and MTTR by up to 25%
  • Reduce the volume of incidents that get escalated up the ladder
  • Distribute subject matter expertise across response teams
  • Trigger human-assisting and self-healing automation before responders even get paged
  • Invoke secure automation behind firewalls and VPCs
  • Deploy automated actions in place of manual procedures
  • Enrich incident documentation for smoother postmortems and reduced operator work

To learn more about the PagerDuty automation portfolio, visit our automation hub. If you want to learn more about PagerDuty Automation Actions and how it can help your team save time and money, contact your account manager or learn more today.

The post Democratize Your Team’s Automation Capabilities With PagerDuty® Automation Actions appeared first on PagerDuty.

]]>
Closing the Gap: Deploying Automation the Right Way by Joseph Mandros https://www.pagerduty.com/blog/closing-the-gap-deploying-automation-the-right-way/ Thu, 17 Mar 2022 13:00:25 +0000 https://www.pagerduty.com/?p=74289 Automation in the enterprise is nothing new. Engineers have been working with automation tools and frameworks for decades. From configuration management tools, to continuous integration...

The post Closing the Gap: Deploying Automation the Right Way appeared first on PagerDuty.

]]>
Automation in the enterprise is nothing new. Engineers have been working with automation tools and frameworks for decades. From configuration management tools, to continuous integration and delivery pipelines to cloud formation, you name it—automation is part of the fabric of nearly any technology use case in the business landscape. If the previous statement is true, then why does automation still seem to pair with so much manual work? 

The answer? There is still a palpable lag in the widespread adoption of automation across IT and the rest of an organization. For example, the majority of companies today have some form of digital offering—whether that’s a product, service, helpdesk, or other customer-facing application. And most companies leverage some degree of automation within their IT organization to deploy or maintain that service offering. However, even though automation is utilized, the full value is often unrealized in production. Its use is typically pigeonholed in small pockets of the business, and only the employees who implemented and/or built it are able to use and apply it. 

We call this the automation gap.

The Automation Gap

The automation gap is a scenario where the use of automation within an organization exists in islands—with each specialized operational unit leveraging pockets of automation in silos. Additionally, the existence of an automation gap means most employees (besides the subject matter experts) either lack the tribal knowledge, skill sets, and/or access privileges across the organization to actually use it. 

Through the lens of IT, an automation gap slows down operations, and can lead to bigger business problems such as:

  • Limited innovation capacity because of constant escalations to specialists
  • Increased SLA penalties and error rates from slowed incident resolutions
  • Specialist burnout due to inundation of unplanned work and requests
  • Ultimately unhappy customers and lost revenue

Considering the pace of innovation and rise in consumer expectations in the digital world, the negative externalities of an automation gap will only worsen—or widen—as the days go by. But before you can address the issues and bridge the chasm, you need to be able to identify and understand the underlying factors that contribute to the existence of the gap at its foundation.

The characteristics of an automation gap can be categorized into three main subcomponents:

  • The Knowledge Gap
  • The Skills Gap
  • The Access Gap

 

 

 

 

 

 

 

The Knowledge Gap

Digital-first organizations of all sizes have to constantly transform to meet customer needs, keep pace with innovation, and stay ahead of the competition. In order to see success in this evolution, the digital infrastructure needs to adapt and evolve in parallel. But evolutions can’t just happen overnight. They occur through years of digital transformation, employee and cultural transitions, and new complexities and technical dependencies—all layered atop legacy infrastructure. Without proper documentation and understanding, seamless execution can be nearly impossible.

The knowledge gap is the understanding that no single individual or employee can be the subject matter expert for every dependency, system, or best practice in the IT organization. For example, a former employee with a 5+ year standing may have the tribal knowledge to tackle a legacy infrastructure incident, but the on-call responder who’s been working eight months might not have that nuanced understanding of the same underlying infrastructure.

The Skills Gap

The skills gap is the reality that different users have different technical skill sets. Similar to the knowledge gap, the skills gap stems from employee specialization of new technologies and complexities across an organization’s digital infrastructure. As new systems and processes are introduced, often the subject matter experts (SMEs) that built or implemented them are the only people who can properly triage a problem when an incident inevitably occurs. This specialization bottleneck can negatively impact the response lifecycle, burn out your specialists, and reduce the efficiency of your remediation efforts. This gap is especially evident during periods of attrition, where one or two specialists who left were the only ones with the understanding and skills needed for X system or Y service.

Not everyone knows how to administer a database, or automate a continuous integration pipeline. In fact, a company’s most highly skilled technicians are usually in such demand, anything that can offload their work helps scale the business. To constantly escalate repetitive diagnostics and procedures to SMEs creates unplanned work and acts as a distraction to otherwise high-value work experts should be prioritizing. Talent acquisition around these technical roles continues to grow, which makes this particular gap all the more important to tackle for organizations long term.

The Access Gap

Finally, an access gap is related to maintaining security posture according to today’s best practices.  Following the principle of least privileged access, super-user credentials should not be widely distributed or shared among IT staff. Without the proper access to tools, information, and systems, you will see negative outcomes stemming from a lack of access, such as prolonged resolution times, inefficiencies around remediation, and less time for SMEs to focus on high-value work.

So how can PagerDuty help you bridge this automation gap, improve the overall agility of your IT operations, and enable your teams to innovate faster?

Apply Automation

PagerDuty’s automation capabilities enable your end-users to do what previously only your expert engineers were able to do. The platform is designed to bridge these gaps by safely delegating automation for use by other stakeholders, eliminating escalations interruptions and dramatically reducing wait times. These processes can incorporate existing task automation as individual steps in an operational workflow, abstracting the specific context of each step from the process user, while providing a consistent operational experience. 

The PagerDuty® Process Automation portfolio consists of the following offerings:

  • PagerDuty® Automation Actions. A PagerDuty add-on that curates and connects responders in PagerDuty to automated diagnostics and remediation for services involved in incidents.
  • PagerDuty® Runbook Automation. A SaaS service that enables engineers to standardize and automate runbook procedures, and delegate services as self-service operations.
  • PagerDuty® Process Automation On-Prem. A self-hosted software cluster that gives engineers the ability to standardize and automate end-to-end operational workflows, and safely delegate them as self-service operations to stakeholders.

PagerDuty’s automation capabilities also help organizations safely close the access gap by providing the ability to invoke automated workflows without needing to explicitly share credentials or keys with end-users. It integrates with single sign-on (SSO) to enable role-based access control, and logs all activity at both process and step levels to meet compliance requirements.

To learn more about how PagerDuty’s automation capabilities can help you on your journey to close the automation gap, visit https://www.pagerduty.com/use-cases/automation/ today.

The post Closing the Gap: Deploying Automation the Right Way appeared first on PagerDuty.

]]>
Now You can Invoke PagerDuty Rundeck Actions Within the PagerDuty Slack Integration by Joseph Mandros https://www.pagerduty.com/blog/rundeck-actions-slack-integration-launch/ Thu, 03 Feb 2022 14:00:54 +0000 https://www.pagerduty.com/?p=73589 Collaboratively diagnose problems to resolve incidents faster. Last year, we released PagerDuty Rundeck Actions, a PagerDuty add-on product that connects responders to automated diagnostics and...

The post Now You can Invoke PagerDuty Rundeck Actions Within the PagerDuty Slack Integration appeared first on PagerDuty.

]]>
Collaboratively diagnose problems to resolve incidents faster.

Last year, we released PagerDuty Rundeck Actions, a PagerDuty add-on product that connects responders to automated diagnostics and remediation for common problems directly in the PagerDuty incident response workflow. After working with our customers and listening to the community, we are excited to announce that PagerDuty Rundeck Actions now integrates with PagerDuty’s Slack integration.

Fusing Automation With Collaboration

With this latest integration, responders now have the ability to deploy automated diagnostics and remediation actions directly from a Slack channel. This eliminates the need to access a service through a terminal and context switch between windows, creating a faster and more efficient way to resolve incidents—while also reducing escalations to specialists.

The days of pulling up multiple windows on your double monitors to address a single problem are no longer. Not only can you communicate incident status to your stakeholders, you can also deploy repair actions from the same window. When an incident occurs, responders can quickly create incident-specific channels within their Slack instance to collaborate with affected teams and stakeholders, run diagnostic steps, and invoke automation to remediate the issue in real time. 

CollabOps in Action

First-line responders and personnel can leverage the connections they built across the IT department (specifically applications and services integrated with PagerDuty + Rundeck) and deploy a chatbot to run the actions for you. Instead of escalating and passing off the problem up the ladder, this integration allows you to drop the incident right into a dedicated channel with the right people for the job to collaborate on a fix. And, while this is all happening, the integration also proactively captures and records the logistics of the incident as it happens, making the documentation process fully transparent and accessible to all stakeholders.

Rundeck Actions: How it Works

With PagerDuty Rundeck Actions, engineers can create and delegate automated actions for tedious and recurring diagnostic procedures to responders, reducing time wasted on repetitive tasks. This also includes automation for common mitigation approaches such as fail-over and other remediation tactics. 

Simple, repeatable cures for known issues can also be triggered without human intervention using event triggers, turning urgent issues into a resolved afterthought. To help responders working in PagerDuty accelerate resolution of incidents, Rundeck Actions connects automated diagnostics and remediation to the incident response workflow. 

Automated Diagnostics are a set of automated actions for production services that your responders can invoke when an incident occurs. Rather than having to escalate to expert specialists who manually run common tests, first responders can safely invoke this automation themselves from within PagerDuty and see responses delivered in real time back to your incident timeline. 

With PagerDuty Rundeck Actions, teams can:

  • Decrease response times up to 30 minutes
  • Distribute subject matter expertise across response teams via Slack
  • Trigger human-assisting and self-healing automation before responders are paged
  • Invoke secure automation behind firewalls and VPCs
  • Deploy automated actions in place of manual procedures
  • Enrich incident documentation for smoother postmortems and reduced operator work

To learn more about how Rundeck Actions can work for you, check out the knowledge base. Additionally, you can contact your account manager or request a demo today.

 

The post Now You can Invoke PagerDuty Rundeck Actions Within the PagerDuty Slack Integration appeared first on PagerDuty.

]]>
A Complete Guide to Enterprise DevOps by Joseph Mandros https://www.pagerduty.com/blog/complete-guide-enterprise-devops/ Tue, 06 Apr 2021 13:00:02 +0000 https://www.pagerduty.com/?p=68681 It’s easy to assume that DevOps only works for start-ups that build their culture from scratch, or for tech giants with cloud-native roots. But in...

The post A Complete Guide to Enterprise DevOps appeared first on PagerDuty.

]]>
It’s easy to assume that DevOps only works for start-ups that build their culture from scratch, or for tech giants with cloud-native roots. But in reality, DevOps best practices can benefit everyone—from agile new businesses to decades-old enterprises. As a result, DevOps adoption is on the rise, with 74% of enterprises adopting DevOps in some form. Organizations that don’t make this shift risk being disrupted by those that have achieved greater agility, automation, and communication.

With all that being said, implementing a DevOps culture can be a daunting task. Cultural change is challenging, especially for enterprises with entrenched modes of operation and countless legacy processes and services. But it’s not impossible. This blog explains how enterprises can achieve the change necessary to begin, and sustain, a DevOps journey.

What is DevOps?

DevOps is an approach to organizational development that’s been gaining popularity over the last decade or so. Although it has become associated with a variety of software tools and platforms, DevOps itself is centered around the cultural transformation of development and operations organizations. It is rooted in cultural values and people rather than just a specific toolset, process, or developmental architecture.

Components of DevOps

DevOps can be broken down into six core values:

  • Agility: The ability to adapt quickly to embrace new technologies and services, and seamlessly scale tools and processes.
  • Collaboration: Breaking down organizational “silos” existing between developers, ITOps admins, and business stakeholders to enable cross-team collaboration.
  • Code Ownership: By emphasizing the principle of “owning your code,” DevOps encourages developers to participate in all steps of software delivery, from writing and deploying code to monitoring applications in production.
  • Automation: DevOps emphasizes the automation of processes, from code builds and deploys to application monitoring, driving better agility and collaboration.
  • Continuous Learning: By collecting metrics and building continuous feedback loops, DevOps lets organizations constantly assess their performance and continuously improve.
  • Communication: Effective, organization-wide communication is the foundation to implementing these values.

Why Enterprises Are Making the Move to DevOps

DevOps adoption is growing, with 26% of organizations saying they have adopted DevOps across all projects compared to just 12% in 2017. There are a myriad of benefits that enterprises can achieve from embracing DevOps best practises. It can be implemented at, and benefit, companies of all types and sizes. DevOps also supports improved, enterprise-wide communication, generating value by saving time and money.

Organizations who embrace DevOps can also use employees’ time more efficiently. This helps deliver better, faster updates to customers and enables fast response times to market changes. Finally, it allows organizations to continuously improve by making ongoing, incremental updates to products, services, and processes.

DevOps is a Journey

To embrace DevOps fully, it’s essential to understand that it cannot just be implemented and forgotten about. Building a DevOps culture is a continuous process. An organization’s DevOps journey typically follows the paths outlined below as it evolves from a basic embrace of DevOps values toward an advanced DevOps culture:

Software Delivery Practices

Traditional: Waterfall model. Releases every 1-2 years. Deployments require planning.
Beginning: Quarterly releases. Development and ITOps may interact, but changes require manual handoff.
Intermediate: Coordination between Development, ITOps and QA. Deployments impact fewer services. Faster software releases and less downtime.
Advanced: Continuous integration, deployment, and delivery. Releases throughout the day, zero downtime.

Troubleshooting Processes

Traditional: Development/QA send ad hoc manual requests to ITOps. ITOps is not automatically notified of problems.
Beginning: Development/QA have a process for sending requests to ITOps, but goals are uncoordinated.
Intermediate: ITOps enables some self-service access for Development. ITOps is engaged with Development for troubleshooting.
Advanced: Teams collaborate throughout the software delivery lifecycle. Collective accountability for maximizing performance. Recovery is quick.

Environment and Tooling

Traditional: Static test environments and manual change management. Incidents are handled on an ad hoc basis. No parity between development and production environments.
Beginning: Some environment parity, automated build and testing, and automated alerts are in place. Alert response and escalations are manual.
Intermediate: Service-oriented monitoring and automated test-and build are in place. Deployment may require specialized ITOps skillsets.
Advanced: Complete automation across test-and-build. Incident response mobilization. Development and production environments completely and automatically integrated.

Challenges in Enterprise DevOps

But there are many challenges to adoption. For one, it can be difficult to facilitate communication and collaboration across the many different teams that contribute to software creation and management. This is further complicated when teams exist within teams – for example, a development team may have subsets of developers all working on different projects.

Many enterprises are also dependent on legacy systems and applications and may mistakenly believe that DevOps only works in a DevOps-centric organization. In reality, DevOps techniques can be applied to legacy systems – for instance, using modern alerting tools to enable continuous monitoring of legacy applications.

Concerns around security, governance, and compliance can also prevent enterprises from adapting quickly. So, organizations looking to embrace DevOps must find ways to balance it with change management and risk mitigation. Finally, DevOps culture requires flexible roles—for instance, ITOps engineers must understand what developers are doing, and vice versa. However, not all engineers have the necessary skill sets to do this. As such, enterprises may need to invest in additional IT recruitment or training for current engineers.

Enabling and Consulting DevOps Transformation for the Enterprise

Overcoming these barriers requires enterprises to do the following:

  • Start small: Enterprises can begin by encouraging the adoption of DevOps principles on a small scale within certain groups. Once the model has been proved, expansion will follow.
  • Improve Communication: Organizations need communication tools that can aggregate both human and machine data on the state of software and systems. The tools should make it easy to understand the nature of an issue and coordinate responses in real-time.
  • Focus on Integration: Enterprises that rely on legacy software, and struggle to coordinate operations across teams, should integrate their software tools and workflows as much as possible. This will help make data available to the personnel when they need it.
  • Implement Flexible Roles: Members of different teams – from engineers to ITOps – should be equipped with the knowledge and communication skills to collaborate outside of their specialist areas. To find out more about essential DevOps roles, click here.
  • Think Holistically: Enterprises must integrate the workflows of everyone who impacts software delivery, not just developers and ITOps. For instance, the legal department is involved in software licensing and contracts, and may be needed if downtime means an SLA is broken with clients. So, legal must be integrated into the organization’s DevOps culture. The same applies to other departments—including HR, PR, and executive leadership.
  • Implement the Right Tooling: Enterprises seeking to embrace DevOps should build a toolset that enables agility, collaboration, automation, communication, and legacy compatibility if required.
  • Enable Cultural Change: Embracing DevOps is not as simple as installing the right software.Enterprises should identify their cultural transformation goals and select the software tools that help achieve them.

Achieving Enterprise DevOps With PagerDuty

DevOps adoption is only set to increase. However, in order to reap the benefits, organizations must build continuous improvement into their culture and commit to their ongoing DevOps journey. With PagerDuty, enterprises can build efficient communication into the core of their culture, enabling them to continue their DevOps journey.

To learn more about how PagerDuty can help your developer organization, click here.

The post A Complete Guide to Enterprise DevOps appeared first on PagerDuty.

]]>