runbook automation | Tags | PagerDuty

Democratize Automation with AI-Generated Runbooks by Ranjana Devaji

Ranjana Devaji — Thu, 31 Aug 2023 12:00:57 +0000

Operational efficiency is as critical within the IT and engineering teams as any other part of the business. Automating repetitive tasks and reducing escalations within and to these teams is of immense value.

While automation saves time and boosts productivity, the complexity of developing automation can be a limiting factor and bottleneck. Generative AI is a paradigm shift here, in that it brings consumer-style simplicity to assisting in the development of enterprise-grade automation.

With the new interface of generative AI, organizations can democratize automation and therefore increase the number of individuals that are contributing to authoring and harnessing automation.

To help our customers achieve their goals with automation, we are excited to announce public Early Access for AI-generated Runbooks. Starting today, Runbook Automation users can write the task they wish to automate in plain-English and let AI build a template of automation for that particular task.

AI-generated Runbooks lower the barrier to entry to new automation developers and speeds up the time to create new automation for experienced automation authors. This feature works seamlessly with the user’s preferred scripting language, offering a low-code solution for what used to be a high-code task.

Simply sign up for the PagerDuty Runbook Automation Trial if you are not an existing Runbook Automation user. For existing Runbook Automation customers, App administrators can enable this feature at any time.

Tangible Benefits from Leveraging AI-generated Runbooks

Better Development timelines

Are you a seasoned automation engineer? AI-generated runbooks will help you save time and effort.

Creating self-service automation for tasks involves sifting through documentation, identifying the right calls/commands, and then transposing them into individual job steps manually.
With AI-generated runbooks, authors can generate these on the fly and faster than ever before.

Here’s a quick look at what provisioning access to apps in Okta looks like with and without AI-generated Runbooks:

Faster Onboarding

Get good fast with example jobs for reference. Start with tasks you’re familiar with and see how these tasks operate within the Runbook Automation platform.

Democratize Technical Automation

With AI-generated Runbooks, less experienced automation-authors can quickly go from thought to development to implementation. This broadens the scope of people within an organization that can leverage a technical tool such as Runbook Automation.

Conquering Blank Slate Problems

A typical conundrum for users in the face of automation is “Where do I start?”. With AI-generated runbooks, users can now hit the ground running by creating baseline versions of Jobs for their variety of use cases.

Build with Best Practices for Optimal Results

The AI-generated Runbooks use fine-tuned “prompt engineering” that embeds the known best-practices from the engineers here on the Process Automation team at PagerDuty.
For example, all Jobs are created with a ReadMe that explains the prerequisites for invoking that Job. And all secrets used within the Job are retrieved from Key Storage – rather than requested from the user.

GenAI – Security & Data

A common concern we hear from our customers considering adopting generative AI is around the security of their data, and the potential of giving competitors advantages through model training. PagerDuty AI-generated runbooks feature is an opt-in, meaning you need to enable it to be able to use it. Furthermore, as stated in the feature documentation:

The only data sent to the generative AI model is the text entered into the prompt field. No other data about your environment, existing Jobs or the source of the prompt is sent to the AI model. Furthermore, the AI model is not trained on the text entered into the prompt.

Read our Guidelines for the Safe and Secure Use of Generative AI to learn more about how we’re working with and building our AI-powered features.

See AI-generated Runbooks in Action

AI-generated Runbooks propel automation into a new realm, where your operations are empowered like never before. Embrace the future of automation with PagerDuty, and witness the transformation it brings to your operational landscape.

The post Democratize Automation with AI-Generated Runbooks appeared first on PagerDuty.

Gartner Market Guide: Embedding Automation Into the Enterprise by Joseph Mandros

Joseph Mandros — Thu, 20 Jul 2023 13:00:17 +0000

Read the Gartner Market Guide today!

“Existing workload automation strategies are unable to cope with the expansion in complexity of workload types, volumes and locations driven by evolving business demand, as per Gartner. Digital business is slowed without collaboration and automation inside and outside of IT, leading to siloes of capabilities across business and IT teams.Cost optimization is an evolving challenge, driven by technical debt and requirements to demonstrate business value of investments.”

The lack of collaboration and automation both within and beyond IT departments creates isolated capabilities across business and IT teams, hindering the pace of digital business. Additionally, as technical debt accumulates and the need to showcase the value of investments grows, cost optimization becomes an ongoing challenge.

By embracing these key findings in a recent Market Guide published by Gartner®, we believe businesses can streamline their operations and enhance efficiency in the face of expanding complexities. Let’s take a brief look at our understanding of the key findings from Gartner Market Guide for Service Orchestration and Automation Platform.

Workload Automation Challenges

Businesses face a myriad of challenges when it comes to managing workloads effectively. “According to Gartner, existing workload strategies are unable to cope with the expansion in complexity of workload types, volumes, and locations driven by evolving business demands.” Automation strategies that once sufficed are no longer equipped to handle the expanding complexities driven by evolving business demands.

To address these challenges, organizations need to adopt intelligent automation solutions that can adapt to changing requirements. Intelligent workload automation leverages technologies such as artificial intelligence (AI) and machine learning (ML) to dynamically allocate resources, optimize scheduling, and automate repetitive tasks.

Collaboration and Automation: Breaking Silos

According to Gartner, digital business is slowed without collaboration and automation inside and outside of IT, leading to siloes of capabilities across business and IT teams. The lack of collaboration and automation between these teams can significantly hinder digital business initiatives. Silos create barriers, slowing down decision-making processes and impeding the flow of information and ideas.

To overcome these challenges, organizations must foster a culture of collaboration and implement automation solutions that span across business and IT functions. By integrating workflows and sharing information seamlessly, teams can work together in harmony, driving innovation and accelerating digital initiatives. Collaborative automation tools, such as workflow management platforms and project management software, can facilitate effective communication, collaboration, and information sharing, leading to faster time-to-market and improved customer satisfaction.

The Evolving Challenge of Cost Optimization

“According to Gartner, cost optimization is an evolving challenge, driven by technical debt and requirements to demonstrate business value of investments.” As digital infrastructures expand, organizations accumulate technical debt—a burden caused by outdated technologies, inefficient processes, and legacy systems. This technical debt not only impedes agility and innovation but also increases operational costs.

To address this challenge, businesses must prioritize cost optimization through strategic investments and ongoing evaluation of their technology portfolios. Embracing cloud-based solutions, leveraging automation, and adopting agile practices can help organizations reduce technical debt and achieve greater cost efficiency.

Additionally, organizations should establish clear metrics and processes to measure and demonstrate the business value of technology investments, enabling informed decision-making and resource allocation.

Conclusion

We believe the key findings from the report underscore the need for organizations to stay agile, adaptive, and innovative in their approach to workload management, ensuring they can effectively meet evolving business demands and drive sustainable growth in the ever-changing digital landscape.

“According to Gartner, it is recommended to drive collaboration across business and IT teams by democratizing access to automated capabilities through feedback-driven self-service automation solutions.

Unlock the business value of orchestrated I&O automation by implementing expanded service orchestration and event-driven workflows to drive agility, innovation and cost optimization efforts.

Meet today and tomorrow’s business demands by service orchestration and automation platforms (SOAPs) delivering the needed agility and efficiency. Embed agility and efficiency into orchestrated IT processes to meet business demands by using SOAPs.”

To learn how PagerDuty can help you on your automation journey, click here.

___________________________________

Gartner Market Guide for Service Orchestration and Automation Platforms, Chris Sanderson, Daniel Betts, Hassan Ennaciri, 23 January 2023.

The post Gartner Market Guide: Embedding Automation Into the Enterprise appeared first on PagerDuty.

Take Advantage of the New Product Trial of Runbook Automation for Incident Resolution by Jorge Villamariona

Jorge Villamariona — Mon, 12 Jun 2023 12:00:48 +0000

The PagerDuty Operations Cloud is the platform that enables our customers to manage the full lifecycle of urgent incidents. Many of our customers are leveraging Process Automation to augment their incident response teams and as a key driver to grow and scale their capabilities.

The work resulting from urgent incidents cannot be postponed because it impacts a company’s revenue or ability to service customers. Often, this work is repetitive and could be delegated to first responders. However, the deeper context needed to make accurate diagnosis and remediation of these incidents is locked away in production environments and requires knowledge, skills and access privileges from specialists. Responders frequently have to escalate the incident to already overworked specialists—a time-consuming process that can be disruptive, frustrating and repetitive.

By automating repetitive and time-consuming tasks from your incident resolution process, you can free up your engineers to focus on higher-value activities that require creativity and critical thinking. This, in turn, can lead to shorter MTTR, better customer experiences, faster innovation, revenue protection and improved profitability.

PagerDuty Automated Incident Resolution provides pre-built and customizable diagnostic and remediation capabilities that help first-responders determine the cause and initiate remediation within their production environment, which saves time and requires fewer individuals to assist with the response. Automating this repetition can speed up MTTR by 25% and reduce costs and interruptions by at least 50%.

In order to make it easier for our customers to realize how Automated Incident Resolution can speed up their MTTR, we rolled out the In-Product Trial of Runbook Automation for Incident Resolution last month. This trial is exclusively available to our Business and Digital Operations customers. PagerDuty users can request a trial from the automation tab using the Web UI:

The account owner will see an approval request via email. Upon approval, the account owner can set up a trial for Automation Actions and get a fully functional Runbook Automation instance in just a few minutes. Users will see a visual guide to help them get started authoring automation, including: Creating a Runbook Automation (RBA) Instance, Adding a Runner (a program that allows you to execute automation jobs in your environment), Adding Automation Actions (which allows you to invoke automation jobs and workflows from PagerDuty), Running Actions (from the PagerDuty incident details page), and viewing the output from the automation.

We encourage you to fully take advantage of this trial to further automate and optimize your incident response process. We are looking forward to hearing your Incident Resolution Automation Success story.

The post Take Advantage of the New Product Trial of Runbook Automation for Incident Resolution appeared first on PagerDuty.

Debug State Capture for Traditional Infrastructure & Apps by Justyn Roberts

Justyn Roberts — Thu, 25 May 2023 13:00:59 +0000

In our previous blogs on Capturing Application State and using Ephemeral Containers for Debugging Kubernetes, we discussed the value of being able to deploy specific tools to gather diagnostics for later analysis, while also providing the responder to the incident the means to resolve infrastructure or application issues.

This drives a balance between the need to restore a service as quickly as possible, in addition to ensuring enough debug data is available for a later permanent resolution—all while allowing a development team to keep a container running lean and in a performant way.

By capturing both application and environment state when the incident occurs, any responder or service owner spends less time context switching between tools, credentials, and environments—enabling more accurate and faster responses and problem resolution.

The techniques discussed in the prior blogs in this series focussed around modern, cloud-native platforms like Kubernetes, and the unique approaches needed for containers—especially containers that do not natively ship with debugging tools.

Not everyone is able or willing to move every application to cloud native, and many of us still work within a hybrid scenario of both containerized and traditional applications.

Even without the ephemeral nature of containers and the strict policies of container images, there is still a need to capture in-the-moment evidence to help with root-cause analysis in order to avoid future occurrences of incidents.

Let’s look at use cases describing the ability to capture state automatically in the event of a failure or decreased performance, and pick some interesting scenarios to dive into for a deeper look.

This is a non-exhaustive list, but here are some examples of how debug state capture is used in traditional application environments:

Infrastructure & Network

Top resource-consuming processes on one or more infrastructure components
TCP dump; thread/memory/core dump

Database

Top resource consuming queries
Current query state
Execution of application specific queries

Application specific

Java – Run thread/heap dump with tools like jstack
Windows – Proc Dump
Python – Running thread dump
All – Application specific log files

Additional Log Files

Debug state capture can grab whole or partial logs from any file that may not be captured by a log aggregator.

PagerDuty Process Automation provides many pre-built template workflows for capturing application and environment state as part of the automated diagnostics project. These workflows are flexible and extendable so that they can be customized to work for your particular use-cases.

Taking a Deeper Dive

Let us take a closer look at some specific examples of capturing environment state that could prove useful at identifying the long-term fix for an incident.

Use case 1 – Gather database debug

We can use the SQL RUN Step in Process Automation to add either an inline statement, or execute an existing script. As my application is MariaDB (A fork of MySQL), I can use the following parameters to run the MySQL query:

SHOW FULL PROCESSLIST;

(Note: credentials are derived from my existing external store and passed securely as I execute the step as part of a workflow, so I can safely delegate without exposing info)

I pass the output to my Incident platform (In my case, PagerDuty, of course), and set the job to collect automatically if an incident occurs within the database service.

This info is now automatically available to both my responder in their app, chatops tool, or within any post mortem. In this case, I can see someone is running a benchmark test at the point of incident! As with the previous blog posts, it would also be easy to post more complex versions of this to a storage environment like an AWS S3 Bucket for later analysis.

Use case 2 – Gather application debug

My observability tool is very quick to let me know WHEN an application has failed, but not always the information on WHY it failed. This second use case will run an ad hoc command for my python application to use py-spy, a sampling profiler for my application, in conjunction with one of our automation plugins to move files securely to S3 for later retrieval.

Outputs data direct to my S3 storage :

This example highlights worker states for my python app at a thread level, straight into the hands of my developer, and stored for as long as they might need to reference.
Of course, these commands are not exclusive, and I could easily chain multiple checks to provide a broader view.

Use case 3 – Traditional Infrastructure debug state capture

For the third use case, I need to deploy a set of bash commands to a remote machine and run again at the trigger event. This primarily surfaces diagnostics such as open files and network connections, but it also runs bpftrace, a tool that can be used for tracing specific calls:

Process Automation allows me to define and deploy a whole script and store the output for gathering a snapshot of my environment state:

Conclusion

Signals from monitoring tools, even in traditional environments, benefit from broader visibility to allow any responder, DevOps engineer or SRE to make quick and safe decisions. Developers also often need additional information and the ability to capture state when problems arise, as they might not be on hand immediately.

Debug State Capture enables this, providing additional context for a responder, reducing time spent digging around in different tools and the capability to collect deeper datasets for subsequent analysis.

Curious to learn more? Get started today with a trial of Runbook Automation.

The post Debug State Capture for Traditional Infrastructure & Apps appeared first on PagerDuty.

What is Runbook Automation? by Catherine Craglow

Catherine Craglow — Wed, 03 May 2023 15:00:35 +0000

The post What is Runbook Automation? appeared first on PagerDuty.

Debugging Kubernetes with Automated Runbooks & Ephemeral Containers by Jake Cohen

Jake Cohen — Tue, 02 May 2023 13:00:01 +0000

In our previous blog, we discussed the difficulty in capturing all relevant diagnostics during an incident before a “band-aid” fix is applied. The most common, concrete example of this is an application running in a container and the container is redeployed—perhaps to a prior version or the same version—simply to solve the immediate issue. For companies where every millisecond of performance and every second of uptime has consequential impacts on the customer experience, these types of short-term fixes are a necessity. The costs to the business become significant, though, when engineers are tasked with developing the long-term solution to these incidents. For both major and (recurring) minor incidents, engineers have to spend inordinate amounts of time gathering evidence of the state of the application and environment when the incident occurred.

While a good portion of this diagnostic data resides in monitoring tools and therefore persists, there are times when it is necessary to get a shell in a container to retrieve information that is only available for the lifetime of the container. In Kubernetes, this is done using the kubectl exec command. With the right parameters, users can get a live shell in their running container and start executing commands to retrieve diagnostics. For example, once a user has a shell in a Java container, they can invoke jstack to get a thread dump of their application.

But many operations teams do not let anyone exec into production pods (which is where critical incidents happen), or the number of people that can is very slim—for both security reasons and due to the limited number of people that are familiar with operating in Kubernetes. Consequently, in order to retrieve diagnostic data during an incident, individuals with Kubernetes access and expertise regularly need to be pulled in for help. This process drives up the cost of incidents by increasing MTTR, as well as the number of people that need to get involved.

For these reasons, it is best to use automation that removes the need for users to exec into running pods. With this automation architecture, when an issue occurs, an automated runbook is invoked, and that runbook retrieves the debug data, sends it to a persistent storage location (S3, Blob Storage, SFTP server, etc), and then informs the engineers where they can locate and use the debug data.

PagerDuty Process Automation provides a pre-built, templatized runbook for exactly this use case: when an alert creates an incident inside PagerDuty, this can automatically (or by the click of a button) trigger the runbook to execute commands in the pod, send the output to a persistent storage, and provide details on the location of that data in the incident.

Link to debug data is provided to engineers during and after the incident

Users of both our commercial automation products (Process Automation and Runbook Automation) and open source Rundeck can follow the instructions here to download and get started with the automated-runbook.

This automated runbook is great when the container image already has the command-line utilities (binaries) needed for debugging. For example, many containerized Java apps ship with the jstack utility in the container image; however, what happens when the debugging utilities are not shipped as part of the container image? Or, as is increasingly commonplace, what happens when the container is “distro-less,” and therefore will not even provide a shell?

This is where Kubernetes Ephemeral Containers come into play—providing users a mechanism to attach a container (of any image) to a running pod without the need to modify the pod definition or redeploy the pod.

By sharing the process namespace, the ephemeral container can use its debugging utilities for another container in the pod—even if the original container is in a crashed state. Here is a blog by Ivan Velichko that goes into great detail about process-namespace sharing with ephemeral containers:

Source: https://iximiuz.com/en/posts/kubernetes-ephemeral-containers/

Similar to using kubectl exec, leveraging ephemeral containers properly still requires access to executing kubectl commands on the Kubernetes cluster—which is rarely available to those outside operations. And just as before, knowing how to properly construct the command takes a superior level of familiarity with Kubernetes:

kubectl debug -it -n ${namespace} -c debugger --image=busybox --share-processes ${pod_name}
(Sample command for using Kubernetes Ephemeral Containers)

To accommodate users that have containers without debugging utilities or distro-less containers, we have built a new Kubernetes plugin that harnesses the ephemeral containers functionality:

We have used this plugin in a template for an automated runbook that also captures diagnostic output and sends the output to a persistent location. Process Automation and Runbook Automation users can get started with this template job by downloading it as part of the Automated Diagnostics Project here.

If you do not yet have a Process Automation or Runbook Automation account, click here to get started with PagerDuty’s automation products.

The post Debugging Kubernetes with Automated Runbooks & Ephemeral Containers appeared first on PagerDuty.

PagerDuty Announces New Automation Enhancements That Simplify Operations Across Distributed and Zero Trust Environments by Joseph Mandros

Joseph Mandros — Tue, 28 Mar 2023 13:00:59 +0000

Be sure to register for the launch webinar on Thursday, March 30th to learn more about the latest release from the PagerDuty Operations Cloud.

Rundeck by PagerDuty has long helped organizations bridge operational silos and automate away IT tasks so teams can focus more time on building and less time putting out fires. And while this mission still rings true today, our vision is to extend this reality and revolutionize all operations while continuing to build trust.

To resolve high-impact work faster and more efficiently, the PagerDuty Operations Cloud delivers value across every IT environment; whether it be pre-production or production, isolated or secure, multi-cloud or on premise—you name it. We want to meet our customers where they are and deliver the value they need.

Starting today, that vision is now a reality.

We are thrilled to introduce a next-generation architecture for PagerDuty Runbook Automation and PagerDuty Process Automation that simplifies how our customers manage automation across cloud, remote, and hybrid environments.

This latest functionality, among others, is why Runbook Automation is an integral part of the PagerDuty Operations Cloud. Now PagerDuty helps automate across any infrastructure, multi-zoned hybrid environment, network, and more to resolve that unplanned, time-sensitive, and high-impact work we know about all too well.

Standardizing automation across secure infrastructure

It’s clear that automation has become a necessity in order for businesses to keep pace with the rapid transformations happening across the technical landscape. These businesses also have to sustain growth and transformation while also doing more with the same—or even fewer–resources. Additionally, segregated environments and disparate services add complexity via hybrid cloud realities and increasing security and regulatory requirements. This sprawl of IT environments has led to a new dimension of organizational silos, along with departmental and technical silos.

One thing is for sure: When built, conventional automation tooling didn’t anticipate the complexity of security requirements in modern distributed environments. As a result, engineers have to manually execute tasks for operations within each environment, causing long wait times, more personnel time consumed, and higher levels of engineering toil. To solve this problem of fragmented automation, something more is needed. Teams need full visibility across their entire infrastructure and the ability to seamlessly execute distributed automation jobs—without having to manually build new automated operations into each project and environment.

With this new functionality, instead of having to manually invoke an automation step in each environment, engineers can now manage and run automated tasks and distribute that automation across their many segregated environments from a single administration.

As a result, teams will be able to:

Operate faster by enabling automated operations across cloud and data center environments
Simplify security when operating in high-compliance and zero-trust architectures
Eliminate toil by speeding up task resolution and reducing personnel time across all zones, environments, and networks

In order to better understand how this is made possible by the new functionality, let’s touch on some of the challenges we are looking to solve for our current and future customers.

Enabling scale and efficiency with security in mind

While it is true that automation can unlock new levels of scale and potential for innovation, it also brings with it critical challenges around added complexity, connectivity, and security. For technology teams, this means additional dependencies inside isolated environments that need to be maintained, distributed network endpoints to keep in check, and islands of fragmented automation spread across remote and local environments that need to be securely managed and run.

One of the bigger challenges that we hear from our customers is around managing and running automation across environments with high security and compliance requirements. In many cases, engineers have to manually manage each of their several isolated environments due to the many security nuances and process dependencies within each zone.

Now, PagerDuty Runbook Automation can be that connectivity conduit across our customer’s distributed operations that wield strict requirements for:

Disparate environments? No problem: Runbook Automation and Process Automation can now authorize orchestration of automation steps in remote environments as if they were local, and allows incorporation of many environments in the same job definition. This eliminates network silos that typically compromise automation and thus requires manual log-ins to properly run in those environments.
Compliance audits? No problem: Runbook Automation and Process Automation now simplify compliance by embedding access control and logging into automation, now extending these capabilities into remote environments—all from a centralized control plane.
Zero trust security? No problem: For customers with high security requirements, Runbook Automation and Process Automation can now enable connectivity without the need to open ports in their firewalls, such as SSH, enabling remote operations. This new functionality simplifies secure connectivity to automation by reducing the need for customers to deploy their own bastion or jump host and public endpoints.

Example diagram of PagerDuty Runbook Automation running an automated diagnostic process in remote environments to capture environmental state.

New Runner functionality

The Runner is a remote execution point purpose built for node steps to run on specified endpoints, rather than from the automation server itself. The Runner, available for both Process Automation and Runbook Automation, securely opens up network/communication between data centers, remote environments, and the automation cluster.

The new release offers a next-generation Runner that is now integrated with common infrastructure such as Ansible, Docker, and Kubernetes that execute locally within the private network. The new architecture now allows job authors to develop automated jobs that incorporate multiple environments.

New feature highlights

Run automation anywhere with next-generation Runners that provide secure and resilient connectivity from within remote environments.
Support complex architectures and jobs with distributed automation steps that enable the orchestration of standardized automation to work across any environment.
Simplify management with an enhanced Runner UI and APIs that simplify administration of Runners from the central automation environment, including configuration, status, and managing credentials.
Integrate your existing stack with plugins available on remote Runners for common technologies like Ansible, WinRm, Kubernetes, and Docker that can execute in local and remote environments.

Process Automation and Runbook Automation can now provide the same breadth of automation workflows with execution steps for Ansible or Kubernetes in remote environments that will only continue to strengthen as we blaze this trail of new distributed automation capabilities for our customers.

Looking ahead

These new automation features from Runbook Automation and Process Automation are just the beginning, and strengthen the value of the PagerDuty Operations Cloud by providing more flexibility for our customers to create triggered workflows across a wider variety of secure environments.

Register for our webinar on Thursday, March 30th to hear more about the latest release from the PagerDuty Process Automation portfolio. If you have any questions or are interested in learning more, make sure to contact your account manager and visit our Process Automation page.

The post PagerDuty Announces New Automation Enhancements That Simplify Operations Across Distributed and Zero Trust Environments appeared first on PagerDuty.

Getting Started Workshop: Rundeck By PagerDuty by Nisha Prajapati

Nisha Prajapati — Mon, 12 Dec 2022 20:49:04 +0000

The post Getting Started Workshop: Rundeck By PagerDuty appeared first on PagerDuty.

Automated Diagnostics by Joseph Mandros

Joseph Mandros — Tue, 27 Sep 2022 22:34:21 +0000

The post Automated Diagnostics appeared first on PagerDuty.

New! Common Automated Diagnostics for AWS Users by Jake Cohen

Jake Cohen — Wed, 03 Aug 2022 13:00:02 +0000

Today’s modern cloud architectures centered on AWS are typically a composite of ~250 AWS services and workflows implemented by over 25,000 SaaS services, house-developed services, and legacy systems. When incidents fire off in these environments—whether or not a company has built out a centralized cloud platform—distinct expertise is often a necessity. Because of this scaled complexity, first responders find themselves having to escalate to several different service owners or expert engineers to gather diagnostics before it’s possible to determine who an ultimate resolver of an issue should be.

When it comes to incident response, it’s critical that these new cloud environments seamlessly integrate with an organization’s existing critical applications and services—both old and new. In light of enhancing service quality and making it easier for responders to cross that bridge of expertise, we are excited to announce the immediate availability of new AWS plug-in integrations for automated diagnostics.

New AWS Plugins for Automated Diagnostics

Our new AWS plugins for Automated Diagnostics help provide deeper coverage for customers that are also users of AWS, making it easier to get up and running with automated diagnostics in their AWS environment quickly.

The new AWS plugins for Automated Diagnostics include:

CloudWatch Logs plugin. This plugin retrieves diagnostic data from AWS infrastructure and applications. Now users can more easily run automated diagnostics for AWS across multiple accounts and products.
Systems Manager plugin. This plugin allows for faster execution and accuracy for tasks such as configuration management, patching, and deploying monitoring and security tooling agents. Users are now able to apply automation to the above tasks for faster execution.
ECS Remote Command plugin. This plugin provides a mechanism to execute commands on containers. This enables developers and operators to retrieve diagnostic data from their running applications in real-time before redeploying their services.
Lambda Custom Code Workflow plugin. Create, execute, and optionally delete a new Lambda function with the custom code provided in a Job step as its input. Execute custom scripts as steps in jobs without having to install any software.

Sound complex? Don’t worry, we thought of everything :).

New Auto-Diagnostic Job Templates for AWS Users

We also released new pre-built templates for AWS, so you can start enhancing incident details for your specific environments immediately. These are purpose-built to be used with minimal configuration. Instead of starting from scratch, users now have a library of curated, ready-to-use job definitions that retrieve data for investigating, debugging, and triaging incidents during a response.

New users can start automating diagnostics for AWS faster and existing users can easily add AWS diagnostics to their existing PagerDuty Process Automation project.

Some example job templates include:

AWS – EC2	Instance Status & Associated IAM Roles	Retrieve EC2 Instance Status and Associated IAM Roles	Remote command (or SSM)
AWS – ECS	Stopped ECS Task Errors	Checks stopped ECS Tasks for errors and provides detailed information on the reason for the errors.	Stopped ECS Tasks
AWS – ELB	Retrieve ELB Targets Health Status	Retrieve the list of unhealthy Targets in the Load Balancer’s associated Target Groups.	ELB Instance Statuses
AWS – RDS	Check Database Storage Status	Checks RDS database for the instance status.	RDS Instance Status
AWS – VPC	IP addresses using UDP transfer protocol	Query CloudWatch logs to identify any hosts using the UDP transfer protocol.	CloudWatch Logs
AWS – VPC	Top 10 Hosts by Throughput on Subnet	Query CloudWatch logs to identify the top 10 hosts by throughput on a given subnet.	CloudWatch Logs
AWS – VPC	Top 10 Source IP Addresses with Highest Rejected Requests	Query CloudWatch logs to identify the top 10 source-IP addresses with the highest rejected-requests.	CloudWatch Logs
AWS – VPC	Top 10 Web-Server Requestors by Public IP	Query CloudWatch logs to identify the top 10 public-IP requestors to our web-server (e.g. Nginx).	CloudWatch Logs

And this is just the tip of the iceberg! We will continue to develop and build upon our existing plugins to ensure our customers that use AWS are well-equipped to invoke automation wherever it is needed, including providing some interactive guides.

Want to learn more about common diagnostics? Register for our webinar event, “Common Diagnostics for Common Components,” on September 14th. Request a demo to see automated diagnostics with PagerDuty Process Automation in action.

Already using PageDuty Process Automation? Check out the Automated Diagnostics solution guide to see the end-to-end process of achieving the full solution.

The post New! Common Automated Diagnostics for AWS Users appeared first on PagerDuty.