automated diagnostics | Tags | PagerDuty

Debug State Capture for Traditional Infrastructure & Apps by Justyn Roberts

Justyn Roberts — Thu, 25 May 2023 13:00:59 +0000

In our previous blogs on Capturing Application State and using Ephemeral Containers for Debugging Kubernetes, we discussed the value of being able to deploy specific tools to gather diagnostics for later analysis, while also providing the responder to the incident the means to resolve infrastructure or application issues.

This drives a balance between the need to restore a service as quickly as possible, in addition to ensuring enough debug data is available for a later permanent resolution—all while allowing a development team to keep a container running lean and in a performant way.

By capturing both application and environment state when the incident occurs, any responder or service owner spends less time context switching between tools, credentials, and environments—enabling more accurate and faster responses and problem resolution.

The techniques discussed in the prior blogs in this series focussed around modern, cloud-native platforms like Kubernetes, and the unique approaches needed for containers—especially containers that do not natively ship with debugging tools.

Not everyone is able or willing to move every application to cloud native, and many of us still work within a hybrid scenario of both containerized and traditional applications.

Even without the ephemeral nature of containers and the strict policies of container images, there is still a need to capture in-the-moment evidence to help with root-cause analysis in order to avoid future occurrences of incidents.

Let’s look at use cases describing the ability to capture state automatically in the event of a failure or decreased performance, and pick some interesting scenarios to dive into for a deeper look.

This is a non-exhaustive list, but here are some examples of how debug state capture is used in traditional application environments:

Infrastructure & Network

Top resource-consuming processes on one or more infrastructure components
TCP dump; thread/memory/core dump

Database

Top resource consuming queries
Current query state
Execution of application specific queries

Application specific

Java – Run thread/heap dump with tools like jstack
Windows – Proc Dump
Python – Running thread dump
All – Application specific log files

Additional Log Files

Debug state capture can grab whole or partial logs from any file that may not be captured by a log aggregator.

PagerDuty Process Automation provides many pre-built template workflows for capturing application and environment state as part of the automated diagnostics project. These workflows are flexible and extendable so that they can be customized to work for your particular use-cases.

Taking a Deeper Dive

Let us take a closer look at some specific examples of capturing environment state that could prove useful at identifying the long-term fix for an incident.

Use case 1 – Gather database debug

We can use the SQL RUN Step in Process Automation to add either an inline statement, or execute an existing script. As my application is MariaDB (A fork of MySQL), I can use the following parameters to run the MySQL query:

SHOW FULL PROCESSLIST;

(Note: credentials are derived from my existing external store and passed securely as I execute the step as part of a workflow, so I can safely delegate without exposing info)

I pass the output to my Incident platform (In my case, PagerDuty, of course), and set the job to collect automatically if an incident occurs within the database service.

This info is now automatically available to both my responder in their app, chatops tool, or within any post mortem. In this case, I can see someone is running a benchmark test at the point of incident! As with the previous blog posts, it would also be easy to post more complex versions of this to a storage environment like an AWS S3 Bucket for later analysis.

Use case 2 – Gather application debug

My observability tool is very quick to let me know WHEN an application has failed, but not always the information on WHY it failed. This second use case will run an ad hoc command for my python application to use py-spy, a sampling profiler for my application, in conjunction with one of our automation plugins to move files securely to S3 for later retrieval.

Outputs data direct to my S3 storage :

This example highlights worker states for my python app at a thread level, straight into the hands of my developer, and stored for as long as they might need to reference.
Of course, these commands are not exclusive, and I could easily chain multiple checks to provide a broader view.

Use case 3 – Traditional Infrastructure debug state capture

For the third use case, I need to deploy a set of bash commands to a remote machine and run again at the trigger event. This primarily surfaces diagnostics such as open files and network connections, but it also runs bpftrace, a tool that can be used for tracing specific calls:

Process Automation allows me to define and deploy a whole script and store the output for gathering a snapshot of my environment state:

Conclusion

Signals from monitoring tools, even in traditional environments, benefit from broader visibility to allow any responder, DevOps engineer or SRE to make quick and safe decisions. Developers also often need additional information and the ability to capture state when problems arise, as they might not be on hand immediately.

Debug State Capture enables this, providing additional context for a responder, reducing time spent digging around in different tools and the capability to collect deeper datasets for subsequent analysis.

Curious to learn more? Get started today with a trial of Runbook Automation.

The post Debug State Capture for Traditional Infrastructure & Apps appeared first on PagerDuty.

Quick! Grab all the evidence: Capturing application state for post-incident forensics. by Jake Cohen

Jake Cohen — Thu, 09 Feb 2023 14:00:42 +0000

This is the first in a multi-part blog series. In the next blog, we’ll provide a template job that leverages Kubernetes Ephemeral Containers to capture evidence from applications running in Kubernetes.

Everyone loves a good mystery thriller. Ok, not everyone – but Hollywood certainly does. Whether it’s Sherlock Holmes or Hercule Poirot, audiences clearly enjoy a page-turning plot of hunting down the culprit for some heinous crime. Many individuals, myself included, prefer the mystery-thriller where, at the end of the story, the mystery is “solved.” While cliff-hangers can evoke your imagination to think through the probable outcomes (remember the end of Chris Nolan’s Inception?), when it comes to detective stories, I admittedly prefer to know the answer of “who dunnit.”

Source: https://www.dazeddigital.com/artsandculture/article/24949/1/christopher-nolan-explains-the-spinning-top-in-inception

For a detective story to have full closure, the evidence of the crime is ideally crystal clear and provided in great detail such that the true culprit will be uncovered and hopefully brought to justice.

In the world of technical operations and maintaining uptime for critical services, a clear conflict arises that is akin to the Hollywood detective story. When critical applications suffer performance degradation—or worse yet, a full outage—engineers rush to find the (apparent) cause of the incident, such that they can remediate the issue as fast as possible. Teams will use the tools at their disposal to track down and isolate the overloaded compute resource, the hung query, or the maxed-out queue and quickly take action to rectify the issue.

As it turns out, however, this is only the beginning of the detective thriller. This is the type of detective mystery where the witnesses perceive the culprit to be the person who happened to be in the wrong place at the wrong time. But as more evidence is uncovered, it eventually becomes clear that the “true culprit” was an evil mastermind puppeteering a grander scheme from afar. Unfortunately for our “detective” engineers, when the (apparent) root cause is remediated, there is a high likelihood that they have [un]knowingly eliminated evidence that points to the underlying culprit: restarting a service or redeploying a pod eliminates valuable forensics evidence.

You can imagine the chief inspector pulling their hair out when they realize that poorly timed rain washed away the fingerprints that would have given them their strongest lead.

Source: https://www.defendyourcase.com/criminal-defense-blog/2020/february/are-fingerprints-at-the-crime-scene-enough-evide/

These days, developers and operations engineers struggle with this same tug-of-war between restoring services as quickly as possible while not losing critical evidence that would help them identify the code-level root cause of their incidents.

But wait—isn’t that what monitoring tools are for? The answer is: sometimes. Depending on the issue, configuration or code-level errors can be tracked down using sophisticated observability tools. However, developers often need even more granular data that isn’t captured by monitoring tools—simply because this debug-level data isn’t needed for alerting or service restoration. Data such as heap, thread, and TCP-dumps, top resource-consuming database queries, and stack traces are used to identify the “true culprit,” but most often are not needed to restore-service. Gathering this data takes time, and we all know that, during an incident, restoring service availability takes precedence over everything else.

Unfortunately, the adoption and proliferation of containerized applications and container orchestration has only heightened this tug-of-war struggle for two primary reasons:

Microservice architectures provide faster methods for safely restoring availability, such as redeploying a pod.
Fewer debugging utilities are available in these environments since developers and operations engineers want to minimize the surface area of their container images.

In order to accommodate the opposing forces in this tug-of-war, a solution is needed that can take action at “instant speed” such that evidence can be captured and persisted—while also immediately restoring service thereafter:

Such a solution is provided with PagerDuty’s Operations Cloud. By harnessing runbooks that are instantly triggered when an issue is detected, debug-level evidence can be captured and sent to a persistent storage service – such as S3 – and services can be restored using known fixes. With a large library of prebuilt integrations for both on-premise and cloud environments, and a growing list of template runbooks, PagerDuty users can achieve this seemingly audacious goal to cut down on both MTTR and time spent replicating bugs to resolve tech debt tickets. Existing PagerDuty customers can request a trial of Runbook Automation here, while new users can get started with PagerDuty Incident Response here.

Additionally, be sure to check out our Automated Diagnostics Solution Guide to see some of these example runbooks.

Stay inquisitive, my fellow detectives.

The post Quick! Grab all the evidence: Capturing application state for post-incident forensics. appeared first on PagerDuty.

PagerDuty at re:Invent 2022 Launches Automated Diagnostics for AWS that Enables Organizations to Resolve Incidents Faster So They Can Innovate More by Inga Weizman

Inga Weizman — Mon, 28 Nov 2022 14:00:03 +0000

It’s that time of the year! PagerDuty is coming back to sin city for AWS re: Invent 2022! The global conference brings organizations of all sizes and is set to explore themes of modernization, automation, and resiliency in the cloud. With current economic conditions, enterprises are looking to scale operations and optimize costs while delivering always-on, digital experiences to their customers. Automaton plays a key role in helping support operational and cost efficiency. This year, we are excited to bring along a new solution to the re:Invent floor: Automated Diagnostics for AWS that helps engineering teams have more time for innovation and less interruptions. We are also proud to be a Platinum Sponsor at re:Invent, deepening our long-term relationship with AWS and delivering automated CloudOps to joint our customers.

Cloud is eating the world

According to Gartner, “by 2023, 40% of all enterprise workloads will be deployed in cloud infrastructure and platform services, which is an increase from 20% in 2020.” This quote further drives home the reality that cloud adoption continues to be a top priority for enterprises looking to further digitize their services and backend infrastructure. AWS gives you unprecedented scale, agility, and speed of innovation, but teams face increasing complexity and ever-growing dependencies across systems, processes, and their organizations. This complex situation threatens to put the customer and employee experience — not to mention revenue — at risk.

As organizations migrate to the cloud and deploy cloud-native architectures, the increased complexity can cause more (expensive) incidents. Many organizations run in complex cloud architectures containing several interconnected services — many existing ephemerally — that are deployed across different availability zones and accounts. When incidents happen, it can take a long time to resolve them without understanding the root cause or who has the proper access privileges and subject matter expertise. This means lots of escalations and developers being pulled away from high-value work.

Incidents can get expensive — really expensive. A major retailer can lose upwards of $200K per minute in revenue every minute the site is down. Incidents also incur productivity costs, as engineers are working on fixing the problem instead of building new features and focusing on innovation. A poor customer experience because of or during an incident can further cost an organization dollars in the form of brand reputation. And when you add up all of those factors, the cost of an incident is much higher than you may have accounted for.

Resiliency matters

Resiliency is essential to ensuring that your customers enjoy their digital experiences with little to no interruption. The uncomfortable reality? Things will inevitably break and services will go down. It happens to all of us. What really matters is how fast you can recover and get your services back in the green, in addition to ensuring similar incidents like that don’t occur again in the future. Ensuring you have full visibility across your hybrid infrastructure and making sure you can detect and diagnose issues quickly is essential to continuity of your business and all your services.

Resilience doesn’t just happen, it’s a shared responsibility. Customers have to set up their infrastructure, operations, and people in a way that helps them endure and quickly respond to incidents. Defining clear ownership and accountability by having teams build and own their services is an essential part ensuring that you can have focused, real-time incident response.

PagerDuty empowers teams with end-to-end incident response and advanced automation capabilities that quickly and accurately orchestrate the right response, every time. Process Automation helps teams to quickly diagnose and resolve incidents by significantly reducing the number of escalations and MTTR so engineering teams can focus on continuous improvement and innovation.

Too many humans, too little time

Modern cloud architectures for AWS customers are composites of some 250 AWS services and 25,000 SaaS workflows available in the market, combined with in-house developed software and other legacy systems.

When incidents occur in these complex cloud environments, access to full cloud stack expertise is often needed to determine probable root cause, rule out other possibilities among dependencies, and check for false positives. This may require a first responder to escalate to several expert engineers to gather these diagnostics to determine who the ultimate resolver should be.

First line responders often lack know-how and access to gather diagnostic content in AWS environments. Many first line responders are generalists, and lack technical knowledge of what investigations are needed to diagnose specific issues in services. First responders also lack superuser access to be able to execute technical investigations due to security policies.

This means first responders typically must escalate to multiple experts to get the data they need to triage an incident, consuming more staff time to resolve the incident and interrupting more team members. For serious outages this needlessly extends the length of time it takes to resolve an incident, takes engineers away from high-value work, and increases the overall cost of an outage. Automation can play a key role in not only resolving incidents faster, but in arming first responders with the diagnostic data they need to resolve incidents on their own, thus safeguarding valuable engineering time.

Automated Diagnostics for AWS

With Automated Diagnostics for AWS, incident responders can quickly triage incidents themselves, reducing the need to escalate for help, speeding up resolution for customers, and increasing operational efficiency. Automated Diagnostics for AWS in PagerDuty provides frequently used, pre-built diagnostics job templates for commonly used services, including Amazon EC2, AWS Lambda, Amazon ECS, Amazon RDS, and more. Customers can easily configure these template jobs to work in their specific environments and extend the diagnostics steps in a workflow. Automated Diagnostics for AWS also allows customers to quickly design their own diagnostic jobs for AWS, and corrective automation for mitigation and remediation that can be invoked by responders within PagerDuty Incident Response, or triggered by PagerDuty Event Intelligence.

Customer Service teams and stakeholders are coordinated with real-time status information to deliver a better customer support experience. Automation helps internal teams operate more efficiently by shaving 25 minutes off MTTR, reducing the number of people required to resolve an incident and decreasing the number of escalations by 40%, saving time and money while improving the customer experience.

Automated Diagnostics for AWS:

Empowers first responders with the power to triage, mitigate, and resolve incidents, improving MTTR across the board.
Reduces escalations to engineers by using pre-built job templates and plugin integrations to critical AWS tools and services
Enables teams to continuously improve the efficiency of incident response within their AWS environments giving time back to engineers

Learn more about Automated Diagnostics for AWS or get started here.

Meet PagerDuty at AWS re:Invent

There will be plenty of opportunities to meet our team at re:Invent, pick up some swag, say hello to Pagey and attend lightning talks at our booth.

Stop by our booth #3819 to get a demo of our product offerings including, Process Automation, Incident Response, Event Intelligence and Customer Service to see how the PagerDuty Operations Cloud can help you uplevel your digital operations. We will also have plenty of lightning talks from our partners – learn more here.

If you would like to avoid the crowds and schedule a meeting or demo in one of our meeting rooms, just submit your request here. Our team can tailor the conversation to your specific needs, and share more about PagerDuty and AWS.

Monday November 28, 2:30 pm at Venetian Theatre, Level 2, Session #PRT217

Join us for our panel of industry leaders from SalesForce, Netflix, Sailpoint and Benefitfocus as they discuss how they have transformed their organizations with PagerDuty and AWS.

Wednesday, November 30, 6:00-8:00 pm

Join our leaders and your industry peers at Matteo‘s Ristorante Italiano for evening cocktails and conversation. Learn more about how PagerDuty helps some of the most innovative companies in the world deliver a superior customer experience.

To learn more about PagerDuty and AWS click here or watch this webinar.

The post PagerDuty at re:Invent 2022 Launches Automated Diagnostics for AWS that Enables Organizations to Resolve Incidents Faster So They Can Innovate More appeared first on PagerDuty.

Automated Diagnostics by Joseph Mandros

Joseph Mandros — Tue, 27 Sep 2022 22:34:21 +0000

The post Automated Diagnostics appeared first on PagerDuty.

Accelerating Incident Resolution with PagerDuty through Automated Diagnostics for AWS by Nisha Prajapati

Nisha Prajapati — Fri, 26 Aug 2022 18:09:33 +0000

The post Accelerating Incident Resolution with PagerDuty through Automated Diagnostics for AWS appeared first on PagerDuty.

Improve Efficiency of Incident Response with Automated Diagnostics for AWS in PagerDuty by Nisha Prajapati

Nisha Prajapati — Thu, 25 Aug 2022 21:11:39 +0000

The post Improve Efficiency of Incident Response with Automated Diagnostics for AWS in PagerDuty appeared first on PagerDuty.

New! Common Automated Diagnostics for AWS Users by Jake Cohen

Jake Cohen — Wed, 03 Aug 2022 13:00:02 +0000

Today’s modern cloud architectures centered on AWS are typically a composite of ~250 AWS services and workflows implemented by over 25,000 SaaS services, house-developed services, and legacy systems. When incidents fire off in these environments—whether or not a company has built out a centralized cloud platform—distinct expertise is often a necessity. Because of this scaled complexity, first responders find themselves having to escalate to several different service owners or expert engineers to gather diagnostics before it’s possible to determine who an ultimate resolver of an issue should be.

When it comes to incident response, it’s critical that these new cloud environments seamlessly integrate with an organization’s existing critical applications and services—both old and new. In light of enhancing service quality and making it easier for responders to cross that bridge of expertise, we are excited to announce the immediate availability of new AWS plug-in integrations for automated diagnostics.

New AWS Plugins for Automated Diagnostics

Our new AWS plugins for Automated Diagnostics help provide deeper coverage for customers that are also users of AWS, making it easier to get up and running with automated diagnostics in their AWS environment quickly.

The new AWS plugins for Automated Diagnostics include:

CloudWatch Logs plugin. This plugin retrieves diagnostic data from AWS infrastructure and applications. Now users can more easily run automated diagnostics for AWS across multiple accounts and products.
Systems Manager plugin. This plugin allows for faster execution and accuracy for tasks such as configuration management, patching, and deploying monitoring and security tooling agents. Users are now able to apply automation to the above tasks for faster execution.
ECS Remote Command plugin. This plugin provides a mechanism to execute commands on containers. This enables developers and operators to retrieve diagnostic data from their running applications in real-time before redeploying their services.
Lambda Custom Code Workflow plugin. Create, execute, and optionally delete a new Lambda function with the custom code provided in a Job step as its input. Execute custom scripts as steps in jobs without having to install any software.

Sound complex? Don’t worry, we thought of everything :).

New Auto-Diagnostic Job Templates for AWS Users

We also released new pre-built templates for AWS, so you can start enhancing incident details for your specific environments immediately. These are purpose-built to be used with minimal configuration. Instead of starting from scratch, users now have a library of curated, ready-to-use job definitions that retrieve data for investigating, debugging, and triaging incidents during a response.

New users can start automating diagnostics for AWS faster and existing users can easily add AWS diagnostics to their existing PagerDuty Process Automation project.

Some example job templates include:

AWS – EC2	Instance Status & Associated IAM Roles	Retrieve EC2 Instance Status and Associated IAM Roles	Remote command (or SSM)
AWS – ECS	Stopped ECS Task Errors	Checks stopped ECS Tasks for errors and provides detailed information on the reason for the errors.	Stopped ECS Tasks
AWS – ELB	Retrieve ELB Targets Health Status	Retrieve the list of unhealthy Targets in the Load Balancer’s associated Target Groups.	ELB Instance Statuses
AWS – RDS	Check Database Storage Status	Checks RDS database for the instance status.	RDS Instance Status
AWS – VPC	IP addresses using UDP transfer protocol	Query CloudWatch logs to identify any hosts using the UDP transfer protocol.	CloudWatch Logs
AWS – VPC	Top 10 Hosts by Throughput on Subnet	Query CloudWatch logs to identify the top 10 hosts by throughput on a given subnet.	CloudWatch Logs
AWS – VPC	Top 10 Source IP Addresses with Highest Rejected Requests	Query CloudWatch logs to identify the top 10 source-IP addresses with the highest rejected-requests.	CloudWatch Logs
AWS – VPC	Top 10 Web-Server Requestors by Public IP	Query CloudWatch logs to identify the top 10 public-IP requestors to our web-server (e.g. Nginx).	CloudWatch Logs

And this is just the tip of the iceberg! We will continue to develop and build upon our existing plugins to ensure our customers that use AWS are well-equipped to invoke automation wherever it is needed, including providing some interactive guides.

Want to learn more about common diagnostics? Register for our webinar event, “Common Diagnostics for Common Components,” on September 14th. Request a demo to see automated diagnostics with PagerDuty Process Automation in action.

Already using PageDuty Process Automation? Check out the Automated Diagnostics solution guide to see the end-to-end process of achieving the full solution.

The post New! Common Automated Diagnostics for AWS Users appeared first on PagerDuty.

Automating Common Diagnostics for Kubernetes, Linux, and other Common Components by Joseph Mandros

Joseph Mandros — Wed, 27 Jul 2022 13:00:45 +0000

Watch our Automated Diagnostics webinar on demand to learn about common diagnostics for common components and how we provide out-of-the-box job templates for you to get started right away.

This is the second piece in a series about automated diagnostics, a common use case for the PagerDuty Process Automation portfolio.

In the last piece, we talked about the basics around automated diagnostics and how teams can use the solution to reduce escalations to specialists and empower responders to take action faster. In this blog, we’re going to talk about some basic diagnostics examples for components that are most relevant to our users.

But before we jump in, let’s make clear what automated diagnostics isn’t, based on some audience feedback on Twitter from the last article:

Automated diagnostics is different from alert correlation. Alert correlation depends on a specified depth of signals, as well as an engine that can properly identify said correlated signals. Automated diagnostics is meant to help the first responder triangulate the source of the issue to either fix the issue faster themselves, or escalate more accurately.
Automated diagnostics is different from monitoring. Monitoring is purpose-built to identify undesired states in performance or activity. This means that most monitoring is not purpose-built to emulate a first-responder’s activities to validate a true positive, or identify the first actions to take. Monitoring is focused on raising the alert. Automated diagnostics is focused on determining how to fix an issue once the alert is already created.

That said, automated diagnostics can certainly make use of data collected by monitoring tools—most people don’t apply thresholds to every datapoint they collect. In fact, one of our more commonly used diagnostics integration is to query CloudWatch logs. While we might consider a log aggregator a monitoring tool, sometimes the first steps of investigation are to look at the data in the monitoring tool that exists purely for diagnosing issues.

Providing responders with on-demand or pre-run diagnostic capabilities for their own environments can help a first responder quickly determine probable cause, thereby pulling in fewer individuals to assist with the incident. By providing first-responders with “diagnostic” data that is typically only retrievable by domain experts, the need to pull in more people for troubleshooting incidents is reduced significantly. This in turn drives down the cost of incidents and reduces mean time to response (MTTR) by automating the investigative steps that are typically manual in nature.

The status quo: Automation in incident response

Operations managers are often excited about the idea of enabling self-healing or auto-remediation. It’s a natural inclination to assume that speeding up resolution through automation means “applying a cure.” But more often than not, the industry theory of “no two incidents are truly identical” rears its head. When you have a high degree of variability, this reduces the value of such potential automation since it’s less likely to be run. For example, restarting a core service may be the right way to fix today’s issue, but it could lead to a cascading failure—and an even bigger incident—tomorrow.

*The reader now switches cognitive gears to the initial stages of a response.*

But you know what tends to be highly repetitive? The same investigative steps a responder takes to begin to diagnose what went wrong and determine what happened. More repetitive action means more value to gain from applying automation. For example, let’s say an incident kicks off within your Kubernetes distribution. No matter the nature of the incident, whether it be something within your image repository, or load balancer, you’re likely still going to take the same diagnostic step of pulling your kubernetes logs.

These diagnostic steps often remain static—for the most part—depending on the component you’re working with, no matter the priority of the incident that occurs. Automated diagnostics can be applicable to heterogeneous incidents; it doesn’t have to be purpose-built for the same, recurring incident, it can be applied to and customized around all sorts of common incident types and severities—specific to your environment—for almost any common component. Think of it like going to the doctor’s office. Whether you are going to urgent care for a specific complaint or just an annual checkup, they still take your temperature, blood pressure, and weight when you walk in.

Common Examples

Every developer environment is different; but many environments are also quite similar once you really pop the hood. In the beginning stages of a response, most diagnostics will come from three main data sources:

Application data
System data
Environment data

There are several examples of common diagnostics and components that can be automatically pulled during the beginning of a response. This would not only help the responder better understand the severity of the incident, but will also help ensure the responder doesn’t pull in too many specialists and interrupt them from their normal day of work. For example, let’s look at Kubernetes (k8s) as a component for a responder during an incident. When an incident happens within a k8s environment, the infrastructure engineer who maintains the technology would typically perform actions such as:

Tail logs from k8s pod
Retrieve logs from k8s by selector label
Check image repo
Describe deployment
Execute command in pod

One thing all of these actions have in common? A typical L1 responder ACK’ing an incident doesn’t know how to orchestrate these actions—it’s just not their area of expertise. But with the out-of-the-box jobs from PagerDury’s Automated Diagnostics, the L1 responder can automatically run these diagnostics and execute these jobs, which speeds up the response and reduces the escalation to the infrastructure engineer responsible for the k8s environment.

Some common diagnostics and alert examples include:

CPU/Memory Consuming Services
- Common alert: High CPU/Memory
- Common question: Which service(s) are consuming CPU/Memory?
File size / Disk Consumption
- Common alert: High CPU/Memory
- Common question: Which files/directories are consuming the most space?
System Logs: Linux/Windows Commands
- Common alert: Server/service issues
- Common question: Is it an OS issue or app issue?
SQL Database Commands
- Common alert: Database blocks/deadlocks
- Common question: Is there a long-running query blocking other database requests?
Host Availability
- Common alert: Host down
- Common question: Is it actually down or is it a false-positive reachability issue?
Application Error: Application Logs or traces
- Common alert: 400/500 error codes
- Common question: What is the stack-trace?

A few examples of some common diagnostics for known components:

Cloudwatch Logs: Surface specific application and VPC logs.
ECS: View stopped ECS task errors.
ELB: Debug unavailable target-group instances.
Kubernetes. Retrieve logs from Pods by selector label.
Linux. Retrieve service status.
Nginx. Retrieve error logs.
Redis. Slow log entries.

And these are just some of the over 30 out-of-the-box jobs templates we have built for our users that you can find in the Automated Diagnostics solution guide. To use the Automated Diagnostics Solution, you must either have a PagerDuty Runbook Automation license or a Process Automation (previously Rundeck Enterprise) license. See the FAQ for details on how to use. If you do not have a license for either of these products, contact us to learn more.

Automating diagnostics within PagerDuty

Incidents that notify responders are filled with information provided by monitoring tools that have a “miopic” view on the alert(s). A common example is that high CPU usage triggers an alert, and this notifies a responder. But the information contained in the alert is surface-level in that it does not specify what might be the cause of the spiked CPU.

Diagnostic data is the deeper-level information that helps answer the “why” and “where” questions of incidents. Even though some monitoring and correlation tools provide some help in providing root-cause analysis for users, most fall short in their ability to emulate a responder’s investigative/troubleshooting steps of collating disparate data-sources into a unified view. By providing responders with on-demand or pre-run diagnostic capabilities, the odds of the first responder resolving the issue on their own increase, as well as the probability of pulling in fewer individuals to assist with the incident. Enter Automated Diagnostics.

Want to learn more about common diagnostics for the components you use? Register for our September 14th webinar event of the same name, hosted by Justyn Roberts, Senior Solutions Consultant, PagerDuty. New to Process Automation? Request a demo. Already using PageDuty Process Automation? Check out the automated diagnostics solution guide to see the end-to-end process of achieving the full solution. Questions? Reach out to me directly on Twitter @sordnam and let’s chat!

The post Automating Common Diagnostics for Kubernetes, Linux, and other Common Components appeared first on PagerDuty.

Common Diagnostics for Common Components by Nisha Prajapati

Nisha Prajapati — Thu, 30 Jun 2022 15:33:09 +0000

The post Common Diagnostics for Common Components appeared first on PagerDuty.

Extending Automation Actions Across the PagerDuty Platform by Joseph Mandros

Joseph Mandros — Tue, 07 Jun 2022 13:00:03 +0000

It’s day one of PagerDuty Summit, and we are looking forward to a full day of expert presenters, actionable content, and educational sessions to boost your PagerDuty IQ and show you new ways to improve your team’s operational excellence.

One point you will continue to hear throughout the duration of the conference echoes our greater mission: To revolutionize operations so teams spend less time on reactive, break-fix work and more time delivering new innovation. At PagerDuty, we see this future of operations extending beyond the digital teams that build and run software to all teams in the organization. While many PagerDuty products and features exist to make this mission a reality, we are going to focus on the latest and greatest with PagerDuty Automation Actions®, part of the PagerDuty Process Automation® portfolio.

New Updates and Integrations With Automation Actions

Automation Actions connects your first-line responders to corrective automation directly within PagerDuty. Instead of pushing escalations to specialists when an incident kicks off, responders can triage and resolve incidents themselves using safely delegated automation. As a result, teams can reduce MTTR, lower interruptions to specialists, and quickly diagnose and remediate incidents.

We launched Automation Actions last year to help organizations get started quickly in simple first steps towards automation. Now, Automation Actions is integrated across the entire PagerDuty platform for all users to remove manual, time consuming repetitive tasks like diagnosing issues when brought into bridge calls.

Let’s look at some of the latest and greatest automation capabilities with Automation Actions:

Automation Actions in Incident Response. Teams can now run automated diagnostics and remediate incidents directly within PagerDuty. This integration will improve productivity and remove toil by automating repetitive, manual tasks, and give time back to your engineers to focus on innovation.

Automation Actions for Customer Service Ops. This integration gives customer service agents the ability to validate customer problems and capture critical information via automation to diagnose and resolve cases faster. Agents are now empowered to validate customer-impacting issues and run automated actions directly from the PagerDuty app in Service Cloud.

Automation Actions for Event Orchestration. By combining nested event rules with machine learning and precise, targeted automation triggers, it’s now possible to action an incident before responders even get paged. This integration with Event Orchestration helps teams automate common diagnostics and enable self-healing for recurring and well-understood types of incidents, resulting in reduced MTTR and escalations to specialists.

Automation Actions in PagerDuty’s Mobile App. Everything you love about Automation Actions is now mobile! Invoke the same automation from Automation Actions and resolve common incidents directly from the PagerDuty mobile app.

Automation Actions in Slack. With this integration, incident responders can deploy scriptable diagnostics and remediation actions directly from Slack.

Automation Everywhere

To be ready for anything with increasing digital complexity and dependencies, operations must transform from being manual, rigid, and ticket queue-based, to a continuously improving system that focuses on outcomes and customer experience, delivers operational speed and resilience, and is heavily automated by machine learning and AI. Only then can teams move toward a more proactive posture, and reduce the burden of manual work to avoid burnout and preserve focus. With Automation Actions, teams not only have the ability to reach this operational milestone, but to excel and continue to mature their automation capabilities.

Be sure to check out related Automation Actions sessions at PagerDuty Summit:

Normalize Automation, Sean Noble, Principal Product Manager, PagerDuty
Is it the Cloud? App? Database? Reduce Escalations by giving first responders automated diagnostics, Jake Cohen, Senior Product Manager, PagerDuty

To learn more about the PagerDuty automation portfolio, visit our automation hub. If you want to learn more about PagerDuty Automation Actions and how it can help your team save time and money, contact your account manager or learn more today.

The post Extending Automation Actions Across the PagerDuty Platform appeared first on PagerDuty.