With the decision to bring the development and management of its website in-house, this retail giant saw an opportunity to reinvent its technology ecosystem. The website would be the first to use a brand new API platform, and the company needed real-time visibility into the system to manage and diagnose issues. A Principal Developer shared, “The first application that was going to use new APIs on the new platform was the website. Building a successful online retail experience was critical for our strategy moving forward.”
While the engineering team was mature, more scalable processes for incident management were needed to support a changing environment. “We needed all sorts of different technologies to support this initiative on the new platform,” explained the Principal Developer. While there was logging in place, there was no easy way to alert the team about an issue. Somebody had to review and understand the logs, determine if the alert warranted a call, and figure out who to actually call. Over time, engineering’s rapid delivery made it challenging to track down the right person at the right time.
The retailer needed to address these challenges to ensure a highly available website. An outage could result in lost productivity, missed revenue, and negative brand impact. After a review, the team defined several technical requirements needed to improve incident response:
To achieve this, the company needed a platform that could enrich the information available—link dependencies between systems, and sync information with ITSM and APM tools. This would inform who’s affected by an incident and what capabilities might be disrupted, and ensure critical work be sent to the right teams quickly.
PagerDuty was selected as a scalable, easy-to-use digital operations platform. PagerDuty integrated with the retailer’s existing services, providing end-to-end visibility across the ecosystem. This allowed the team to build an orchestrated process for critical work and supported a culture of product ownership.
A tight integration with ServiceNow immediately proved valuable for incident response—mapping priorities, syncing notes between the two, and closing incidents down on either platform. “It was really great to have a lot of this integration provided out of the box with very minimal work required,” shared the Principal Developer. A Jira integration was used for alerts that didn’t need to go through the formal ITSM process, for example byproducts of other issues. The team leveraged the workflows inside Jira to manage these alerts, syncing notes between the two platforms. This integration encouraged a more resilient application design, steered quality logging, and ensured quality tickets were created.
PagerDuty’s ML-powered event management, Event Intelligence, helped automate incident response. Change events provided situational awareness, surfacing critical information about recent deployments and releases in the code repository. This was especially useful for Terraform projects, providing insight around an event like when, where, and who did the merge.
With key integrations in place, the team built out technical services to route incidents. Engineers were empowered to take ownership over the database of technical services, tracking who owns what. Dependencies were created across these technical services, enabling the correlation of issues across APIs. Over time, PagerDuty could determine potential contributing factors of an incident and narrow down the correct engineer. “We’re seeing the benefit of an AI lens over our services and dependencies.” said the Principal Developer. “PagerDuty has helped us become a lot more confident in our services, and provided us with a source of truth from an engineering lens on a technical service and its status.”
It was critical for the retailer to understand an incident’s impact on the business. Using PagerDuty’s Business Services, it was able to effectively communicate information to the right business stakeholders. Even better, owners of a business service could subscribe to alerts or view the status dashboard to be informed about what was happening and when a resolution was in place. “Using PagerDuty has enabled our service desk to know immediately what capability might be disrupted with a particular incident,” explained the Principal Developer.
With PagerDuty, the retail store successfully launched its new website in-house using the new API platform. With better incident response operations in place, the company is set up to deliver an amazing online retail experience to customers while keeping their own employees happy.
The team has achieved:
The Principal Developer shared, “PagerDuty is helping us understand our applications, visualize our product health, and enable a culture of ownership.”
There are plans for continuous improvement across the organization. The company is exploring PagerDuty Analytics, including intelligent dashboards to measure the impact of incidents on teams, and will introduce postmortems to avoid repeating mistakes. It’s also actively investigating ways to best implement more Event Intelligence features to help the team reduce noise and drive down resolution time. To further streamline operations, PagerDuty will be rolled out to other teams including corporate infrastructure and security.
“Looking back on it all, we met the objectives and have a very strong platform which we can build on,” said the Principal Developer.
To learn more about how PagerDuty is helping retailers deliver amazing customer experiences, visit www.pagerduty.com/customers or www.pagerduty.com/industries/retail/, and start a 14-day free trial today.