Toil: Still Plaguing Engineering Teams
This blog is an update from a popular blog authored by Damon Edwards.
Our industry has always had localized expressions for work that was necessary but didn’t move the company forward. The SRE movement calls this type of work “toil.”
The concept of toil is a unifying force because it provides an impartial framework for identifying — then containing — the work that takes up our time, blocks people from fulfilling their engineering potential, and doesn’t move the company forward.
Why Toil Matters
Unfortunately, “not enough time and too much to do” describes the default working conditions inside operations orgs. There is an unlimited supply of planned and unplanned work — new things to roll out, incidents to respond to, support requests to answer, technical debt to pay down, and the list goes on.
With only so many hours in the day, how do you make sure what you’re working on actually makes a difference?
How do you make sure your team and your broader organization maximize the kinds of work that add value and find ways to eliminate work that doesn’t? After all, organization and team decisions dictate the majority of your work.
To maximize both the value of your engineering organization and the human potential of your colleagues, you need an objective framework to identify and contain the “wrong” kind of work and maximize the “right” kind of work. Understanding what toil is — and keeping the amount of toil contained — provides economic benefits to your company and improves the work lives of fellow engineers.
What is the Definition of Toil?
Google first popularized the term “toil,” and the SRE movement, and it has since been pushed to IT operations.
In a nutshell, SRE is about injecting software engineering practices — and a new mindset — into IT operations to create highly reliable and highly scalable systems. Interest in the topic of SRE has skyrocketed since Google published their Site Reliability Engineering book.
In the book, Vivek Rau articulates an excellent definition, “Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.”
The more of these attributes a task has, the more confidence you can have in classifying the work as “toil.” However, just because work is classified as toil doesn’t mean that a task is frivolous or unnecessary. On the contrary, most organizations would grind to a halt if the toil didn’t get done.
A goal of “no toil” sounds nice in theory. However, in reality, a “no toil” goal is not attainable in a business. Technology organizations are always in flux, and new developments (expected or unexpected) will almost always cause toil. Just because a task is necessary to deliver value to a customer doesn’t mean that it is always value-adding work. Toil may be necessary at times, but it doesn’t add enduring value (i.e., a change in the perception of value by customers). Long-term, we should want to eliminate the need for the toil.
The best we can hope for is to be effective at reducing toil and keeping toil at a manageable level across the organization. Toil will come from sources you already know about but just haven’t had the time or budget to automate (e.g., semi-manual deployments, schema updates/rollbacks, changing storage quotas, network changes, user adds, adding capacity, DNS changes, service failover). Toil will also come from any number of unforeseen conditions that can cause incidents requiring manual intervention (e.g., restarts, diagnostics, performance checks, changing config settings).
What Should People Be Doing Instead of Toil?
Instead of engineers spending time on non-value-adding toil, you want them spending as much of their time as possible on value-adding engineering work.
Also pulling from Vivek Rau’s helpful definitions, engineering work can be defined as the creative and innovative work that requires human judgment, has enduring value, and can be leveraged by others.
Working in an organization with a high ratio of engineering work to toil feels like everyone is swimming towards a goal. Working in an organization with a low ratio of engineering work to toil feels more like you are treading water, at best, or sinking, at worst.
High Levels of Toil Are Toxic
Toil may seem innocuous in small amounts. However, when left unchecked, toil can quickly accumulate to levels that are toxic to both the individual and the organization.
For the individual, high-levels of toil lead to:
- Discontent and a lack of feeling of accomplishment
- Burnout
- More errors, leading to time-consuming rework to fix
- No time to learn new skills
- Career stagnation (hurt by a lack of opportunity to deliver value-adding projects)
For the organization, high-levels of toil lead to:
- Shortages of team capacity
- Excessive operational support costs
- Inability to make progress on strategic initiatives (the “everybody is busy, but nothing is getting done” syndrome)
- Inability to retain top talent (and acquire top talent once word gets out about how the organization functions)
One of the most dangerous aspects of toil is that it requires engineering work to eliminate it.
Reducing toil requires engineering time to either build supporting automation to automate away the need for manual intervention or enhance the system to alleviate the need for the intervention in the first place.
Engineering work needed to reduce toil will typically be a choice of creating external automation (i.e., scripts and automation tools outside of the service), creating internal automation (i.e., automation delivered as part of the service), or enhancing the service to not require maintenance intervention.
Toil eats up the time needed to do the engineering work that will prevent future toil. If you aren’t careful, the level of toil in an organization can increase to a point where the organization won’t have the capacity needed to stop it. If we use the Technical Debt metaphor, this would be “engineering bankruptcy.”
The SRE model of working — and all of the benefits that come with it — depends on teams having ample capacity for engineering work. This capacity requirement is why toil is such a central concept for SRE. If toil eats up the capacity to do engineering work, the SRE model doesn’t work. An SRE perpetually buried under toil isn’t an SRE, he is just a traditional long-suffering sysadmin with a new title.
Why PagerDuty Cares About Toil
One of our main goals is to improve the work-lives of operations professionals. Reducing toil and maximizing engineering time does just that.
Our users have often shown us how they use PagerDuty Process Automation and Rundeck in their efforts to reduce toil.
Benefits include:
- Reduction in variation and errors to reduce toil by standardizing procedures.
- Making it easier to do engineering work that reduces toil by automating tasks that previously required a lot of toil.
- Stop one team from creating toil for another team by enabling self-service and allowing others to do operations tasks themselves.
Contact us to learn more about PagerDuty Runbook Automation.