Software Forestry 0x06: The Controlled Burn
Did you know earthworms aren’t native to North America? Sounds crazy, but it’s true; or at least it has been since the glaciers of the last ice age scoured the continent down to the bedrock and took the earthworms with them. North America certainly has earthworms now, but as a recently introduced invasive species. (For everyone who just thought “citation needed”, Invasive earthworms of North America.)
As such, the biomes North America have very different lifecycles than their counterparts in Eurasia do. In, say, a Redwood Forest, organic matter builds up in a way it doesn’t across the water. Things still rot, there’s still fungus and microbes and bugs and things, but there isn’t a population of worms actively breaking everything down. The biomass decays slower. Some buildup is a good thing, it provides a habitat for smaller plants and animals, but if it builds up too much, it can start choking plants out before it can break down into nutrients.
So what happens is, the forest catches on fire. In a forest with earthworms, a fire pretty much always a bad thing. No so much in the Redwoods, or other Californian forests. The trees are fire resistant, the fire clears away the excess debris, frees those nutrients, and many species of cone-bearing conifer trees—redwoods, pines, cypresses, and the like—have what are called “serotinous” cones, which means they only germinate after a fire. Some are literally covered in a layer of resin that has to melt off before the seeds can sprout. The fire rips though, clears out the debris, and the new plants can sprut in the newly fertilized ground. Fire isn’t a hazard to be endured, it’s been adopted as a critical part of the entire ecosystem’s lifecycle.
Without human intervention, fires happen semi-regularly due to lighting. Of course, that’s a little unpredictable and doesn’t always turn out great. But the real problem is when humans prevent fires from taking hold, and then no matter how much you “sweep the forest,” the debris and overgrowth builds up and builds up, until you get the really huge fires we’ve been having out here.
The people who used to live here (Before, ahh… a bunch of other people “showed up and took over” who only knew how to manage forests with earthworms) knew what the solution was: the Controlled Burn. You choose a time, make some space, and carefully set the fire, making sure it does what it needs to do in the area you’ve made safe, but keep it out of places where the people are. In CA at least, we’re starting to adopt controlled burns as an intentional management technique again, a few hundred years later. (The biology, politics, history, and sociology of setting a forest on fire on purpose are beyond our scope here, but you get the general idea.)
I think a lot of Software Forests are like this too.
Every place I’ve ever worked has struggled with figuring out how to plan and estimate ongoing maintenance outside of a couple of very narrow cases. If it’s something specific, like a library upgrade, or a bug, you can usually scope and plan that without too much trouble. But anything larger is a struggle, because those larger maintenance and care efforts are harder to estimate, especially when there isn’t a specific & measurable customer-facing impact. You don’t have a “thing” you can write a bug on. You don’t know what the issues are, specifically, it’s just acting bad.
The problem requires sustained focus, the kind that lasts long enough to actually make a difference. And that’s hard to get.
One of the reasons why Cutting Trails is so effective is that it doesn’t take that much more time than the work the trail is being cut towards. Back when estimating via Fibonacci Sequence was all the rage, the extra work to cut the trail usually didn’t get you up to the next fibonacci number.
Furthermore, the effort to get in and actually estimate and scope some significant maintenance work is often more work than the actual changes. It’s wasteful to spend a week investigating and then write up a plan for someone to do later. You’re already in there!
Finally, rarely is there a direct advocate. There’s nearly always someone who acts as the Voice of the Customer, or the Voice of the Business, but very rarely is anyone the Voice of the Forest.
(I suspect this is one of the places where agile leads us astray. The need to have everything be a defined amount of work that someone can do in under a week or two makes it incredibly easy to just not do work that doesn’t lend itself to being clearly defined ahead of time.)
So the overgrowth and debris builds up, and you get the software equivalent of an unchecked forest fire: “We need to just rewrite all of this.”
No you don’t! What you need are some Controlled Burns.
It goes like this:
Most Forests have more than one application, for a wide definition of “application.” There’s always at least one that’s limping along, choked with Overgrowth. Choose one. Find a single person to volunteer. (Or get volun-told.) Clear their schedule for a month. Point them at the app with overgrowth and let them loose to fix stuff.
We try to be process-agnostic here at Software Forestry, but we acknowledge most folks these days are doing something agile, or at least agile adjacent. Two-week sprints seems to have settled as the “standard” increment size; so a month is two sprints. That’s not nothing! You gotta mean it to “lose” a resource for that much time. But also, you should be able to absorb four weeks of vacation in a quarter, and this is less disruptive than that. Maybe schedule it as one sprint with the option to extend to a second depending on how things look “next week.”
It helps, but isn’t mandatory, to have success metrics ahead of time. Sometimes, the right move is to send the person in there and assume you’ll find something to paint a bullseye around. But most of the time you’ll want to have some kind of measurement you can do a before-and-after comparison with. The easiest ones are usually performance related, because you can measure those objectively, but probably aren’t getting handled as part of the normal “run the business.” Things like “we currently process x transactions per second, we need to get that to 2x,” or “cut RAM use by 10%,” or “why is this so laggy sometimes?”
I did a Controlled Burn once on a system that needed to, effectively, scan every record in a series of database tables to check for things that needed to be deleted off of a storage device. It scanned everything, then started over and scanned everything again. When I started, it was taking over a day to get through a cycle, and that time was increasing, because it wasn’t keeping up with the amount of new work sliding in. No one knew why it took that long, and everyone with real experience with that app was long gone from the company. After a month of dedicated focus, it got through a cycle in less than two hours. Fixed a couple bits of buggy behavior while I was at it. No big changes, no re-architecture, no platform changes, just a month of dedicated focus and cleanup. A Controlled Burn.
This is the time to get that refactoring done—fix that class hierarchy, split that object into some collaborators. Write a bunch of tests. Refactor until you can write a bunch of tests. Fix that thing in the build process everyone hates. Attach some profilers and see where the time is going.
Dig in, focus, and burn off as much overgrowth as you can. And then leave a list of things to do next time. You should know enough now to do a reasonable job scoping and estimating the next steps, write those up for the to do list. Plants some seeds for new growth. You shouldn’t have to do a Controlled Burn on the same system twice.
Deploying this kind of directed focus can be incredibly powerful. The average team can absorb maybe one or two of these a year, so deploy them with purpose.
Sometimes, all the care in the world won’t do the trick, and you really do need to replace a system. Next time: The Trellis Pattern