software forestry Gabriel L. Helman software forestry Gabriel L. Helman

Software Forestry 0x07: The Trellis Pattern

In general, rewriting a system from scratch is a mistake. There are times, however, where replacing a system makes sense.

You’ve checked off every type in the Overgrowth Catalogue, entropy is winning, and someone does a back of the envelope cost analysis and finally says “you know, I think it really would be cheaper to replace this thing.”

Part of what makes this tricky is that presumably, this system has customers and is bringing in revenue. (If it didn’t you could just shut it down and build something new without a hassle.) And so, you need to build and deploy a new system while the old one is still running.

The solution is to use the old system as a Trellis.

The ultimate goal is to eventually replace the legacy system, but you can’t do it all at once. Instead, use the existing system as a Trellis, supporting the new system as it grows along side.

This way, you can thoughtfully replace a part at a time, carefully roll it out to customers, all while maintaining the existing—and working—functionality. The system as a whole will evolve from the current condition to the improved final form.

As you work through each capability, you can either use the legacy system as a blueprint to base the new one on, or harvest code for re‐use.

The great thing about a Trellis is that the Trellis and the Tree are partners. They have the same goals: for the tree to outgrow the trellis. But the trellis can be an active partner. I have a tree right now that still has part of a trellis holding up some branches. It’s one of the those trees that got a little too big a little too fast, put on a little too much fruit. A few years ago it had supports and trellises all around, now it’s down to just one whimsically-leaning stick. If things go well, I’ll be able to pull that out next summer.

The old system isn’t abandoned, it transitions into a new role. You can add features to make it work better as a trellis: an extra API endpoint here, a data export job there. The new system calls into the old one to accomplish something, and then you throw the switch and the old system starts calling into the new one to do that same thing.

Eventually the new system pulls away from the Trellis, grows beyond it. And if you do it right, the trellis falls away when its job is done. Ideally, in such a way that you could never tell it was there in the first place.

Sometimes, you can schedule the last switchover and have a big party when you turn the old system off. But if you’re really lucky, there comes a day where you realize the old load balancer crashed a month ago and no one noticed.

🌲

There’s a social & emotional aspect to this as well, which goes almost entirely undiscussed.

If we’re replacing a system, it’s probably been around for a while. We’re probably talking about systems with a decade+ of run time. The original architects may have moved on, but there are people who work on it, probably people who have built a whole career out of keeping it running.

There’a always some emotions when it comes time to start replacing the old stuff. Some stereotypes exist for a reason, and the sorts of people who become successful software engineers or business people tend to be the sorts of folks for whom “empathy” was their dump stat. There’s something galling about the newly arrived executive talking about moving on from the old and busted system, or the new tech lead espousing how much better the future is going to be. The old system may have gotten chocked out by overgrowth, left behind by the new growth of the tech industry, but it’s still running, still pulling in revenue. It’s paying for the electricity in the projector that’s showing the slide about how bad it is. It deserves respect, and so do the people who worked on it.

Thats the point: The Trellis is a good thing, it’s positive. The old system and—the old system’s staff—have a key role to play. Everyone is on the same team, everyone has the same goal.

🌲

There’s an existing term that’s often used for a pattern similar to this. I am, of course, talking about the Strangler Fig pattern. I hate this term, and I hate the usual shorthand of “strangler pattern” even more.

Really? Your mental model for biring in new software is that it’s an invasive parasite that slowly drains nutrients and kills its host? There are worse ways to go than being strangled, but not by much.

This isn’t an isolated style of metaphor, either. I used to work with someone—who was someone I liked, by the way—who used to say that every system being replaced needed someone to be an executioner and an undertaker.

Really? Your mental model for something ending is violent, state-mandated death?

If Software Forestry has a central thesis, it is this: the ways we talk about what we do and how we do it matter. I can think of no stronger example of what I mean than otherwise sane professionals describing their work as murdering the work of their colleagues, and then being surprised when there’s resistance.

What we do isn’t violent or murderous, it is collaborative and constructive.

What I dislike the most about Strangler Figs, though, is that a Strangler Fig can never exceed the original host, only succede. The Fig is bound to the host forever, at first for sustenance, and then, even after the host has died and rotted away, the Fig has an empty space where the host once stood, a ghost haunting the parasite that it can never fully escape from.

🌲

So if we’re going to use a real tree as our example for how to do this, let’s use my favorite trees.

Let me tell you about the Coastal Redwoods.

The Redwood forest is a whole ecosystem to itself, not just the trees, but the various other plants growing beneath them. When a redwood gets to the end of its life, it falls over. But that fallen tree then serves as the foundation to a whole new mini-ecosystem. The ferns and sorrel cover the fallen trunk. Seedlings sprout up in the newly exposed sunlight. Burls or other nodes sprout new trees from the base of the old, meaning maybe the tree really didn’t die at all, it just transitioned. From one tree springs a whole new generation of the forest.

There are deaths other than murder, and there are endings other than death.

Let’s replace software like a redwood falling; a loud noise, and then an explosion of new possibilities.

Read More
software forestry Gabriel L. Helman software forestry Gabriel L. Helman

Software Forestry 0x06: The Controlled Burn

Did you know earthworms aren’t native to North America? Sounds crazy, but it’s true; or at least it has been since the glaciers of the last ice age scoured the continent down to the bedrock and took the earthworms with them. North America certainly has earthworms now, but as a recently introduced invasive species. (For everyone who just thought “citation needed”, Invasive earthworms of North America.)

As such, the biomes North America have very different lifecycles than their counterparts in Eurasia do. In, say, a Redwood Forest, organic matter builds up in a way it doesn’t across the water. Things still rot, there’s still fungus and microbes and bugs and things, but there isn’t a population of worms actively breaking everything down. The biomass decays slower. Some buildup is a good thing, it provides a habitat for smaller plants and animals, but if it builds up too much, it can start choking plants out before it can break down into nutrients.

So what happens is, the forest catches on fire. In a forest with earthworms, a fire pretty much always a bad thing. No so much in the Redwoods, or other Californian forests. The trees are fire resistant, the fire clears away the excess debris, frees those nutrients, and many species of cone-bearing conifer trees—redwoods, pines, cypresses, and the like—have what are called “serotinous” cones, which means they only germinate after a fire. Some are literally covered in a layer of resin that has to melt off before the seeds can sprout. The fire rips though, clears out the debris, and the new plants can sprut in the newly fertilized ground. Fire isn’t a hazard to be endured, it’s been adopted as a critical part of the entire ecosystem’s lifecycle.

Without human intervention, fires happen semi-regularly due to lighting. Of course, that’s a little unpredictable and doesn’t always turn out great. But the real problem is when humans prevent fires from taking hold, and then no matter how much you “sweep the forest,” the debris and overgrowth builds up and builds up, until you get the really huge fires we’ve been having out here.

The people who used to live here (Before, ahh… a bunch of other people “showed up and took over” who only knew how to manage forests with earthworms) knew what the solution was: the Controlled Burn. You choose a time, make some space, and carefully set the fire, making sure it does what it needs to do in the area you’ve made safe, but keep it out of places where the people are. In CA at least, we’re starting to adopt controlled burns as an intentional management technique again, a few hundred years later. (The biology, politics, history, and sociology of setting a forest on fire on purpose are beyond our scope here, but you get the general idea.)

I think a lot of Software Forests are like this too.

Every place I’ve ever worked has struggled with figuring out how to plan and estimate ongoing maintenance outside of a couple of very narrow cases. If it’s something specific, like a library upgrade, or a bug, you can usually scope and plan that without too much trouble. But anything larger is a struggle, because those larger maintenance and care efforts are harder to estimate, especially when there isn’t a specific & measurable customer-facing impact. You don’t have a “thing” you can write a bug on. You don’t know what the issues are, specifically, it’s just acting bad.

The problem requires sustained focus, the kind that lasts long enough to actually make a difference. And that’s hard to get.

One of the reasons why Cutting Trails is so effective is that it doesn’t take that much more time than the work the trail is being cut towards. Back when estimating via Fibonacci Sequence was all the rage, the extra work to cut the trail usually didn’t get you up to the next fibonacci number.

Furthermore, the effort to get in and actually estimate and scope some significant maintenance work is often more work than the actual changes. It’s wasteful to spend a week investigating and then write up a plan for someone to do later. You’re already in there!

Finally, rarely is there a direct advocate. There’s nearly always someone who acts as the Voice of the Customer, or the Voice of the Business, but very rarely is anyone the Voice of the Forest.

(I suspect this is one of the places where agile leads us astray. The need to have everything be a defined amount of work that someone can do in under a week or two makes it incredibly easy to just not do work that doesn’t lend itself to being clearly defined ahead of time.)

So the overgrowth and debris builds up, and you get the software equivalent of an unchecked forest fire: “We need to just rewrite all of this.”

No you don’t! What you need are some Controlled Burns.

It goes like this:

Most Forests have more than one application, for a wide definition of “application.” There’s always at least one that’s limping along, choked with Overgrowth. Choose one. Find a single person to volunteer. (Or get volun-told.) Clear their schedule for a month. Point them at the app with overgrowth and let them loose to fix stuff.

We try to be process-agnostic here at Software Forestry, but we acknowledge most folks these days are doing something agile, or at least agile adjacent. Two-week sprints seems to have settled as the “standard” increment size; so a month is two sprints. That’s not nothing! You gotta mean it to “lose” a resource for that much time. But also, you should be able to absorb four weeks of vacation in a quarter, and this is less disruptive than that. Maybe schedule it as one sprint with the option to extend to a second depending on how things look “next week.”

It helps, but isn’t mandatory, to have success metrics ahead of time. Sometimes, the right move is to send the person in there and assume you’ll find something to paint a bullseye around. But most of the time you’ll want to have some kind of measurement you can do a before-and-after comparison with. The easiest ones are usually performance related, because you can measure those objectively, but probably aren’t getting handled as part of the normal “run the business.” Things like “we currently process x transactions per second, we need to get that to 2x,” or “cut RAM use by 10%,” or “why is this so laggy sometimes?”

I did a Controlled Burn once on a system that needed to, effectively, scan every record in a series of database tables to check for things that needed to be deleted off of a storage device. It scanned everything, then started over and scanned everything again. When I started, it was taking over a day to get through a cycle, and that time was increasing, because it wasn’t keeping up with the amount of new work sliding in. No one knew why it took that long, and everyone with real experience with that app was long gone from the company. After a month of dedicated focus, it got through a cycle in less than two hours. Fixed a couple bits of buggy behavior while I was at it. No big changes, no re-architecture, no platform changes, just a month of dedicated focus and cleanup. A Controlled Burn.

This is the time to get that refactoring done—fix that class hierarchy, split that object into some collaborators. Write a bunch of tests. Refactor until you can write a bunch of tests. Fix that thing in the build process everyone hates. Attach some profilers and see where the time is going.

Dig in, focus, and burn off as much overgrowth as you can. And then leave a list of things to do next time. You should know enough now to do a reasonable job scoping and estimating the next steps, write those up for the to do list. Plants some seeds for new growth. You shouldn’t have to do a Controlled Burn on the same system twice.

Deploying this kind of directed focus can be incredibly powerful. The average team can absorb maybe one or two of these a year, so deploy them with purpose.

🌲

Sometimes, all the care in the world won’t do the trick, and you really do need to replace a system. Next time: The Trellis Pattern

Read More
software forestry Gabriel L. Helman software forestry Gabriel L. Helman

Software Forestry 0x05: Cutting Trails

The lived reality of most Software Foresters is that we spend all our time with large systems we didn’t see the start of, and won’t see the end of. There’s a lot of takes out there about what makes a system “Legacy” and not “just old”, but one of the key things is that Legacy Systems are code without continuity of philosophy.

Because almost certainly, the person who designed and built in the first place isn’t still here. The person that person trained probably isn’t still here. Given the average tenure time in tech rolls, it’s possible the staff has rolled over half-a-dozen or more times.

Let me tell you a personal example. A few lifetimes ago, I worked on one of these two-decade old Software Forests. Big web monolith, written in what was essentially a homebrew MVC-esque framework which itself was written on top of a long-deprecated 3rd party UI framework. The file layout was just weird. There clearly was a logic to the organization, but it was gone, like tears in the rain.

Early on, I had a task where I needed to add an option to a report generator. From the user perspective, I needed to add an option to a combobox on a web form, and then when the user clicked the Generate button, read that new option and punch the file out differently.

I couldn’t find the code! I finally asked my boss, “hey, is there any way to tell which file has which UI pages?” The response was, “no, you just gotta search.”

As they say, Greppability Is an Underrated Code Metric.

(Actually, I’ve worked on two big systems now where I had essentially this exact conversation. The other one was the one where one of the other engineers described it as having been built by “someone who knew everything about how JSP tags worked except when to use them.”)

So you search for the distinctive text in the button, or the combo box, or something on the page. You find the UI. Then you start tracing in. Following the path of execution, dropping into undocumented methods with unhelpful names, bouncing into weird classes with no clear design, strange boundaries, one minute a function with a thousand lines, the next minute an inheritance hierarchy 10 levels of abstraction deep to call one line.

At this point you start itching. “I could rewrite all of this,” you think. “I could get this stood up in a weekend with Spring Boot/React/Ruby on Rails/Elixir/AWS Lambdas/Cool Thing I Used Last”. You start gazing meaningfully at the copy of the Refactoring book on the shelf. But you gotta choke down that urge to rebuild everything. You have a bug you have to fix or a feature to deploy. But it’s not actually going to get better if you keep digging the hole deeper. You gotta stop digging, and leave things better for the next person.

You need to Cut a Trail.

1. Leave Trail Markers.

First thing is, you have to figure out what it does now before you change anything. In a sane, kind world, there would be documentation, diagrams, clear training. And that does happen, but very, very rarely. If you’re very lucky, there’s someone who can explain it to you. Otherwise, you have to do some Forensic Architecture.

Talk to people. Add some logging. Keep clicking and watching what it does. Step through it in a debugger, if you can and if that helps, although I’ve personally found that just getting a debugger working on a live system is often times more work than is worth it for the information you get out of it. But most of all, read. Read the code closely, load as much of that system’s state and behavior into your mind as you can. Read it like you’re in High School English and trying to pull the symbolism out of The Grapes or Wrath or The Great Gatsby. That weird function call is the light at the end of the pier, those abstractions are the eyes on the billboard—what do they mean? Why are they here? How does any of this work?

You’ll get there, that’s what we do. There’s a point where you’ll finally understand enough to make the change you want to make. The trick is to stop at this point, and write everything down. There’s a “haha only serious” joke that code comments are for yourself in six months, but—no. Your audience here is you, a week ago. Write down everything you needed to know when you started this. Every language has a different way to do embedded documentation or comments, but they all have a way to do it. Document every method or function that your explored call path went through. Write down the thing you didn’t understand when you started, the strange overloaded behavior of that one parameter, what that verb really means in the function name, as much as possible, why it does what it does.

Take an hour and draw the diagram you wish you’d had. Write down your notes. And then leave all that somewhere that other people can find it. If you’re using a language where the imbedded documentation system can pull in external files, check that stuff right on in to the codebase. Most places have an internal wiki. Make a page for you team if there isn’t one. Under that, make a page for the app if it doesn’t have one. Then put all that you’ve learned there.

Something else to make sure to document early on: terminology. Everyone uses the same words to mean totally different things. My personal favorite example: no two companies on earth use the word “flywheel” the same way. It doesn’t matter what it was supposed to mean! Ask. Then write it down. The weird noun you tripped over at the start of this? Put the internal definition somewhere you would have found it.

People frequently object that they don’t have the time to do this, to which I say: you’ve already done the hard part! 90% of the time for this task was figuring it out! Writing it down will take a fraction of the time you already had to spend, I promise. And when you’re back here in a year, the time you save in being able to reload all that mental state is going to more than pay for that afternoon you spent here.

2. Write Tests.

Tests are really underrated as a documentation and exploration technique. I mean, using them to actually test is good too! But for our purposes we’re not talking about a formal TDD or Red-Green-Refactor–style approaches. That weird function? Slap some mocks and stubs together and see what it does. Throw some weird data at it. You’re goal isn’t to prove it correct, but to act like one of those Edwardian Scientists trying to figure out how air works.

Another Forest I inherited once, which was a large app that customers paid real money to use, had a test suite of 1 test—which failed. But that was great, because there was already a place to write and run tests.

Tests are a net benefit, they don’t all have to be thorough, or fall into strict unit/integration/acceptance boundaries. Sometimes, it’s okay to put a couple of little weird ones in there that exist to help explain what some undocumented code does.

If you’re unlucky enough to run into a Forest with no test runner, trust me, take the time to bolt one on. It doesn’t have to be perfect! But you’ll make that time back faster than you’d believe.

When you get done, in additional to whatever “normal” Unit or Integration tests your process requires or requests, write a really small test that demonstrates what you had to do. Link that back to the notes you wrote, or the documentation you checked in.

3. A Little Cleanup, some Limited Refactoring

Once you have it figured out, and have a test or two, there’s usually two strong responses: either “I need to replace all of this right now”, or “This is such a mess it’ll never get better.”

So, the good news is that both of those are wrong! It can get better, and you really probably shouldn’t rework everything today.

What you should do a little cleanup. Make something better. Fix those parameter names, rename that strangely named function. Heck, just fix the tenses of the local variables. Do a little refactoring on that gross class, split up some responsibilities. It’s always okay to slide in another shim or interface layer to add some separation between tangled things.

(Don’t leave a huge mess in the VC diff, though, please.)

Leave the trail a little cleaner than when you found it. Doesn’t have to be a lot, we don’t need to re-landscape the whole forest today.

4. Write the New Stuff Right (as possible)

A lot of the time, you know at least one thing the original implementors didn’t: you know how the next decade went. It’s very easy to come in much later, and realize how things should have been done in the first place, because the system has a lot more years on it now than it used to. So, as much as you can, build the new stuff the right way. Drop that shim layer in, encapsulate the new stuff, lay it out right. Leave yourself a trail to follow when you come back and refactor the rest of it into shape.

But the flip side of that is:

5. Don’t be a Jerk About It

Everyone has worked on a codebase where there’s “that” module, or library, or area, where “that guy” had a whole new idea about how the system should be architected, and it’s totally out of place with everything else. A grove of palm trees in the middle of a redwood forest. Don’t be that guy.

I worked on an e-commerce system once where the Java package name was something like com.company.services.store, and then right next to it was com.company.services.store2. The #2 package was one former employee’s personal project to refactor the whole system; they had left years before with it half done, but of course it was all in production, and it was a crapshoot which version other parts of the system called into. Don’t do that.

After you’re gone, when someone looks at the version control change log for this part of the system, you want them to see your name and think “oh, this one had the right idea.”

Software Forestry is a group project, for the long term. Most of the time, “consistency and familiarity” are more valuable than some kind of quixotic quest for the ideal engineering. It’s okay, we’ll get there. Keep leaving it better than you found it. It’ll be worth it.

🌲

You’ll be amazed what your overgrown codebase looks like after a couple months of doing this. That tangled overgrowth starts to look positively tidy.

But sometimes, just cutting trails doesn’t get you there. Next Time: The Controlled Burn.

Read More
software forestry Gabriel L. Helman software forestry Gabriel L. Helman

Software Forestry 0x04: Library Upgrade Week

Here at Software Forestry we do occasionally try to solve problems instead of just turning them into lists of smaller problems, so we’re goona do a little mood pivot here and start talking about how to manage some of those forms of overgrowth we talked about last time.

First up: let me tell you the good news about Library Upgrade Week.

Just about any decently sized software system uses a variety of third party libraries. And why wouldn’t you? The multitude of high-quality libraries and frameworks out there is probably the best outcome of both the Open Source movement and Object-Oriented Software design. The specific mechanics vary between languages and their practitioner’s cultures, but the upshot is that very rarely does anyone build everything from scratch. There’s no need to go all historical reenactment and write your own XML parser.

Generally, people keep those libraries pinned and stay on one fixed version, rather than schlepping in a new version every-time an update happens. This is a good thing! Change is risk, and risk should be taken on intentionally. The downside is that those libraries keep moving forward, the version you’re using slips out of date, and now you have a bunch of Overgrowth. And so that means you need to upgrade them.

The upshot of all that is on a semi-regular basis, we all need to slurp in a bunch of new code no-one on the payroll wrote, and don’t really know how to test. Un-Reviewed Code Is Tech Debt, and one of the mantras of writing good tests is “don’t test the framework”, so this is always a little iffy.

It’s incredibly easy to just keep letting those weeds grow a little longer. “The new version doesn’t have anything we need”, “there’s no bugs”, “if it ain’t broke don’t fix it”, and so on. They always take too long, don't usually deliver immediate gratification, and are hard to schedule. It’s no fun, and no one likes to do it.

The trick is to turn it into a party.

It works like this: set aside the last week of the quarter to concentrate on 3rd party library upgrades. Regardless of what your formal planning cycle or process is, most businesses tends to operate on quarters, and there’s usually a little dead time at the end of the quarter you can repurpose.

The Process:

  1. Form squads. Each squad is a group of like-minded individuals focused on a single 3rd party library or framework. Squads are encouraged to be cross-team. Each squad will focus on updating that 3rd party library in all applications or places where it is used. The intent is to make this a group event, where people can help each other out. Participation is not mandatory.

  2. Share squad membership and goals ahead of time. Leadership should reserve the right to veto libraries as “too scary” or “not scary enough”. Libraries with a high severity alerts or known CVE are good candidates.

  3. That week, each squad self organizes and works as a group through any issues caused by the upgrade. Other than major outages or incidents, squad members should be excused from other “run the business” type work for that week; or rather, the library upgrades are “the business.” Have fun!

  4. On that Friday hold the Library Upgrade Week Show-n-Tell. Every team should demo what they did, how they did it, and what it took to pull it off. Tell war stories, hold a happy hour, swap jokes. If a squad doesn’t finish that's okay! The expectation is that they’ll have learned a lot about what it'll take to finish, and that work will be captured in the relevant team’s todo lists. If you’re in a process with short develop-deploy increments (like sprints) you can make the library upgrade(s) a release on its own. Ideally you already have a way to sign off a release as not containing regressions, and so a short release with just a library upgrade is a great way to make sure you didn’t knock some dominos over.

But wait! There's more! All participants will vote on awards to give to squads, for things like:

  • Error message with least hits on Stack Overflow
  • Largest version number jump
  • Most lines changed
  • Fewest lines changed
  • Best team name
  • Best presentation

Go nuts! Have a great time!

🌲

Yes, it’s a little silly, but that’s the point. I’ve deployed a version of this at a couple of jobs now, and it’s remarkable how effective it is. The first couple of cycles people hit the “easy” ones—uprev the logging library or a JSON parser or something. But then, once people know that Library Upgrade Week is coming, they start thinking about the harder stuff, and you start getting people saying they want to take a swing at the main framework, or the main language version, or something else load-bearing. It’s remarkable how much progress two or three people can make on a problem that looks unsolvable when they have an uninterrupted week to chew on it. (If you genuinely can’t spare a handful of folks to do some weeding four weeks out the the year, that’s a much larger problem than out of date libraries, and you should go solve that problem first. Like, right now.)

There’s an instinct to take the core idea to schedule this kind of maintenance for a few times a year, but leave off the part where it’s a party. This is a mistake. This is work people want to do even less than their usual work; the trick is to make everthing around it fun.

We’re Foresters, and both we and the Forest are here long term. The long term health of both depends on the care of the Forest being something that the Foresters enjoy, and it’s okay to stack that deck in your favor.

Read More
software forestry Gabriel L. Helman software forestry Gabriel L. Helman

Software Forestry 0x03: Overgrowth, Catalogued

Previously, we talked about What We Talk About When We Talk About Tech Debt, and that one of the things that makes that debt metaphor challenging is that it has expanded to encompass all manner of Overgrowth, not all of which fits that financial mental model.

From a Forestry perspective, not all the things that have been absorbed by “debt” are necessarily bad, and aren’t always avoidable. Taking a long-term, stewardship-focused view, there’s a bunch of stuff that’s more like emergent properties of a long-running project, as opposed to getting spendy with the credit card.

So, if not debt, what are we talking about when we talk about tech debt?

It’s easy to get over-excited about Lists of Things, but I got into computer science from the applied philosophy side, rather than the applied math side. I think there are maybe seven categories of “Overgrowth” that are different enough to make it useful to talk about them separately:

1. Actual Tech Debt.

Situations where an explicit decision to do something “not as good” to ship faster. There’s two broad subcategories here: using a hacky or unsustainable design to move faster, and cutting scope to hit a date.

In fairness, the original Martin Fowler post just talks about “cruft” in a broad sense, but generally speaking “Formal” (orthodox?) tech debt assumes a conscious choice to accept that debt.

This is the category where the debt analogy works the best. “I can’t buy this now with cash on hand, but I can take on more credit.” (Of course, this also includes “wait, what’s a variable rate?”)

In my experience, this is the least common species of Overgrowth, and the most straightforwardly self correcting. All development processes have some kind of “things to do next” list or backlog, regardless of the formal name. When making that decision to take on the debt, you put an item on the todo list to pay it off.

That list of cut features becomes the nucleus of the plan for the next major version, or update, or DLC. Sometimes, the schedule did you a favor, you realize it was a bad idea, and that cut feature debt gets written off instead of paid off.

The more internal or infrastructure–type items become those items you talk about with the phrase “we gotta do something about…”; the logging system, the metrics observability, that validation system, adding internationalization. Sometimes this isn’t a big formal effort, just a recognition that the next piece of work in that area is going to take a couple extra days to tidy up the mess we left last time.

Fundamentally, paying this off is a scheduling and planning problem, not a technical one. You had to have some kind of an idea about what the work was to make the decision to defer the work, so you can use that same understanding to find it a spot on the schedule.

That makes this the only category where you can actually pay it off. There’s a bounded amount of work you can plan around. If the work keeps getting deferred, or rescheduled, or kicked down the road, you need to stop and ask yourself if this is actually debt or a something asperational that went septic on you.

2. We made the right decision, but then things happened.

Sometimes you make the right decisions, don’t choose to take on any debt, and then things happen and the world imposes work on you anyway.

The classic example: Third party libraries move forward, the new version isn’t cleanly backwards compatible, and the version you’re using suddenly has a critical security flaw. This isn’t tech debt, you didn’t take out a loan! This is more like tech property taxes.

This is also a planning problem, but tricker, because it’s on someone else’s schedule. Unlike the tech debt above, this isn’t something you can pay down once. Those libraries or frameworks are going to keep getting updated, and you need to find a way to stay on top of them without making it a huge effort every time.

Of course, if they stop getting updated you don’t have an ongoing scheduling problem anymore, but you have the next category…

3. It seemed like a good idea at the time.

Sometimes you just guess wrong, and the rest of the world zigs instead of zags. You do your research, weigh the pros and cons, build what you think is the right thing, and then it’s suddenly a few years later and your CEO is asking why your best-in-class data rich web UI console won’t load on his new iPad, and you have to tell him it’s because it was written in Flash.

You can’t always guess right, and sometimes you’re left with something unsupported and with no future. This is very common; there’s a whole lot of systems out there that went all-in on XML-RPC, or RMI, or GWT, or Angular 1, or Delphi, or ColdFusion, or something else that looked like it was going to be the future right up until it wasn’t.

Personally, I find this to be the most irritating. Like Han Solo would say, it’s not your fault! This was all fine, and then someone you never met makes a strategic decision, and now you have to decide how or if you’re going to replace the discontinued tech. It’s really easy to get into a “if it ain’t broke don’t fix it” headspace, right up until you grind to a halt because you can’t hire anyone who knows how to add a new screen to the app anymore. This is when you start using phrases like “modernization effort”.

4. We did the best we could but there are better options now.

There’s a lot more stuff available than there used to be, and so sometimes you roll onto a new project and discover a home-brew ORM, or a hand-rolled messaging queue, or a strange pattern, and you stop and realize that oh wait, this was written before “the thing I would use” existed. (My favorite example of this is when you find a class full of static final constants in an old Java codebase and realize this was from before Java 5 added enums.)

A lot of the time, the custom, hand-rolled thing isn’t necessarily “worse” than some more recent library or framework, but you have to have some serious conversations about where to spend your time; if something isn’t your core business and has become a commodity, it’s probably not worth pouring more effort in to maintaining your custom version. Everyone wants to build the framework, but no one really wants to maintain the framework. Is our custom JSON serializer really still worth putting effort into?

Like the previous, it’s probably time to take a deep breath and talk about re-designing; but unlike the previous, the persion who designed the current version is probably still on the payroll. This usually isn’t a technical problem so much as it is a grief management one.

5. We solved a different problem.

Things change. You built the right thing at the time, but now we got new customers, shifted markets, increased scale, maybe the feds passed a law. The business requirements have changed. Yesterday, this was the right thing, and now it isn’t.

For example: Maybe you had a perfectly functional app to sell mp3 files to customers to download and play on their laptops, and now you have to retrofit that into a subscription-based music streaming platform for smartphones.

This is a good problem to have! But you still gotta find a way to re-landscape that forest.

6. Context Drift.

There’s a pithy line that Legacy Code is “code without tests,” but I think that’s only part of the problem. Legacy code is code without continuity of philosophy. Why was it built this way? There’s no one left who knows! A system gets built in a certain context, and as time passes that context changes, and the further away we get from the original context, the more overgrown and weedy the system appears to become. Tests—good tests—are one way to preserve context, but not the only way.

A whole lot of what’s called “cruft” is here, because It’s harder to read code than to write it.. A lot of that “cruft” is congealed knowledge. That weird custom string utility thats only used the one place? Sure, maybe someone didn’t understand the standard library, or maybe you don’t know about the weekend the client API started handing back malformed data and they wouldn’t fix it—and even worse, this still happens at unpredictable times.

This is both the easiest and least glamorous to treat, because the trick here is documentation. Don’t just document what the code does, document why the code does what it does, why it was built this way. A very small amount of effort while something is being planted goes a long way towards making sure the context is preserved. As Henry Jones Sr, says, you write it down so you don’t have to remember.

To put all that another way: Documentation debt is still tech debt.

7. Not debt, just an old mistake.

The one no one likes to talk about. For whatever reason, someone didn’t do A-quality work. This isn’t necessarily because they were incompetent or careless, sometimes shit happens, you know? This the flip side of the original Tech Debt category; it wasn’t on purpose, but sometimes people are in a hurry, or need to leave early, or just can’t think of anything better.

And so for whatever the reason, the doors aren’t straight, there’a bunch of unpainted plywood, those stairs aren’t up to code. Weeds everywhere. You gotta spend some time re-cutting your trails through the forest.

🌲

As we said at the start, each of those types of Overgrowth has their own root causes, but also needs a different kind of forest management. Next week, we start talking about techniques to keep the Overgrowth under control.

Read More
software forestry Gabriel L. Helman software forestry Gabriel L. Helman

Software Forestry 0x02: What We Talk About When We Talk About Tech Debt

Tell me if this sounds familiar: you start a new job, or roll onto a new project, or even have a job interview, and someone says in a slightly hushed tone, words to the effect of “Now, I don’t want to scare you, but we have a lot of tech debt.” Or maybe, “this system is all Legacy Code.” Usually followed by something like “but don’t worry! We’ve got a modernization effort!”

Everyone seems to be absolutely drowning in “tech debt”; hardly a day goes by where you don’t read another article about some system with some terrible problem that was caused by being out of date, deferred maintenance, “in debt.” We constantly joke about the fragile house-of-cards nature of basically, everything. Everyone is hacking their way, pun absolutely intended, through overgrown forests.

There’s a lot to unpack from all that. Other engineering disciplines don’t beg to rebuild their bridges or chemical plants after a couple of years, but they also don’t need to; they build them to last. How does this happen? Why is it like this?

For starters, I think this is one of those places where our metaphors are leading us wrong.

I can’t now remember when I first heard the term Technical Debt. I think it was early twenty-teens, the place I was working in the mid-aughts had a lot of tech debt but I don’t ever remember anyone using that term, the place I was working in the early teens also had a lot, and we definitely called it that.

One of the things metaphors are for is to make it easier to talk to people with a different background—software developers and business folks, for example. We might use different jargon in our respective areas of expertise, but if we can find an area of shared understanding, we can use that to talk about the same thing. “Debt” seems like a kind of obvious middle-ground: basically everyone who participates in the modern economy has a basic, gut-level understanding of how debt works.

Except, do they have the same understanding?

Personally, I think “debt” is a terrible metaphor, bordering on catastrophic. Here’s why: it has very, very different moral dimensions depending on whose talking about it.

To the math and engineering types who popularized the term, “debt” is obviously bad, bordering on immoral. They’re the kind of people who played enough D&D as kids to understand how probability works, probably don’t gamble, probably pay off their credit cards in full every month. Obviously we don’t want to run up debt! We need to pay that back down! Can’t let it build up! Queue that scene in Ghostbusters where Egon is talking about the interest on Ray’s two mortgages.

Meanwhile, when the business-background folks making the decisions about where to put their investments hear that they can rack up “debt” to get features faster, but can pay it off in their own time with no measurable penalty or interest, they make the obvious-to-them choice to rack up a hell of a lot of it. They debt-financed the company with real money, why not the software with metaphorical currency? “We can do it, but that’ll add tech debt” means something completely different to the technical and business staff.

Even worse, “debt” as a metaphor implies that it ends. In real life, you can actually pay the debt off; pay off the house, end the car payment, pay back the municipal bonds, keep your credit cards up to date, whatever. But keeping your systems “debt free” is a process, a way of working, not really something you can save up and pay off.

I’m not sure any single metaphor has done more damage to our industry’s ability to understand itself than “tech debt.”

Of course, the definiton “tech debt” expanded and has come to encompass everything about a software system that makes it hard to work on or the developers don’t like. “Cruft” is the word Fowler uses. “Tech debt”, “legacy”, “lack of maintenance” all kind of swirl into a big mish-mash, meaning, roughly, “old code that’s hard to work on.” Which makes it even less useful as a metaphor, because it covers a lot of different kinds of challenges, each of which calls for different techniques to treat and prevent. In fairness, Fowler takes a swing at categorizing tech debts via the Technical Debt Quadrant, which isn’t terrible, but is a little too abstract to reflect the lived reality.

This is a place where our Forestry metaphor offers up an obvious alternate metaphor: Overgrowth. Which gets close to heart of what the problem feels like: that we built a perfectly fine system, and now, after no action on our part, its not fine. Weeds. There’s that sense that it gets worse when you’re not looking,

There’s something very vexing about this. As Joel said: As if source code rusted. But somehow, that system that was just fine not that long ago is old and hard to work on now. We talk about maintenance, but the kind of maintenance a computer system needs is very different from a giant engine that needs to get oiled or it’ll break down.

I think a big part of the reason why it seems so endemic nowadays is that there was a whole lot of appetite to rewrite “everything for the web” in either Java or .Net around the turn of the century, at the same time a lot of other software got rebuilt mostly from scratch to support, depending on your field, Linux, Mac OS X, or post-NT Windows. There hasn’t been a similar “replant the forest” mood since, so by the teens everyone had a decade-old system with no external impetus to rebuild it. For a lot of fields, this was the first point where we had to think in terms of long term maintenance instead of the OS vendor forcing a rebuild. (We all became mainframe programmers, metaphorically speaking.) And so, even though the Fowler article dates to ’03, and the term is older than that, “Tech Debt” became a twenty-teens concern. Construction stopped being the main concern, replaced with care and feeding.

Software Forests need a different kind of tending than the old rewrite-updates-rewrite again loop. As Foresters, we know the codebases we work on were here before us, and will continue on after us. The occasional greenfield side project, the occasional big new feature, but mostly out job is to keep the forest healthy and hand it along to the next Forester. It takes a different, longer term, continuous world view than counting down the number of car payments left.

Of course, there’s more than one way a forest can get out of hand. Next Time: Types of Overgrowth, catalogued

Read More
software forestry Gabriel L. Helman software forestry Gabriel L. Helman

Software Forestry 0x01: Somewhere Between a Flower Pot and a Rainforest

“Software” covers a lot of ground. As there are a lot of different kinds and ecosystems of forests, there are a lot of kinds and ecosystems of software. And like forests, each of those kinds of software has their own goals, objectives, constraints, rules, needs.

One of the big challenges when reading about software “processes” or “best practices” or even just plain general advice is that people so rarely state up front what kind of software they’re talking about. And that leads to a lot of bad outcomes, where people take a technique or a process or an architecture that’s intrinsically linked to its originating context out of that context, recommend it, and then it gets applied to situations that are wildly inappropriate. Just like “leaves falling off” means something very different in an evergreen redwood forest than it does in one full of deciduous oak trees, different kinds of software projects need different care. As practitioners, it’s very easy for us to talk past each other.

(This generally gets cited in cases like “if you aren’t a massive social network with a dedicated performance team you probably don’t need React,” but also, pop quiz: what kind of software were all the signers of the Agile Manifesto writing at the time they wrote and signed it?1)

So, before we delve into the practice of Software Forestry, let’s orient ourselves in the landscape. What kinds of software are there?

As usual for our industry, one of the best pieces written on this is a twenty-year old Joel On Software Article,2 where he breaks software up into Five Worlds:

  1. Shrinkwrap (which he further subdivides into Open Source, Consultingware, and Commercial web based)
  2. Internal
  3. Embedded
  4. Games
  5. Throwaway

And that’s still a pretty good list! I especially like the way he buckets not based on design or architecture but more on economic models and business contraints.

I’d argue that in the years since that was written, “Commercial web-based” has evolved to be more like what he calls “Internal” than “Shrinkwrap”; or more to the point, those feel less like discrete categories than they do like convenient locations on a continuous spectrum. Widening that out a little, all five of those categories feel like the intersections of several spectrums.

I think spectrums are a good way to view the landcape of modern software development. Not discrete buckets or binary yes/no questions, but continuous ranges where various projects land somewhere in between the extremes.

And so, in the spirit of an enthusiastic “yes, and”, I’d like to offer up what I think are the five most interesting or influential spectrums for talking about kinds of software, which we can express as questions sketching out a left-to-right spectrum:

  1. Is it a Flower Pot or a Sprawling Forest?
  2. Does it Run on the Customer’s Computers or the Company’s Computers?
  3. Are the Users Paid to Use It or do they Pay to Use It?
  4. How Often Do Your Customers Pay You?
  5. How Much Does it Matter to the Users?
🌲

Is it a Flower Pot or a Sprawling Forest?

This isn’t about size or scale, necessarily, as mich as it is about overall “complexity”, the number of parts. On one end, you have small, single-purpose scripts running on one machine, on the other end, you have sprawling systems with multiple farms or clusters interacting with each other over custom messaging busses.

How many computers does it need? How many different applications work together? Different languages? How many versions do you have to maintain at once? What scale does it operate at?3 How many people can draw an accurate diagram from memory?

This has huge impacts on not only the technology, but things like team structure, coordination, and planning. Joel’s Shrinkwrap and Internal categories are on the right here, the other three are more towards the left.

🌳

Does it Run on the Customer’s Computers or the Company’s Computers?

To put that another way, how much of it works without an internet connection? Almost nothing is on one end or the other; no one ships dumb terminals or desktop software that can’t call home anymore.

Web apps are pretty far to the right, depending on how complex the in-browser client app is. Mobile apps are usually in the middle somewhere, with a strong dependency on server-side resources, but also will usually work in airplane mode. Single-player Games are pretty far to the left, only needing server components for things like updates and achievement tracking; multiplayer starts moving right. Embedded software is all the way to the left. Joel’s Shrinkwrap is left of center, Internal is all the way to the right.

This has huge implications for development processes; as an example, I started my career in what we then called “Desktop Software”. Deployment was an installer which got burned to a disk. Spinning up a new test system was unbelievably easy, pull a fresh copy of the installer and install it into a VM! Working in a micoservice mesh environment, there are days that feels like the software equivalent of greek fire, a secret long lost. In a world of sprawling services, spinning up a new environment is sometimes an insurmountable task.

A final way to look at this: how involved do your users have to be with an update?

🌲

Are the Users Paid to Use It or do they Pay to Use It?

What kind of alternate options do the people actually using the software have? Can they use something else? A lot of times you see this talked about as being “IT vs commercial,” but it’s broader than that. On the extreme ends here, the user can always choose to play a different mobile game, but if they want to renew their driver’s license, the DMV webpage is the only game in town. And the software their company had custom built to do their job is even less optional.

Another very closely related way of looking at this: Are your Customers and Users the same people? That is, are the people looking at the screen and clicking buttons the same people who cut the check to pay for it? The oft-repeated “if you’re not the customer you’re the product” is a point center-left of this spectrum.

The distance between the people paying and the people using has profound effects on the design and feedback loops for a software project. As an extreme example, one of the major—maybe the most significant—differences between Microsoft and Apple is that Microsoft is very good at selling things to CIOs, and Apple is very good at selling things to individuals, and neither is any good at selling things the other direction.

Bluntly, the things your users care about and that you get feedback on are very, very different depending on if they paid you or if they’re getting paid to use it.

Joel’s Internal category is all the way to the left here, the others are mostly over on the right side.

🌳

How Often Do Your Customers Pay You?

This feels like the aspect that’s exploded in complexity since that original Joel piece. The traditional answer to this was “once, and maybe a second time for big upgrades.” Now though, you’ve got subscriptions, live service models, “in-app purchases”, and a whole universe of models around charging a middle-man fee on other transactions. This gets even stranger for internal or mostly-internal tools, in my corporate life, I describe this spectrum as a line where the two ends are labeled “CAPEX” and “OPEX”.

Joel’s piece doesn’t really talk about business models, but the assumption seems to be a turn-of-the-century Microsoft “pay once and then for upgrades” model.

🌲

How Much Does it Matter to the Users?

Years and years ago, I worked on one of the computer systems backing the State of California’s welfare system. And on my first day, the boss opened with “however you feel about welfare, politically, if this system goes down, someone can’t feed their kids, and we’re not going to let that happen.” “Will this make a kid hungry” infused everything we did.

Some software matters. Embedded pacemakers. The phone system. Fly-by-wire flight control. Banks.

And some, frankly, doesn’t. If that mobile game glitches out, well, that’s annoying, but it was almost my appointment time anyway, you know?

Everyone likes to believe that what they’re working on is very important, but they also like to be able to say “look, this isn’t aerospace” as a way to skip more testing. And thats okay, there’s a lot of software that if it goes down for an hour or two, or glitches out on launch and needs a patch, that’s not a real problem. A minor inconvenience for a few people, forgotten about the next day.

As always, it’s a spectrum. There’s plenty of stuff in the middle: does a restaurant website matter? In the grand scheme of things, not a lot, but if the hours are wrong that’ll start having an impact on the bottom line. In my experience, there’s a strong perception bias towards the middle of this spectrum.

Joel touches on this with Embedded, but mostly seems to be fairly casual about how critical the other categories are.

🌳

There are plenty of other possible spectrums, but over the last twenty years those are the ones I’ve found myself thinking about the most. And I think the combination does a reasonable job sketching out the landscape of modern software.

A lot of things in software development are basically the same regardless of what kind of software you’re developing, but not everything. Like Joel says, it’s not like Id was hiring consultants to make UML diagrams for DOOM, and so it’s important to remember where you are in the landscape before taking advice or adopting someone’s “best practices.”

As follows from the name, Software Forestry is concerned with forests—the bigger systems, with a lot of parts, that matter, with paying customers. In general, the things more on the right side of those spectrums.

As Joel said 22 years ago, we can still learn something from each other regardless of where we all stand on those spectrums, but we need to remember where we’re standing.

🌲

Next Time: What We Talk About When We Talk About Tech Debt


  1. I don’t know, and the point is you don’t either, because they didn’t say.
  2. This almost certainly wont be the last Software Forestry post to act as extended midrash on a Joel On Software post.
  3. Is it web scale?
Read More
software forestry Gabriel L. Helman software forestry Gabriel L. Helman

Software Forestry 0x00: Time For More Metaphors

Software is a young field. Creating software as a mainstream profession is barely 70 years old, depending on when you start counting. Its legends are still, if just, living memory.

Young enough that it still doesn’t have much of its own language. Other than the purely technical jargon, it’s mostly borrowed words. What’s the verb for making software? Program? Develop? Write? Similarly, what’s the name for someone who makes software? Programmer? Developer? We’ve settled, more or less, on Engineer, but what we do has little in common with other branches of engineering. Even the word “computer” is borrowed; not that long ago a computer was something like an accountant, a person who computed.1 None of this is a failing, but it is an indication of how young a field this is.

This extends to the metaphors we use to talk about the practice of creating that software. Metaphors are a cognitive shortcut, a way to borrow a different context to make the current one easier to talk about. But they can also be limiting, you can trap yourself in the boundaries of the context you borrowed.

Not that we’re short on metaphors, far from it! In keeping with the traditions of American Business, we use a lot of terms from both Sports (“Team”, “Sprint,” “Scrum”) and the Military (“Test Fire,” “Strategy vs. Tactics”). The seminal Code Complete proposed “Construction”. Knuth called it both an Art and a branch of Literature. We co-opted the term “Architecture” to talk about larger designs. In recent years, you see a lot of talk about “Craft.” “Maintenance-Oriented Programming.” For a while, I used movies. (The spec is the script! Specialized roles all coming together! But that was a very leaky abstraction.)

The wide spread of metaphors in use shows how slippery software can be, how flexible it is conceptually. We haven’t quite managed to grow a set of terms native to the field, so we keep shopping around looking for more.

I bring this up because what’s interesting about the choice of metaphors isn’t so much the direct metaphors themselves but the way they reflect the underlying philosophy of the people who chose them.

There’s two things I don’t like about a lot of those borrowed metaphors. First, most of them are Zero Sum. They assume someone is going to lose, and maybe even worse, they assume that someone is going to win. I’d be willing to entertain that that might be a useful way to think about a business as a whole in some contexts, but for a software team, that’s useless to the point of being harmful. There’s no group of people a normal software team interacts with that they can “beat”. Everyone succeeds and fails together, and they do it over the long term.

Second, most of them assume a clearly defined end state: win the game, win the battle, finish the building. Most modern software isn’t like that. It doesn’t get put in a box in Egghead anymore. Software is an ongoing effort, it’s maintained, updated, tended. Even software that’s not a service gets ongoing patches, subscriptions, DLC, the next version. There isn’t a point where it is complete, so much as ongoing refinement and care. It’s nurtured. Software is a continuous practice of maintenance and tending.

As such, I’m always looking for new metaphors; new ways of thinking about how we create, maintain, and care for software. This is something I’ve spent a lot of time stewing on over the last two decades and change. I’ve watched a lot of otherwise smart people fail to find a way to talk about what they were doing because they didn’t have the language. To quote every informercial: there has to be a better way.

What are we looking for? Situations where groups of people come together to accomplish a goal, something fundamentally creative, but with strict constraints, both physical and by convention. Where there’s competition, but not zero sum, where everyone can be successful. Most importantly, a conceptual space that assumes an ongoing effort, without a defined ending. A metaphor backed by a philosophy centered around long-term commitment and the way software projects sprawl and grow.

“Gardening” has some appeal here, but that’s a little too precious and small-scale, and doesn’t really capture the team aspect.2 We want something larger, with people working together, assuming a time scale beyond a single person’s career, something focused on sustainable management.

So, I have a new software metaphor to propose: Software Forestry.

These days, software isn’t built so much as it’s grown, increment by increment. Software systems aren’t a garden, they’re a forest, filled with a whole ecosystem of different inhabitants, with different sizes, needs, uses. It’s tended by a team of practitioners who—like foresters—maintain its health and shape that growth. We’re not engineers as much as caretakers. New shoots are tended, branches pruned, fertilizer applied, older plants taken care of, the next season’s new trees planned for. But that software isn’t there for its own sake, and as foresters we’re most concerned with how that software can serve people. We’re focused on sustainability, we know now that the new software we write today is the legacy software of tomorrow. Also “Software Forestry” means the acronym is SWF, which I find hilarious. And personally, I really like trees.3 Like with trees, if we do out jobs right this stuff will still be there long after we’ve moved on.

It’s easy to get too precious about this, and let the metaphor run away with you; that’s why there were all those Black Belts and Ninjas running around a few years ago. I’m not going to start an organization to certify Software Rangers.4 But I think a mindset around care and tending, around seasons, around long-term stewardship, around thinking of software systems as ecosystems, is a much healthier relationship to the software industry we actually have than telling your team with all seriousness that we have to get better at blocking and tackling. We’re never going to win a game, because there’s no game to win. But we might grow a healthy forest of software, and encourage healthier foresters.

Software Forestry is a new weekly feature on Icecano. Join us on Fridays as we look at approaches to growing better software. Next Time: What kind of software forests are we growing?


  1. My grandmother was a “civillian computer” during the war, she computed tables describing when and how to release bombs from planes to hit a target; the bombs in those tables were larger than normal, needing new tables computed late in the war. She thought nothing of this at the time, but years later realized she had been working out tables for atomic bombs. Her work went unused, she became a minister.

  2. Gardening seems to pop up every couple of years; searching the web turns up quite a few abandoned swings at Software Gardening as a concept.
  3. I did briefly consider “Software Arborists”, but that’s a little too narrow.
  4. Although I assume the Dúnedain would make excellent programmers.
Read More