A year or two ago I had an opportunity to sit down with AWS's Marcin Kowalski in a cafeteria and discuss the problems of software development at almost-unimaginable scale. I walked away with a new (for me) conception of software engineering that is part engineering, part organic biology, and I've found this perspective has shifted my approach to software development in a powerful and immensely helpful way.
As Computer Scientists and Software Engineers, we've been trained to employ precision in algorithm design, architecture and implementation: Everything must be Perfect. Everything must be Done Right.
For smaller, isolated projects, this engineering approach is critical, sound and practical, but once we begin to creep into the world of integrated solutions and micro-services it rapidly begins to break down.
We simply don't have time to rewrite everything.
We don't have time to build "perfect" solutions.
We can't predict the future.
Context matters. Circumstances matter. The properties and requirements of an emergent system take precedence over those of its constituent parts.
"You cannot upgrade an airplane wing in-flight"
In the world of hardware, it is unusual to be able to repair or upgrade something broken, damaged or inferior without disconnecting it or shutting it down. In contrast, software can generally be upgraded "live" using a kind of sleight-of-hand to switch between different iterations of a project, and these days the cost of cloning an entire solution is often negligible.
For many solutions, though, this doesn't apply and it doesn't really scale. At present, for example, I build software that operates on cargo ships, and the cost of a repair or major overhaul while at sea (or even the cost of getting service technicians on board while in dock during a pandemic) is exceedingly high. For a large cloud provider, even the tiniest risk of outage is entirely unacceptable.
While it's safe to say that there are more software developers around than ever before, good ones don't come cheap; between failing post-pandemic economies and an interminable lack of resources, companies can rarely afford to throw development hours into anything that isn't critical to their business - no matter how crucial R&D efforts can be to any organization's long-term success, no matter how much better a product or component might be if it was redesigned or rewritten.
It is due to these circumstances that all companies, large and small, must operate iteratively and rely on incremental improvements.
Defining "Technical Debt"
"Technical Debt" is a well-known term, but I've come to believe that it's not, in fact, a real thing. As a term we use it to describe anything that's built... sub-optimally. A "temporary" hack, or a workaround, that we pretend we'll get around to sorting out at some unspecified point in the future (I addressed this in a previous article).
While it's true that "technical debt" describes imperfections that we should try to avoid, it has negative connotations that are not necessarily deserved. It's too easy to view our predecessors and our former selves unfavourably, but I find it more constructive to frame these rushed hacks, workarounds and "suboptimal" decisions as unavoidable constructs that were good ideas at the time. I say "were" good ideas rather than "seemed like" because just like biological evolution, we generally select for short term advantages and adapt to our current environments rather than preparing for an uncertain future.
I don't have any alternative terms to offer, but I would like to talk about "legitimate" sources of technical debt as opposed to "illegitimate" sources:
A legitimate source of technical debt is a hack or work-around demanded by an unavoidable external source, such as an upstream dependency, or urgent and unanticipated customer pain.
An illegitimate source of technical debt is a hack or work-around required to implement unplanned or unsupported features or use-cases generated with artificial urgency by an internal source, such as a product manager, a marketing team, or unhealthy company politics.
Regardless of whether the technical debt was generated legitimately or not, what we really have at any given moment is the current state of the software, and whether it is operational or not. It might be worth investing time in figuring out where in an organization technical debt is coming from and putting processes in place to reduce it, but there will always be technical debt and it isn't very constructive to cry over spilled milk.
Caring for your Iron Giant
For many years I've complained about the broken state of our "global software ecosystem", drawing on my experiences with proprietary monoliths and micro-service architectures, as well as the enormous emergent system of interdependent open-source software projects that we all know and love (or love to hate). I hadn't yet1 seen the 1999 film "The Iron Giant", but during that fateful conversation I was struck with an image of a giant, humanoid robot made up of millions of tiny parts and slowly, inexorably walking forward towards some unseen goal.
While the Iron Giant behaves as single organism, whenever it is smashed into its individual components or is damaged it becomes clear that it is actually a self-organizing and self-repairing community much like those of plants or animals. Come to think of it, individual human beings are, in fact, self-organizing and self-repairing communities of cells, organs and bacteria as well!
In organic (biological) evolution, organisms are primarily occupied with survival. In order to survive, an organism must not only consume what it needs to operate and propagate, but it must do so in a way that is compatible with the ecosystem in which it resides. When it comes to the most fundamental parts of an organism - its cells - there are neither opportunities nor capabilities for a fresh start, but in the long run those cells will fail if they cannot collectively adapt to their environment.
That's exactly where slow and iterative evolution comes in. It's the software equivalent of evolving DNA, only it's more like an epigenetic response to environmental pressure and somewhat less random. Small changes that make our software stronger and more resilient, small changes that can be made not only without disabling our machines, or bringing them to an outright halt, but that can be made without even breaking their stride.
Theory, Practice and Ideals
"Prototypes" are generally developed rapidly in order to gather information about what's possible and viable.
"Minimum Viable Products" are usually intended as marginally longer-term learning tools, where investment in future planning and architectural extensibility might be welcomed but is generally not budgeted for.
Over the course of the past few decades the software industry has moved from detailed and lengthy waterfall design processes towards fail-fast iterative approaches, but even if the models have shifted, for individual developers the ideal of building perfect software that will last for all eternity prevails. In a typical modern-day scenario, software solutions tend to begin their journeys into production as prototypes and MVPs, then bear that legacy to the ends of their lifetimes.
Idealistic developers and managers tend to find "legacy code" a source of frustration, when in fact it is an inevitable outcome both of the evolution of the software and of the developers themselves. Legacy code is akin to a machine of a different age still operating long after we would have expected it to fail, having taken on more responsibilities than it was originally intended for. Perhaps, in our fast-paced and ever-shifting software landscape, it would benefit us to consider this an impressive feat of engineering!
Please don't get me wrong - I am not advocating for poorly designed architecture or poorly written code. What I am advocating for is dropping emotion-bound perfectionism and taking a more pragmatic approach to design and development that takes circumstances and context into account. I will be the first to admit to compulsive "boy-scouting" - to a fault, I always try to improve any code I work on or around - but for the sake of our own sanity and the success of our enterprises we need to realize that our time and our resources are precious and limited, that we cannot and should not fix or replace everything all in one go, and that it's absolutely acceptable to fix things iteratively rather than tear them down and start all over again.
In most cases, the truth is that we don't need new software, just better software. Why reinvent the wheel when we can take someone else's slightly squarish wheel and make it round?
For every element of technical debt we encounter, it would be helpful to ask the following questions:
"If the solution has been written in an inappropriate2 language, will we be better off maintaining it as is, or migrating it to a new language completely? What are the short and long-term trade-offs?"
"If a solution is not sufficiently extensible for our needs, should we invest in rebuilding it, retrofitting it, hacking in what we can? Or should we attempt to shift users to an entirely new solution?"
These are nuanced decisions to make, and it's easy for us to let our biases and prejudices make those decisions for us and get in the way of doing what's right at the time. To illustrate this with an example from my own personal experience:
I once worked with a team on a PHP product that was fraught with evil Anti-patterns, Bugs, catastrophically poor Code and Design, and a team of PHP developers that was thoroughly uninterested in developing in any other language. Was PHP itself to blame? Partially3, and as someone who really doesn't approve of PHP4 the easiest solution was to migrate to something like Node.js. In reality, though, what I was looking at was a monolithic beast built with limited resources that was somehow or other managing to pull its weight. It made no sense to throw away the code (or the developers), so we had to take a different approach: finding ways to iteratively improve the organism (both the code and developers) without letting any part fail. This turned out to be a complex problem, but entirely solvable, whereas any solution that didn't include that legacy would have engendered wholesale chaos and was unlikely to result in success.
A Taoist's Summary
If there's any take-away to this article, let it be this: What we have are obstacles and challenges. What we need are solutions. It doesn't benefit us to come at those obstacles and challenges as if they're somehow in the wrong. If we accept that everything that has brought us to the current state of the solution had a context and purpose, and we accept that we are currently in yet another situation that has context and purpose, then we can ask the questions and provide the solutions that let us influence our software's evolution in a healthy direction, even if it's only just a nudge.
Above all, let us remember that what we a building is just a tiny moving part amongst a myriad of moving parts, and that our role as engineers is not only to keep our Iron Giant operational, but to help it to be a hero and not a villain.
1 I picked it up recently for my five year-old, it's now one of my favourite movies.
2 "Inappropriate" here could mean poorly fitting the solution, or no longer supported, or hard to find developers for, or just unnecessarily difficult to work with
3 I've had a number of PHP gigs, even written a couple of prototypes for myself using it, and it's always immediately apparent that almost anything else would be an improvement.
4 To be fair, neither does its creator.