Is code an asset or a liability?
We code because we want to solve a specific problem – from finding the shortest path to the airport to building a new strategy game.
Over time, as the solution’s complexity grows, it’s harder to keep it running smoothly. With the shortest path, each new data type included (e.g. traffic conditions) makes integrating the next one (say, weather) harder. In the case of a strategy game, adding character classes makes it harder to keep the balance of strength. Remember StarCraft and patches trying to fix the imbalance?
The good old days
Once you’re in serious production, code changes get harder. Even innocent changes, like 200ms extra latency on a trip to the database, can easily snowball into an outage under the right circumstances. Dealing with this complexity adds delays to your timelines.
These delays are impossible to plan for, as you can only learn in production—hence all the canary release shenanigans. This is on top of distractions such as customer bugs, dependency updates, keeping your hosting bill under control, etc.
The team that was so successful in shipping things is suddenly struggling to make even simple changes. So you start hearing folk tales about the good old days when “all of this was built in a quarter with a team of three.”
What changed? Why are we so slow now?
So, what has changed since the good old days?
Well, we’re in production, and keeping the production reliable, bug-free, and maintainable requires a certain level of capacity from the team. Let’s call this required carrying capacity.
If the team invests less than the required carrying capacity, the product will suffer outages, unexpected bugs, poor performance, security issues, and, as a result, unhappy customers. These problems won’t surface immediately – they may surface months later, so connecting cause and effect is hard.
The key lesson is – each time you launch something, you add to the required carrying capacity. And not investing the required carrying capacity into your software is akin to not servicing your car – it ends up costing you more later.
The vicious cycle begins
As required carrying capacity eats into a team’s total capacity, teams deprioritize maintenance work, starting with delaying dependency upgrades, not keeping various dashboards alive, and ignoring the tail performance implications of certain changes. These are typically deprioritized in favor of new features, which, when launched, add to the required carrying capacity, and so the vicious cycle begins.
Eventually, things get so bad in production (frequent outages, bugs), or code gets so hard to change that teams go into crisis mode and either declare bankruptcy (e.g., by doing a big v2 project) or deliberately focus all their capacity on getting things under control.
In the good old days we had no required carrying capacity, so we shipped a ton of stuff fast and kept adding to the required carrying capacity which eventually ate up all our capacity.
Can we bring back the good old days?
It might be tempting to measure the required carrying capacity to support discussions around feature-maintenance trade-offs, but that’s impractical because you can only measure the required carrying capacity after the fact. For example, you can’t accurately measure production operations impact until you have an outage or code maintainability until you get stuck working on it.
That being said, some crude proxy metrics could include counting Jira tickets with a given label or production lines of code per engineer. As usual with engineering productivity metrics, buyer beware – chasing them directly is going to be harmful. They’re more appropriately used as one of many signals when comparing multiple teams to see if there are potential improvement ideas to be learned across teams.
Required carrying capacity is not meant to be directly or accurately measured – it’s a mental framework to use to trigger important conversations before they’re obviously needed. To borrow from Henry Mintzberg – if you can’t measure it, you’d better manage it.
You can only bring back the good old days if you keep your required carrying capacity low. You can do that by killing features, investing in automated tests and observability, and continuously simplifying your architecture to keep the cognitive load low.
New people add complexity
It might be tempting to increase the team’s total capacity by adding people to compensate for the increased carrying capacity. That can work to an extent, as long as these people are properly onboarded—i.e., given the appropriate mentorship and time to learn the context. When done wrongly, new people will add more complexity, further increasing carrying capacity.
Extreme programming was a neat attempt from the 90s that never became mainstream, despite the fact that its distant cousins like TDD or Scrum did. The core idea was to continuously evolve all aspects of the software being built – quality, customer requirements, and testability. Intuitively, it feels like XP teams should have lower carrying capacity than others. Probably one of the main reasons why XP failed to become the main way we build software today is that it requires a lot of discipline and that makes it fragile – it’s easy to relapse under pressure.
So, how can you keep your team disciplined and with reasonable required carrying capacity?
It’s a long game that requires continuous work. The key takeaway is that carrying capacity needs to be a continuous topic in your team.
Here are some ideas that might help with that:
- Try to estimate your carrying capacity over the past month: how much unexpected or maintenance work can be attributed to carrying capacity? How many production issues have we had?
- Use this analysis to motivate what could have prevented that work. That should inspire your next investment in reducing carrying capacity.
- Try to have a slightly lower carrying capacity each month compared to the previous month.
- When you have multiple teams, try comparing them and seeing which ones are doing better and why—let that inspire new carrying capacity reduction work.
- When planning your roadmap, all features need to carry their own weight. Features must justify the added carrying capacity they incur. Consider doing regular spring cleaning like Google.
This article was written by:
Ivan Klarić:
Ivan has been building teams and software in diverse domains (from real time betting to search systems) for the past 20 years, of which the past 10 years have been in scale-ups and big tech. Still loves coding.
Franjo Stipanović:
Franjo is an experienced software and security engineer with a strong background in engineering and security leadership. Worked on projects across diverse tech stacks, from small startups to global corporations.