The Bloat Point

Mon 24 February 2014

Have you ever felt that there was a point where your codebase escaped from you? Where obvious bugs go unseen for long periods of time, and files lurk in the source tree, unauditable and ready to devour flesh [1]?

This is what I like to call "the bloat point." It’s the point where no one person can keep the details of the entire software project in their head, and code bloat becomes inevitable. Steve Yegge wrote at length about the problem of code size, and I agree with his sentiment - bloat turns a code base rotten very quickly. It’s tricky determining exactly where the bloat point lies.

There’s no one number that can be used for determining the lines of code (LOC) where one hits the bloat point, because all languages read differently. However, I’ve found that reading roughly 2,000 LOC/day is possible (it varies from person to person). If the codebase is a million lines, it would take 500 days to read every single line. And since codebases are rarely static, you can imagine it could round up to 2 years before you could say you’ve read the codebase completely. Keep in mind, this is just reading - we haven’t even factored in reporting bugs, fixing bugs, anything like that.

The FreeBSD codebase currently sits around 11.5 million LOC [2]. At 2,000 LOC/day, it would take you almost 16 years to fully read the source tree, assuming it isn’t touched while you’re going through it. Last I heard, the Linux kernel alone has passed the 15 million LOC mark (much of its bulk is device drivers).

This is a problem, and not one that’s often talked about because it kills a few sacred cows. There are real human limits as to how much information any one person can process, and even rockstar programmers need to eat and sleep (even if only a little of both). This isn’t a matter of 1337ness, but of there being a limited amount of time in the day.

There are several strategies to managing a large codebase. A common strategy is to have ‘domain experts’ who overlook select swathes of code. However, this still leaves projects susceptible to bloat, since the experts would only know about duplication within their own section of code. As the size increases, so does the need for more domain experts. Since skill level can oscillate across a group, there will be an unevenness of quality within the project. Breaking a large codebase into smaller chunks to accommodate cognitive limits still means a large codebase in the aggregate.

One of the goals I’ve set in my new project NiceBSD is to have a useful base system under 1 million LOC. While this will be accomplished using a variety of methods (using Go, reducing source duplication, reduced responsibility of main source tree), the number is set with the explicit idea that one person should be able to review and understand the system as a whole.

Battling bloat is everyone’s responsibility, and the process of fighting it leads to better code.

[1]Yes, I’m aware that this was OK’d by FreeBSD core in 2004. I think it was short-sighted to put this in the source tree, as this is just as bad as a binary. There’s no easy way for a third party to audit the code.
[2]Found using David Wheeler’s sloccount