Kaizen: Make code base we all work in a little bit better every time we interact with it

by Dean Dieker


I think there’s a universal tendency amongst software engineers to share the latest ‘buggy code’ or ‘horrible design’ found in the codebase. As engineers we pride ourselves on our craft, and we know that sometimes diving into even a 6-month-old feature can be fraught with rabbit holes and poorly-commented functionality that takes a fair amount of reasoning to understand.  Further, we are encouraged to share these ‘nuggets’ we find in the codebase by our fellow engineers, who will (rightly!) commiserate with us as we share the latest ‘gotcha’ we discovered in implementing v1.3 of that feature written by some-guy-wrote- who-doesn’t-work- here-anymore-and-oh-god-why-don’t-we-have-a-type-system.

During these discussions it’s also common to brainstorm solutions, everything from small tweaks and specific error handling to http timeouts and cleaner business logic (with comments!) to complete refactors. As engineers we also pride ourselves on being able to collaborate to come up with elegant (and hopefully straightforward) solutions to the problems we encounter in our day to day. Further, any engineer who has been able to solve a problem directly (whether it was their own or someone else’s) knows the feeling of ecstasy that comes with committing a fix for that bug or for that broken piece of code.

 

Of course, an engineer’s job is to balance these refactors and fixes with the needs of the business. It’s impossible to devote all of our time to fixing bugs or paying down tech debt; instead it is necessary for us to balance new development with ‘just the right amount’ of code cleanup and refactoring.  While this sounds fantastic, along with this autonomy comes a lot of responsibility--the responsibility to actually make time to do that refactoring and code improvement work, while still ensuring that the features we collaborate with Product on get done on time and to spec.

Logfiles and PIDs

It was during one of these exact discussions that the idea of simply “doing it” popped into my head. We were in the Tapjoy Boston kitchen discussing a frequent issue related to tailing logs when diving into production issues.  Our production servers are running multiple unicorn workers that all write to the same log file. It wasn’t unusual to be paying attention to a specific request, only to have the logging messages from other requests that one server was handling be interleaved, making it difficult to tell what messages went together.

Take the following code for example:

Does “Doing some stuff” occur before or after rendering a 204? Does “Doing some more stuff” always happen in conjunction with “Doing some stuff” or is that simply a log statement from a different process writing to the same log, or a different type of request altogether?  It was impossible to follow the flow of a request all the way through.

Naturally, for the first few minutes our group of engineers complained about how annoying it was that this problem wasn’t already solved for us by Rails. “After all, logging is a solved problem,” we declared, followed by the arguments for and against various logging packages. Eventually, we came around to declaring that it would be trivial to implement a solution that would prepend the PID to every log line...and everyone walked back to their desks to do their work for the day.  

For whatever reason, that day I felt differently about the problem. I decided that it would be worth the ‘trivial’ amount of time to figure out a solution for prepending the PID. I would defer working on my product-related tasks for a few minutes while I chased down a solution that I hoped would save future engineers cumulative hours of debugging and sifting through logs. After about 15 minutes of furious googling and a little testing on my local machine and a test server, the PR was up and ready for review. The next day it was in production.

The code change consisted of modifying one line from

buffer << message

to

buffer << "[#{Process.pid}] #{message}"

The logs would then include the PID prepended to any messages, making our example from before clear as day:

Moreover, the code change enabled anyone doing a production diagnosis to, once the PID of the worker in question was known, filter the logfile to view only that worker’s logs. With that simple change, we were able to shave time (and frustration!) off of every single production investigation..and all for about 15 minutes worth of work. And, in fact, the next few times there were issues that required spelunking through logs, I received direct feedback from those individuals about how this simple change improved the experience of diving into production logs.

Kaizen

I felt really good about the PID change. I felt so good that I wanted to share my experience and hopefully inspire others to dive in and find areas of the code that we could improve. I wished there was a backlog of these small tasks that I could tackle as I found free time in my day. Time that I might have otherwise spent browsing Facebook, checking email, or getting lost in the internet. The result was the birth of a (first Trello, now JIRA) board: The Kaizen Board.

The term Kaizen literally means “good work,” loosely interpreted to mean “change for the better.”  My deeper understanding of Kaizen and its role in the Toyota Production System came from a book I was reading at the time, The Machine That Changed the World. Kaizen is probably a term familiar more to students of martial arts and practitioners of Lean Production than it is to software developers.  It was a practice first introduced formally by Toyota as part of the Toyota Production System, and included small bounties for improvements that any employee could suggest or implement to make their work more efficient.  My vision for the Kaizen board was exactly that: small, continuous improvement to our codebase to make engineering work more efficient.

The Kaizen board was a home for tasks that any engineer could add, any engineer could feel free to pull a task across, and where pairing on issues was strongly encouraged. It also existed above and beyond sprint commitments, forcing tasks to be kept small in scope.  It was a literal manifestation of a desire to make the code base better.

Evolution

Initially I started with a firm Kaizen “process” of announcing the tasks delivered at the end of every two-week sprint. Now, nearly a year and a half after the Kaizen board’s inception, the process has become much lighter touch and I see tasks move across the board on their own, without a lot of outside encouragement.

Kaizen began as a dream to make the code base we all work in a little bit better every time we interact with it. I’m excited to see engineers acting in the spirit of Kaizen even without the formal announcements, or any kind of incentive (such as existed with Toyota’s implementation of Kaizen). To date more than half of engineering has participated (everyone from our Operations to our Product teams) and we’ve moved over 70 kaizen tasks across the board, and it’s a part of the culture to the extent that PRs cross the line with a `kaizen` branch prefix (similar to how we have `feature` prefixes or JIRA ticket prefixes for Github branches to keep them organized)

Together we’re making our codebase better, one small incremental improvement at a time, and that’s one of the things that makes me proud to work with my fellow team members.