2016 Engineering Reflections

by Dan Kleiman

As we approached the end of 2016, we asked members of the engineering team to think back over their work and answer three questions:

  1. What are you most proud of shipping or did you enjoy working on the most in 2016?
  2. What is your new favorite tool that you started using in 2016?
  3. What is the biggest engineering lesson you learned in 2016?

We thought it would be interesting to examine some of their answers. In doing so, we found a few key themes for Tapjoy Engineering in 2016.

Pride of Ownership and Hunger to Get Involved

One interesting theme that emerged was the split between our more experienced engineers and newer members of the team.

With multiple services and data stores and a complex infrastructure, the Tapjoy codebase can be a lot to get your head around. Many of our more senior team members shared accomplishments, in response to the first question, that showed their pride in making our infrastructure more understandable, accessible, and of course, more reliable and performant.

Ed Healy took pride in two documentation projects where he captured how we route traffic from our public endpoints in one and the architecture and administration of our MemSQL cluster in the other.

Ed says, "I pursued the routing question because a co-worker came up and asked me a general question about how traffic flows through our system. I had a general idea of how it worked, since I manage all the individual infrastructure pieces, but seeing as I had been here for 3 years at that point and didn't have a satisfactory answer or documentation available to point to, I took it upon myself to review and document it."

When the engineer who led the MemSQL project was getting ready to leave the company, Ed again took on the work of ensuring the project would be accessible and understandable.

"Since the Operations Team was going to be responsible for the upkeep of this production system, I wanted to be absolutely certain the available documentation was as accurate and as thorough as possible instead of simply trusting the documentation would be sufficient when we ran into our first crisis situation. I then went about reviewing and performing as many of the admin tasks as possible, expanding and revising with future use in mind."

As a member of the Operations Team, for Ed, "the highest and best use of time will likely be any endeavor that makes my fellow engineers more effective at their jobs, either via writing tools or sharing any insight I can into how our systems actually work."

By contrast, facing such large, complex systems, newer engineers took pride in learning how to get involved and make a positive contribution to the team. Many highlights included new integrations, special customer-facing features, or upgrades to internal tools.

Even though they can feel arbitrary or burdensome, one engineer cited hitting his team's sprint goals as an important way to build trust between engineering and the product team.

Optimization vs Building New Features

When it came to the type of work our engineers did, there were two clear camps: Optimization and Building New Features.

Optimization projects touched every aspect of our infrastructure. Some focused on reworking core code paths. Others upgraded our data stores to be more reliable and resilient. We had projects to consolidate existing microservices, creating highly performant new services. We even had a team that was tasked with ruthlessly pruning out old, unused code--greatly enhancing the maintainability of our core product.

New, refined features emerged from the larger refactoring projects. Elaine Uy reported that her team took one refactor as an opportunity to build "a shiny new front end system for composing ads that we needed to integrate with the monolithic, complicated ad models that actual render content at request time. The first couple of MVP's were functional, but not clean and incurred a lot of tech debt. Going back to refactor the code to be DRY, better tested, better encapsulated, easier to maintain and extend felt like a breath of fresh air that made day-to-day work more enjoyable."

For another team, 2016 saw the release of a real-time reporting system that had been in the works for almost two years. Greg Sabatino describes our New Reporting Pipeline or NRP as "an exactly-once processing engine that can aggregate ~2.5 billion messages a day into an hourly time series data store with under a 5-minute lag from event origination to end-user report availability."

Building out NRP required cuting over existing customers who were using legacy reporting systems and 5 years of legacy data with zero downtime. In addition, we wanted to enhance reporting with new data segmentation, add real-time roll-ups, improve accuracy and availability, and cut the cost of service delivery.

Longtime Tapjoy engineer Aaron Pfeifer explains, "we don't have that many projects that tend to go so long without being able to say that it's fully live. Over the years, different reporting systems had evolved, but we wanted to build a unified pipeline, designed to work now and designed to scale out to meet our future needs. Being able to see all of that hard work and effort all come together gives me a great sense of accomplishment."

Monitoring and Measuring

There was interesting overlap between "lessons learned" and "favorite tool" too, when it came to monitoring and measuring projects. Throughout the year, the team emphasized putting the right monitoring in place for any changes. The most widely-adopted tool for ongoing monitoring was SignalFx. By implementing monitoring along with new features or other code changes, engineers had to think beyond "what" they are releasing and try to understand "how" it will have an impact on existing systems.

In working on the reporting pipeline, Greg Sabatino cited the usefulness of SignalFx to provide a real-time view into both "open-source and custom software components as well as server instances hosting them. With near-zero code required, it allows for transparency into each component, highlighting any trends or acute problems in a complex system."

At Tapjoy, we have also developed internal tools that help mimic production services and databases in a development environment, or simulate end-user views of ads and interactions as critical for planning how and what to measure when code goes live.

Elaine Uy describes one such internal tool, TJSH, or "Tapjoy Shell" as a command line tool that integrates "all parts of your dev work flow so that you only need to issue one command instead of changing boxes and remembering start up script incantations. That handles a lot of context switching for you." Not only does it speed up work for our engineers, but because we maintain it as a team, anyone, even newer members of the team, can contribute to it and improve it. As a manager, Elaine also points out that "it reduces the amount of knowledge transfer I had to do for incoming new hires. All of the commands it issues are easily accessible from the tool's repo, so it works like shared documentation."

When it came to measuring impact, there was lots of love for BigQuery and MemSQL, as well. With these two fast, easy-to-use views into large swaths of data, engineers were able to quickly gain insight into the results of their work. BigQuery in particular was a big upgrade for Nick Martin, who explained that "our previous tools were very slow, like most GUI-based database engines, which simply did not give us meaningful turnaround time on queries over billions of rows. They lacked really solid ways of exporting data too, with the exception of something like CSV files. All of this is built into the BigQuery client."

Lessons Learned

Finally, in their own words, here are a collection of lessons learned in 2016 from Tapjoy engineers:

  • The first step is to do it by hand. You need to experience it before you understand it. Do it manually so you can understand the problem, e.g. curl the endpoints with test data to understand, or recreate the error locally.
  • Always question the first principles of the work you are doing and have at least a satisfactory answer, no matter what stage in the project you're at.
  • Working through issues and hitting sprint goals and deadlines makes a huge difference in future planning and aids to build trust between engingeering and product.
  • Asking questions and being involved is the best way to learn.
  • I was harshly reminded that: if it can fail, it will.
  • Considering all of the possible failure scenarios is hard. As is keeping on top of your Time To Recovery after a catastrophic failure over a period of time.
  • A small feature scope is a happy feature scope.
  • Monitor everything.
  • Trust but verify everything.
  • Know what you're doing as best you can before you do it. And document it so you don't forget!
  • Performance profiling is an important process for any new service.
  • Solve the problem at hand, not all the ones you see. I'm finding that as I become more experienced, I notice more things about code that I don't like or is ineffecient or is broken etc etc. It's sometimes challenging to focus on just what you set out to do and not get sidetracked.

Thanks to everyone at Tapjoy engineering who shared their reflections. And thanks for reading along. Here's to a thoughtful 2017, full of well-monitored work we can be proud of!