Here at Tapjoy, we believe in using the right tool for the right problem. With a platform as large and globally-reaching as ours, we've long moved beyond the idea of anything being a one-size-fits-all solution. We use lots of different technologies, primarily around Messaging and Data Storage. From SQS, SNS, and Kafka (to name a few), to Amazons RDS MySQL, PostgreSQL, and coming soon, Google's BigQuery.
As you can imagine, with a stack that large and diverse, Tapjoy also has a fleet of internal applications and services. And while we're always building out new product features, and the code to support them, we also have a huge array of older system that run our platform as well. Any engineer can tell you, a large part of their job is to help maintain and manage these older systems, as quite often they can be of critical importance to the day to day operation of the company.
So when one of these systems began to show signs of age, we started to look into what was going wrong, and what we could fix. One of the important things about Engineering at Tapjoy is that we have an emphasis on moving quickly. If we can, for example, swap out an ailing piece of an older application or service while still maintaining the functionality and guarantees of that service, we'll take that option first, before considering a large, more costly full scale re-write. This doesn't mean we simply continue to bolt on, but rather we look for ways to remove just enough of what isn't scaling, and replace it with something that will, so that we can continue to focus on pushing out new products, features, and innovations into the AdTech/MarTech space.
Replacing the Engine
One such system is one that is responsible for a constant stream of incoming data from one of our primary processing pipelines. The difficult part about this data, is that it's in some part financial in nature - it deals with and records bits of money, essentially, as they move around the system. That means that per our own guidelines, it must be stored in an ACID compliant system. Off the bat, we've already cut out a wide swath of possible contenders.
But ACID compliance alone isn't the only metric by which we needed to choose a new home for this data. At Tapjoy, we've long outgrown systems that aren't horizontally scalable when it comes to capacity requirements. If we can respond to a capacity issue by throwing more hardware at a problem, we don't consider the problem solved. We prefer push-button solutions, but we also have a world-class team of Dev-Ops engineers for a reason - so long as human intervention is low enough, we're comfortable with systems that may require someone to walk through a few steps to correct some problems.
Additionally, the project isn't a success if we only solve today's problems. We operate at a certain scale now, but month over month, we continue to see our scale rise. Any system we put in place has to handle some multiple of our current traffic, with the basic benchmark being an order of magnitude. If we choose something that addresses the first two points, but not future scalability, we haven't solved the problem - we've put it into a shinier bucket.
Those three points alone: ACiD compliance, Horizontally Scalable, and handles at least 1 order of magnitude over what we do today, would be more than enough to cut the list of candidates down to a lean set of contenders. But there was one more bullet point on the list we needed to hit. We needed to make as few changes to the underlying data model, and the application writing and querying this data, as possible. There are a lot of reasons for this, but the important ones boil down to the following:
Time - We needed to get a solution in place sooner rather than later.
Effort - A full re-write would be a large undertaking, and we only had few resources available to work on the project.
Risk - Old code works. It may not be beautiful, but when it comes to revamping parts of a financial pipeline, it's often a wise idea to change only what you need.
So now we need something that works with the existing data model, but even more, with as much of the existing code as possible. This includes things like the ORM (Object Relational Mapper - a layer between the database and the application code) itself. This data currently lived in MySQL, and if we could get into something that was as close to a MySQL as possible that still met the above three conditions, we'd have a winner on our hands.
Making a Choice
Truth be told, there weren't a lot of choices for us to look at here, with all these constraints in mind:
We could move to a self-hosted, self-managed sharded MySQL. While this is possible, not a single person on the team (either engineering nor ops) were excited about having to administer this kind of setup. In addition, for the size and scale of data we were talking about, we've traditionally always kept it out of MySQL. This was the rare exception, and we didn't want to put it right back into the same system we were unhappy with
We looked at the new AuroraDB option in RDS. Conceptually, it was very promising. It had wire compatibility with MySQL, meaning we could switch to it by simply changing our connection string. The main problem was performance. In our pilot test, it suffered from poor throughput in a battery of tests, and so we passed on it. Something completely managed like an RDS offering is incredibly tempting, but without the scale to match it, we couldn't pull the trigger
Our DataScience team had been using a product called MemSQL for a year, and had reported a lot of good things about it. It scaled horizontally, it had stellar throughput, and it was ACID compliant. Even better, it too was wire-compatible with MySQL. If the title of the blog post alone wasn't enough of a giveaway, MemSQL is the option we chose.
This would be a pretty short, and somewhat boring, blog post if the story ended there. MemSQL had everything we wanted, and after an initial pilot, we confirmed it would meet our stated requirements. But the proof is in the pudding. We set out to build a cluster, shadow-write data to it, and put it through a battery of tests to make sure it would hold up to our use cases.
The first order of business was simply getting the ability to easily switch between the old and new databases in the application. While not a MemSQL specific topic, it's an important aspect of any cutover project - the ability to seamlessly rollback to the old system if something goes wrong. At Tapjoy, everything runs through our Chef cookbooks. So we came up with a simple integration plan that relied on a Chef-controlled environment variable to instruct a given instance of the application which database it should use for only the relevant models (not everything needed a new home, after all).
Once we had that settled, we began to shadow all of the writes to MemSQL. Because we couldn't do traditional replication, we took a more brute-force approach. We hooked into our ORM layer, and for any of the impacted models where a commit was successfully completed, we pushed a copy of the data into a queue. Then, a second set of servers simply consumed from the queue, and used the aforementioned environment flag to determine where to write the data. In this way, we had 1 set of servers writing to our "primary" store, and a "mirrored" cluster, writing to the test store. With this, we not only had a copy of all our production traffic, we had a way to swap between which data store was considered primary, and which was the backup. When we went to do our cutover, this would let us swap what was written where seamlessly.
One critical piece of work that often gets overlooked is the amount of effort put into productionalizing any piece of software into your operational environment. Learning how to deploy it, configure it, monitor it, etc, is always a lot of work. We spent a bit of time getting the way MemSQL works into our Chef and Autoscaling environments, so that all of our tests, and our hopeful production rollout, could involve as much automation as possible. So while it isn't interesting or exciting, always factor in the time it takes to put a new piece of technology into your stack up front. Your ops team should always get the final say on whether a piece of infrastructure is ready for production, so make sure you work with them during any process like this.
Kicking the Tires
Finally, we get to putting MemSQL through it's paces. After locking ourselves into a meeting room that had been commandeered for the purpose, we went to work. We increased the throughput to over 20x our rate of data under today's numbers. We kicked off migrations. We scaled the cluster up, down, and rebalanced it. We stopped MemSQL processes. We `kill -9`-ed MemSQL processes. We re-mapped iptables to simulate a network partition. We hard terminated nodes out from under the cluster itself. We did everything we could think of to induce error conditions we would expect to see in the wild. Throughout all of that testing, we were able to successfully recover from every error condition but one.
One of our tests was a pretty egregious termination of the Master node. In MemSQL terminology, the Master node isn't so much the "brains" of the operation, but more of a highly-privileged coordinator of certain operations. Specifically, managing the cluster as a whole, or running any DDL (Data Definition Language - Anything dealing with table structure or definition), must occur on the Master. If the Master goes down, the database itself remains in the same state it was when the Master ran, but you lose the ability to alter the cluster or run DDL until a new one is promoted from a pool of Aggregators (a lesser-coordinator that you use to run queries). During our test, we eventually got our promoted master into a corrupt state, where it could not take over for the cluster as a whole. This turned out to be bitter-sweet. The upside was, we never saw any data dropped when this happened. In fact, we didn't see a single error in the application at all. The cluster effectively lost it's head, but continued to thrive. However, even after following their documentation on cluster recovery to the letter, we would up in a bad state.
Luckily, one of the non-technical strengths of MemSQL is the support team behind it. When we reached out to them with our issue, we were put in touch with their lead support engineer immediately, who we walked through the steps we took. After a series of emails and a video conference, we found the problem. While the product itself is fairly robust, the documentation has not kept up with the pace of development in all cases. We found that the article we had on how to recover from an outage wasn't fully updated, and there was a step we had missed. To their credit, MemSQL admitted the page wasn't up to standard, and had us provide direct feedback on what we, as engineers, would want to see out of the page. While it's unfortunate we ran into the problem, as an engineer responsible for putting a new piece of technology into production, this was one of the best possible outcomes. Not only did we receive prompt help, we also were invited to participate in shaping the resolution on their end so that future users of the product had a better experience.
With this new knowledge in hand, we repeated our entire battery of tests, and were able to successfully recover from each outage with minimal to no errors in our application. Another point of notice for any engineers out there looking to put new tech into production - spend a dedicated amount of time trying to break it. Throw everything you have at it. Nothing will survive every possible problem, but knowing how a system responds to various issues, and having real experience recovering from those issues to help inform your SOP (Standard Operating Procedures) will be invaluable. Never put anything into production that you don't understand at least some amount of the expected failure cases. Anyone can understand how something works - but it takes real effort to understand how something fails.
Once we were confident in the technology itself, it was basically a waiting game. The purpose of this data had a lifetime of roughly 30 days - it's used primarily for weekly or monthly aggregates for specific reports. So, once we had a rolling 35 days of data (We wanted to give ourselves the extra runway of a work-week, to respond to any issues in time to prevent issues with the reports from being show-stoppers), we planned a cutover.
Earlier I mentioned that we had a simple switch in place, controlled via chef, that allowed us to toggle back and forth between which datastore was primary and secondary. This allowed the rollout to primarily consist of a single chef deploy. And, like all major rollouts, it turned out to be incredibly important that we had a simple, safe rollback procedure, and ensured data was always written to both datastores. During the rollout, we actually hit a glitch, and needed to roll back. After a night of frantic hair-pulling and worrying, we were able to isolate the issue to something in our own deployment process, and not with MemSQL. The next day, after fixing the bug, we were able to repeat the rollout and swap the primary and secondary backups with no application level issues.
One corner that we had decided we could cut early were the indices. The old table had a plethora of indices designed to support various reporting workloads. In initial testing, the latency for these queries in MemSQL was so many leagues ahead of RDS that we decided we could live without some of the indices. Here are some examples of the latency decrease we're talking about. Note that while the MemSQL times are for the worst-performing query we could generate (so, over the largest data set possible), the RDS times were largely averages that we pulled from New Relic. Also keep in mind there is only a single index on date, so where we did some filtering on individual partners VS global queries, performance actually takes a hit when the CPUs needed to kick on and do the filtering. Also, while the RDS times are averaged from New Relic under existing cluster load, the MemSQL cluster we tested on was put under 20x read/write load while running our latency tests, to ensure future scalability would hold up:
|Sum data point for single customer||68 seconds||1.23 seconds|
|Sum data point for entire platform for 1 month||28 minutes||1.1 seconds|
|Distinct list of all customer who did business recently||24 seconds||3 seconds|
Leaving the indices off was important initially, because while MemSQL was a superior product, it was also more expensive to run. Each index eating up more of the precious memory available on each Leaf node, we wanted to avoid as many of them as we could. Unfortunately, a few days later, we learned about a gap in our performance testing. While latency was still great, the additional CPU burn associated with running these queries at scale resulted in overwhelmed nodes in our cluster. After some initial debugging and testing, we determined we were a little too overly cautious, and that we would need to support these indices. One of the other benefits to MemSQL, is that it allows online table operations - something that MySQL RDS does not out of the box. Adding our missing indices required only minor delays in processing data (while the table never truly went "offline", it did experience a fairly throttled rate of ingress as this happened). For a table with over 330 million rows comprising around 600 gigs of data, each index operation took roughly 15 to 20 minutes, and never resulted in actual downtime.
The takeaway here for us was two-fold: Make sure you're not using latency as the only metric by which you verify something is working, and that our ability to recover from an incorrect decision about the database itself is now much greater, giving us the leeway to make changes as we need to.
MemSQL turned out to be the right solution to our problem: We wanted a horizontally scalable, ACID compliant, MySQL compatible database. Truthfully, with those requirements in mind, there aren't many other options at all. But luckily, our primary option turned out to be a strong piece of technology that we believe we can trust to hold our data.
A few takeaways, for anyone looking at doing this type of cutover:
When cutting over a critical piece of production infrastructure, always have a backup plan. Even if that plan is to fall back to the old system, never run without a net.
Try to make the cutover as simple as possible. We got ours boiled down to a single deploy of Chef, and even further to a single environment variable that could be tuned globally, or per-cluster.
Put every piece of infrastructure you want to introduce through failure testing. Try to break it. Force it to fail. Understanding how something works when all is right in the world is easy. The same can't be said for failures.
Latency isn't the only thing to watch for. Memory, Disk, Network, and possibly most importantly CPU are all things to keep an eye on when testing performance. Don't get latency-myopia!
Make sure you understand how to correct issues with the system after you've gone-live. If you need to change things after the fact, understand how to avoid or minimize downtime - and always communicate outward to engineering about any downtime! Your ops team will thank you!