BigRubyConf was a great opportunity to meet and talk with some really great developers. The slides for my talk on high-scale job processing are available online via speakerdeck.Read More
I'm excited to announce I've been accepted to speak at BigRuby Conference in Dallas on February 20th. This is (as far as I'm aware) the only conference in the US focused specifically on high-scale Ruby. I'll be giving a talk on the tools, techniques, and interesting problems from the trenches of scaling up job processing over the last two years. This is an exciting opportunity to both share what we've been up to at Tapjoy as well as to learn from other folks working on big Ruby applications. I'll be posting a follow-up after the conference with more information on my talk, and any other useful notes from the conference as a whole.
There are few things in engineering as difficult as swapping out an entire back-end solution without incurring downtime. It is a significant challenge whether your platform is in its infancy or one like our that needs to replicate between multiple data-centers with hundreds of thousands of reads and writes per second.
Over the past year, we've worked closely with the good folks at Basho and were able to make this transition with minimal disruption while increasing our efficiency and reliability to our systems.
We're happy to share that story here.
It is a very exciting time to be an engineer, especially an engineer at Tapjoy! One of the many reasons is that Tapjoy Engineering is dedicated to education. In fact, each engineer has a budget to take classes or go to conferences. It is with that education focused spirit that we wanted to take it a step further, so we partnered with RailsBridge to host a free workshop at our San Francisco office teaching Ruby on Rails to the public.
For those that aren't familiar with RailsBridge, it was founded in 2009 to help encourage more women to get into development. Since then they've created a curriculum and set of guidelines on how to teach Ruby on Rails to groups with varying degrees of programing knowledge.
In August of this year, we put together a team of Tapjoy employees who were excited about hosting an event with RailsBridge and we got to work planning how to pull this off. We decided to run the event similar to how we plan any engineering effort at Tapjoy. With a Trello board and weekly meetings!
Ok, maybe we were a little more lax with deliverable dates and point values for tasks, but all in all it worked really well.
We had more than 60 students and 20 teachers during the event. It was a tremendous turnout. Check out the pictures below! One of the reasons RailsBridge works so well is that there is a large community of professional developers who love to come teach at the events.
We were fortunate to have so many teachers on the day of the event and we where able to run seven classes of varying levels. These classes were designed for all levels including people who haven't used the command line before all the way up to folks who've programmed in other languages and want to learn Rails. In the end, the average class size was about 8 students with 2 teachers so it was a very effective and intimate learning environment.
We started off on Friday night with "InstallFest". InstallFest is a three hour window when students get help setting up their environment. Installing Ruby, Rails, Git, Heroku and anything else they need to hit the ground running for structured lessons the next day.
Saturday was dedicated to taking students through the excellent RailsBridge curriculum. It's setup so a student can copy and paste each line and get through to the end on their own. The beauty of the event however is that there are teachers to expand on and explain what's really happening, as well as answer any and all questions.
While the set up and dedicated audience is critical, the real magic of Railsbridge is the attitude and tone that is set at the event. These are very accepting and egoless events. Often people learning a huge technical skill can feel intimidated by the material and how much they don't know, but the attitude at Railsbridge is that everyone goes through this, and we look for ways to make people feel comfortable in what they know. There is even a teacher training workshop given to those teaching the classes. During that workshop, one of the primary topics is 'how to make people feel technically and socially comfortable".
That is where RailsBridge really shines. In creating a community of teachers and students that are all really happy to help each other learn. It is very much like the Ruby community at large.
So if you are a Rails developer in the Bay Area, I highly recommend volunteering to TA or teach at a one of the nearly monthly workshops.
If you aren't in the Bay Area, workshops are hosted all over the country. If no one is putting one on near you and you like organizing events try to host one in your town!
Day 2 from the Tapjoy offsite!
Every week, we take time out to either plan our upcoming two week "sprint" cycle (don't be thrown off by the wording there, I'd hardly call what we do SCRUM), or we take some time to checkpoint where we are in our progress. This week was a full planning day, which meant a lot of team collaboration, organizing themes for the upcoming two weeks, and making sure we were aligning behind core technology and product innovations.
There's a reason why we take planning so seriously, and why we are constantly re-evaluating our own process. Our planning process is meant to facilitate efficiency during the rest of the cycle, by reducing the number of roadblocks while engineers are focusing on building, optimizing, and shipping. Outsiders may look at our planning days as a waste of time, but we view them as time optimizers. It's easy to fall into patterns where you're constantly iterating and working, but you're never taking a step back to truly look at what you're trying to accomplish. That's the purpose of our planning days. That's why we take them so seriously.
The planning day for the offsite was made a bit more special by the fact that we were all able to collaborate in one room. There were engineers floating in and out of meetings they normally would have never been a part of, and they gained useful insights into what other teams are working on and building. It was also an opportunity for more ad-hoc conversations to happen around long term goals, visions for where teams want to go, and what we as colleagues can do to help one another.
We ended the evening in downtown Santa Rosa watching the Red Sox vs. Cardinals game. We have a large contingent of Boston natives and transplants as part of our Boston office, so needless to say there were a lot of eyes glued to big screen TVs all evening. There was even a moment where one of our colleagues wandered off into the streets, and ended up joining a band practice (not joking) via his amazing trumpet skills. Check out the video below, that really is our colleague playing the trumpet.
Welcome to Tapjoy Santa Rosa, our temporary home away from home for the Engineering Offsite 2013! As a highly distributed team, it's infrequent that we're all able to get together in one place at the same time, which is why every engineer in our team has take the opportunity to come to the offsite this year. People packed onto six hour flights, probably caused a bit more noise than they should have, and hiked their way up to Santa Rosa via San Francisco.
The first night in town was casual, fun, and a time for everyone to catch up and re-connect. Tomorrow is a planning day, and that's where the real interesting bits begin. Here's a photo of everyone chatting away as the dinner wound down, but the night picked up:
That's it for night one from Tapjoy Santa Rosa. Talk to you all tomorrow.
At Tapjoy we're constantly evaluating our technical infrastructure to ensure we're using the right tools in the right places. We've spent a lot of time over the past year working on our message queuing infrastructure. As part of using the right tool for the job, we use a variety of queue systems for different purposes. We recently did a dive into our job processing infrastructure to determine the best way to improve overall stability and reliability without interrupting existing operations. Given that the existing infrastructure was processing nearly a million unique jobs an hour, this was not a trivial problem. Our evaluation looked at a few key areas: Durability, Availability and Performance.
When we talk about Durability as it relates to queuing, what we really want to know is how likely it is that once a message has been sent to the queue that it will be successfully delivered. This means that the queue needs to be as tolerant as possible to failure scenarios such as a process dying, a disk being full, an entire server disappearing, or a significant network partition or failure.
Availability in this case is how likely we'll be able to successfully write a message to the queue. The ideal failure scenario for a queuing system is one in which it fails to allow reads, but continues successfully writing new data to the queue in a durable fashion. Obviously, this is not possible at all times, so just as important is being clear when it's not able to safely write data. The worst failure case is one in which it silently fails to write data.
Finally, let's discuss Performance. There are three things to measure when looking at queue performance: write throughput, read throughput, and time spent in queue. All three are fairly simple to test, but each requires fairly specific knowledge of the particular queue systems you're working with to determine the right way to get these metrics.
Breaking down our job queue use cases, we found two distinct scenarios. In the first, it is critically important that we're eventually able to deliver all messages. In the second, it is more important that we're able to quickly process most messages. So, we have a need for a less latency sensitive, loss intolerant system and also a very latency sensitive, loss tolerant system (although the amount of loss we will tolerate is exceedingly small). Using our previous terms, we need a highly Durable, highly Available system and a high Performance, highly Available system.
With those parameters in place we set out to look at a variety of queue systems to see what fit our two use cases the best. We quickly narrowed down our choices to two systems, Amazon's SQS and RabbitMQ based on our existing infrastructure and in house expertise. You can read more about our use of RabbitMQ as a message bus in one of our previous posts. So let's talk more about SQS as it relates to job processing.
We chose SQS for our first use case, where we were willing to trade a bit of Performance (specifically latency/time in queue) for high Durability and high Availability. SQS, for those who aren't familiar, is Amazon Web Service's message queue system. It provides a simple HTTP API for both publishing, and consuming messages, which hides the complexities of managing a widely distributed, highly durable job queue system.
What makes a job queue?
Without going too far off the rails here, let's talk a bit about job queues. Job queues are a subset of the use cases for generalized Message queue systems. In a job queue, we don't care (as much) about things like fanout (multiple delivery) or strict message ordering. However, we do care a lot about things like message durability, retry, timeouts, and locking. Let's look at how those concepts map to SQS.
We want to ensure that once a message has been fetched from the queue, that it won't be fetched again until it has been completed (or processing reached an error condition). SQS handles this with a concept called Visibility. When you create a queue, you specify the Visibility of messages that go into that queue. As a message is fetched from the queue it is marked as "Not Visible", and is ineligible to be fetched again. Once a timeout has been reached internally, SQS will automatically make that message "Visible" again and eligible to be reprocessed. This sounds like the exact behavior we want here! But wait, there's a catch…
Despite the Visibility concept, SQS only guarantees "at least once" not "only once" delivery. In practice, you may receive the same message multiple times if you're fetching from the queue from multiple connections nearly simultaneously. In a significantly scaled job processing system, this is almost guaranteed to occur. The good news is that this is a relatively easy problem to solve with memcached. In short, we look up a message by ID in memcached. If it’s not found, we add it and begin processing. Thanks to memcached this is an atomic operation. This process ensures that we get incredibly close to "only once" delivery of our jobs.
Message durability is just baked into SQS. You don't have to specify on a per-message basis that messages should persist, or what strategy to use for that persistence. There is a hard cap that messages can only live in a queue for 14 days before they will be expunged, but we're building a job queue! We want to churn through these things as quickly as we can; however, we want to tolerate some general delays as well. A limit of 14 days has been more than sufficient for our usage thus far.
Retrying messages is an incredibly common feature in a job system. SQS handles the simple case automatically and provides some information for you to handle more advanced things on your own. By default, any message that hits its visibility timeout for any reason is put back into the queue for reprocessing. All messages are automatically retried by default. If you want to do more advanced things like exponential back-off or maximum retries, you'll need to do a bit of work. To facilitate this, SQS has a concept of per-message "Receives". Every message has an ApproximateReceiveCount, which is the number of times that message has been fetched from the queue. By default this data is not sent down with the message; however, when fetching a message, you can request additional parameters by name. Using the receive count, you can then implement more advanced error recovery scenarios in your processing code.
The official Ruby AWS SDK provides a wrapper around SQS itself, but it doesn't really provide any of the higher level "job processing" code we need to be able to use this as a full-fledged job processing system. After some research, we couldn't identify any existing Ruby based solutions that interfaced with SQS and fit into our overall infrastructure in a way we were comfortable with. Given that we have a lot of in-house experience with this problem space, we went ahead and rolled our own job processor, which we'll hopefully be releasing to the public soon.
Over the last year, we've found SQS to be well within our expectations of availability. There have been a few brief partial outages largely affecting the ability to fetch messages from the queue. There have been two instances of being unable to write to a particular queue (both of which were temporary and isolated to a specific queue). In the failed write scenarios, the service provided adequate information back to the client so that we were able to handle the failures at the client, and not lose messages.
Remember earlier when I mentioned measuring the time a message spends in the queue? Let's get back to that.
Time spent to get a single message out of a particular queue is variable in SQS. To keep up the level of durability that SQS provides, it distributes a message across many servers. Each time you attempt to fetch messages from a queue, you only hit a subset of the servers that the queue is distributed across. The likelihood that you get no messages back for a particular request increases as the total number of messages in the queue decreases. It may take several requests to get a message when a queue is nearly empty. In an attempt to reduce the impact of this, the SQS team implemented long-polling. Long-polling lets you poll many servers on each request for messages, thereby reducing the number of empty requests. The tradeoff is that each request is much longer than the non-long-polling requests. In practice, we found issues when combining this setting with fetching batches of messages, and have not yet gone back to investigate how to balance these two options.
In theory we would be able to push more messages (better write and read volume) through our RabbitMQ infrastructure than we can with SQS for the same number of clients and consumers. But, we would be trading durability and availability for performance. Coupled with the significantly lower operational costs of SQS, we've found this to be a definite win. At some point in the future that equation could turn over; however, we don't anticipate that being anytime soon.
Wow, we left out something really important. Regardless of what solution we use, we need to have solid monitoring and alerting around the state of our job system. Since we’re in AWS that means Cloudwatch. Cloudwatch is a solid tool for providing high-level information about the health of the system. It also allows you to configure simple alerts based on thresholds. We mainly use this to watch for sudden queue growth, which is often a symptom of a bigger problem elsewhere in the system we should be looking into. Cloudwatch gives us a solid base-line, but it's not without a few warts.
There has been a secondary effect on our system based on our use of Cloudwatch for high-level monitoring. We have adopted a queue-per-job strategy for managing our queues. This is somewhat less than ideal in a few cases. In an ideal job queue system, you can design your queues around priorities or resources. This lets you scale your worker pools based on the performance or availability of particular resources, or based on the number of high-priority jobs going into the system. You then separately monitor the number of each type of job going into, and coming out of the system. This lets you inspect the volume of particular types of jobs, while still scaling out the system based on performance or resource availability. But using just Cloudwatch for monitoring, you're unable to know anything at a level more granular than "queue". We're actively working on adapting our internal monitoring tools to do the job type tracking, which will allow us to begin to collapse our queues down and still be able to easily inspect the system.
In conclusion, we use SQS to process a staggering amount of jobs everyday. It is one part of our infrastructure that we now spend very little time worrying about as it mostly “just works”. That isn’t to say it’s perfect, but we’ve found that the tradeoffs Amazon made with SQS work well for our job processing infrastructure. Stay tuned. We’ll continue to share more information, experiences, and code around these problems in the coming months.