Abstracting S3 for Fun and Profit in Go

by Craig Smith


At Tapjoy we have a team that focuses on optimizing our infrastructure. We've identified areas in our infrastructure that can be rewritten to reduce our cloud expenses as well as prepare our infrastructure to scale up to handle our traffic growth.  To this end we're employing more and more Go because it's a strongly typed compiled language that is easy to learn and gives performance benefits right out of the box compared to other VHLLs. Of late we've been looking to inject Go into heavy compute areas that would normally require scaling up clusters and increasing our costs. Without giving away too much of the secret sauce, we’re breaking off pieces of a monolith rewriting them in Go. More of a migration, if you will, and less of a microservices story.

One of the problems we faced in this migration is the models we needed were partially computed and the focus of this migration was on porting the logic, not  porting those models and computations. We didn’t have the time to do that work. The models were, however, behind JSON hyper-schema managed endpoints so we could easily retrieve the raw and computed data in JSON format. We found generating the JSON document containing all of the required models (5,000+, 43M/7M compressed) took too long to retrieve via an HTTP GET. Instead we elected to periodically generate the JSON document and push the data to S3. The new Go service periodically retrieves the data from S3 and deserializes the JSON back to the models we need.  This decoupling allowed the new service to periodically update data without blocking requests with the downside that the data might be slightly out-of-date, yet still well within our SLA. We left the database behind, for now, but avoided the cost of porting over a lot more code than we wanted. Also, we leveraged the existing legacy system to do a lot of our work and allowed us to focus on the key business issue: reducing our cluster size by migrating our compute intensive area to a more efficient language.

Synchronizing our models from S3 proved quite simple and efficient--it takes roughly 3.5 seconds to download the 7M JSON file, decompress it and deserialize it into our structs.  In development mode we found it useful, nay, necessary at times to override S3 and read from a file. If during development a field type was changed we would have to modify the hyper schema definition in the legacy system, re-generate the JSON blob and then upload that document to S3 so that it can be read by the new service. However, the JSON blob schema has changed and breaks the service for everyone else on the team. We found it was better to read from a local file for the duration of development and testing. How could we have added this simply? We could have found a package to mock S3 but we wanted something lighter and under our control.

Let's start with a contrived code example that we'll morph to use a simple pattern to abstract S3:

 

Note this is a contrived example; normally we would place Person in its own file with associated methods.

Clearly this code has a direct dependency on S3. In production and development we have to download the JSON data from S3. We could wire in an override to the bucket and key in S3 but then we are littering S3 with temporary files. As mentioned previously, better to override the S3 interaction altogether and read from a local file.  We can do this by refactoring the code to abstract retrieving the data from S3 and then injecting an object that does the reading from S3 (default) or a file (development).  In this simplified example there is only one place we call directly to S3:

If we can abstract the call to GetObject we can remove the dependency on S3. The Go language has a simplified methodology for interfaces which blends with the idea that there is no inheritance hierarchy. An interface is defined as a set of method signatures which a concrete type must implement in order to satisfy the interface.  However, one needn’t explicitly state that a struct (the closest relative to an object in Go) implements an interface. When assigning to an interface type the Go compiler merely checks if all the methods are implemented by the concrete type. If not, that generates a compilation error. With that in mind let’s create our S3 interface:

We'll use that in downloadS3Data and propagate it all the way up to main. It’s easier to reason about the changes if we look at the diffs of the changes--the left side is using S3Fetcher:

Changes to use S3Fetcher abstraction

Now for the special sauce, the implementation of the file based fetcher:

 

Now we need to implement the S3Fetcher interface by writing the GetObject function for the fileFetcher type:

The Body field in the GetObjectOutput struct is where we’ll stash the data from the file. To inject our data source into GetObjectOutput we need to conform to that interface. We'll do that with the MockBody struct and accompanying receiver methods:

NewMockBody creates a MockBody struct that overrides Close and Read such that they use embedded string when called via the Copy function in downloadS3Data:

Results: Body points to our MockBody object. Copy calls Read on the MockBody object to retrieve the data inside the Reader. Finally the defer needs the Close function which, when reading from a string, is a NOP.

So that's how we switch between reading from a file and reading from S3 (and back!). As an exercise for the reader consider changing this to read from a hardcoded string instead.

This blog entry was originally about unit testing Go code that has an S3 dependency. The groundwork for that turned out to be too long and became this post. In a subsequent blog post we’ll dive into using this abstraction to unit test using the powerful, flexible, somewhat confusing, but extremely useful, gomock.