Testing Spark Streaming Applications

by Ramesh Sen


Like many companies dealing with large volumes of data, Tapjoy has been moving towards a streaming data architecture using Apache Kafka and Apache Spark Streaming. While the foundations for building such a system are pretty well documented at this point, one area in which it’s tough to find much information is testing. One possible approach is covered in this post, but I came up with a slightly different technique using specs2 Scopes.

 

A few notes

This article is only covering unit testing of the transformations used in a Spark Streaming application. While full end-to-end integration tests are undeniably valuable, they become very complex in a system like this with many distinct subsystems and would likely need to run in a dedicated environment. Also, since our current use cases don’t involve any windowed computation, I haven’t gone to the trouble of checking values after each slide as discussed in the other post.

 

The setup

In order to model something a little more interesting than word count, let’s use a stream of (simplified) purchase data encoded as JSON that we want to aggregate in one hour intervals. Our incoming messages look like

{“item_id”: “abc123”, “amount”: 123.45, “time”: 1431504603105}

And our application code is

So how do we test it already?

Spark provides a few facilities that come in handy at this point. One is the makeRDD method that allows you to turn any Scala sequence into an RDD, and the other is the ConstantInputDStream that returns the same RDD at each batch interval. We also have to take some steps to ensure that tests don’t interfere with each other, force Spark to actually compute our stream at some point, and provide some mechanism to collect the output of our transformation. In order to accomplish those goals we’ll use a specs2 Scope that sets up the StreamingContext and executes a foreachRDD block that appends each element of the output stream to a mutable List.

The class signature probably needs a little explanation - it’s parameterized by the output type of our transformation function, which is passed in as an implicit parameter so that we can test multiple input scenarios without explicitly passing the function to the constructor each time. It uses the specs2 After scope so that it can stop the StreamingContext after each test group. The rest of the class should be self-explanatory, so let’s look at how we use it in a test.

(The special Predef import is necessary to avoid an ambiguous implicit between conforms and fun.) First, we turn the log level way down so it doesn’t clutter the output, then we tell specs2 to execute the tests sequentially rather than in parallel. This is necessary because the actor systems of the different Spark processes wind up conflicting with each other if you try to run the tests in parallel, though I wonder if this might be a bug as it’s generally not a problem to run multiple Spark processes simultaneously. As mentioned above, the method we’re testing is bound to an implicit val so we don’t have to explicitly pass it to each test -- if you want to test multiple transformation methods in the same class, group them separately. The processed records are appended to the collector list, which can then be checked for the expected results.

 

Obviously you’d want to check more than 2 scenarios, but at this point it’s just a matter of creating the desired test cases and verifying the output. Hopefully this has been a helpful overview of how to test Spark Streaming applications, and I’d love to hear feedback in the comments or on HN or Reddit.

 

Think this is interesting? We're hiring! See our current openings here.