Monitoring Distributed Systems - Riak KV on SignalFx

by Geoffrey Clark

Running a high-throughput distributed infrastructure means paying close attention to a wide variety of metrics. As a member of the Operations team at Tapjoy, I help to use those metrics to scale and tune our infrastructure to meet application demand. It’s also important to make sure those systems are highly available. To help meet that reliability requirement, we use Riak KV as the backing datastore for many of our systems.  As we scale our existing systems up, or add new applications and their associated load, the performance of the cluster can change. These increased workloads result in more dependency on Riak. This makes monitoring Riak KV an important part of my job.

As our cluster has grown over time, our ability to diagnose and address production issues improved through necessity. Some of our previous monitoring solutions were delayed by as much as 5 minutes. When working with Riak - a core component of our services - we wanted greater granularity. It’s crucial that, within seconds, we know what impact an increase in user traffic or configuration change may have on the cluster. In addition, we wanted the ability to see both aggregated data as well as individual node metrics, all in one place.

We found SignalFx, set it up, and we’ve been happy with the results. If you want to learn more about our choice of SignalFx, you can do so here. This post details how we configured Riak KV to send metrics to SignalFx, in case you want to do the same.


Configuring Collectd

SignalFx supports a variety of channels through which you can ship metrics. Our method of choice is collectd. It’s a lightweight client that can be installed on every node to support a wide variety of community plugins, and collection methods. Collectd is baked into all of our AWS AMIs. We use Chef to manage the configuration of our nodes, including the collectd configuration. Signalfx recommends using the write_http plugin to send metrics with collectd.

LoadPlugin write_http
<Plugin "write_http">
  <URL "<%= node['collectd']['signalfx']['url'] %>">
    User "<%= node['collectd']['signalfx']['user'] %>"
    Password "<%= node['collectd']['signalfx']['token'] %>"
    VerifyPeer false
    VerifyHost false
    Format "JSON"

For our Riak cluster, we apply the curl_json plugin with the configuration provided by Basho to collect metrics from the HTTP stats endpoint available here.

<Plugin curl_json>
  <URL "http://localhost:8098/stats"> 
    Instance "riak@"
    <Key "vnode_gets">
      Type "gauge"

With this template, we get a great baseline for metric collection. Running a `riak-admin status` command will give you an excellent idea of what metrics can be collected. Additionally, our cluster is configured for Multi-Datacenter Replication. Valuable replication metrics are available from the riak-repl stats endpoint.

<URL "http://localhost:8098/riak-repl/stats"> 
    Instance "riak@"
    <Key "rt_source_errors">
      Type "counter"


Generating Graphs in SignalFx

Once we have metrics configured and shipping to SignalFx, we’re in good shape. The datapoints are showing up in a Catalog. Generating graphs and dashboards is the next step. In our case, we have two Riak dashboards - one for each AWS Availability Zone. Let’s take a look at setting up a graph that aggregates the overall vNode Puts and Gets for a cluster in Availability Zone us-east-1d.

Knowing the name of the metric is helpful, but if you don’t, the search feature in SignalFx is quite useful - it includes several methods for wildcard searching. We found the vnode_puts and vnode_gets which are both coming in as a gauge type.

This graph is not useful to us in its current state for a few reasons. First, it includes all 60 of our nodes. Let’s apply a filter. Since SignalFx will also integrate with AWS metrics we can use any EC2 filters or tags. Here, we’ve applied the aws_availability_zone filter to show nodes in us-east-1d.

Seeing individual node metrics might be useful but in our example we really want to see the cluster activity as a whole - a sum of this data.

Okay, how did we get here? We added an analytical function for Sum > Aggregation which adds all node data together. We also changed the graph type to be Area instead of Line. This causes them to be filled with color for a better visualization of “Sum”. Then, we changed the “Color By:” to be Metric instead of Dimension which varies based on your collection methods. Things are looking great now, but are they accurate? Take a look at the numbers - 5 million gets seem high?

The color changed - yes - but also we applied a scale of 1/60. After reviewing Basho documentation for Riak stats, the puts and gets are collected in 1 minute intervals. Since we want the SignalFx graph in per second rates, this scale converts the numbers accordingly.

SignalFx filters and analytical functions are powerful and extensive. The ability to adjust and manipulate data also makes our collection mechanism - Collectd - very simple.


Implementing Detectors

From here, we can create meaningful thresholds that act as triggers for other activity, like notifying us. These are called detectors. Taking the previous example, creating a detector for put operations exceeding some determined high water mark is simple.

GClark implementing detectors.png

Creating a rule based on the metric (signal) gauge.vnode_puts we can see some of the available options. In this case, we are going to send an alert to PagerDuty if we exceed 30000 vNode Puts for 70% of a 10 minute duration. This is a great way for taking noise out of a graph and keeping those ops pages to a minimum. Detector rules are very flexible. They can be as simple as a static value and static duration or can include dynamic thresholds derived from other metrics, with duration defined on the fly.


Real-time Metrics Make the Difference

For Tapjoy, the benefit of SignalFx is that it promises data visualization with up to the second granularity. We know our systems are having issues as soon as those issues start. Previously, we were responding to application problems after a five or ten minute delay. By the time we detected the production issue and traced it back to a configuration change or instance failure, however rare, in our Riak KV cluster, we were already experiencing application downtime. In our work at Tapjoy, that time easily translates into tens of thousands of dollars lost. Now we know in just a few seconds. In many instances we have resolved cluster issues before seeing an application-level impact.

My coworker, Weston Jossey, discusses more on how SignalFx works for us as well as how Riak KV enables our mobile strategy. We’re excited to use both these technologies to empower our DevOps practices.


Geoff Clark
Senior DevOps Engineer at Tapjoy


Think this is interesting? We're hiring! See our current openings here.