Monitoring


On the Rally Engineering blog we’ve written many articles on systems’ performance, monitoring, resilience, and recovery. An interesting event just happened at our Denver production facility that pushed our systems to their limits. It helped us answer the question:

How well does production survive in the face of all-out failures?

Maybe inspired by Netflix’s Chaos Monkey, some local wildlife found its was into our server room and litterally tore apart our hardware.

Enter the “Rally Chaos Raccoon”

Last weekend the Chaos Raccoon, we now call “Cyril”*, chewed his way into our production server room. From security video footage we saw that at first he was cautious and quiet. Just hiding behind the Dell 4210s — where it’s warm.

But after a full day of no food Cyril got angry.

Pissed angry.

He gnawed on the mounting brackets of our UPS and ripped open a SAN unit in use by app-server-01.

The instant app-01 failed our backup app-server spun into action and took over the production traffic from app-01. The process all had to happen automatically — when Aaron tried to enter the cage for a manual failover Cyril attacked and bit his cheek (yes, that cheek).

angry-raccoon

As Aaron left the server room, seeking medical attention and rabies shots, the Chaos Raccoon ripped into another Dell 4210 and chewed its innards into shreds. Our system smoothly recovered, load-balancing network traffic whilst simultaneously paging the operations team another warning. When Cyril finally chewed through the UPS he electrocuted himself to death.

It smelled bad.

In the future we’ll probably test our processes with a “Bad-ass Bear”, “Crazy Coy Carp”, or “Janky Jackalope”. Opening up your production systems to wildlife attack demonstrates confidence in monitoring, recovery, and backup processes. It stretches your failover strategies to their limits. You may think your systems are ready for anything but when a raccoon attacks there are no rules.

If you’re going to implement your own Chaos Raccoon we recommend you first deploy an array of recovery tools and test with non-endangered creatures. It’s organic and the ecologically sound option.

* real names changed to protect the innocent

One of the (many) great things about Rally culture is the Hackathon concept. We get a period of time (traditionally it was one week in every 8, this time out will be longer due to a strategic rescheduling) to work on a project of our choice, with the caveat that the project will benefit the company. Pretty wide remit, huh? We love Hackathon as it gives us time to work on pet peeves and/or pet projects we might not otherwise be able to. It’s a chance to work with people outside of your normal team, learn a new language, build something really cool, prototype designs in a safe environment, etc. In short, it’s awesomeness.

Our current performance tests are written in Java and run via JMeter but they don’t actually give us true client side metrics since they don’t spin up a browser. We know how well the Java performs, but (as an example) all that JavaScript we’ve been writing about may affect our application performance for whatever reason. In short, we won’t really know for sure how fast the system is or isn’t until after the code is released and production monitoring takes over. Unsurprisingly we aren’t particularly comfortable with that, and want to get a holistic view of our overall system performance prior to release. One way we think we can get that view is by spinning up browsers and exercising the application as a user would. Since our browser test framework already spins up 100s of mechanical ‘users’ to exercise the application, being able to leverage it to track ‘real world’ usage metrics should give us a valuable insight into how the system really behaves.

We’ve invested heavily in the framework to make it better, and during that investment we spent time refactoring it in a way that makes it fairly easy to re-purpose certain aspects. That’s a story for another post, but put simply our tests use generic behaviors that call the implementation specific driver for the page or panel. Last time out The Chauncey and I started some work re-purposing our existing browser test framework to be used in performance tests. We had a good proof of concept: JMeter was calling the Ruby scripts, parameters were passed between the right places, and some of the appropriate metrics were being written to the appropriate logs. Then we ran out of time and had to go back to work… :-( But we didn’t throw away what we had, instead we tucked it away in a branch for safe-keeping with some notes about what was outstanding, and what needed some more work.

In case you haven’t guessed by now, our plan is to finish up that work in this coming Hackathon. We plan to take the drivers and write a collection of performance behaviors to give us more bang for the bucks we spent on the tests and gain that holistic view. Furthermore, our current performance tests have a fairly large feedback loop, so I’m also wanting an improved execution time so we aren’t suffering from a lack of attention to performance simply due to the time it takes to get results.

There are some hurdles to overcome however, our framework is very much geared to functional testing in the Arrange/Act/Assert pattern but to gather “real world performance” numbers we’ll need to write tests that can use pre-existing data. Currently our tests are data-independent in that each one creates _everything_ it needs via the WSAPI endpoint. In order to get meaningful metrics from the tests they’ll need to use large datasets, so we’ll have to make it possible to pass a user to a test and then manipulate the data available to that user, and then do the same operations for a different user and it’s respective data. So I figure it’ll end up being more Collect/Act/Assert (I’m fairly sure I made that up but hopefully you get the idea). Perhaps the assert portion will simply be along the lines of !exception, I’m not quite sure how it will look yet. We might even end up with functional performance tests, where the two are combined and we measure our functional tests – while that concept doesn’t feel quite right to me while I’m writing this, I’m open to the possibility.

Regardless, I’m pretty excited about the chance to crack on with this project. I’ll report back once the Hackathon is over, or if the project [fails | reaches it's minimum capability state and we can think about plans for putting it into production].

As we’ve mentioned in previous posts, we monitor a lot of different metrics. One of the challenges you have as your list of metrics grow is the impact metric collection has on your service. Typically monitoring methods expose a single metric per request so if I want to observe the current heap allocation I have to make a single request and if I want to observe the number of active threads I have to make another request. These requests all add up and when you query them every minute, like we do, there are concerns about being able to query all the metrics you want to in a 1 minute interval. There are also concerns about what impact all these requests have on the application itself.

Recently we have started trying to use a metrics library provided by Yammer to expose information about our application via JSON. Besides making it easy to access the metrics (they are exposed via HTTP), this library also makes it possible to collect those metrics in a single request which delivers a blast of JSON with everything in it. We’ve found adding new metrics to be pretty easy & you get a lot of metrics out of the box.

Having solved the problem of multiple queries nicely, the next problem was the overwhelming number of metrics that are now exposed. Being the lazy Ops guy who has to configure all of those into the monitoring system, I opted to automate the process so that when new metrics are added to the JSON we use our monitoring system’s API to add the new metrics & immediately start to collect data for them. This could be done for any respectable monitoring system but in our case we’re using Zabbix, so my example is based on that.

Moving from JSON to key/value pairs

The first challenge was that JSON presents data in a hierarchy and our monitoring system wants to assign a value to a key. We opted to just “flatten” the JSON structure so that the path to any given value is the key – thus we go from something like this:

"jvm" : {
  "memory" : {
    "memory_pool_usages" : {
      "Code Cache" : 0.029173533121744793,
    }
  }
}

To this:

jvm.memory.memory_pool_usages.Code_Cache = 0.029173533121744793

There are lots of examples of how to do this in whatever language you choose to use – here’s how we’re doing it in ruby:

def hash_flatten(input = {}, output = {}, options = {})
  input.each do |key, value|
    key = options[:prefix].nil? ? "#{key}" : "#{options[:prefix]}#{options[:delimiter]||"."}#{key}"
    if value.is_a? Hash
      hash_flatten(value, output, :prefix => key, :delimiter => ".")
    else
      clean_key = key.gsub(/\s/, "_")
      output[clean_key]  = value
    end
  end
  output
end

Once we have a list of key/value pairs we dump these metrics into a file and use the zabbix_sender tool to deliver all those metrics to Zabbix every minute.

The next issue is how to make sure all the metrics in the JSON show up in our monitoring system without Ops having to do any extra work. For this we leverage the Zabbix API.

Creating Metrics via the Zabbix API

We are using Templates to associate a list of monitored items with a host. This makes it easy for us to add a new host & apply a few templates to get all the metrics we want associated with that host. When we are adding new metrics via the API we want to make sure we add them to the Template, not to the individual host being monitored. The API has an authorization process that I’m not going to cover here, it’s covered pretty well in their documentation. We are going to focus on creating new metrics.

The Zabbix API offers a convenient method “item.exists” to query a particular key to see if it already exists. We use this to confirm a key is missing before we create it. Creating a key involves a fair amount of information about the key you want to create, so here is the code (ruby) and then we’ll talk about it a bit:

def create_item(key, apps, hostid, value_type)
  #appids = []
  # Get a list of application ID's
  appids = apps.collect { | app | get_app_id(app) }

  req_json = jinit("item.create")
  req_json[:params] = {
    :description => key,
    :key_ => key,
    :hostid => hostid,
    :applications => appids,
    :type => 2,
    :history => 90,
    :value_type => value_type
  }

  api_req(JSON.generate(req_json))
end

Ok, so when this is called we pass a number of parameters to create_item:

  • key – this is the name of the item (the “flattened” JSON path to the value listed above).
  • apps – this is an array of application names we want this key to be added to. Refer to the Zabbix manual for the relevance of these – just know you can associate multiple apps with a single key.
  • hostid – In our case, this is the template ID since we’re using templates, but this could be any host ID you want to associate this item with. The Zabbix API re-uses the hostid to pass the templateid for items associated with templates.
  • value_type – this defines the type of value this is: string, float, int, etc

We have to iterate over the list of apps provided and obtain their numeric IDs (apps.collect). To do this we have another bit of code called get_app_id() which resolves an app name to an app ID via the API. We setup the basic JSON request structure with a call to jinit() along with the API method we want to use, in this case ‘item.create’. Using jinit() takes care of making sure we have an authentication token, keeps track of request IDs and sets up the basic structure that’s common to all API requests.

This is what jinit looks like:

def jinit(method)
  # If we don't have a token and we aren't
  # being called to setup authentication then
  # we setup auth first
  if !$auth_token && method != "user.login"
    auth_setup
  end
  json = {
    :jsonrpc => "2.0",
    :method => method,
    :auth => $auth_token,
    :id => get_id(),
  }
end

Once we get back a hash from jinit we complete the params portion of the hash with all the values we were passed. The two static values, type & history, are common to all our requests. We keep 90 days of per-minute data on monitors & setting “type” to “2″ means that this is a “Zabbix Trapper” key type which is required to receive data sent to Zabbix via the zabbix_sender utility.

This generates a JSON request that looks something like this:

{
  "auth": "12345678901234567890",
  "method": "item.create",
  "id": 1,
  "params": {
    "key_": "test.db.query.composites.duration.p95",
    "history": 90,
    "hostid": "100100000010157",
    "applications": [
       "100100000001773"
     ],
     "type": 2,
     "description": "test.db.query.composites.duration.p95",
     "value_type": 0
   },
   "jsonrpc": "2.0"
}

Zabbix responds to this type of request with the itemid that was created as the result and keeps the “id” value the same as your request so you can associate the two:

 {
   "result": {
     "itemids": [
       "100100000039405"
     ]
   },
   "id": 1,
   "jsonrpc": "2.0"
 }

Now that the key is created zabbix will collect data about this value anytime it’s delivered via zabbix_sender.

How do you remove metrics that are no longer in use?

We don’t. Partly this is because I’m paranoid and wouldn’t want to clobber months or years of data because of a bug. Another reason is because the metrics are reasonably easy to remove via the Zabbix UI or API using a purpose built script and overall we rarely remove metrics from our systems. We also found that on application restart, if certain individual metrics had not been incremented they would not appear in the JSON. We’re working on fixing this for consistency but we wouldn’t want a problem like that in the application to cause us to remove metric data, we would rather just fail to collect.

Wrapping up

This method works well with a variety of monitoring systems but relies on the metrics being discoverable without knowing what they are. Typically this means they need to be exposed via JSON or XML and provided to the polling client with a single request. There is an additional side effect of doing this that may not be clear. If you make it automatic and easy to expose & graph metrics in your application then Developers can suddenly add metrics & observe those very easily without asking anyone. More metrics means better understanding & hopefully happier Dev & Ops folks.