As we’ve mentioned in previous posts, we monitor a lot of different metrics. One of the challenges you have as your list of metrics grow is the impact metric collection has on your service. Typically monitoring methods expose a single metric per request so if I want to observe the current heap allocation I have to make a single request and if I want to observe the number of active threads I have to make another request. These requests all add up and when you query them every minute, like we do, there are concerns about being able to query all the metrics you want to in a 1 minute interval. There are also concerns about what impact all these requests have on the application itself.
Recently we have started trying to use a metrics library provided by Yammer to expose information about our application via JSON. Besides making it easy to access the metrics (they are exposed via HTTP), this library also makes it possible to collect those metrics in a single request which delivers a blast of JSON with everything in it. We’ve found adding new metrics to be pretty easy & you get a lot of metrics out of the box.
Having solved the problem of multiple queries nicely, the next problem was the overwhelming number of metrics that are now exposed. Being the lazy Ops guy who has to configure all of those into the monitoring system, I opted to automate the process so that when new metrics are added to the JSON we use our monitoring system’s API to add the new metrics & immediately start to collect data for them. This could be done for any respectable monitoring system but in our case we’re using Zabbix, so my example is based on that.
Moving from JSON to key/value pairs
The first challenge was that JSON presents data in a hierarchy and our monitoring system wants to assign a value to a key. We opted to just “flatten” the JSON structure so that the path to any given value is the key – thus we go from something like this:
"jvm" : {
"memory" : {
"memory_pool_usages" : {
"Code Cache" : 0.029173533121744793,
}
}
}
To this:
jvm.memory.memory_pool_usages.Code_Cache = 0.029173533121744793
There are lots of examples of how to do this in whatever language you choose to use – here’s how we’re doing it in ruby:
def hash_flatten(input = {}, output = {}, options = {})
input.each do |key, value|
key = options[:prefix].nil? ? "#{key}" : "#{options[:prefix]}#{options[:delimiter]||"."}#{key}"
if value.is_a? Hash
hash_flatten(value, output, :prefix => key, :delimiter => ".")
else
clean_key = key.gsub(/\s/, "_")
output[clean_key] = value
end
end
output
end
Once we have a list of key/value pairs we dump these metrics into a file and use the zabbix_sender tool to deliver all those metrics to Zabbix every minute.
The next issue is how to make sure all the metrics in the JSON show up in our monitoring system without Ops having to do any extra work. For this we leverage the Zabbix API.
Creating Metrics via the Zabbix API
We are using Templates to associate a list of monitored items with a host. This makes it easy for us to add a new host & apply a few templates to get all the metrics we want associated with that host. When we are adding new metrics via the API we want to make sure we add them to the Template, not to the individual host being monitored. The API has an authorization process that I’m not going to cover here, it’s covered pretty well in their documentation. We are going to focus on creating new metrics.
The Zabbix API offers a convenient method “item.exists” to query a particular key to see if it already exists. We use this to confirm a key is missing before we create it. Creating a key involves a fair amount of information about the key you want to create, so here is the code (ruby) and then we’ll talk about it a bit:
def create_item(key, apps, hostid, value_type)
#appids = []
# Get a list of application ID's
appids = apps.collect { | app | get_app_id(app) }
req_json = jinit("item.create")
req_json[:params] = {
:description => key,
:key_ => key,
:hostid => hostid,
:applications => appids,
:type => 2,
:history => 90,
:value_type => value_type
}
api_req(JSON.generate(req_json))
end
Ok, so when this is called we pass a number of parameters to create_item:
- key – this is the name of the item (the “flattened” JSON path to the value listed above).
- apps – this is an array of application names we want this key to be added to. Refer to the Zabbix manual for the relevance of these – just know you can associate multiple apps with a single key.
- hostid – In our case, this is the template ID since we’re using templates, but this could be any host ID you want to associate this item with. The Zabbix API re-uses the hostid to pass the templateid for items associated with templates.
- value_type – this defines the type of value this is: string, float, int, etc
We have to iterate over the list of apps provided and obtain their numeric IDs (apps.collect). To do this we have another bit of code called get_app_id() which resolves an app name to an app ID via the API. We setup the basic JSON request structure with a call to jinit() along with the API method we want to use, in this case ‘item.create’. Using jinit() takes care of making sure we have an authentication token, keeps track of request IDs and sets up the basic structure that’s common to all API requests.
This is what jinit looks like:
def jinit(method)
# If we don't have a token and we aren't
# being called to setup authentication then
# we setup auth first
if !$auth_token && method != "user.login"
auth_setup
end
json = {
:jsonrpc => "2.0",
:method => method,
:auth => $auth_token,
:id => get_id(),
}
end
Once we get back a hash from jinit we complete the params portion of the hash with all the values we were passed. The two static values, type & history, are common to all our requests. We keep 90 days of per-minute data on monitors & setting “type” to “2″ means that this is a “Zabbix Trapper” key type which is required to receive data sent to Zabbix via the zabbix_sender utility.
This generates a JSON request that looks something like this:
{
"auth": "12345678901234567890",
"method": "item.create",
"id": 1,
"params": {
"key_": "test.db.query.composites.duration.p95",
"history": 90,
"hostid": "100100000010157",
"applications": [
"100100000001773"
],
"type": 2,
"description": "test.db.query.composites.duration.p95",
"value_type": 0
},
"jsonrpc": "2.0"
}
Zabbix responds to this type of request with the itemid that was created as the result and keeps the “id” value the same as your request so you can associate the two:
{
"result": {
"itemids": [
"100100000039405"
]
},
"id": 1,
"jsonrpc": "2.0"
}
Now that the key is created zabbix will collect data about this value anytime it’s delivered via zabbix_sender.
How do you remove metrics that are no longer in use?
We don’t. Partly this is because I’m paranoid and wouldn’t want to clobber months or years of data because of a bug. Another reason is because the metrics are reasonably easy to remove via the Zabbix UI or API using a purpose built script and overall we rarely remove metrics from our systems. We also found that on application restart, if certain individual metrics had not been incremented they would not appear in the JSON. We’re working on fixing this for consistency but we wouldn’t want a problem like that in the application to cause us to remove metric data, we would rather just fail to collect.
Wrapping up
This method works well with a variety of monitoring systems but relies on the metrics being discoverable without knowing what they are. Typically this means they need to be exposed via JSON or XML and provided to the polling client with a single request. There is an additional side effect of doing this that may not be clear. If you make it automatic and easy to expose & graph metrics in your application then Developers can suddenly add metrics & observe those very easily without asking anyone. More metrics means better understanding & hopefully happier Dev & Ops folks.