Fri 30 Sep 2011
The value of service instrumentation – an example
I feel like we do a pretty good job of instrumenting our systems here at Rally. I know part of the reason for our instrumentation quality is our constant attention to how we can improve it & a feeling that it isn’t good enough. The reality is that it should never be good enough – things are always changing and you are always trying to get visibility into some dark corner of your system. All that said, we had a recent event that provided a good example of how we understand a high load event on one of our servers and I wanted to talk about how that went.
We use some pretty standard tools – we have invented a few things here and there to help us but in general these are all tools you can use yourself – the main components of our instrumentation are:
- Cacti for on going trending data
- Splunk for ad-hoc slicing and dicing of log data
- Postgres data warehouse which stores data extracted from our logs
- Tableau for visualization of more complex data & report generation
Todays story is mostly focused on Cacti & Splunk.
It starts with what you alert on
The investigation begins because we receive notifications of high load on one of our application servers. We set fairly low thresholds for load on our systems and keep them tuned pretty close to ‘normal’ so that any increase triggers an investigation. This can get annoying sometimes because it doesn’t always indicate a real problem – but it very often does indicate a change in behavior & sometimes shows us a problem in the making. We want to know when problems are brewing and certainly want to know before our customers do – so this approach suits us well (at the cost of some false positives). In this particular case, had we not set such low thresholds for load alerting we probably would not have noticed this issue because it never actually caused any problems with service delivery.
Here’s a load graph from a 24 hour period of this system – the window of concern is from 1pm to 3pm:

The ramp you see in the morning is pretty normal, but the ramp at 1pm is unusual and crossed our threshold and began to fire alerts.
Digging in – what’s driving the load
OS load is generally a poor indicator of system performance but it’s often useful as a broad view into the system activity trend and is useful for identifying anomalies. Beyond that – load doesn’t provide much insight into what is going on and you have to get more stats. Here are a few more graphs showing some other characteristics of this same system during this time.
Note that the rest of the graphs focus on a window from 12:30pm to 3pm to focus on the event. These graphs are all specific to the JVM on which this problem was being observed – they aren’t system-wide graphs. It’s important to be able to focus in on a single system as well as having system-wide views into activity.
Here is a graph showing a message queue used by the application – it shows that messages are beginning to back up in the system starting around 1pm.

And here is a graph showing our garbage collection activity in this same window. You can see that this activity corresponds to the messages collecting in the queue above.

Looking at these 2 graphs we are starting to understand why our load is occurring. We have an increase of GC activity which typically will drive system load. That GC activity is likely more of a symptom than a cause. What these graphs do not tell us is whether the increasing message queue is due to an increase in message volume or an increase in message processing time – for that we need to start looking at individual requests.
We have to dig further – per request information
At some point aggregate statistics only help so much. The graphs above are general indicators to help focus a search but they are not enough to really identify what is going on. At some point you have to start looking at the activity going on during that time at a per-request level to weed out the noise and focus on those events which are driving this activity.
We use Splunk for this today – it does a good job of taking raw log data and allowing us to slice and dice it various ways to find specific activity. In order to make this useful we’ve instrumented our logs to expose a lot of detail about each request. We will talk more about that in a future post, but for this post just know that our ability to correlate log data to this event is highly reliant on having a lot of detail in those log messages that is easy to parse by a tool like Splunk.
We have a log message which indicates when message processing rates exceed a threshold – the application logs this message to help us identify outliers. Looking for this message in splunk gives us an interesting looking timeline:

This is interesting, so during this same window we know that there was an increase in messages that took longer than we like them to.
Because these log messages are also instrumented with the customers ID we can look at all the activity for that customer and see if there’s any correlation to get an idea of volume.
Here’s a histogram of the volume of requests from that customer during this time:

Now we are starting to understand what is going on. There was a dramatic increase in activity around 1pm which doesn’t directly correlate to the increase in processing times & GC activity but it’s pretty close.
What is also interesting is that while the number of messages taking over our threshold (first Splunk graph) seems to follow the same trend as the GC graph, the overall volume for this customer doesn’t follow that same trend. The timeline is similar – but it seems like there might be some specific requests in here which are driving these issues – not just the overall volume of requests.
One of the other important pieces of information we include in the log message for each request is the URI of the request. We can exclude endpoints which cannot generate a message (since the message queue is the real issue here) and understand what activity is driving the creation of these messages.
When we do that we get this histogram:

Looking at this you might think “That looks nothing like the pattern above” and you would be right – except for a few things:
- The start time of these requests correlates much more closely with the start of our event than the overall increase in request volume does.
- The start/end of these requests correlate with plateaus and positive/negative slopes in the message queue graph.
- If you envision these requests driving message generation and it taking time for the message processing to complete – you can start to correlate the graphs.
We can now look at these events & the notifications they produced and build a picture around what drove the other behavior in the system.
These graphs are great, but what was the actual problem you found?
The bottom line in this example is that we have a situation which isn’t causing us problems right now but which could grow into a problem in the future. The actual source of the event is complex and is a combination of the data this customer has, the way they are manipulating this information & the way our system is architected to handle that.
But this post isn’t about the source of the event – it’s about visibility.
During this event the system performed fine for our customers. Despite the graphs in this post, the warnings that triggered this investigation were just that – warnings. We didn’t have to take any action because there was no problem to resolve, the customer was using the system as it was intended and the system was working fine.
We know this is true because we have data to show us this is true.
We have graphs showing us server side response time, client side response time, failure rates, etc – all of those looked great. The thing that didn’t look great was the system load & it triggered an investigation just like it’s supposed to.
What do we do with all this data we have now that we’ve investigated?
Continuous improvement is an important concept at Rally and we take it very seriously. This event is something we want to understand better. We want to make sure this doesn’t turn into a problem in the future.
Shortly after this investigation concluded and we had a good idea of what occurred the Operations team summarized the findings in a document & that was distributed to the Engineering & Operations teams. Each week the Engineering & Operations teams meet for a meeting we call “Dev/Ops sync” – this issue was raised there and discussed. After some discussion this event had an owner in Engineering & a goal of investigating the effort to improve our handling of this for the future.
This effort will turn into a story and be prioritized for work in the future. We don’t stop there.
Another important group we have at Rally is our “Architecture Council”. This is a group of Engineering & Operations members who are passionate about the way that Rally evolves our service. Architecture Council was formed to track and address problems just like this – things that aren’t problems yet but will probably become problems in the future. That group has the ability to prioritize work into Engineering teams & get it done.
As time goes on, if we discover events like this being repeated they will continue to be raised at Dev/Ops sync & Architecture Council. Data such as the impact of these events on the system, the impact on customer experience, and the resources we consume each time they occur will drive the priority.
The important idea here is that it’s data driven, and it all starts and ends with visibility. You have to instrument your systems to know what they are doing & why. You then have to communicate those events to the team so that improvements can happen.
