Wed 16 Nov 2011
Per-request logging Part II: The Response
In Part I of this post we talked about the general format you want to consider for logging as well as the broad categories of information you want to log (Session, Request & Response detail). I am going to focus this particular post on the response logging. Yeah, I know the blog post is titled “Per-request logging”, but for every request there usually should be a response and neglecting to log that response is a huge gap in understanding your system.
Why is the response so important?
With request and session detail you should be able to create a stream of requests from a particular session & have a fair amount of detail about timing of requests and a picture of what was being done during that session. What we want in response detail is an understanding of how the system… well… responded. This can go pretty deep when you consider how many back-end components might be involved in satisfying a single request to your application. Start with the basics and then expand over time.
Sure, there are a lot of ways to measure metrics but when they go into logs they are easy to preserve. This means that if we have a problem down the road and want to go back and look at our logs, the detail is already there in many cases. Often times we find that for scenarios we might otherwise have to reproduce, there is enough log data to point us pretty near where the problem lies. This saves time & frustration for everyone.
What do you log in the response?
Most of our system is contained in our main application, so besides the database there aren’t a lot of other components involved. As such, our metrics are very focused on understanding how the application & database responded to a request. Here are some key metrics you would find in the response log information for our application:
- responseTime – Total response time. This is the time between us receiving the request and us sending a response to the client and includes any db interactions, etc.
- cpuTime – How many CPU cycles were spent generating this response in the application.
- heapAllocated – How much heap was allocated for this request/response.
- executedStatements – How many separate SQL statements were executed to produce this response.
- totalDbTime – How much time was spent interacting with the database
These are the big general values, the things that we can easily look at to get a general idea of where we are spending resources on a request, but we don’t stop there. For any given response we also detail a lot of information about the SQL interaction including:
- tcp connect time
- cache interaction time
- sql prepare time
- time spent fetching rows
- time waiting for sql to execute
We break down as much detail as is reasonable so that for every request we have information about resources consumed & time spent. We focus a lot on the database because that tends to be a strong indicator for us and is often where we spend a lot of time optimizing our performance. Your application may have different pressure points & other areas where more detail is interesting and you should focus on instrumenting those areas as much as you can.
Now I’m logging all this data, what do I do with it?
Ultimately, logging all this data needs to produce useful knowledge about your service. You’re investing Engineering effort, Operations effort & infrastructure expense to create, store and process these logs – they better produce some results! The infrastructure that drives much of the analysis at Rally is possibly a topic for another post, but for now let me give just a few simple examples of how we can use the above data to produce something interesting.
One of the things we are constantly trying to optimize is heap usage. This ties directly to the performance of our application. You’ll notice one of the items we log above is heapAllocated. In this way, we can understand how much heap is being allocated per-request and whether that changes based on the customers account metrics, the type of request, etc. Do we allocate more heap for certain requests for customers who have more artifacts (stories, defects, etc)? What specific requests allocate the largest amount of heap? We’ve discovered some real monsters asking this question.
We can collect heapAllocated along with response time for each request in a single session and turn this into a request stream showing the events in sequence. I have taken a handful of requests from a testing session I was working with a while back and put them into a table below. The first 3 columns are just our timestamp, the request method & the URI that the user was interacting with. The first column after the URI is the response time in milliseconds, the last column is the heap allocated. When you look at this, one request quickly jumps out at you as having the highest response time as well as the largest amount of heap allocated. You can also start to see a correlation between response time & heap allocated. This isn’t surprising, but you can start to see how this sort of detail & correlation could be valuable.

Looking at this data in a table can make it difficult to pick out trends which is where visualization comes in. We use Splunk for this to get quick histograms of data queries and Tableau for our more complex visualizations. Visualization is key to finding outliers in many cases. It’s also an important tool when trying to make a point to an impatient audience.
Another thing we can do is perform forensics on logs to understand what precisely occurred during an event that drove load in another area of our system. Take a look at the post on the value of service instrumentation for another detailed example (with graphs!) of how per-request logging can be used.
Wrapping up
Rally uses the data in our logs every single day to understand the behavior of our systems. The amount of knowledge that can be gained from logging and understanding per-request data is massive and can lead to real improvements in your applications and service.
Try to think of how you might augment logs you already have to give you more knowledge about your service & the behavior or your customers. Getting the application changes in place to collect this data is one thing, doing something valuable with the data once you have it is another. You can start small by adding a little detail at a time to the logs & then reporting on that information to show value. Once you can prove that this detail in the logs has business value it’ll be much easier to justify adding more.
Good Luck!
