We love us some metrics. We pull data from all kinds of places including our monitoring system agent, from log data, from mbeans & from client timers. We pull all this stuff together into a variety of tools and slice and dice it into pretty graphs & ultimately turn it into knowledge. You really cannot collect too much data about your service and we could talk all day about all the ways we watch things today and how we want to watch things in the future. But, since we don’t have all day, I wanted to focus this post on per-request logging
Why per-request logging?
There are a variety of ways to observe the performance of your systems & applications. Most folks are familiar with what we would call aggregate metrics. Aggregate metrics represent a group of measurements over time. This could be load average over the last 5 minutes, the number of transactions in the last minute, or the number of seconds spent waiting for the database over some period of time. Aggregate metrics are good because they are usually lightweight and easy to collect and they provide you with a view into changing trends. They are great for plugging into tools like Cacti or Nagios for trending and alerting.
However, once you identify a trend change, how do you get more detail about that change & understand why it occurred? This is where per-request metrics come in. As you’ll see in this post, per-request metrics let you understand specifically what types of activity drove changes in a trend & allow you to get very specific about where you spend your time diagnosing your system. They also provide you with forensics data to use after the fact so that you can reproduce what led up to a problem, which is sometimes the only way to reproduce & fix it.
It all starts with your log format.
At Rally we have a pretty verbose log format to track requests. Below is a sample of what we log for every request that goes through our system. This isn’t 100% of the data we log by any means, this is just what is logged for each request/response transaction. Any errors, exceptions, or other data points are logged separately but are often identified in this message. For example, if a particular request triggers an exception, that information will be included in this log message so that we can tie the exception back to the request that generated it.
Note that we log this message at the completion of a request so we have detail about the entire transaction when this message is logged. Be aware that if your application has a problem during a transaction and fails to complete a request it may not generate a log message – logging some detail at the start of message processing helps identify those cases (but I haven’t included an example of that here).
Important Note: Rally takes the security of customer data very seriously. One thing our logs do not contain is any information about the data in a customers account. We purposefully exclude anything which might be confidential from the logs so that we can safely allow our Operations & Engineering staff to view these logs for diagnostic purposes. Keep this in mind when deciding on a log format.
The request section
<request timestamp=”2011-10-05 14:24:10.841″
sessionID=”host02103leaaiinaxh1ksh3lolqg9pq.host02″
httpStatus=”200″
requestID=”host02103leaaiinaxh1ksh3lolqg9pq.host0240537943″
subscriptionID=”100″
subscriptionOID=”400059
userOID=”3670880549″
workspaceOid=”41529001″
projectOid=”154487468″
user-agent=”Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.81 Safari/535.2″
method=”POST” server=”rally1.rallydev.com”
uri=”/slm/porthole/detail.sp”
queryString=”"
remoteHost=”192.168.1.1″
realRemoteHost=”192.168.1.1″
rallyProxy=”" build=”32″ version=”2011.09.24″
oid=”4324499112″
hash=”/detail/userstory/4324499112″
detail-view=”/ar/view.sp”>
The response section
<response
responseTime=”73″
cpuTime=”40.0″
blockedCount=”14″ blockedTime=”0″
waitedCount=”14″ waitedTime=”1″
heapAllocated=”7408456″
totalBytesWritten=”13646″/>
<queries
executedStatements=”18″
totalDbTime=”43.66″
objectsBuilt=”63″
cacheHits=”74″ cacheMiss=”1″
totalQueryCount=”92″/>
<profOps totalTime=”43.40″>
<profOp opName=”register” count=”2″ time=”0.16″/>
<profOp opName=”transactions” count=”4″ time=”0.44″/>
<profOp opName=”connect” count=”18″ time=”0.22″/>
<profOp opName=”UnitOfWorkCommits” count=”2″ time=”0.00″/>
<profOp opName=”cache” count=”72″ time=”0.13″/>
<profOp opName=”sql prepare” count=”18″ time=”1.72″/>
<profOp opName=”object building” count=”63″ time=”0.52″/>
<profOp opName=”row fetch” count=”18″ time=”1.87″/>
<profOp opName=”sql execute” count=”18″ time=”37.78″/>
</profOps>
</response>
</request>
Instead of focusing on what each value means, I’d like to focus on what you can do with this type of detail. You’ll notice in the request section we have “sessionID”, this allows us to tie all the requests in a single session together. Our customers will sometimes access our tool from multiple tabs or multiple browsers – this helps us differentiate between those different sessions. You’ll also notice we have values like the “uri”, “queryString”, etc – these help us identify what exactly the user was doing. We don’t have 100% of the picture without the users actual data but this is good enough most of the time to understand what was going on.
Creating a click-stream.
One of the first things we can do with logging like this is create a click stream like this:
2011-11-02 07:26:10.903 : GET : /slm/rally.sp
2011-11-02 07:26:12.684 : GET : /slm/quality/data.sp
2011-11-02 07:26:13.081 : GET : /slm/webservice/x/project/729766.js
2011-11-02 07:26:13.353 : GET : /slm/navigationModel.sp
2011-11-02 07:26:13.864 : POST : /slm/pjt/recentJson.sp
2011-11-02 07:26:14.029 : GET : /slm/pjt/build.sp
2011-11-02 07:26:13.899 : GET : /slm/webservice/1.27/TypeDefinition.js
2011-11-02 07:26:14.179 : POST : /slm/defaultDashboard.sp
2011-11-02 07:26:14.476 : GET : /slm/webservice/1.27/Workspace.js
This shows us the sequence of events a user followed from beginning to end. We can include a lot of detail about each request depending on what we’re trying to diagnose. If we’re trying to understand why a customer generated an exception, we can reproduce their activity. If we are trying to understand why a customer is complaining of slow performance we can look at the timing information on each request. If we are trying to perform forensics for a customer who doesn’t understand how their data ended up the way it did, we can look at the sequence of events and help them understand that.
There is all kinds of value in being able to reproduce the metrics around a specific users session and it’s impossible to do without this type of log detail.
Monitoring System impact
One of the things we are constantly trying to optimize is heap usage. This ties directly to the performance of our application. You’ll notice a section in the above log data where we log heapAllocated. In this way, we can understand how much heap is being allocated per-request and whether that changes based on the customers account metrics. Do we allocate more heap for certain requests for customers who have more artifacts (stories, defects, etc)? What specific requests allocate the largest amount of heap (we’ve discovered some real monsters asking this question)?
We can add heapAllocated to that same clickstream along with response time for each request. The first column after the URI is the response time in ms, the 2nd column is the heap allocated. You can start to see a correlation between response time & heap allocated. Not surprising, but you can see how this could be valuable.
2011-11-02 07:26:10.903 : GET : /slm/rally.sp : 40 : 1371256
2011-11-02 07:26:12.684 : GET : /slm/quality/data.sp : 144 : 9296648
2011-11-02 07:26:13.081 : GET : /slm/webservice/x/project/729766.js : 103 : 4343616
2011-11-02 07:26:13.353 : GET : /slm/navigationModel.sp : 167 : 7927536
2011-11-02 07:26:13.864 : POST : /slm/pjt/recentJson.sp : 8 : 1902968
2011-11-02 07:26:14.029 : GET : /slm/pjt/build.sp : 16 : 1881680
2011-11-02 07:26:13.899 : GET : /slm/webservice/1.27/TypeDefinition.js : 237 : 11227208
2011-11-02 07:26:14.179 : POST : /slm/defaultDashboard.sp : 12 : 1903088
2011-11-02 07:26:14.476 : GET : /slm/webservice/1.27/Workspace.js : 57 : 2757768
We can also understand precisely where the system is slow. A customer clicks on their Backlog page and it takes 5 seconds to load – why? Looking at the log data maybe we see that the total responseTime was 4.5 seconds – that’s pretty darn slow. We see that totalDbTime was 4.45 seconds – most of our time was spent waiting for the db, but why? We can look at executedStatements and see that we performed 100 SQL queries for this one request and our “sql execute” time was over 4.3 seconds. Time to look at some SQL tuning.
Identifying excessive transactions
Ever have an event where your request rate went way up and you needed to figure out who was driving that load and why? I suspect anyone who offers an API to their customers has this problem & Rally is no different.

We love us some metrics. We pull data from all kinds of places including our monitoring system agent, from log data, from mbeans & from client timers. We pull all this stuff together into a variety of tools and slice and dice it into pretty graphs & ultimately turn it into knowledge about our service & our customers. You really cannot collect too much data about your service and we could talk all day about all the ways we watch things. But, since we don’t have all day, I wanted to focus this post on the foundation of much of our analysis – per-request logging

Why per-request logging?

There are a variety of ways to observe the performance of your systems & applications. Most folks are familiar with what we would call aggregate metrics. Aggregate metrics represent a group of measurements over time. This could be load average over the last 5 minutes, the number of transactions in the last minute, or the number of seconds spent waiting for the database over some period of time. Aggregate metrics are good because they are usually lightweight and easy to collect and they provide you with a view into changing trends. They are great for plugging into tools like Cacti or Nagios for trending and alerting.

However, once you identify a trend change, how do you get more detail about that change & understand why it occurred? This is where per-request metrics come in. Per-request metrics let you understand specifically what types of activity drove changes in a trend & allow you to get very specific about where you spend your time diagnosing your system. They also provide you with forensics data to use after the fact so that you can reproduce what led up to a problem, which is sometimes the only way to reproduce & fix it.

It all starts with your log format

There are endless different ways to log information about your system. We recommend you pick just one & standardize on that across the organization. Having a standard log format makes processing those logs a lot easier & makes sure that everyone is including similar metrics. The actual format of the logs isn’t super critical as long as it meets a few criteria:

  • Machine parseable. It should be trivial for your tools to extract fields out of your log data – ideally directly into a data structure you can reference or easily store for future use. JSON (if you are cool) & XML (if you enjoy pain) are reasonable candidates for this but there are plenty of other formats.
  • Precise timestamps. Transactions happen very fast and you want to be able to sequence events & measure the time taken. Your logs should at least have millisecond timing in the timestamps. If you can get microsecond timing that is even better.
  • Make it easy to extend. You don’t want to have to re-write your log processing tools every time you need to add a field to your log. Make this process easy to handle. This is partly to do with your log format and part to do with the tools you use.
  • Document it! Don’t complain that the guys building your new widget aren’t following an undocumented convention – document it, make it a standard.

At Rally we have a pretty verbose log format to track requests. We want to log any detail the application is aware of that is reasonable to log on a per-request basis. It’s important to remember that you will often want to go back and look at the data from a variety of perspectives, so just because you aren’t using the data today does not mean you shouldn’t log it. Defining a log

Important Note: Rally takes the security of customer data very seriously. One thing our logs do not contain is any information about the data in a customer’s account. We intentionally exclude anything which might be confidential from the logs so that we can safely allow our Operations & Engineering staff to view these logs for diagnostic purposes. Keep this in mind when deciding on a log implementation.

What details should you log?

In general there are a few categories of information you want to include in every log message. Within each of these categories you may choose to log more or less data – it depends a lot on what type of application you are creating and what data is relevant to you. This is based on what Rally has found useful.

Session information

You want to be able to find a series of requests from a single session. For this we include detail about the session such as a sessionID & timestamp. In this way we can understand all the requests that were made in sequence. You can include whatever details here are relevant but the most important part is that you are able to construct a series of events that belong to a single session or a group of sessions that have something in common.

Request Details

You want to understand what the user is interacting with in your system so you should be including any details relevant to the request. This could include the URI, GET/POST parameters (remember, keep confidential data out of logs) as well as possibly details about what hostname they made the request to, what protocol was used, etc. Add anything here that helps you identify what the user was trying to do when they made the request.

Response Detail

In the next post we’ll talk more about the details in a response but in general we want to include information about the success/failure of the request (did it generate a successful response or not). At Rally this is also where we include a lot of detail about timing, which we will expand on next time.

Gotchas – log the request ASAP

One of the problems we have run into is that if you wait for the completion of a request/response before you log data about the request there is a potential for data loss. What happens if your application fails after receiving a request but before sending a response? This hasn’t been a huge problem, the Rally application is pretty stable, but it has been a problem from time to time. We have chosen to address this by logging both at the start of a request as well as once the request is satisfied by a response. This assures that if something bad happens during the request we still have information about what request was in flight when that problem occurred and helps us to diagnose that issue.

More in the next post

In the next post we’ll talk more about the detail you can capture in the response and how you could use that – this is where using your log data for analytics becomes really interesting.