Author Archive

One of the things I really enjoy about being in Operations is the variety of work that comes your way. Some of this variety comes from projects in different areas of the system, but some of this variety comes from being exposed to interrupts that come along each day. While Agile methods are designed to allow for pivots, handling daily interrupts on the scale that you see in Operations can be very intrusive to the process.

Most organizations have an on-call schedule to handle after-hours issues on a rotating basis, but we have taken that one step further. What we wanted to do was try to offload much of that daily interrupt driven work away from those folks working through their normal project oriented stories. This has the effect of allowing the majority of the team to focus for 3 weeks at a time (we have a 4 person rotation) while one person handles the majority of these interrupt type issues. This lucky person is the “Op of the Week” or OW for short.

A similar idea exists on the Development side, we call them the “Dev of the Week” or DW. They are the first point of escalation for the OW on application related issues where we need Dev team assistance.

One of the areas we have struggled a little bit with our OW rotation is making sure everyone agrees about what falls under the responsibility of the OW. As a result we’ve created a document which defines our working agreements around the OW role. We review this as a team and make adjustments where necessary.

Here’s an overview of the types of stuff we focus the OW on for their given week & the role we expect them to hold:

Rotation: OW duties rotate every Monday at 12pm. This provides time for the outgoing OW to provide an update at SoS (Scrum of Scrums) and to provide any turnover to the new OW.

General Duties:

  • Respond to all critical pages (Failed hardware, software, high load events, etc.).
  • Communicate status to team when issues are being worked. Include DW & general Dev team as appropriate.
  • File defects when needed for application events or problems that require assistance from DW or Dev team.
  • Issues raised during OW are owned to completion, coordinating additional resources as necessary.
  • Attend Scrum of Scrums & provide status on any new Ops issues (customer problems, production outages, etc)
  • Respond to all incoming operations interrupt type requests (includes all Ops responsibilities, not just prod)
  • Remediation of critical security issues. Drive required security patches / workarounds to completion.
  • Review nightly jobs including Backups for errors / problems.

We also maintain a calendar of events in this same document which describes the routine daily activities which are part of the OW duties. For example, because we perform weekly deploys to production we have a pre-production push on Thursday evening of the Release Candidate that the OW performs. Saturday mornings when our production release rolls out, the OW is expected to be on high alert for problems.

This process has worked well with some adjustment over time. It is also not intended to be completely rigid. For example, we certainly have cases where something will come up around a certain part of our infrastructure where the OW is not the expert. When appropriate, and where both parties agree, other team members are welcome to take ownership of issues that come up if they are passionate about getting those issues resolved and/or have expertise in that area. We don’t expect team members to do this all the time though, as being OW also forces some cross-training, something we think is really important.

If you’ve found a good way to offload interrupts or have feedback on what we’ve done, let us know in the comments.

We have been using the PER process now for a while and as we have run through a number of these we’ve made some changes to the process. As with all things at Rally, we rely heavily on inspecting and adapting. Here are some areas we’ve found we needed to adjust.

Communicating with teams outside those directly involved in the event

We have started to bring our PER process into the Dev/Ops sync meetings to raise visibility in that meeting of events that occur. After a PER is completed (meaning we’ve discussed it & all the action related stories are created) we present that PER at the start of our Dev/Ops sync meeting so that the broader team can ask questions or provide feedback. A big part of the PER process is simply to raise awareness of the types of issues which impact our service and this has been an easy way to briefly review these events with the team.

For larger events, in particular those where the system is unavailable, we have distributed the PER document to the teams and even the Executive team as necessary to answer questions about the event. We try to be as transparent as possible about these events & what we learn from them.

Creating and tracking stories from the event

One of the most difficult aspects of establishing a process like this is managing the large number of new stories which are created out of each PER. We do not focus on a single root cause for an event, instead we focus on discovering and documenting all the factors that led to the event or could lead to future events. Doing this means that we can generate a pretty large number of action items out of any given event – usually it’s 5-10 new stories.

Moving those stories through the process to get prioritized & completed is a challenge. We have created a “PER View” in Rally where we can see all our parent stories and their children and we have started to work on reviewing and prioritizing those incomplete items. The good news is that some are already complete, the bad news is that many others are not. We are continuing to evolve our process for reviewing these within Operations and with other teams to raise awareness about incomplete stories.

When to run a PER

When we first started this new process, we indicated that we would only run a PER for events which caused downtime. We have since started to expand our use of the PER to other events. This is driven by the fact that many events which do not cause downtime still expose areas where we could significantly improve things. Sometimes we get lucky and catch something in manual testing, or the sequence of events is such that production downtime is avoided, but it doesn’t mean we can’t learn from the event.

More recently we have tried to look at an event from the perspective of how much we think we can learn, rather than the impact it had on our system or customers. We run a PER on any event that causes downtime without question, but we’re looking for opportunities to use this process to learn about events which help us improve before we have downtime.

Developing Muscle Memory

After having done this a few times, we’ve noticed an interesting side effect. We’re getting better at recognizing if an event can lead to a PER, and we start the data collection process during the event itself. This has been immensely helpful in capturing detail without relying on an individual’s memory of the event. It also helps make our PER meetings more efficient by creating a document with nearly complete timeline, deltas and action items. As an example, during a recent event the document was created, the timeline was started, and an email was sent with a link to the PER. Those involved in the event could then immediately begin to contribute to that document as the event occurred.

A big part of this is people getting used to the PER process and having a template ready to go which folks can quickly fill in with details. Making it easy for anyone to start the document & allow everyone to contribute to it is key. We use Google Docs for these items but you can use just about any collaborative document system.

Making change easy

When the PER process was created it was intended to be a flexible process that allows for easy change to the process. We didn’t invest a lot of effort in creating any complex software or forms – we just created a Google Document. Over time we have modified that document (the template) a number of times as changes made sense. This was trivial to do, anyone could do it, and you could see a history of what had been changed in the document.

Although the process has changed a bit since it was started, we didn’t have to have a lot of discussion about those changes. The cost of change was low, and the ability to reverse a change that doesn’t work means there is very little risk to trying new things. This means we didn’t have to think too much about what the best process would be when we started – we just built something that was good enough and started using it, adjusting as we go.

Wrapping up

We will continue to evolve this process and we’ll talk about it more as we do, but the process seems to be working pretty well for collecting data about an event. The most important aspect of this is to be open to change & to try different ways of approaching things. Collecting ideas about what went wrong is the easy part though, the hard part is turning those ideas into actual changes to the software, process or the organization. When we get that part all figured out, expect another blog post with the results.

I bet you’ve never once done this before right?

Screen Shot 2011-11-22 at 11.29.11 PM

Or how about this in prod when you thought it was dev?

Screen Shot 2011-11-22 at 11.31.48 PM

Mistakes happen, but they still suck. As time goes on you become pretty careful about how you do things in production environments. Most of us who have been burned have our own tricks to avoid this. Usually this comes down to visual indicators that tell you what environment you are working in. It never hurts to add in a little pause to think before hitting enter as well.

It’s a really simple idea, but one I don’t often see implemented except by folks who have been burned. It’s better if you actually implement it across the organization to help everybody out. Here are a few examples of things we use here at Rally and that I’ve used in the past to help make important information more obvious.

Shell prompt to make the environment obvious

If you login to a production system at Rally – no matter who you are – you’ll see something like this:

Screen Shot 2011-11-22 at 11.19.18 PM

And if you su to root you’ll see this:

Screen Shot 2011-11-22 at 11.12.13 PM

The Red prompt tells you:

  • You are on production, be careful
  • The name of the server you are logged into
  • If you su to root, we try to make that more evident

This is accomplished by creating a file called /etc/profile.d/prompt.sh with the following contents:

# Special prompt for production systems
if [ "$EUID" = "0" ]; then
  PS1='\[\033[01;31m\]| PROD - \H - ROOT | \[\033[00m\]\n[\w]# '
else
  PS1='\[\033[01;31m\]| PROD - \H - \u | \[\033[00m\]\n[\w]$ '
fi

Similarly – if you login to a non-production demo system you might see something like this:

Screen Shot 2011-11-22 at 11.20.44 PM

The yellow prompt tells you two things:

  • You are not on production
  • You are still on a demo system which might be in use right now, so still be careful.

If you login and see something like this:

Screen Shot 2011-11-23 at 9.48.30 AM

You know something is amiss, because Adam shouldn’t have production access.

Leverage your MOTD

A feature of Unix I rarely see used to it’s full extent is the motd. The contents of this file are dumped to the terminal each time you login and are an excellent place for messages you want users to see when they login. Look at the prompt below and then we can discuss:

Screen Shot 2011-11-26 at 6.56.33 AM

This message tells us a number of things immediately upon logging in:

  • Someone else is doing something with this system
  • What date they started work
  • Who that person is
  • What internal story (S123456) is associated with this work

This is pretty much all you need. You can look at the story if you want more detail, you can follow up with the individual to see what’s going on, but most importantly, if an alarm went off in the monitoring system for this system you know it’s not in service and you aren’t going to attempt to “fix” it without checking with the person listed.

I like to refer to this as “communicating close to the change”. Try to add an indicator somewhere that others will see it during the natural course of their work looking into a change. Having this information in a change log somewhere is great but if an alarm goes off at 3am, I’m just going to login to the system and take a look at it – having the motd in place assures I’ll see it.

Change the color of your screen/tmux status bar

I tend to use screen or tmux when I’m working on systems. This gives me one additional area where I can expose an indicator of my environment.

Screen Shot 2011-11-28 at 10.27.54 AM

So now, not only do I have my shell prompt to tell me I’m in production but I have the red bar at the bottom telling me I am in production. For contrast, here is what my tmux status bar looks like when I’m not in production:

Screen Shot 2011-11-28 at 10.35.04 AM

You can change this pretty easy in tmux.conf:

Here is production:

# Highlighting the active window in status bar
setw -g window-status-current-bg red
setw -g window-status-current-fg white

And here is non-production:

# Highlighting the active window in status bar
setw -g window-status-current-bg green
setw -g window-status-current-fg white

Wrapping up

The main point here is that you should make it as obvious as you can (without being intrusive) which environment you are working in. Use colors, blinky lights or whatever else you need to make sure you know where you are working at all times. This doesn’t prevent all mistakes, but it makes it easier to avoid some.

As we’ve mentioned in previous posts, we monitor a lot of different metrics. One of the challenges you have as your list of metrics grow is the impact metric collection has on your service. Typically monitoring methods expose a single metric per request so if I want to observe the current heap allocation I have to make a single request and if I want to observe the number of active threads I have to make another request. These requests all add up and when you query them every minute, like we do, there are concerns about being able to query all the metrics you want to in a 1 minute interval. There are also concerns about what impact all these requests have on the application itself.

Recently we have started trying to use a metrics library provided by Yammer to expose information about our application via JSON. Besides making it easy to access the metrics (they are exposed via HTTP), this library also makes it possible to collect those metrics in a single request which delivers a blast of JSON with everything in it. We’ve found adding new metrics to be pretty easy & you get a lot of metrics out of the box.

Having solved the problem of multiple queries nicely, the next problem was the overwhelming number of metrics that are now exposed. Being the lazy Ops guy who has to configure all of those into the monitoring system, I opted to automate the process so that when new metrics are added to the JSON we use our monitoring system’s API to add the new metrics & immediately start to collect data for them. This could be done for any respectable monitoring system but in our case we’re using Zabbix, so my example is based on that.

Moving from JSON to key/value pairs

The first challenge was that JSON presents data in a hierarchy and our monitoring system wants to assign a value to a key. We opted to just “flatten” the JSON structure so that the path to any given value is the key – thus we go from something like this:

"jvm" : {
  "memory" : {
    "memory_pool_usages" : {
      "Code Cache" : 0.029173533121744793,
    }
  }
}

To this:

jvm.memory.memory_pool_usages.Code_Cache = 0.029173533121744793

There are lots of examples of how to do this in whatever language you choose to use – here’s how we’re doing it in ruby:

def hash_flatten(input = {}, output = {}, options = {})
  input.each do |key, value|
    key = options[:prefix].nil? ? "#{key}" : "#{options[:prefix]}#{options[:delimiter]||"."}#{key}"
    if value.is_a? Hash
      hash_flatten(value, output, :prefix => key, :delimiter => ".")
    else
      clean_key = key.gsub(/\s/, "_")
      output[clean_key]  = value
    end
  end
  output
end

Once we have a list of key/value pairs we dump these metrics into a file and use the zabbix_sender tool to deliver all those metrics to Zabbix every minute.

The next issue is how to make sure all the metrics in the JSON show up in our monitoring system without Ops having to do any extra work. For this we leverage the Zabbix API.

Creating Metrics via the Zabbix API

We are using Templates to associate a list of monitored items with a host. This makes it easy for us to add a new host & apply a few templates to get all the metrics we want associated with that host. When we are adding new metrics via the API we want to make sure we add them to the Template, not to the individual host being monitored. The API has an authorization process that I’m not going to cover here, it’s covered pretty well in their documentation. We are going to focus on creating new metrics.

The Zabbix API offers a convenient method “item.exists” to query a particular key to see if it already exists. We use this to confirm a key is missing before we create it. Creating a key involves a fair amount of information about the key you want to create, so here is the code (ruby) and then we’ll talk about it a bit:

def create_item(key, apps, hostid, value_type)
  #appids = []
  # Get a list of application ID's
  appids = apps.collect { | app | get_app_id(app) }

  req_json = jinit("item.create")
  req_json[:params] = {
    :description => key,
    :key_ => key,
    :hostid => hostid,
    :applications => appids,
    :type => 2,
    :history => 90,
    :value_type => value_type
  }

  api_req(JSON.generate(req_json))
end

Ok, so when this is called we pass a number of parameters to create_item:

  • key – this is the name of the item (the “flattened” JSON path to the value listed above).
  • apps – this is an array of application names we want this key to be added to. Refer to the Zabbix manual for the relevance of these – just know you can associate multiple apps with a single key.
  • hostid – In our case, this is the template ID since we’re using templates, but this could be any host ID you want to associate this item with. The Zabbix API re-uses the hostid to pass the templateid for items associated with templates.
  • value_type – this defines the type of value this is: string, float, int, etc

We have to iterate over the list of apps provided and obtain their numeric IDs (apps.collect). To do this we have another bit of code called get_app_id() which resolves an app name to an app ID via the API. We setup the basic JSON request structure with a call to jinit() along with the API method we want to use, in this case ‘item.create’. Using jinit() takes care of making sure we have an authentication token, keeps track of request IDs and sets up the basic structure that’s common to all API requests.

This is what jinit looks like:

def jinit(method)
  # If we don't have a token and we aren't
  # being called to setup authentication then
  # we setup auth first
  if !$auth_token && method != "user.login"
    auth_setup
  end
  json = {
    :jsonrpc => "2.0",
    :method => method,
    :auth => $auth_token,
    :id => get_id(),
  }
end

Once we get back a hash from jinit we complete the params portion of the hash with all the values we were passed. The two static values, type & history, are common to all our requests. We keep 90 days of per-minute data on monitors & setting “type” to “2″ means that this is a “Zabbix Trapper” key type which is required to receive data sent to Zabbix via the zabbix_sender utility.

This generates a JSON request that looks something like this:

{
  "auth": "12345678901234567890",
  "method": "item.create",
  "id": 1,
  "params": {
    "key_": "test.db.query.composites.duration.p95",
    "history": 90,
    "hostid": "100100000010157",
    "applications": [
       "100100000001773"
     ],
     "type": 2,
     "description": "test.db.query.composites.duration.p95",
     "value_type": 0
   },
   "jsonrpc": "2.0"
}

Zabbix responds to this type of request with the itemid that was created as the result and keeps the “id” value the same as your request so you can associate the two:

 {
   "result": {
     "itemids": [
       "100100000039405"
     ]
   },
   "id": 1,
   "jsonrpc": "2.0"
 }

Now that the key is created zabbix will collect data about this value anytime it’s delivered via zabbix_sender.

How do you remove metrics that are no longer in use?

We don’t. Partly this is because I’m paranoid and wouldn’t want to clobber months or years of data because of a bug. Another reason is because the metrics are reasonably easy to remove via the Zabbix UI or API using a purpose built script and overall we rarely remove metrics from our systems. We also found that on application restart, if certain individual metrics had not been incremented they would not appear in the JSON. We’re working on fixing this for consistency but we wouldn’t want a problem like that in the application to cause us to remove metric data, we would rather just fail to collect.

Wrapping up

This method works well with a variety of monitoring systems but relies on the metrics being discoverable without knowing what they are. Typically this means they need to be exposed via JSON or XML and provided to the polling client with a single request. There is an additional side effect of doing this that may not be clear. If you make it automatic and easy to expose & graph metrics in your application then Developers can suddenly add metrics & observe those very easily without asking anyone. More metrics means better understanding & hopefully happier Dev & Ops folks.

In Part I of this post we talked about the general format you want to consider for logging as well as the broad categories of information you want to log (Session, Request & Response detail). I am going to focus this particular post on the response logging. Yeah, I know the blog post is titled “Per-request logging”, but for every request there usually should be a response and neglecting to log that response is a huge gap in understanding your system.

Why is the response so important?

With request and session detail you should be able to create a stream of requests from a particular session & have a fair amount of detail about timing of requests and a picture of what was being done during that session. What we want in response detail is an understanding of how the system… well… responded. This can go pretty deep when you consider how many back-end components might be involved in satisfying a single request to your application. Start with the basics and then expand over time.

Sure, there are a lot of ways to measure metrics but when they go into logs they are easy to preserve. This means that if we have a problem down the road and want to go back and look at our logs, the detail is already there in many cases. Often times we find that for scenarios we might otherwise have to reproduce, there is enough log data to point us pretty near where the problem lies. This saves time & frustration for everyone.

What do you log in the response?

Most of our system is contained in our main application, so besides the database there aren’t a lot of other components involved. As such, our metrics are very focused on understanding how the application & database responded to a request. Here are some key metrics you would find in the response log information for our application:

  • responseTime – Total response time. This is the time between us receiving the request and us sending a response to the client and includes any db interactions, etc.
  • cpuTime – How many CPU cycles were spent generating this response in the application.
  • heapAllocated – How much heap was allocated for this request/response.
  • executedStatements – How many separate SQL statements were executed to produce this response.
  • totalDbTime – How much time was spent interacting with the database

These are the big general values, the things that we can easily look at to get a general idea of where we are spending resources on a request, but we don’t stop there. For any given response we also detail a lot of information about the SQL interaction including:

  • tcp connect time
  • cache interaction time
  • sql prepare time
  • time spent fetching rows
  • time waiting for sql to execute

We break down as much detail as is reasonable so that for every request we have information about resources consumed & time spent. We focus a lot on the database because that tends to be a strong indicator for us and is often where we spend a lot of time optimizing our performance. Your application may have different pressure points & other areas where more detail is interesting and you should focus on instrumenting those areas as much as you can.

Now I’m logging all this data, what do I do with it?

Ultimately, logging all this data needs to produce useful knowledge about your service. You’re investing Engineering effort, Operations effort & infrastructure expense to create, store and process these logs – they better produce some results! The infrastructure that drives much of the analysis at Rally is possibly a topic for another post, but for now let me give just a few simple examples of how we can use the above data to produce something interesting.

One of the things we are constantly trying to optimize is heap usage. This ties directly to the performance of our application. You’ll notice one of the items we log above is heapAllocated. In this way, we can understand how much heap is being allocated per-request and whether that changes based on the customers account metrics, the type of request, etc. Do we allocate more heap for certain requests for customers who have more artifacts (stories, defects, etc)? What specific requests allocate the largest amount of heap? We’ve discovered some real monsters asking this question.

We can collect heapAllocated along with response time for each request in a single session and turn this into a request stream showing the events in sequence. I have taken a handful of requests from a testing session I was working with a while back and put them into a table below. The first 3 columns are just our timestamp, the request method & the URI that the user was interacting with. The first column after the URI is the response time in milliseconds, the last column is the heap allocated. When you look at this, one request quickly jumps out at you as having the highest response time as well as the largest amount of heap allocated. You can also start to see a correlation between response time & heap allocated. This isn’t surprising, but you can start to see how this sort of detail & correlation could be valuable.

Screen Shot 2011-11-15 at 5.30.40 PM

Looking at this data in a table can make it difficult to pick out trends which is where visualization comes in. We use Splunk for this to get quick histograms of data queries and Tableau for our more complex visualizations. Visualization is key to finding outliers in many cases. It’s also an important tool when trying to make a point to an impatient audience.

Another thing we can do is perform forensics on logs to understand what precisely occurred during an event that drove load in another area of our system. Take a look at the post on the value of service instrumentation for another detailed example (with graphs!) of how per-request logging can be used.

Wrapping up

Rally uses the data in our logs every single day to understand the behavior of our systems. The amount of knowledge that can be gained from logging and understanding per-request data is massive and can lead to real improvements in your applications and service.

Try to think of how you might augment logs you already have to give you more knowledge about your service & the behavior or your customers. Getting the application changes in place to collect this data is one thing, doing something valuable with the data once you have it is another. You can start small by adding a little detail at a time to the logs & then reporting on that information to show value. Once you can prove that this detail in the logs has business value it’ll be much easier to justify adding more.

Good Luck!

We love us some metrics. We pull data from all kinds of places including our monitoring system agent, from log data, from mbeans & from client timers. We pull all this stuff together into a variety of tools and slice and dice it into pretty graphs & ultimately turn it into knowledge. You really cannot collect too much data about your service and we could talk all day about all the ways we watch things today and how we want to watch things in the future. But, since we don’t have all day, I wanted to focus this post on per-request logging
Why per-request logging?
There are a variety of ways to observe the performance of your systems & applications. Most folks are familiar with what we would call aggregate metrics. Aggregate metrics represent a group of measurements over time. This could be load average over the last 5 minutes, the number of transactions in the last minute, or the number of seconds spent waiting for the database over some period of time. Aggregate metrics are good because they are usually lightweight and easy to collect and they provide you with a view into changing trends. They are great for plugging into tools like Cacti or Nagios for trending and alerting.
However, once you identify a trend change, how do you get more detail about that change & understand why it occurred? This is where per-request metrics come in. As you’ll see in this post, per-request metrics let you understand specifically what types of activity drove changes in a trend & allow you to get very specific about where you spend your time diagnosing your system. They also provide you with forensics data to use after the fact so that you can reproduce what led up to a problem, which is sometimes the only way to reproduce & fix it.
It all starts with your log format.
At Rally we have a pretty verbose log format to track requests. Below is a sample of what we log for every request that goes through our system. This isn’t 100% of the data we log by any means, this is just what is logged for each request/response transaction. Any errors, exceptions, or other data points are logged separately but are often identified in this message. For example, if a particular request triggers an exception, that information will be included in this log message so that we can tie the exception back to the request that generated it.
Note that we log this message at the completion of a request so we have detail about the entire transaction when this message is logged. Be aware that if your application has a problem during a transaction and fails to complete a request it may not generate a log message – logging some detail at the start of message processing helps identify those cases (but I haven’t included an example of that here).
Important Note: Rally takes the security of customer data very seriously. One thing our logs do not contain is any information about the data in a customers account. We purposefully exclude anything which might be confidential from the logs so that we can safely allow our Operations & Engineering staff to view these logs for diagnostic purposes. Keep this in mind when deciding on a log format.
The request section
<request timestamp=”2011-10-05 14:24:10.841″
sessionID=”host02103leaaiinaxh1ksh3lolqg9pq.host02″
httpStatus=”200″
requestID=”host02103leaaiinaxh1ksh3lolqg9pq.host0240537943″
subscriptionID=”100″
subscriptionOID=”400059
userOID=”3670880549″
workspaceOid=”41529001″
projectOid=”154487468″
user-agent=”Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.81 Safari/535.2″
method=”POST” server=”rally1.rallydev.com”
uri=”/slm/porthole/detail.sp”
queryString=”"
remoteHost=”192.168.1.1″
realRemoteHost=”192.168.1.1″
rallyProxy=”" build=”32″ version=”2011.09.24″
oid=”4324499112″
hash=”/detail/userstory/4324499112″
detail-view=”/ar/view.sp”>
The response section
<response
responseTime=”73″
cpuTime=”40.0″
blockedCount=”14″ blockedTime=”0″
waitedCount=”14″ waitedTime=”1″
heapAllocated=”7408456″
totalBytesWritten=”13646″/>
<queries
executedStatements=”18″
totalDbTime=”43.66″
objectsBuilt=”63″
cacheHits=”74″ cacheMiss=”1″
totalQueryCount=”92″/>
<profOps totalTime=”43.40″>
<profOp opName=”register” count=”2″ time=”0.16″/>
<profOp opName=”transactions” count=”4″ time=”0.44″/>
<profOp opName=”connect” count=”18″ time=”0.22″/>
<profOp opName=”UnitOfWorkCommits” count=”2″ time=”0.00″/>
<profOp opName=”cache” count=”72″ time=”0.13″/>
<profOp opName=”sql prepare” count=”18″ time=”1.72″/>
<profOp opName=”object building” count=”63″ time=”0.52″/>
<profOp opName=”row fetch” count=”18″ time=”1.87″/>
<profOp opName=”sql execute” count=”18″ time=”37.78″/>
</profOps>
</response>
</request>
Instead of focusing on what each value means, I’d like to focus on what you can do with this type of detail. You’ll notice in the request section we have “sessionID”, this allows us to tie all the requests in a single session together. Our customers will sometimes access our tool from multiple tabs or multiple browsers – this helps us differentiate between those different sessions. You’ll also notice we have values like the “uri”, “queryString”, etc – these help us identify what exactly the user was doing. We don’t have 100% of the picture without the users actual data but this is good enough most of the time to understand what was going on.
Creating a click-stream.
One of the first things we can do with logging like this is create a click stream like this:
2011-11-02 07:26:10.903 : GET : /slm/rally.sp
2011-11-02 07:26:12.684 : GET : /slm/quality/data.sp
2011-11-02 07:26:13.081 : GET : /slm/webservice/x/project/729766.js
2011-11-02 07:26:13.353 : GET : /slm/navigationModel.sp
2011-11-02 07:26:13.864 : POST : /slm/pjt/recentJson.sp
2011-11-02 07:26:14.029 : GET : /slm/pjt/build.sp
2011-11-02 07:26:13.899 : GET : /slm/webservice/1.27/TypeDefinition.js
2011-11-02 07:26:14.179 : POST : /slm/defaultDashboard.sp
2011-11-02 07:26:14.476 : GET : /slm/webservice/1.27/Workspace.js
This shows us the sequence of events a user followed from beginning to end. We can include a lot of detail about each request depending on what we’re trying to diagnose. If we’re trying to understand why a customer generated an exception, we can reproduce their activity. If we are trying to understand why a customer is complaining of slow performance we can look at the timing information on each request. If we are trying to perform forensics for a customer who doesn’t understand how their data ended up the way it did, we can look at the sequence of events and help them understand that.
There is all kinds of value in being able to reproduce the metrics around a specific users session and it’s impossible to do without this type of log detail.
Monitoring System impact
One of the things we are constantly trying to optimize is heap usage. This ties directly to the performance of our application. You’ll notice a section in the above log data where we log heapAllocated. In this way, we can understand how much heap is being allocated per-request and whether that changes based on the customers account metrics. Do we allocate more heap for certain requests for customers who have more artifacts (stories, defects, etc)? What specific requests allocate the largest amount of heap (we’ve discovered some real monsters asking this question)?
We can add heapAllocated to that same clickstream along with response time for each request. The first column after the URI is the response time in ms, the 2nd column is the heap allocated. You can start to see a correlation between response time & heap allocated. Not surprising, but you can see how this could be valuable.
2011-11-02 07:26:10.903 : GET : /slm/rally.sp : 40 : 1371256
2011-11-02 07:26:12.684 : GET : /slm/quality/data.sp : 144 : 9296648
2011-11-02 07:26:13.081 : GET : /slm/webservice/x/project/729766.js : 103 : 4343616
2011-11-02 07:26:13.353 : GET : /slm/navigationModel.sp : 167 : 7927536
2011-11-02 07:26:13.864 : POST : /slm/pjt/recentJson.sp : 8 : 1902968
2011-11-02 07:26:14.029 : GET : /slm/pjt/build.sp : 16 : 1881680
2011-11-02 07:26:13.899 : GET : /slm/webservice/1.27/TypeDefinition.js : 237 : 11227208
2011-11-02 07:26:14.179 : POST : /slm/defaultDashboard.sp : 12 : 1903088
2011-11-02 07:26:14.476 : GET : /slm/webservice/1.27/Workspace.js : 57 : 2757768
We can also understand precisely where the system is slow. A customer clicks on their Backlog page and it takes 5 seconds to load – why? Looking at the log data maybe we see that the total responseTime was 4.5 seconds – that’s pretty darn slow. We see that totalDbTime was 4.45 seconds – most of our time was spent waiting for the db, but why? We can look at executedStatements and see that we performed 100 SQL queries for this one request and our “sql execute” time was over 4.3 seconds. Time to look at some SQL tuning.
Identifying excessive transactions
Ever have an event where your request rate went way up and you needed to figure out who was driving that load and why? I suspect anyone who offers an API to their customers has this problem & Rally is no different.

We love us some metrics. We pull data from all kinds of places including our monitoring system agent, from log data, from mbeans & from client timers. We pull all this stuff together into a variety of tools and slice and dice it into pretty graphs & ultimately turn it into knowledge about our service & our customers. You really cannot collect too much data about your service and we could talk all day about all the ways we watch things. But, since we don’t have all day, I wanted to focus this post on the foundation of much of our analysis – per-request logging

Why per-request logging?

There are a variety of ways to observe the performance of your systems & applications. Most folks are familiar with what we would call aggregate metrics. Aggregate metrics represent a group of measurements over time. This could be load average over the last 5 minutes, the number of transactions in the last minute, or the number of seconds spent waiting for the database over some period of time. Aggregate metrics are good because they are usually lightweight and easy to collect and they provide you with a view into changing trends. They are great for plugging into tools like Cacti or Nagios for trending and alerting.

However, once you identify a trend change, how do you get more detail about that change & understand why it occurred? This is where per-request metrics come in. Per-request metrics let you understand specifically what types of activity drove changes in a trend & allow you to get very specific about where you spend your time diagnosing your system. They also provide you with forensics data to use after the fact so that you can reproduce what led up to a problem, which is sometimes the only way to reproduce & fix it.

It all starts with your log format

There are endless different ways to log information about your system. We recommend you pick just one & standardize on that across the organization. Having a standard log format makes processing those logs a lot easier & makes sure that everyone is including similar metrics. The actual format of the logs isn’t super critical as long as it meets a few criteria:

  • Machine parseable. It should be trivial for your tools to extract fields out of your log data – ideally directly into a data structure you can reference or easily store for future use. JSON (if you are cool) & XML (if you enjoy pain) are reasonable candidates for this but there are plenty of other formats.
  • Precise timestamps. Transactions happen very fast and you want to be able to sequence events & measure the time taken. Your logs should at least have millisecond timing in the timestamps. If you can get microsecond timing that is even better.
  • Make it easy to extend. You don’t want to have to re-write your log processing tools every time you need to add a field to your log. Make this process easy to handle. This is partly to do with your log format and part to do with the tools you use.
  • Document it! Don’t complain that the guys building your new widget aren’t following an undocumented convention – document it, make it a standard.

At Rally we have a pretty verbose log format to track requests. We want to log any detail the application is aware of that is reasonable to log on a per-request basis. It’s important to remember that you will often want to go back and look at the data from a variety of perspectives, so just because you aren’t using the data today does not mean you shouldn’t log it. Defining a log

Important Note: Rally takes the security of customer data very seriously. One thing our logs do not contain is any information about the data in a customer’s account. We intentionally exclude anything which might be confidential from the logs so that we can safely allow our Operations & Engineering staff to view these logs for diagnostic purposes. Keep this in mind when deciding on a log implementation.

What details should you log?

In general there are a few categories of information you want to include in every log message. Within each of these categories you may choose to log more or less data – it depends a lot on what type of application you are creating and what data is relevant to you. This is based on what Rally has found useful.

Session information

You want to be able to find a series of requests from a single session. For this we include detail about the session such as a sessionID & timestamp. In this way we can understand all the requests that were made in sequence. You can include whatever details here are relevant but the most important part is that you are able to construct a series of events that belong to a single session or a group of sessions that have something in common.

Request Details

You want to understand what the user is interacting with in your system so you should be including any details relevant to the request. This could include the URI, GET/POST parameters (remember, keep confidential data out of logs) as well as possibly details about what hostname they made the request to, what protocol was used, etc. Add anything here that helps you identify what the user was trying to do when they made the request.

Response Detail

In the next post we’ll talk more about the details in a response but in general we want to include information about the success/failure of the request (did it generate a successful response or not). At Rally this is also where we include a lot of detail about timing, which we will expand on next time.

Gotchas – log the request ASAP

One of the problems we have run into is that if you wait for the completion of a request/response before you log data about the request there is a potential for data loss. What happens if your application fails after receiving a request but before sending a response? This hasn’t been a huge problem, the Rally application is pretty stable, but it has been a problem from time to time. We have chosen to address this by logging both at the start of a request as well as once the request is satisfied by a response. This assures that if something bad happens during the request we still have information about what request was in flight when that problem occurred and helps us to diagnose that issue.

More in the next post

In the next post we’ll talk more about the detail you can capture in the response and how you could use that – this is where using your log data for analytics becomes really interesting.

Ever feel like your Ops team just doesn’t understand what’s going on in Development? Ever feel like your Development team is just “tossing stuff over the wall”? I’ve felt like that plenty in the past. There are all kinds of strategies to battle this, this post will talk about some of the things we do at Rally to keep things moving smoothly.

Some of these are dedicated meetings and some are just Ops being involved in the regular Development process. In general the idea is to try to have Operations and Engineering in the same room as much as possible when decisions are being made.

Stand Up Integration – Daily, 5-15 minutes

We have many different Dev teams working on a variety projects at the same time. That makes for a big challenge staying in touch with all those projects in Ops. One of the most effective ways that we have found to get visibility into what is going on in Dev is to attend their daily standups. This doesn’t require a ton of our time, usually 5-15 minutes per team, and has a huge benefit.

We try to have one person in Ops assigned to a particular stand-up and work with Development to stagger their stand-ups so that we can attend them. Ideally the same person in Ops goes to the stand-up each day.

A few things attending standup each day buys us:

  • Get to hear about new changes as they are being concieved & built.
  • Ability to gather feedback from Ops & Dev each day and keep both teams in sync as ideas evolve.
  • As issues come up in production for a given team they can be brought to the Dev team directly by someone from Ops who is familiar with their projects because they attend the stand-up each day.
  • Establish a relationship where the designated Ops person works as part of the Dev team. They aren’t necessarily responsible for all the tasks that Dev team needs, but they are an easy and available point of contact.

This doesn’t solve 100% of your problems, but it does help to open up communication & involvement between the teams and goes a long way toward getting things moving smoothly.

SoS (Scrum of Scrums) – Daily, 5-10 minutes

This typically happens following most of the individual team stand-ups. This is a very quick meeting for teams to raise any cross-team dependency issues. If Ops had a production issue the night before, we let everyone know at SoS. If there are problems that are blocking another team and they need help, they get raised at SoS. If there’s an important demo going on and we need to be extra vigilant, it gets raised at SoS.

SoS is a fantastic opportunity to make everyone aware of important current events. If there are issues that are raised which require additional discussion we typically have folks stay behind who want to discuss and that meeting happens immediately following SoS.

Dev/Ops Sync – Weekly, 1 hour

Once a week we pull the Operations & Engineering teams together for a meeting we call “Dev/Ops Sync”. This meeting has evolved over time but the basic premise & agenda is something like this:

  • Discuss changes in this weeks release, what features are being toggled on, what does Ops need to be aware of with the upcoming release.
  • Talk about any open issues we have for the Dev team. This could be defects we have created which are impacting production, new issues that have been discovered since the last meeting, or areas of concern where we want input from Developers to help steer our actions.
  • Make Development aware of any outages & share the associated PER document.
  • Talk about any issues Development needs to bring to Ops. Some of these issues are headed off by Ops attending stand-ups but sometimes there are larger issues that need to be discussed.

We try to have someone from each of the Engineering teams represented here which helps facilitate making sure we can cover any questions that come up for any of the teams.

Architecture Council – Weekly, 1 hour

This meeting isn’t strictly for keeping Dev and Ops in sync and more often than not it’s pretty Dev focused, but Operations involvement in this meeting is always significant. The Architecture Council at Rally is where our larger architectural issues are raised, discussed, and assigned to teams for prioritizing the work. This team is intended to make sure there is a forum for raising issues that could cause us problems down the road, for reviewing the design of new features and components being built, and to provide a way to keep all teams in sync around what important issues are being worked on.

Sometimes this team meeting is where decisions get made, sometimes it’s just to review what decisions are being made, but it’s one of the areas Dev & Ops come together to get consensus on how to move forward.

Retro & Demo – Bi-Weekly, 1 hour

Again, not strictly for the benefit of Ops, but these are two meetings where Dev and Ops come together to talk about how things are going. Every week on Friday we have either an Engineering Demo or a Retrospective. Both of these are opportunities to share what work is being done and what’s working or what isn’t.

The Engineering Demo

This is where our Development Team shows off what they’ve been working on. Sometimes it’s Will showing us some awesome new menu’s, other times it’s Jeff showing us an app that says “Hello World”. Either way, Dev & Ops get to ask questions and give feedback on work in progress, hang out and relax, and have fun in the process.

The Retrospective

Our Bi-Weekly retro’s are where each team reflects on what worked and what didn’t over the previous 2 week period and share those items that they think are important for the other teams. This is a great opportunity for Ops to raise examples of things that worked well and to call out areas where we need help. It is also a chance for any of the teams to raise items that aren’t working well and get those items assigned to an owner so they get fixed. It’s not perfect, some things take time to address, but if something continues to be a problem it continues to be raised here and usually gets fixed sooner than later.

Do what works for you

These are the things that are working for Rally today, but we change all the time and try new things. You may not need all of these meetings to have a successful Dev/Ops culture but hopefully some of these ideas help. The most important part of all of this is to foster communication and involvement by both teams – making team members accessible & constantly asking “what can we do better?”.

In Part I & Part II we talked about the overall structure of the PER process & the information we collect. In this post we will talk about the meeting we use to wrap all this up. Any time we have a service impacting event we have a PER meeting to discuss the event and gather all that information we discussed in Part II. Read on for details about that meeting.

A note about facilitating a PER meeting

At Rally we are big fans of meeting facilitators and we even have a group of people who have been trained to facilitate meetings for teams they aren’t on. Often meetings at Rally aren’t run by a manager or person on the team but rather by a facilitator who is asked to come in and facilitate. The facilitator can have a large impact on the outcome of the meeting, both in the quality of the result as well as whether or not you finish on time. We could write a whole post on meeting facilitation but here are a few things to keep in mind when you facilitate a PER meeting:

  • Avoid coming up with a solution to the problem. The PER meeting stops at identifying what the next action for a given delta is.
  • The Facilitator should represent the group. This is why a 3rd party facilitator is good – they don’t have an opinion about what is being discussed. As a member of the team, make sure you get the groups permission when you add something to the document & allow the group to drive the decisions. Note: this is hard.
  • Control cross-talk. Discussions like this can get pretty passionate and folks will start talking over each other – as the facilitator you need to make sure everyone has a chance to talk. Sometimes your most timid members will have great ideas – you want to hear those.
  • Keep it on time. These meetings are draining if they run too long – if you need to do some research & come back for a 2nd meeting do that, don’t hold everyone in a meeting trying to get answers you don’t have.

Scheduling the PER meeting

We have a rule that PER meetings must be held within 24 hours of an event to make sure things are fresh in folks minds. No, we don’t schedule them on weekends for end of the week events – we do them on Monday. For most events we will schedule 1 hour and we can typically wrap up a PER in that time. For larger teams you may need a longer meeting and for any large event where you have multiple teams you may have to extend the meeting.

Who should attend?

The PER meeting should include anyone who was involved in the event or needs to be present to come up with corrective actions. For problems where the component impacted is maintained by Operations, we will typically keep the meeting constrained to Ops, but for events where an application problem was involved we include a broader audience.

Folks at Rally who might get called into a PER meeting include

  • Operations – Familiar with the event and typically the ones who responded to it.
  • Development – Familiar with the applications & service as well as what might be tweaked to improve things.
  • Product Owner – Able to prioritize important changes within Development
  • Customer Support – Responsible for communication with customers during an event

For significant enough events you might even have folks from Marketing & Legal involved. Let’s hope that’s not the case – but don’t exclude anyone who would be able to help improve things.

Outcome of the PER Meeting

When we fire up a PER meeting we have a few specific goals for that meeting:

  • Make sure we have an accurate & complete timeline of the event.
  • Identify the most complete list of Pluses & Deltas possible.
  • Identify Corrective actions and assign ownership to someone.
  • Agree on the overall event metrics

That’s it. Now granted, getting those goals completed can be a long road sometimes. In the documents linked below we have included a detailed meeting agenda as well as the items that we cover for meeting preparation. I would encourage you to have a look at those documents to get an idea of what items we cover in the meeting & what the structure of the meeting is – there’s not a lot of reason to cover that here.

Wrapping Up

For folks who may not already be doing this we encourage you to find your own balance in this process. The best process is one that you actually use. Do not create a process that you avoid because it’s too much work or too heavyweight.

We’ve made our own documents around this process public to help you with some examples. You are welcome to copy as much of these as you would like & make them your own.

Post Event Retrospective Process

Post Event Retrospective Template

I appreciate you sticking with us to this point – if you have feedback on this process or ways that you’ve found which work in your environment please share them in the comments – we want to know what others are doing too.

In Part I of this post I talked about the overall structure and goals of our PER process. I know I didn’t get into a ton of detail in that post but this is a deep topic and I get sleepy when I read long blog posts. This post will go into a lot more detail & cover the specific areas of the PER document & process to help you understand better what we’re trying to accomplish.

Keep in mind that there is a specific order to these sections. The whole point of this process is to prompt questions. We want people asking “Why?” as much as possible – this is what leads to answers and ideally, eventually, leads to improvement. All of the items below lead up to a “PER Meeting” where we gather any missing detail, discuss the event & come up with corrective actions. We’ll discuss that meeting in detail in Part III but for now you should know that the PER meeting is where this is all headed and I’ll refer to it in a few places.

So lets jump into the sections starting with the Timeline and get going!

Timeline

For any event the first step is to start collecting the list of individual events/actions/observations surrounding what you are trying to understand. We gather this detail from a variety of sources including logs, our monitoring system, graphs, email, chat, etc. We have instrumented all of these things to help us with this collection process by making sure timestamps exist and are accurate. The goal is to have a complete timeline of everything that happened which led up to the event (even if you aren’t sure it contributed), during the event, and after the event.

We spend a lot of time trying to make sure we capture everything we can at this step because leaving something out means possibly missing an area of improvement. We assign one individual to perform the research to complete the timeline based on data we have in our systems. This person is called the “PER Owner” and can be anyone on the team capable of looking through systems to gather data about the event. Most often this starts with digging through our systems and collecting data and then leads into asking individuals for details about what they did. We want to collect as much information as possible before we bring everyone into a room for a meeting but we expect there will still be things that only come out at the PER meeting.

Before the PER meeting this timeline & other info is put into a shared document and distributed to the team for review. Other team members add items they think are missing & comment on the content of the timeline before the meeting. The goal is to collect as much data in advance of a PER meeting as possible so the meeting can focus on the “Pluses/Deltas” and “Corrective Actions” sections.

At the PER meeting, the second agenda item (after an overview to get everyone on the same page) is to have everyone look over the timeline and agree as a group that it appears complete. If there are things we need to fill in, we note that at this time. If we need to go back and collect more data we can do that – but we want everyone to agree the timeline is complete.

Pluses and Deltas

This is a typical retrospective concept where you review what went right and what did not in your process. For those things that didn’t go well (Deltas) we want to capture corrective actions and make sure we get those done. For things that went right (Pluses) we want to do more of that in the future.

Deltas

Deltas are anything that didn’t go according to plan or had the potential to cause problems. Deltas are areas where reality did not meet your expectations within the context of this event (sorry, no complaints about your personal objections to reality). What we ask people to do is review the timeline, think about the event and ask themselves some questions:

  • Where did things not go the way they should?
  • Where were there potential for problems, should this happen again?
  • What could have prevented this event in the first place?
  • Where did our response to the event not meet our expectations?
  • What could we do better?

We want to capture not only those obvious problems like “We wasted a bunch of time trying to fix xyz when the problem was actually abc” but also the less obvious ones like “We got lucky, but xyz could have happened and we would have been screwed”. We want to know anything and everything about this event that didn’t meet our expectations.

This part of the meeting usually consumes a large amount of time because this is when people are talking about the things that didn’t work and brainstorming a bit about what actually might address the issues. We try to avoid troubleshooting & solving the problem but if we come up with things that need to change we note those as a Corrective Action. We also review all the deltas when we talk about Corrective Actions & Owners. Deltas that are created are turned into stories or defects in Rally and are tracked in our backlog and prioritized during our weekly planning sessions.

Pluses

This is the opportunity to identify what worked well. Since we use the PER process to review successful maintenance as well as outages and unsuccessful maintenance it’s important to review both and talk about those things that went right so that you reinforce doing the right things. Sometimes you’ll be surprised at what works during maintenance. If you don’t call out those things that work well, others may not recognize them and repeat them in the future. Some of these may even turn into actions to make sure they get done next time.

Corrective Actions & Owners

These are the actionable items we want to do in response to our analysis. Typically for any Delta we want to identify what it is that contributed to that Delta & we want to fix it in some way. This can mean that we want to fix a specific technical problem, we want to modify a process, or we want to just investigate something further. The point is not to be perfect, but the point is to improve upon what we already have using what we have learned from this event.

Assigning owners is an important part of this step as well. Before your PER meeting closes you want to make sure you have owners identified who know that they own these tasks. These owners aren’t necessarily responsible for doing the work – but they do own the task of bringing this corrective action to completion. If you don’t assign owners, make sure you have some process that allows you to review these actions in the near future and get those owners assigned.

Tracking Corrective Actions

One of the things we wanted to avoid in this process is creating a new place to track tasks. We use the Rally tool to track all of our stories and defects. Once we identify corrective actions & owners as part of the PER process we create stories and defects for each action. The Rally tool allows you to create a parent story and then create child stories. We create a parent story for the overall event and then each action becomes a story or defect that is a child of the parent.

We use stories to track things that should be improved but didn’t directly contribute to an outage. We use defects to track those items which directly contributed to an outage and need to be fixed more urgently. We prioritize the stories & defects and they get woven into our normal Kanban process and are pulled in based on priority. During regular planning we review our backlog and re-prioritize what needs to be done and some things may hang around for a while but if you get repeated problems related to the same story not being completed you can raise the priority on that.

Overall Event Metrics

One of the success criteria for a PER meeting is that you should walk away with details about the duration of the event and some metrics around your response time. These metrics are really important because they help you see where you are or are not meeting the business’ expectations. Here are the things we expect to come out of a PER document in terms of metrics:

  • Event Severity – How severe was this issue? This directly relates to the impact on the service & customers.
  • Total Downtime – How long were customers unable to use the service to any degree?
  • Time to Detect – How long did it take for us or our systems to know there was a problem?
  • Time to Resolve – How long after we knew there was a problem did it take for us to restore service?

These metrics allow us to report back to the business metrics about our performance. Some of these metrics come from group consensus reviewing the timeline and others come from places like monitoring where it’s evident how long the outage occurred. The more automated your data around these metrics, the less discussion you have to have around them.

What’s missing? No root cause.

One of the things that many folks will call out as missing from this list is a “root cause”. This is a very common thing to have in post-mortems and is the generally accepted “outcome” of a post-mortem. In our case we prefer the view that, more often than not, there are multiple contributors to an issue and we want to evaluate & fix all of those – not just the one that we think is “most responsible” for an outage or event. We prioritize all the issue of course and the idea is that the strongest contributors get addressed first – but the root cause isn’t what we are after. Different strokes for different folks – if you want to identify a root cause then by all means do it, we just don’t.

Next up – the Meeting

In Part III we’ll talk about the format of the PER meeting & how we wrap up this process and get everything done. Hang in there, we’re almost done.

Next – Post Event Retrospective – Part III

One of the main objectives at Rally is continuous improvement – we strive to learn from what does and doesn’t work and to make things better each week. Things are always changing and you aren’t likely to ever get to perfect, but without striving for continuous improvement you can quickly fall behind.·
A tool we use for this is our “Post Event Retrospective” or PER for short. Many companies use a post-mortem process to capture & understand outages, this is nothing new. Our process isn’t revolutionary but the intent was to design something that we would use and allows us to either apply a lightweight tracking process to an event or to have a full blown meeting to understand all the details. We have some criteria for what process is used when, and we’ll talk about that a little here.
The reason we call ours a “Post Event Retrospective” instead of a “Post Mortem” is that we want to use this process even if nothing goes wrong. We perform planned maintenance all the time and learning from the events that go poorly is important, but so too is learning from those events which go well. When things go well it’s usually for a reason, and if it’s because of luck you need to identify and correct that. This process works as well for those events that go right as it does for events that go wrong.
Blameless – This is important!
An important characteristic of this process is that it is not focused on “who”. We do need to know who was responsible for specific actions so that we know who to ask for details from but the goal is never to associate a specific individual or group with an event’s success or failure. We focus on what process and automation we can put in place to prevent problems. We also ask what led a person to believe what they did was the right choice. Rarely does someone intend to do the wrong thing.
Blame only leads to fear and fear leads to inaction. We want to move fast, not throw anchors into the mud.
The PER Overview
Here are the basic elements of the PER process – I’ll talk about each in detail below
* Timeline
* Pluses & Deltas
* Corrective Actions & Owners
* Overall event metrics
* PER Meeting
Timeline
This is your starting point for understanding the event. For any event the first step is to start collecting the list of individual events surrounding what you are trying to understand. We gather this detail from a variety of sources and have instrumented things to help us with this. The goal is to have a complete timeline of everything that happened which led up to the event (even if you aren’t sure it contributed), during the event, and after the event.
We spend a lot of time trying to make sure we capture everything we can at this step because leaving something out means possibly missing an area of improvement. We assign one individual to perform the research to complete the timeline based on data we have in our systems. This person is called the “PER Owner” and can be anyone on the team capable of looking through systems to gather data. Data comes from all over the place including system log, chat logs, monitoring systems, email, etc. We collect data from all these sources into a single timeline.
Before the meeting to discuss an event this timeline & other info is put into a shared document and distributed to the team for review. Other team members add items they think are missing & comment on the content of the timeline before the meeting. The goal is to collect as much data in advance of a PER meeting as possible so the meeting can focus on the “Pluses/Deltas” and “Corrective Actions” sections as much as possible.
At the PER meeting, the second agenda item (after an overview to get everyone on the same page) is to have everyone look over the timeline and agree as a group that it appears complete. If there are things we need to fill in, we note that at this time. If we need to go back and collect more data we can do that – but we want everyone to agree the timeline is complete.
Pluses and Deltas
This is a typical retrospective concept where you review what went well and what did not in your process. For those things that didn’t go well (Deltas) we want to capture corrective actions and make sure we get those done. For things that went well (Pluses) we want to do more of that in the future.
Deltas
Deltas are anything that didn’t go according to plan or had the potential to cause problems. Deltas are areas where reality did not meet your expectations within the context of this event (sorry, no complaints about your personal objections to reality). What we ask people to do is review the timeline, think about the event and ask themselves some questions:
* Where did things no go the way they should?
* Where was there potential for problems should this happen again?
* What could have prevented this event in the first place?
* Where did our response to the event not meet our expectations?
* What can we do better?
We want to capture not only those obvious problems like “We wasted a bunch of time trying to fix xyz when the problem was actually abc” but also the less obvious ones like “We got lucky, but lmn could have happened and we would have been screwed”. We want to know anything and everything about this event that didn’t meet our expectations.
Deltas that are created are turned into stories or defects in Rally and are tracked in our backlog and prioritized during our weekly planning sessions. This part of the meeting usually consumes a large amount of time because this is when people are talking about the things that didn’t work and brainstorming a bit about what actually might address the issues. If we come up with things that need to change we note those as a Corrective Action. We also review all the gaps when we talk about Corrective Actions & Owners.
Pluses
This is the opportunity to identify what worked well. We use this process to review successful maintenance as well as outages and unsuccessful maintenance. It’s important to review both & talk about those things that work so that you reinforce doing the things that work. Sometimes you’ll be surprised, someone will try something new and it’ll work well. If you don’t call out those things that work well, others may not recognize them and repeat them in the future. Some of these may even turn into actions to make sure they get done next time.
Corrective Actions & Owners
These are the actionable items we want to do in response to our analysis. Typically for any Gap we want to identify what it is that contributed to that Gap & we want to fix it in some way. This can mean that we want to fix a specific technical problem, we want to modify a process, or we want to just investigate something further. The point is not to be perfect, but the point is to improve upon what we already have using what we have learned from this event.
Assigning owners is an important part of this step as well. Before your PER meeting closes you want to make sure you have owners identified who know that they own these tasks. These owners aren’t necessarily responsible for doing the work – but they do own the task of bringing this corrective action to completion.
Tracking Corrective Actions
One of the things we wanted to avoid in this process is creating a new place to track tasks. We use the Rally tool to track all of our stories and defects. Once we identify corrective actions & owners as part of the PER process we create stories and defects for each action. The Rally tool allows you to create a parent story and then create child stories. We create a parent story for the overall event and then each action becomes a story or defect that is a child of that.
We use stories to track things that should be improved but didn’t directly contribute to an outage. We use defects to track those items which directly contributed and need to be fixed more urgently. We prioritize the stories & defects and they get woven into our normal Kanban process and are pulled in based on priority. During regular planning we review our backlog and re-prioritize what needs to be done and some things may hang around for a while but if you get repeated problems related to the same story not being completed you can raise the priority on that.
Overall Event Metrics
One of the success criteria for a PER meeting is that you should walk away with details about the duration of the event and some metrics around your response time. These metrics are really important because they help you see where you are or are not meeting the business’ expectations. Here are the things we expect to come out of a PER document in terms of metrics:
* Event Severity – How severe was this issue? This directly relates to the impact on the service & customers.
* Total Downtime – How long were customers unable to use the service to any degree?
* Time to Detect – How long did it take for us or our systems to know there was a problem?
* Time to Resolve – How long after we knew there was a problem did it take for us to restore service?
These metrics allow us to report back to the business metrics about our performance. Some of these metrics come from group consensus reviewing the timeline and others come from places like monitoring where it’s evident how long the outage occurred. The more automated your data around these metrics, the less discussion you have to have around them.
PER Meeting
This all comes together at PER meetings. These are a requirement for any service impacting event – but are optional for any event that we feel warrants a discussion. This meeting is intended to accomplish a few specific goals:
* Make sure we have an accurate & complete timeline of the event.
* Identify the most complete list of Pluses & Deltas possible.
* Identify Corrective actions and assign ownership to someone.
* Agree on the overall event metrics
Meetings must be held within 24 hours of an event to make sure things are fresh in folks minds. No, we don’t schedule them on weekends for end of the week events – we do them on Monday.
Meeting Attendees
The PER meeting should include anyone who was involved in the event or needs to be present to come up with corrective actions. For problems where the component impacted is maintained by Operations only we typically will keep the meeting constrained to Ops, but for events where an application problem was involved we include a broader audience.
Folks at Rally who might get called into a PER meeting include
* Operations – Familiar with the event and typically the ones who responded to it.
* Development – Familiar with the applications & service as well as what might be tweaked to improve things.
* Product Owner – Able to prioritize important changes within Development
* Customer Support – Responsible for communication with customers during an event
For significant enough events you might even have folks from Marketing & Legal involved. Let’s hope that’s not the case – but don’t exclude anyone who would be able to help improve things.
How do you get started?
For folks who may not already be doing this we encourage you to find your own balance in this process. The best process is one that you use and this is no exception. Do not create a process that you avoid because it’s too much work or too heavyweight.
We’ve made our own documents around this process public to help you with some examples. You are welcome to copy as much of these as you would like & make them your own.
Post Event Retrospective Process: https://docs.google.com/a/rallydev.com/document/pub?id=1Q7zIJC99Q2BOK30ouS1Q0GZLHWbv4Juxga0YaxMKS_g
Post Event Retrospective Template: https://docs.google.com/a/rallydev.com/document/pub?id=17fuwzdI6pmMDhEY6yabEMPPFvsP2SWZjz74-sDJiXGE

One of the main objectives at Rally is continuous improvement – we strive to learn from what does and doesn’t work and to make things better each week. Things are always changing and you aren’t likely to ever get to perfect, but without striving for continuous improvement you can quickly fall behind.·

A tool our Operations team uses for this is our “Post Event Retrospective” or “PER” for short. Many companies use a post-mortem process to capture & understand outages, this is nothing new. Our process isn’t revolutionary but the intent was to design something that we would use and allows us to either apply a lightweight tracking process to an event or to have a full blown meeting to understand all the details. We have some criteria for what process is used when, and we’ll talk about that a little here.

We call this a “Post Event Retrospective” instead of a “Post Mortem” because we want to use this process even if there is no outage. We perform planned maintenance all the time and learning from the events that go poorly is important, but also important is learning from those events which go well. When things go well it’s usually for a reason, and if it’s because of luck you need to identify and correct that. This process works as well for those events that go right as it does for events that go wrong.

Blameless – This is important!

A fundamental principle of this process is that it is not focused on assigning blame. We do need to know who was responsible for specific actions so that we know who to ask for details, but the goal is never to associate a specific individual or group with an event’s success or failure. We focus on what process and automation we can put in place to prevent future problems. We also ask what led a person to believe that what they did was the right choice. Rarely does someone intend to do the wrong thing. By focusing away from blame you get candid feedback from everyone on the team as everyone is working toward the same goal – to make sure the service works better in the future.

Blame only leads to fear and fear leads to inaction. We want to move fast, not throw anchors into the mud.

The PER Overview

Here are the basic elements of the PER process – I’ll talk about these in detail in Part II of this post.

Timeline - This is where we capture all the events before, during and after an event. This includes observations as well as actions we took. An accurate timeline is crucial to identifying where things didn’t go as planned.

Pluses & Deltas - These are the things that went well (Pluses) and the things that didn’t (Deltas) related to this event. This can be anything from software crashing (Delta) to having a comfortable couch to sit on while we discuss the issue (Plus). No holds barred & no judgement at this stage – just collect data.

Corrective Actions & Owners - These are the actions we are taking to improve the way future events play out. These are ultimately the tasks which, once complete, should lead to improvement so the way we track these and getting them complete is critical.

Overall event metrics - These are the overall metrics for the event. This includes things like Time to Detect, Time to Repair and Total Downtime. These metrics feed back into the business to measure how well we do during events & how we improve over time.

PER Meeting

This is the meeting that brings all of this together into a team discussion to make sure we are all on the same page. We make sure the Timeline is accurate, make sure we have collected all our Pluses and Deltas and we define the Corrective Actions & Owners. We will talk about the PER Meeting in detail in Part III.

In Part II we will go over all the aspects of the data that we collect, why each step is important & how we turn that into action. In Part III we will cover the actual PER meeting which is where we pull all the members together & talk about what happened, collect details & come up with a plan.

Next – Post Event Retrospective – Part II