Fri 14 Oct 2011
Post Event Retrospective – Part II
In Part I of this post I talked about the overall structure and goals of our PER process. I know I didn’t get into a ton of detail in that post but this is a deep topic and I get sleepy when I read long blog posts. This post will go into a lot more detail & cover the specific areas of the PER document & process to help you understand better what we’re trying to accomplish.
Keep in mind that there is a specific order to these sections. The whole point of this process is to prompt questions. We want people asking “Why?” as much as possible – this is what leads to answers and ideally, eventually, leads to improvement. All of the items below lead up to a “PER Meeting” where we gather any missing detail, discuss the event & come up with corrective actions. We’ll discuss that meeting in detail in Part III but for now you should know that the PER meeting is where this is all headed and I’ll refer to it in a few places.
So lets jump into the sections starting with the Timeline and get going!
Timeline
For any event the first step is to start collecting the list of individual events/actions/observations surrounding what you are trying to understand. We gather this detail from a variety of sources including logs, our monitoring system, graphs, email, chat, etc. We have instrumented all of these things to help us with this collection process by making sure timestamps exist and are accurate. The goal is to have a complete timeline of everything that happened which led up to the event (even if you aren’t sure it contributed), during the event, and after the event.
We spend a lot of time trying to make sure we capture everything we can at this step because leaving something out means possibly missing an area of improvement. We assign one individual to perform the research to complete the timeline based on data we have in our systems. This person is called the “PER Owner” and can be anyone on the team capable of looking through systems to gather data about the event. Most often this starts with digging through our systems and collecting data and then leads into asking individuals for details about what they did. We want to collect as much information as possible before we bring everyone into a room for a meeting but we expect there will still be things that only come out at the PER meeting.
Before the PER meeting this timeline & other info is put into a shared document and distributed to the team for review. Other team members add items they think are missing & comment on the content of the timeline before the meeting. The goal is to collect as much data in advance of a PER meeting as possible so the meeting can focus on the “Pluses/Deltas” and “Corrective Actions” sections.
At the PER meeting, the second agenda item (after an overview to get everyone on the same page) is to have everyone look over the timeline and agree as a group that it appears complete. If there are things we need to fill in, we note that at this time. If we need to go back and collect more data we can do that – but we want everyone to agree the timeline is complete.
Pluses and Deltas
This is a typical retrospective concept where you review what went right and what did not in your process. For those things that didn’t go well (Deltas) we want to capture corrective actions and make sure we get those done. For things that went right (Pluses) we want to do more of that in the future.
Deltas
Deltas are anything that didn’t go according to plan or had the potential to cause problems. Deltas are areas where reality did not meet your expectations within the context of this event (sorry, no complaints about your personal objections to reality). What we ask people to do is review the timeline, think about the event and ask themselves some questions:
- Where did things not go the way they should?
- Where were there potential for problems, should this happen again?
- What could have prevented this event in the first place?
- Where did our response to the event not meet our expectations?
- What could we do better?
We want to capture not only those obvious problems like “We wasted a bunch of time trying to fix xyz when the problem was actually abc” but also the less obvious ones like “We got lucky, but xyz could have happened and we would have been screwed”. We want to know anything and everything about this event that didn’t meet our expectations.
This part of the meeting usually consumes a large amount of time because this is when people are talking about the things that didn’t work and brainstorming a bit about what actually might address the issues. We try to avoid troubleshooting & solving the problem but if we come up with things that need to change we note those as a Corrective Action. We also review all the deltas when we talk about Corrective Actions & Owners. Deltas that are created are turned into stories or defects in Rally and are tracked in our backlog and prioritized during our weekly planning sessions.
Pluses
This is the opportunity to identify what worked well. Since we use the PER process to review successful maintenance as well as outages and unsuccessful maintenance it’s important to review both and talk about those things that went right so that you reinforce doing the right things. Sometimes you’ll be surprised at what works during maintenance. If you don’t call out those things that work well, others may not recognize them and repeat them in the future. Some of these may even turn into actions to make sure they get done next time.
Corrective Actions & Owners
These are the actionable items we want to do in response to our analysis. Typically for any Delta we want to identify what it is that contributed to that Delta & we want to fix it in some way. This can mean that we want to fix a specific technical problem, we want to modify a process, or we want to just investigate something further. The point is not to be perfect, but the point is to improve upon what we already have using what we have learned from this event.
Assigning owners is an important part of this step as well. Before your PER meeting closes you want to make sure you have owners identified who know that they own these tasks. These owners aren’t necessarily responsible for doing the work – but they do own the task of bringing this corrective action to completion. If you don’t assign owners, make sure you have some process that allows you to review these actions in the near future and get those owners assigned.
Tracking Corrective Actions
One of the things we wanted to avoid in this process is creating a new place to track tasks. We use the Rally tool to track all of our stories and defects. Once we identify corrective actions & owners as part of the PER process we create stories and defects for each action. The Rally tool allows you to create a parent story and then create child stories. We create a parent story for the overall event and then each action becomes a story or defect that is a child of the parent.
We use stories to track things that should be improved but didn’t directly contribute to an outage. We use defects to track those items which directly contributed to an outage and need to be fixed more urgently. We prioritize the stories & defects and they get woven into our normal Kanban process and are pulled in based on priority. During regular planning we review our backlog and re-prioritize what needs to be done and some things may hang around for a while but if you get repeated problems related to the same story not being completed you can raise the priority on that.
Overall Event Metrics
One of the success criteria for a PER meeting is that you should walk away with details about the duration of the event and some metrics around your response time. These metrics are really important because they help you see where you are or are not meeting the business’ expectations. Here are the things we expect to come out of a PER document in terms of metrics:
- Event Severity – How severe was this issue? This directly relates to the impact on the service & customers.
- Total Downtime – How long were customers unable to use the service to any degree?
- Time to Detect – How long did it take for us or our systems to know there was a problem?
- Time to Resolve – How long after we knew there was a problem did it take for us to restore service?
These metrics allow us to report back to the business metrics about our performance. Some of these metrics come from group consensus reviewing the timeline and others come from places like monitoring where it’s evident how long the outage occurred. The more automated your data around these metrics, the less discussion you have to have around them.
What’s missing? No root cause.
One of the things that many folks will call out as missing from this list is a “root cause”. This is a very common thing to have in post-mortems and is the generally accepted “outcome” of a post-mortem. In our case we prefer the view that, more often than not, there are multiple contributors to an issue and we want to evaluate & fix all of those – not just the one that we think is “most responsible” for an outage or event. We prioritize all the issue of course and the idea is that the strongest contributors get addressed first – but the root cause isn’t what we are after. Different strokes for different folks – if you want to identify a root cause then by all means do it, we just don’t.
Next up – the Meeting
In Part III we’ll talk about the format of the PER meeting & how we wrap up this process and get everything done. Hang in there, we’re almost done.
Next – Post Event Retrospective – Part III
