One of the main objectives at Rally is continuous improvement – we strive to learn from what does and doesn’t work and to make things better each week. Things are always changing and you aren’t likely to ever get to perfect, but without striving for continuous improvement you can quickly fall behind.·
A tool we use for this is our “Post Event Retrospective” or PER for short. Many companies use a post-mortem process to capture & understand outages, this is nothing new. Our process isn’t revolutionary but the intent was to design something that we would use and allows us to either apply a lightweight tracking process to an event or to have a full blown meeting to understand all the details. We have some criteria for what process is used when, and we’ll talk about that a little here.
The reason we call ours a “Post Event Retrospective” instead of a “Post Mortem” is that we want to use this process even if nothing goes wrong. We perform planned maintenance all the time and learning from the events that go poorly is important, but so too is learning from those events which go well. When things go well it’s usually for a reason, and if it’s because of luck you need to identify and correct that. This process works as well for those events that go right as it does for events that go wrong.
Blameless – This is important!
An important characteristic of this process is that it is not focused on “who”. We do need to know who was responsible for specific actions so that we know who to ask for details from but the goal is never to associate a specific individual or group with an event’s success or failure. We focus on what process and automation we can put in place to prevent problems. We also ask what led a person to believe what they did was the right choice. Rarely does someone intend to do the wrong thing.
Blame only leads to fear and fear leads to inaction. We want to move fast, not throw anchors into the mud.
The PER Overview
Here are the basic elements of the PER process – I’ll talk about each in detail below
* Timeline
* Pluses & Deltas
* Corrective Actions & Owners
* Overall event metrics
* PER Meeting
Timeline
This is your starting point for understanding the event. For any event the first step is to start collecting the list of individual events surrounding what you are trying to understand. We gather this detail from a variety of sources and have instrumented things to help us with this. The goal is to have a complete timeline of everything that happened which led up to the event (even if you aren’t sure it contributed), during the event, and after the event.
We spend a lot of time trying to make sure we capture everything we can at this step because leaving something out means possibly missing an area of improvement. We assign one individual to perform the research to complete the timeline based on data we have in our systems. This person is called the “PER Owner” and can be anyone on the team capable of looking through systems to gather data. Data comes from all over the place including system log, chat logs, monitoring systems, email, etc. We collect data from all these sources into a single timeline.
Before the meeting to discuss an event this timeline & other info is put into a shared document and distributed to the team for review. Other team members add items they think are missing & comment on the content of the timeline before the meeting. The goal is to collect as much data in advance of a PER meeting as possible so the meeting can focus on the “Pluses/Deltas” and “Corrective Actions” sections as much as possible.
At the PER meeting, the second agenda item (after an overview to get everyone on the same page) is to have everyone look over the timeline and agree as a group that it appears complete. If there are things we need to fill in, we note that at this time. If we need to go back and collect more data we can do that – but we want everyone to agree the timeline is complete.
Pluses and Deltas
This is a typical retrospective concept where you review what went well and what did not in your process. For those things that didn’t go well (Deltas) we want to capture corrective actions and make sure we get those done. For things that went well (Pluses) we want to do more of that in the future.
Deltas
Deltas are anything that didn’t go according to plan or had the potential to cause problems. Deltas are areas where reality did not meet your expectations within the context of this event (sorry, no complaints about your personal objections to reality). What we ask people to do is review the timeline, think about the event and ask themselves some questions:
* Where did things no go the way they should?
* Where was there potential for problems should this happen again?
* What could have prevented this event in the first place?
* Where did our response to the event not meet our expectations?
* What can we do better?
We want to capture not only those obvious problems like “We wasted a bunch of time trying to fix xyz when the problem was actually abc” but also the less obvious ones like “We got lucky, but lmn could have happened and we would have been screwed”. We want to know anything and everything about this event that didn’t meet our expectations.
Deltas that are created are turned into stories or defects in Rally and are tracked in our backlog and prioritized during our weekly planning sessions. This part of the meeting usually consumes a large amount of time because this is when people are talking about the things that didn’t work and brainstorming a bit about what actually might address the issues. If we come up with things that need to change we note those as a Corrective Action. We also review all the gaps when we talk about Corrective Actions & Owners.
Pluses
This is the opportunity to identify what worked well. We use this process to review successful maintenance as well as outages and unsuccessful maintenance. It’s important to review both & talk about those things that work so that you reinforce doing the things that work. Sometimes you’ll be surprised, someone will try something new and it’ll work well. If you don’t call out those things that work well, others may not recognize them and repeat them in the future. Some of these may even turn into actions to make sure they get done next time.
Corrective Actions & Owners
These are the actionable items we want to do in response to our analysis. Typically for any Gap we want to identify what it is that contributed to that Gap & we want to fix it in some way. This can mean that we want to fix a specific technical problem, we want to modify a process, or we want to just investigate something further. The point is not to be perfect, but the point is to improve upon what we already have using what we have learned from this event.
Assigning owners is an important part of this step as well. Before your PER meeting closes you want to make sure you have owners identified who know that they own these tasks. These owners aren’t necessarily responsible for doing the work – but they do own the task of bringing this corrective action to completion.
Tracking Corrective Actions
One of the things we wanted to avoid in this process is creating a new place to track tasks. We use the Rally tool to track all of our stories and defects. Once we identify corrective actions & owners as part of the PER process we create stories and defects for each action. The Rally tool allows you to create a parent story and then create child stories. We create a parent story for the overall event and then each action becomes a story or defect that is a child of that.
We use stories to track things that should be improved but didn’t directly contribute to an outage. We use defects to track those items which directly contributed and need to be fixed more urgently. We prioritize the stories & defects and they get woven into our normal Kanban process and are pulled in based on priority. During regular planning we review our backlog and re-prioritize what needs to be done and some things may hang around for a while but if you get repeated problems related to the same story not being completed you can raise the priority on that.
Overall Event Metrics
One of the success criteria for a PER meeting is that you should walk away with details about the duration of the event and some metrics around your response time. These metrics are really important because they help you see where you are or are not meeting the business’ expectations. Here are the things we expect to come out of a PER document in terms of metrics:
* Event Severity – How severe was this issue? This directly relates to the impact on the service & customers.
* Total Downtime – How long were customers unable to use the service to any degree?
* Time to Detect – How long did it take for us or our systems to know there was a problem?
* Time to Resolve – How long after we knew there was a problem did it take for us to restore service?
These metrics allow us to report back to the business metrics about our performance. Some of these metrics come from group consensus reviewing the timeline and others come from places like monitoring where it’s evident how long the outage occurred. The more automated your data around these metrics, the less discussion you have to have around them.
PER Meeting
This all comes together at PER meetings. These are a requirement for any service impacting event – but are optional for any event that we feel warrants a discussion. This meeting is intended to accomplish a few specific goals:
* Make sure we have an accurate & complete timeline of the event.
* Identify the most complete list of Pluses & Deltas possible.
* Identify Corrective actions and assign ownership to someone.
* Agree on the overall event metrics
Meetings must be held within 24 hours of an event to make sure things are fresh in folks minds. No, we don’t schedule them on weekends for end of the week events – we do them on Monday.
Meeting Attendees
The PER meeting should include anyone who was involved in the event or needs to be present to come up with corrective actions. For problems where the component impacted is maintained by Operations only we typically will keep the meeting constrained to Ops, but for events where an application problem was involved we include a broader audience.
Folks at Rally who might get called into a PER meeting include
* Operations – Familiar with the event and typically the ones who responded to it.
* Development – Familiar with the applications & service as well as what might be tweaked to improve things.
* Product Owner – Able to prioritize important changes within Development
* Customer Support – Responsible for communication with customers during an event
For significant enough events you might even have folks from Marketing & Legal involved. Let’s hope that’s not the case – but don’t exclude anyone who would be able to help improve things.
How do you get started?
For folks who may not already be doing this we encourage you to find your own balance in this process. The best process is one that you use and this is no exception. Do not create a process that you avoid because it’s too much work or too heavyweight.
We’ve made our own documents around this process public to help you with some examples. You are welcome to copy as much of these as you would like & make them your own.
Post Event Retrospective Process: https://docs.google.com/a/rallydev.com/document/pub?id=1Q7zIJC99Q2BOK30ouS1Q0GZLHWbv4Juxga0YaxMKS_g
Post Event Retrospective Template: https://docs.google.com/a/rallydev.com/document/pub?id=17fuwzdI6pmMDhEY6yabEMPPFvsP2SWZjz74-sDJiXGE
One of the main objectives at Rally is continuous improvement – we strive to learn from what does and doesn’t work and to make things better each week. Things are always changing and you aren’t likely to ever get to perfect, but without striving for continuous improvement you can quickly fall behind.·
A tool our Operations team uses for this is our “Post Event Retrospective” or “PER” for short. Many companies use a post-mortem process to capture & understand outages, this is nothing new. Our process isn’t revolutionary but the intent was to design something that we would use and allows us to either apply a lightweight tracking process to an event or to have a full blown meeting to understand all the details. We have some criteria for what process is used when, and we’ll talk about that a little here.
We call this a “Post Event Retrospective” instead of a “Post Mortem” because we want to use this process even if there is no outage. We perform planned maintenance all the time and learning from the events that go poorly is important, but also important is learning from those events which go well. When things go well it’s usually for a reason, and if it’s because of luck you need to identify and correct that. This process works as well for those events that go right as it does for events that go wrong.
Blameless – This is important!
A fundamental principle of this process is that it is not focused on assigning blame. We do need to know who was responsible for specific actions so that we know who to ask for details, but the goal is never to associate a specific individual or group with an event’s success or failure. We focus on what process and automation we can put in place to prevent future problems. We also ask what led a person to believe that what they did was the right choice. Rarely does someone intend to do the wrong thing. By focusing away from blame you get candid feedback from everyone on the team as everyone is working toward the same goal – to make sure the service works better in the future.
Blame only leads to fear and fear leads to inaction. We want to move fast, not throw anchors into the mud.
The PER Overview
Here are the basic elements of the PER process – I’ll talk about these in detail in Part II of this post.
Timeline - This is where we capture all the events before, during and after an event. This includes observations as well as actions we took. An accurate timeline is crucial to identifying where things didn’t go as planned.
Pluses & Deltas - These are the things that went well (Pluses) and the things that didn’t (Deltas) related to this event. This can be anything from software crashing (Delta) to having a comfortable couch to sit on while we discuss the issue (Plus). No holds barred & no judgement at this stage – just collect data.
Corrective Actions & Owners - These are the actions we are taking to improve the way future events play out. These are ultimately the tasks which, once complete, should lead to improvement so the way we track these and getting them complete is critical.
Overall event metrics - These are the overall metrics for the event. This includes things like Time to Detect, Time to Repair and Total Downtime. These metrics feed back into the business to measure how well we do during events & how we improve over time.
PER Meeting
This is the meeting that brings all of this together into a team discussion to make sure we are all on the same page. We make sure the Timeline is accurate, make sure we have collected all our Pluses and Deltas and we define the Corrective Actions & Owners. We will talk about the PER Meeting in detail in Part III.
In Part II we will go over all the aspects of the data that we collect, why each step is important & how we turn that into action. In Part III we will cover the actual PER meeting which is where we pull all the members together & talk about what happened, collect details & come up with a plan.
Next – Post Event Retrospective – Part II
Interesting to read about how Operations (on the other side of the room) do internal retros on events. It is similar to how we (in dev) run retros but we don’t have as many discrete events to timeline on. Maybe we should try and see how it works out.
Blameless is one of the most important aspects.