We have been using the PER process now for a while and as we have run through a number of these we’ve made some changes to the process. As with all things at Rally, we rely heavily on inspecting and adapting. Here are some areas we’ve found we needed to adjust.
Communicating with teams outside those directly involved in the event
We have started to bring our PER process into the Dev/Ops sync meetings to raise visibility in that meeting of events that occur. After a PER is completed (meaning we’ve discussed it & all the action related stories are created) we present that PER at the start of our Dev/Ops sync meeting so that the broader team can ask questions or provide feedback. A big part of the PER process is simply to raise awareness of the types of issues which impact our service and this has been an easy way to briefly review these events with the team.
For larger events, in particular those where the system is unavailable, we have distributed the PER document to the teams and even the Executive team as necessary to answer questions about the event. We try to be as transparent as possible about these events & what we learn from them.
Creating and tracking stories from the event
One of the most difficult aspects of establishing a process like this is managing the large number of new stories which are created out of each PER. We do not focus on a single root cause for an event, instead we focus on discovering and documenting all the factors that led to the event or could lead to future events. Doing this means that we can generate a pretty large number of action items out of any given event – usually it’s 5-10 new stories.
Moving those stories through the process to get prioritized & completed is a challenge. We have created a “PER View” in Rally where we can see all our parent stories and their children and we have started to work on reviewing and prioritizing those incomplete items. The good news is that some are already complete, the bad news is that many others are not. We are continuing to evolve our process for reviewing these within Operations and with other teams to raise awareness about incomplete stories.
When to run a PER
When we first started this new process, we indicated that we would only run a PER for events which caused downtime. We have since started to expand our use of the PER to other events. This is driven by the fact that many events which do not cause downtime still expose areas where we could significantly improve things. Sometimes we get lucky and catch something in manual testing, or the sequence of events is such that production downtime is avoided, but it doesn’t mean we can’t learn from the event.
More recently we have tried to look at an event from the perspective of how much we think we can learn, rather than the impact it had on our system or customers. We run a PER on any event that causes downtime without question, but we’re looking for opportunities to use this process to learn about events which help us improve before we have downtime.
Developing Muscle Memory
After having done this a few times, we’ve noticed an interesting side effect. We’re getting better at recognizing if an event can lead to a PER, and we start the data collection process during the event itself. This has been immensely helpful in capturing detail without relying on an individual’s memory of the event. It also helps make our PER meetings more efficient by creating a document with nearly complete timeline, deltas and action items. As an example, during a recent event the document was created, the timeline was started, and an email was sent with a link to the PER. Those involved in the event could then immediately begin to contribute to that document as the event occurred.
A big part of this is people getting used to the PER process and having a template ready to go which folks can quickly fill in with details. Making it easy for anyone to start the document & allow everyone to contribute to it is key. We use Google Docs for these items but you can use just about any collaborative document system.
Making change easy
When the PER process was created it was intended to be a flexible process that allows for easy change to the process. We didn’t invest a lot of effort in creating any complex software or forms – we just created a Google Document. Over time we have modified that document (the template) a number of times as changes made sense. This was trivial to do, anyone could do it, and you could see a history of what had been changed in the document.
Although the process has changed a bit since it was started, we didn’t have to have a lot of discussion about those changes. The cost of change was low, and the ability to reverse a change that doesn’t work means there is very little risk to trying new things. This means we didn’t have to think too much about what the best process would be when we started – we just built something that was good enough and started using it, adjusting as we go.
We will continue to evolve this process and we’ll talk about it more as we do, but the process seems to be working pretty well for collecting data about an event. The most important aspect of this is to be open to change & to try different ways of approaching things. Collecting ideas about what went wrong is the easy part though, the hard part is turning those ideas into actual changes to the software, process or the organization. When we get that part all figured out, expect another blog post with the results.