Wed 1 Feb 2012
Handling Interrupts in an Agile Ops Team
One of the things I really enjoy about being in Operations is the variety of work that comes your way. Some of this variety comes from projects in different areas of the system, but some of this variety comes from being exposed to interrupts that come along each day. While Agile methods are designed to allow for pivots, handling daily interrupts on the scale that you see in Operations can be very intrusive to the process.
Most organizations have an on-call schedule to handle after-hours issues on a rotating basis, but we have taken that one step further. What we wanted to do was try to offload much of that daily interrupt driven work away from those folks working through their normal project oriented stories. This has the effect of allowing the majority of the team to focus for 3 weeks at a time (we have a 4 person rotation) while one person handles the majority of these interrupt type issues. This lucky person is the “Op of the Week” or OW for short.
A similar idea exists on the Development side, we call them the “Dev of the Week” or DW. They are the first point of escalation for the OW on application related issues where we need Dev team assistance.
One of the areas we have struggled a little bit with our OW rotation is making sure everyone agrees about what falls under the responsibility of the OW. As a result we’ve created a document which defines our working agreements around the OW role. We review this as a team and make adjustments where necessary.
Here’s an overview of the types of stuff we focus the OW on for their given week & the role we expect them to hold:
Rotation: OW duties rotate every Monday at 12pm. This provides time for the outgoing OW to provide an update at SoS (Scrum of Scrums) and to provide any turnover to the new OW.
General Duties:
- Respond to all critical pages (Failed hardware, software, high load events, etc.).
- Communicate status to team when issues are being worked. Include DW & general Dev team as appropriate.
- File defects when needed for application events or problems that require assistance from DW or Dev team.
- Issues raised during OW are owned to completion, coordinating additional resources as necessary.
- Attend Scrum of Scrums & provide status on any new Ops issues (customer problems, production outages, etc)
- Respond to all incoming operations interrupt type requests (includes all Ops responsibilities, not just prod)
- Remediation of critical security issues. Drive required security patches / workarounds to completion.
- Review nightly jobs including Backups for errors / problems.
We also maintain a calendar of events in this same document which describes the routine daily activities which are part of the OW duties. For example, because we perform weekly deploys to production we have a pre-production push on Thursday evening of the Release Candidate that the OW performs. Saturday mornings when our production release rolls out, the OW is expected to be on high alert for problems.
This process has worked well with some adjustment over time. It is also not intended to be completely rigid. For example, we certainly have cases where something will come up around a certain part of our infrastructure where the OW is not the expert. When appropriate, and where both parties agree, other team members are welcome to take ownership of issues that come up if they are passionate about getting those issues resolved and/or have expertise in that area. We don’t expect team members to do this all the time though, as being OW also forces some cross-training, something we think is really important.
If you’ve found a good way to offload interrupts or have feedback on what we’ve done, let us know in the comments.


