On the Rally Engineering blog we’ve written many articles on systems’ performance, monitoring, resilience, and recovery. An interesting event just happened at our Denver production facility that pushed our systems to their limits. It helped us answer the question:
How well does production survive in the face of all-out failures?
Maybe inspired by Netflix’s Chaos Monkey, some local wildlife found its was into our server room and litterally tore apart our hardware.
Enter the “Rally Chaos Raccoon”
Last weekend the Chaos Raccoon, we now call “Cyril”*, chewed his way into our production server room. From security video footage we saw that at first he was cautious and quiet. Just hiding behind the Dell 4210s — where it’s warm.
But after a full day of no food Cyril got angry.
He gnawed on the mounting brackets of our UPS and ripped open a SAN unit in use by
app-01 failed our backup app-server spun into action and took over the production traffic from
app-01. The process all had to happen automatically — when Aaron tried to enter the cage for a manual failover Cyril attacked and bit his cheek (yes, that cheek).
As Aaron left the server room, seeking medical attention and rabies shots, the Chaos Raccoon ripped into another Dell 4210 and chewed its innards into shreds. Our system smoothly recovered, load-balancing network traffic whilst simultaneously paging the operations team another warning. When Cyril finally chewed through the UPS he electrocuted himself to death.
It smelled bad.
In the future we’ll probably test our processes with a “Bad-ass Bear”, “Crazy Coy Carp”, or “Janky Jackalope”. Opening up your production systems to wildlife attack demonstrates confidence in monitoring, recovery, and backup processes. It stretches your failover strategies to their limits. You may think your systems are ready for anything but when a raccoon attacks there are no rules.
If you’re going to implement your own Chaos Raccoon we recommend you first deploy an array of recovery tools and test with non-endangered creatures. It’s organic and the ecologically sound option.