Wed 17 Aug 2011
New Maps of Hell
Once upon a time, my team was engaged in a multi-month firefight. For most of that period, all of engineering was. Some of the decisions that made sense when Rally was a smaller company with smaller customers were coming back to haunt us.
When I say haunt, I’m not talking Casper the friendly ghost here. I’m talking Michael Myers meets that creepy chick from The Ring.
Crashes. Out of memory exceptions. Outages. Not a good scenario for a SaaS product.
That was a scary time. Every time something went down, which was often, we all wondered “for how long?” Would it be long enough for a customer notice? Would it be long enough to start costing us money? Anxiety was high.
A common practice in agile is the retrospective. At that point, we held a retrospective unlike any I’d ever been a part of. We defined a hypothetical future hell, then mapped steps backward to when it started. At that time, we were on one of the steps on that road. We picked that as a branching point and mapped our way forward to an ideal “heavenly” scenario.
The steps to that ideal scenario helped shape the direction engineering would take for the next several months. Those steps were intended to dig us out of the pit in which we found ourselves. We scrambled and broke our four teams into several smaller teams, most of which were aimed at tackling various scalability-related issues.
Steve Neely wrote a post a while back about how implementing Solr lightened the load on our database. That was just one of several architectural projects taken on at the time, but it was probably the most visible change for customers. A similar effort was undertaken to reduce the number of non-search queries we executed in the database.
Another customer-facing project that fell out of that period was the retirement of Use-Case Mode. For lack of a better explanation, Use-Case Mode was the original version of Rally. By that point it was only used by a small fraction of our customers but it added significant complexity to our codebase and added friction to the development of new features.
Another team implemented session-replication. Matt Novinger wrote about it yesterday. This was valuable at the time because if one of our JVMs failed the users with sessions on that node would not be logged out of the system. They’d seamlessly fail over to another node and continue working.
My team at the time had what can at best be described as panning for turds. It’s kind of like panning for gold but you’re looking for, well, turds.
We’d spend hours sifting through heap dumps taken from our load tests, looking for the next memory leak. Just like panning for gold, once we found thought we had found and fixed something we had to have it validated. That’s a process that required us to run the load tests again until the server fell down. That was a process that took several hours.
Rinse and repeat.
Every time the load test failed, we’d sigh and start again. This was soul-crushing work – until it wasn’t. The first time we came close to running a 24-hour load test, the jmeter instance that was orchestrating the test comically ran out of heap space. Rally one, jmeter zero (actually it was more like Rally one, jmeter two hundred at that point). The next day, we upped the heap allocated to jmeter and we had a successful 24-hour run.
That period of several months was difficult, stressful and occasionally frightening. There were also some tense moments between engineering and the rest of the company. Engineering made a hard decision to significantly decrease our allocation to customer-facing feature work for almost a quarter. While it was a hard sell, our business trusted us enough to make it happen.
Perhaps paradoxically, that stretch was also very good for morale. No one was excited for the map-to-hell exercise. A common sentiment was “why are we doing this when we should be fighting fires?” Much to the surprise of many, myself included, the time was well spent. It gave us a clear path out of hell and made things a little less frightening. Following the steps from that exercise, we boosted the health of our system and of our teams. At each step, confidence in the system grew and the mood in engineering improved.
The individual projects from that time have had longer-reaching morale benefits as well.
Not only did Solr take a huge load off of our database, our customers no longer faced those multi-minute searches — not a feature we were proud of.
Deprecating and subsequently eliminating Use-Case Mode allowed us to remove tens of thousands of lines of code and eliminate a great deal of code complexity. We even filmed a little video set to Europe’s “The Final Countdown” as we got ready to make the commit that removed all (most) traces of Use-Case Mode.
As a result of the memory and query work, our system was more stable and perhaps more importantly, it was able to fail safely for the most part. Crashes are embarrassing and that work helped us as engineers to be more proud of our product.
Since then, we’ve shifted to a much healthier balance between feature work and infrastructure work. As we continue to grow, there will always be the need to devote time to scalability and infrastructure. The difference is that we can now be more proactive about exploring and expanding our limits rather than being reactive once we hit them.
(Kudos if you know where I stole the title of this post without Googling it. You have good taste in music.)
