Our resident browser testing nerd, Matt, started to write this but came up with writer’s block. Since I wrote the “hacktastic rake task” in question, I figured I’d finish. GTFO Matt. :)

Writing effective browser tests is hard. The problem space is significantly different from writing faster, lower level tests in whatever language you use for testing in the small. From time to time, tests turn out to be what we call “flaky” for various reasons. There are a few categories of flakiness, but the most common problems when browser testing are timing issues and data interplay.

Timing issues crop up in browser tests when you write tests that don’t wait for the same visual cues that a user would. When this happens, the tests (which will presumably run as fast the computer can make them run) click through the app faster than the app can respond. Clicking through the app faster than it can respond is a great way to introduce non-deterministic (flaky) tests, especially when the tests move between machines of differing speeds.

Data interaction between tests is another problem that can crop up when running full-stack tests. If the tests rely on ambiguous assertions, lingering data can cause the same types of intermittent problems when the test order is randomized. When running multiple threads of tests in parallel, simply adding more tests can change the order in which the tests are run.

We could (and will) write several posts about what makes tests flaky. That’s not the point of this post. This post is about how we detect potential flakiness in browser tests and put in place a set of safeguards to guard against widespread flakiness.

A few months ago, we were feeling some pain from flaky browser tests. Our build lights were red more than they were green. Before then, we had made a lot of progress on our browser tests, but we had grown complacent. Flakies were committed. Morale suffered.

Our build machines are killer fast – even faster than the Mac Pros we use as workstations. As a result, timing issues crop up more often on the build servers than locally. To help debug flaky tests locally, I wrote a rake task we call “flaky finder” that runs the same test fifty times in 8 threads to help reproduce the flakiness locally so it can be more easily fixed.

The script worked great once we discovered a test file was flaky, but it was inherently reactive. We wanted to be more proactive, so we incorporated the script into our build process. We created a ruby script that would run against every checkin and use the Jenkins XML API to discover which browser tests changed. It would then run flaky finder against all of those test files.

Once we started running the script against every commit, we first learned that most browser tests are not flaky. It’s the cumulative effect of a small number of flakies that can cause problems once you build frequently enough.

Next, we learned that just analyzing the test files isn’t enough. We only run the test files that are changed. It’s entirely possible for us to change the dependencies for a test file and totally flakify that file without ever touching it. We have some ideas about how to do the dependency analysis we’re looking for in Ruby (that’s what we use for our browser tests), but we haven’t gotten to it yet. Adding that analysis is high on my list for our upcoming hackathon.

Finally, we learned that small visual indicators are nearly worthless when you have big visual indicators drowning them out. We live and die by our build lights. Any new Jenkins jobs we introduce are worthless if they don’t turn the light red when they fail. As of a few months ago, flaky finder will turn the light red if it fails. That is win.