In my last post in the series, I wrote about the huge improvements we made in the performance of our browser tests. The shorter feedback loop gave us far more confidence in making changes as we would know within a half hour or so if a commit was bad. But our tests still needed a lot of love and attention before we could rely on them, often we would see a false negative where a test reported failure but manual verification showed that the code was good. Fast tests are no good if they’re flaky.
Our basic concept of test flakiness loosely translates as: “any test that works on my machine, but occasionally fails when run on the build machines”. For us there tended to be a few classes of flakiness. Sometimes it was caused by overlooked data dependencies or a loss of browser focus. Normally the problem was not taking into account the appropriate visual cue. To us, visual cues are something that a human user would recognize as an ‘I’m ready’ from the web application. Some examples being a button that’s only clickable once all page content has loaded, confirmation dialogs, the visible state of certain page elements and so on. While a human has learned to wait for these cues, Selenium has to be explicitly told to look for them and we had plenty of race conditions due to missing that step – usually because the performance on our pairing stations was different to that of the build machines. Often we would be able to repeatedly run/debug these failing tests locally and struggle to replicate the fault. The simple act of stepping through the test, or running it in isolation, would allow the server to keep up with the test driver and “hey presto!” it works. That was when we recognized that we needed to get better at explicitly telling Selenium to wait for things.
One step we took was to wrap the standard Selenium
click with our own version that coupled the click with a preceding
wait_for_element, so that every click would first wait for the element to be present on the page. This is a choice we made with the knowledge that it could allow us to be less cognizant of unexpected application changes, but we needed to stabilise our test suite in order to reap any real benefit. Doing this removed a lot of the intermittent ‘element not present’ errors, and also exposed places where we’d missed steps in replicating user actions.
We cleaned up a bunch more race conditions simply by making better use of the Sizzle library that Selenium uses as it’s CSS locator engine. Tweaking our selectors to ensure that we were clicking the right element at the right time helped a lot. Now waiting for a button to be enabled while a server request completes is a common action.
:not all have their place in the suite, though using visible text (i.e.
:contains) is something I try to avoid whenever possible. It’s too easy for a simple content change to break tests, using classes and IDs should be more robust in the long term.
Our Ajax actions needed their own
Interacting with our popup editors also had some flakiness. While Selenium will take care of switching focus (when we tell it to), we still occasionally needed to tell it to wait for the popup to render fully before acting with it. We also needed to wait for the editor to close and thereby update the main window before moving on with the test. Often a test would pop an editor, make a change and save & close but then not wait sufficiently before navigating to another page – which would result in the page being loaded within the soon to close popup instead of the main browser window. Refactoring the editor class with some explicit waits and exit methods cleaned that up a treat. Looking at our
editor_code<br> yield e if block_given?<br> close_editor
We probably won’t ever completely eliminate flakiness from the suite since we have a continuously changing application for which we’re continuously adding tests, but we have made huge progress in stamping it out in our existing tests. We have gained a good understanding of the causes of flakiness in our tests and what we can do to fix them. We have put some guards in place to help us recognize flakiness sooner. Our build light helps draw attention to failures, plus we’re tracking the test failures for repeat offenders which will get added to the refactoring backlog when they’re identified.