Back in May I posted on how we connected build lights to our continuous integration server to increase visibility into the health of our builds. Since that post we’ve grown from a single build light to six lights here in Boulder and two in our Raleigh office. The light control code has become a little more sophisticated and our working agreements have changed over time. We’ve just instrumented a couple of new features so I thought you may be interested in a retrospective post on our experiences.

Too much red: rush to push

The build lights worked well for a while but a negative behavior began to emerge. They encouraged a “rush to push”. This rush was encouraged by extended periods of red lights due to randomly failing flaky tests. When tests would sometimes fail due to timing issues we would kick off a new build and wait for a green light. During these waiting periods local check-ins would pile up on dev machines and the moment the light went green everyone would rush to push code. These check-ins would spawn another build and if that failed we’d have a dozen or so commits to weed through to figure out which one was the culprit. PITA.

After the build lights had been running for a couple of months, our fortnightly departmental retro diverged into a long discussion on problems with too many red build lights. In short, they were encumbering our ability to work. The outcome was a team decision that we’d focus on flaky tests and protect the build light which, in turn, would produce green lights for the majority of the time. One agreement was: every time a flaky test failed we’d round-robin assign it to a team to fix.

This strategy worked. Today the light is green for the majority of the time. Our test suite is far healthier and devs are free to push code pretty much any time during the day. But we don’t want our efforts to slowly erode away with the introduction of new flaky tests or bad practices. Here’s a selection of actions we took:

Flaky finder and metrics

One trick we pulled was to introduce a “flaky finder” job that attempts to spot flakiness in new browser tests. When someone commits a new test it must pass the multi-execution concurrent gauntlet that is flaky finder on the build system before the light will turn green.

At a later retro we talked about how it felt like our builds were doing mostly okay and we believed that things were getting better. But we really did not know. We had no metrics. An action item that came out of this meeting was to instrument ingestion of build logs into Splunk so we could measure the health of builds in the system in fine grained detail. We built a dashboard of graphs and tables with predefined queries over the build log data and monitor these regularly.

More build lights: who is doing what?

As we introduced more build lights we noticed that a red light could result in two or more people grabbing the build and start working on it. This could be duplicated effort. To fix this we implemented a “claim this build” feature. Now a red light means we have a broken build and no one has claimed it. A yellow light means the build broke but someone claimed it and is working on the problem.

“Claim this build” works through the description field of a given build in Jenkins. At first we just wanted commiters to put their names in the description field to claim the build. An interesting emergent behavior was people began adding more details about why the build failed. This further improved visibility into what was going on in Jenkins. I didn’t expect this usage pattern but it is highly valuable.

Another enhancement we made is that the build lights now track weekend release jobs when they exist. This suggested that the team was relying on the lights and not watching the Jenkins UI. We had a broken release build for a few hours one day and no one noticed. The build light script now checks Jenkins to see if there are new jobs for an upcoming release and the build lights automatically tracks them too.

Future work

An ongoing issue we have is that our full test suite can take up to 40 minutes to complete. This means our build light can be green for 40 minutes even though an earlier test failed. During this time devs could be checking in code on top of a broken build. We could use a fail-fast mechanism or faster test suites to combat this. We’ve been working on faster test turnaround with Selenium Grid as part of this.

That’s more work for the future.