The Rally ALM software is backed by a beefy Oracle database so when our product owners first requested search functionality it made sense to utilize the DBMS. Straightforward, right? We can use the SQL LIKE operator to scan the database for a specified pattern, perhaps with wildcards, and job done. Something with the form:
SELECT * FROM table_name WHERE value LIKE search_term
So we did.
As Rally ALM became popular our customer base grew and search started to slooooowww down. LIKE doesn’t like huge datasets; full table scans with wildcards don’t scale.
Metrics gathered from production reported ~110K search requests a week with a response time up to 90 seconds. Yes, that is a one and a half minute response time. These requests generated the longest running queries in our Oracle instance, contributing an ever increasing load. So the team began investigating alternative options.
After a period of research, idea spikes, discussion groups, and talking to colleagues in other companies we decided to implement a Solr based solution. The new architecture looks like this:
Oracle DB → extract indexes → Solr server ↔ user queries
We built Solr search for Rally through a series of “spikes” that evolved into stories for completion. The research-y nature of the work made it difficult to estimate a completion date. As we worked we’d discover pitfalls which would generate new project dependencies on our path. For example, we had a complete implementation using the DataImportHandler to build Solr documents from our data. This worked fine with test data sets but on our production-sized test data the initial import took over 24hrs. This was way outside our tolerance so we spent time building our own scalable messaging system.
When new search was ready for release we performed a staged rollout to our user base using feature toggles. Beginning with a handful of beta customers, then a half dozen heavy search users, then percentage jumps up to a 100% general release. This allowed us to test the water whilst monitoring performance of the app and impact on the database. At each step we could project what would happen as we lit up Solr search for more customers. This gave us confidence in our work and kept the operations team comfortable. The feature toggle gave us the flexibility to flip the Solr switch on and off whenever we wanted.
Looking at performance graphs today, the average response time to search requests has dropped to around 85ms. Our stress tests show that search queries directly against Solr can handle over 100 requests/sec with an average of 3ms/request. Probably because the entire Solr index easily fits into memory.
During the week when we lit up 50% of our users to use Solr we had an extra 20K searches — user behavior changed very quickly and the tests gave us confidence that the system could easily handle this.
So what about the production database? Our early results were astounding:
DB wait time dropped by over 30%
IO requests dropped by over 16%
DB sequential file reads dropped by almost 30%
Sizzle.
