Many modern enterprise applications rely on search to some extent. As of Nov 2016 the most popular search engine is Elasticsearch. It is an open source engine based on Apache Lucene. The need to perform search arose in my home project as well. I chose Elasticsearch for the engine and readily dived into the tutorials. My methodology for writing interactions with the 3rd party systems is to create Facade APIs within Test Driven Development process. The tests for indexing and retrieving documents worked flawlessly, but the test results for the search queries got me puzzled. I have formal training in search engines within Coursera Data Mining specialization, thus I know concepts like TF-IDF. The hope was to get the relevance scores and match them precisely to the numbers computed by formulas in the tutorials.
Basic index for 4 test documents returned me the numbers vastly different from my expectations… After some googling I turned on the “explain” functionality and was up to an even bigger shock: the returned scores didn’t match the scores in the explain section. I started suspecting the unthinkable: the relevance calculations are broken! Elasticsearch tutorial confirmed my worst fears… well, it rather explained to me how little I know about the real search engines. After couple more hours of comparing numbers the discrepancies were decomposed into an optimization feature, a bug pretending to be a feature, and a bug. The optimization feature is that several shards are created for each index and documents are randomly distributed between those shards. The relevance calculations are only performed within each shard for DEFAULT search type. Setting search type to DFS_QUERY_THEN_FETCH forces shard statistics to be combined into a single IDF calculation, thus leading to values closer to the expected numbers. However, the “explain” functionality always employs the DEFAULT search type leading to a mismatch, hence a bug. A bug pretending to be a feature is in really coarse-grained rounding of the relevance norm. The discrepancies reach 15%, which hurts testing.