First experience with Elasticsearch

Many modern enterprise applications rely on search to some extent. As of Nov 2016 the most popular search engine is Elasticsearch. It is an open source engine based on Apache Lucene. The need to perform search arose in my home project as well. I chose Elasticsearch for the engine and readily dived into the tutorials. My methodology for writing interactions with the 3rd party systems is to create Facade APIs within Test Driven Development process. The tests for indexing and retrieving documents worked flawlessly, but the test results for the search queries got me puzzled. I have formal training in search engines within Coursera Data Mining specialization, thus I know concepts like TF-IDF. The hope was to get the relevance scores and match them precisely to the numbers computed by formulas in the tutorials.

Basic index for 4 test documents returned me the numbers vastly different from my expectations… After some googling I turned on the “explain” functionality and was up to an even bigger shock: the returned scores didn’t match the scores in the explain section. I started suspecting the unthinkable: the relevance calculations are broken! Elasticsearch tutorial confirmed my worst fears… well, it rather explained to me how little I know about the real search engines. After couple more hours of comparing numbers the discrepancies were decomposed into an optimization feature, a bug pretending to be a feature, and a bug. The optimization feature is that several shards are created for each index and documents are randomly distributed between those shards. The relevance calculations are only performed within each shard for DEFAULT search type. Setting search type to DFS_QUERY_THEN_FETCH forces shard statistics to be combined into a single IDF calculation, thus leading to values closer to the expected numbers. However, the “explain” functionality always employs the DEFAULT search type leading to a mismatch, hence a bug. A bug pretending to be a feature is in really coarse-grained rounding of the relevance norm. The discrepancies reach 15%, which hurts testing.

Course Review – Big Data – Capstone Project (Ilkay Altintas, Amarnath Gupta)

Here is my review for Big Data Capstone Project course offered on Coursera in Jul 2016. The course represents the final project for the Big Data specialization, it does not have separate rankings, while I passed with 98.2% score.
Technologies/Material: As a final project, the course does not have lectures, but rather brief descriptions of relevant project parts each week. The project is about making suggestions on how to increase revenue of a company promoting a fictional game “Catch the Pink Flamingo”. A lot of simulated game data is made available to the learners. The part assigned each week represents a separate area of big data analytics: data exploration, classification, clustering, and graph analysis. The suggested technologies are: Splunk, KNIME, Apache Spark, and Neo4j, respectively. As usual within the specialization instead of free exploration a “correct” path is given along with substantial help on the way. The assignment each week is peer graded with the ability to submit multiple times and get regraded. Grading asks to compare learners’ numbers with the correct numbers, which means that almost everyone gets correct answers on their second attempt. Unfortunately, many people slack off on their first attempt or simply submit an empty report. At the end of the course a final report with a powerpoint presentation are submitted and also peer graded.
Instructor/lectures: the task instructions are given by Amarnath Gupta and Ilkay Altintas. The course offers a realistic view of a job of a Data Scientist: analyze all available data to increase revenue of a company, improve retention rates, suggest the ways of development, and, most importantly, make presentations to the management. The instructors emphasize each week that the company’s bottom line is of the utmost importance. Even though the specialization is called Big Data, there is no emphasize on especially large volumes of data or on distributed computations, thus we are in the Data Science realm.

Course Review – Graph Analytics for Big Data (Amarnath Gupta)

Here is my review for Graph Analytics for Big Data course offered on Coursera in Feb 2016. The course is ranked 2.5 out of 5, while I passed with 99.4% score.
Technologies/Material: The course provides introduction to graph theory with practical examples of graph analytics. Most of examples and homework is done in Neo4j, a leading graph database. The last assignment employs GraphX API in Spark. Since graph databases are so different from regular databases, the special graph query language called Cypher was developed to write code for Neo4j. Extensive Cypher tutorial and executable code samples grouped by topics are given. Graph analytics offers simple answers to many questions. The discussed graph techniques are Path Analytics, Dijkstra algorithm and its variations, Connectivity Analytics, Community Analytics, and Centrality Analytics.
Instructor/lectures: the course is taught by Amarnath Gupta, an Associate Director of San Diego Supercomputer Center. Amarnath is an amazing instructor. The course is well taught with just the right speed and the right amount of material given. In my view, he succeeded in making an introduction to graphs, while not oversimplifying the concepts.

Course Review – Machine Learning with Big Data (Natasha Balac, Paul Rodriguez)

Here is my review for Machine Learning with Big Data course offered on Coursera in Jan 2016. The course got 2.0 out of 5 rankings and I passed it with 100% score.
Technologies/Material: The course provides basic theory and some exercises on popular machine learning techniques after presenting business justification and ML pipeline. The presented techniques are decision trees, association rules, and clustering. Exercises are largely done in KNIME with some parts in Apache Spark. Thankfully, the course has copyable code samples and provides basic information on how to get started with KNIME. The assignments require digging into non-trivial details of KNIME from its documentation/Internet/forums. For me the course provided valuable insights and examples of decision trees and association rules, which not many other courses offer.
Instructor/lectures: The course is taught by Natasha Balac, who provides most of business background, and Paul Rodriguez, who is a technical person. The presentation is organized better than in previous courses, though the depth of the material is often not sufficient for solid learning. Some slides can be reused to present Big Data to managers.