Full-Text Search in Javascript (Part 1: Relevance Scoring) by Barak Kanber.
From the post:
Full-text search, unlike most of the topics in this machine learning series, is a problem that most web developers have encountered at some point in their daily work. A client asks you to put a search field somewhere, and you write some SQL along the lines of
WHERE title LIKE %:query%
. It’s convincing at first, but then a few days later the client calls you and claims that “search is broken!”Of course, your search isn’t broken, it’s just not doing what the client wants. Regular web users don’t really understand the concept of exact matches, so your search quality ends up being poor. You decide you need to use full-text search. With some MySQL fidgeting you’re able to set up a FULLTEXT index and use a more evolved syntax, the “MATCH() … AGAINST()” query.
Great! Problem solved. For smallish databases.
As you hit the hundreds of thousands of records, you notice that your database is sluggish. MySQL just isn’t great at full-text search. So you grab ElasticSearch, refactor your code a bit, and deploy a Lucene-driven full-text search cluster that works wonders. It’s fast and the quality of results is great.
Which leads you to ask: what the heck is Lucene doing so right?
This article (on TF-IDF, Okapi BM-25, and relevance scoring in general) and the next one (on inverted indices) describe the basic concepts behind full-text search.
Illustration of search engine concepts in Javascript with code for download. You can tinker to your heart’s delight.
Enjoy!
PS: Part 2 is promised in the next “several” weeks. Will be watching for it.