Writing a full-text search engine using Bloom filters by Stavros Korokithakis.
A few minutes ago I came across a Hacker News post that detailed a method of adding search to your static site. As you probably know, adding search to a static site is a bit tricky, because you can’t just send the query to a server and have the server process it and return the results. If you want full-text search, you have to implement something like an inverted index.
How an inverted index works
An inverted index is a data structure that basically maps every word in every document to the ID of the document it can be found in. For example, such an index might look like {"python": [1, 3, 6], "raspberry": [3, 7, 19]}
. To find the documents that mention both “python” and “raspberry”, you look those terms up in the index and find the common document IDs (in our example, that is only document with ID 3).
However, when you have very long documents with varied words, this can grow a lot. It’s a hefty data structure, and, when you want to implement a client-side search engine, every byte you transmit counts.
Client-side search engine caveats
The problem with client-side search engines is that you (obviously) have to do all the searching on the client, so you have to transmit all available information there. What static site generators do is generate every required file when generating your site, then making those available for the client to download. Usually, search-engine plugins limit themselves to tags and titles, to cut down on the amount of information that needs to be transmitted. How do we reduce the size? Easy, use a Bloom filter!
An interesting alternative to indexing a topic map with an inverted index.
I mention it in part because of one of the “weaknesses” of Bloom filters for searching:
You can’t weight pages by relevance, since you don’t know how many times a word appears in a page, all you know is whether it appears or not. You may or may not care about this.
Unlike documents, which are more or less relevant due to work occurrences, topic maps cluster information about a subject into separate topics (or proxies if you prefer).
That being the case, it isn’t the case that one topic/proxy is more “relevant” than another. The question is whether this topic/proxy represents the subject you want?
Or to put it another way, topics/proxies have already been arranged by “relevance” by a topic map author.
If a topic map interface gives you hundreds or thousands of “relevant” topics/proxies, how are you any better off than a more traditional search engine?
If you need to look like you are working, go search any of the social media sites for useful content. It’s there, the difficulty is going to be finding it.