Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 9, 2013

Full-Text Indexing PDFs in Javascript

Filed under: Indexing,Javascript,PDF — Patrick Durusau @ 8:35 pm

Full-Text Indexing PDFs in Javascript by Gary Sieling.

From the post:

Mozilla Labs received a lot of attention lately for a project impressive in it’s ambitions: rendering PDFs in a browser using only Javascript. The PDF spec is incredibly complex, so best of luck to the pdf.js team! On a different vein, Oliver Nightingale is implementing a Javascript full-text indexer in the Javascript – combining these two projects allows reproducing the PDF processing pipeline entirely in web browsers.

As a refresher, full text indexing lets a user search unstructured text, ranking resulting documents by a relevance score determined by word frequencies. The indexer counts how often each word occurs per document and makes minor modifications the text, removing grammatical features which are irrelevant to search. E.g. it might subtract “-ing” and change vowels to phonetic common denominators. If a word shows up frequently across the document set it is automatically considered less important, and it’s effect on resulting ranking is minimized. This differs from the basic concept behind Google PageRank, which boosts the rank of documents based on a citation graph.

Most database software provides full-text indexing support, but large scale installations are typically handled in more powerful tools. The predominant open-source product is Solr/Lucene, Solr being a web-app wrapper around the Lucene library. Both are written in Java.

Building a Javascript full-text indexer enables search in places that were previously difficult such as Phonegap apps, end-user machines, or on user data that will be stored encrypted. There is a whole field of research to encrypted search indices, but indexing and encrypting data on a client machine seems like a good way around this naturally challenging problem. (Emphasis added.)

The need for a full-text indexer without using one of the major indexing packages had not occurred to me.

Access to the user’s machine might be limited by time, for example. You would not want to waste cycles spinning up a major indexer when you don’t know the installed software.

Something to add to your USB stick. 😉

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress