Building a distributed search system with Apache Hadoop and Lucene by Mirko Calvaresi.
From the preface:
This work analyses the problem coming from the so called Big Data scenario, which can be defined as the technological challenge to manage and administer quantity of information with global dimension in the order of Terabyte (1012” bytes) or Petabyte (1015” bytes) and with an exponential growth rate. We’ll explore a technological and algorithmic approach to handle and calculate theses amounts of data that exceed the limit of computation of a traditional architecture based on real-time request processing:in particular we’ll analyze a singular open source technology, called Apache Hadoop, which implements the approach described as Map and Reduce.
We’ll describe also how to distribute a cluster of common server to create a Virtual File System and use this environment to populate a centralized search index (realized using another open source technology, called Apache Lucene). The practical implementation will be a web based application which offers to the user a unified searching interface against a collection of technical papers. The scope is to demonstrate that a performant search system can be obtained pre-processing the data using the Map and Reduce paradigm, in order to obtain a real time response, which is independent to the underlying amount of data. Finally we’ll compare this solutions to different approaches based on clusterization or No SQL solutions, with the scope to describe the characteristics of concrete scenarios, which suggest the adoption of those technologies.
Fairly complete (75 pages) report on a project indexing academic papers with Lucene and Hadoop.
I would like to see treatment of the voiced demand for “real-time processing” versus the need for “real-time processing.”
When I started using research tools, indexes, like the Readers Guide to Periodical Literature were at a minimum two (2) weeks behind popular journals.
Academic indexes ran that far behind if not a good bit longer.
The timeliness of indexing journal articles is now nearly simultaneous with publication.
Has the quality of our research improved due to faster access?
I can imagine use cases, drug interactions for example, the discovery of which should be streamed out as soon as practical.
But drug interactions are not the average case.
It would be very helpful to see research on what factors favor “real-time” solutions and which are quite sufficient with “non-real-time” solutions.