Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 28, 2013

Building a distributed search system

Filed under: Distributed Computing,Hadoop,Lucene,Search Engines — Patrick Durusau @ 2:13 pm

Building a distributed search system with Apache Hadoop and Lucene by Mirko Calvaresi.

From the preface:

This work analyses the problem coming from the so called Big Data scenario, which can be defined as the technological challenge to manage and administer quantity of information with global dimension in the order of Terabyte (1012bytes) or Petabyte (1015bytes) and with an exponential growth rate. We’ll explore a technological and algorithmic approach to handle and calculate theses amounts of data that exceed the limit of computation of a traditional architecture based on real-time request processing:in particular we’ll analyze a singular open source technology, called Apache Hadoop, which implements the approach described as Map and Reduce.

We’ll describe also how to distribute a cluster of common server to create a Virtual File System and use this environment to populate a centralized search index (realized using another open source technology, called Apache Lucene). The practical implementation will be a web based application which offers to the user a unified searching interface against a collection of technical papers. The scope is to demonstrate that a performant search system can be obtained pre-processing the data using the Map and Reduce paradigm, in order to obtain a real time response, which is independent to the underlying amount of data. Finally we’ll compare this solutions to different approaches based on clusterization or No SQL solutions, with the scope to describe the characteristics of concrete scenarios, which suggest the adoption of those technologies.

Fairly complete (75 pages) report on a project indexing academic papers with Lucene and Hadoop.

I would like to see treatment of the voiced demand for “real-time processing” versus the need for “real-time processing.”

When I started using research tools, indexes, like the Readers Guide to Periodical Literature were at a minimum two (2) weeks behind popular journals.

Academic indexes ran that far behind if not a good bit longer.

The timeliness of indexing journal articles is now nearly simultaneous with publication.

Has the quality of our research improved due to faster access?

I can imagine use cases, drug interactions for example, the discovery of which should be streamed out as soon as practical.

But drug interactions are not the average case.

It would be very helpful to see research on what factors favor “real-time” solutions and which are quite sufficient with “non-real-time” solutions.

June 4, 2013

WeevilScout [Distributed Browser Computing]

Filed under: Distributed Computing,Javascript,Web Browser — Patrick Durusau @ 2:35 pm

WeevilScout

From this poster:

The proliferation of web browsers and the performance gain being achieved by current JavaScript virtual machines raises the question whether Internet browsers can become yet another middleware for distributed computing.

Will we need new HPC benchmarks when 10 million high end PCs link their web browser JavaScript engines together?

What about 20 million high end PCs?

But the ability to ask questions of large data sets is no guarantee that we will formulate good questions to ask.

Pointers to discussions on how to decide what questions to ask?

Or do we ask the old questions and just get the results more quickly?

I first saw this at Nat Torkinton’s Four short links: 4 June 2013.

April 25, 2013

PODC and SPAA 2013 Accepted Papers

Filed under: Conferences,Distributed Computing,Parallel Programming,Parallelism — Patrick Durusau @ 2:03 pm

ACM Symposium on Principles of Distributed Computing [PODC] accepted papers. (Montréal, Québec, Canada, July 22-24, 2013) Main PODC page.

Symposium on Parallelism in Algorithms and Architectures [SPAA] accepted papers. (Montréal, Québec, Canada, July 23 – 25, 2013) Main SPAA page.

Just scanning the titles reveals a number of very interesting papers.

Suggest you schedule a couple of weeks of vacation in Canada following SPAA before attending the Balisage Conference, August 6-9, 2013.

The weather is quite temperate and the outdoor dining superb.

I first saw this at: PODC AND SPAA 2013 ACCEPTED PAPERS.

« Newer Posts

Powered by WordPress