A Data Parallel toolkit for Information Retrieval
From the website:
Many modern information retrieval data analyses need to operate on web-scale data collections. These collections are sufficiently large as to make single-computer implementations impractical, apparently necessitating custom distributed implementations.
Instead, we have implemented a collection of Information Retrieval analyses atop DryadLINQ, a research LINQ provider layer over Dryad, a reliable and scalable computational middleware. Our implementations are relatively simple data parallel adaptations of traditional algorithms, and, due entirely to the scalability of Dryad and DryadLINQ, scale up to very large data sets. The current version of the toolkit, available for download below, has been successfully tested against the ClueWeb corpus.
Are you using large data sets in the construction of your topic maps?
Where large is taken to mean data sets in the range of one billion documents. (http://boston.lti.cs.cmu.edu/Data/clueweb09/)
The authors of this work are attempting to extend access to large data sets to a larger audience.
Did they succeed?
Is their work useful for smaller data sets?
What tools would you add to assist more specifically with topic map construction?