Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 23, 2011

Lucene.net is back on track

Filed under: Lucene,Search Algorithms,Search Engines — Patrick Durusau @ 3:08 pm

Lucene.net is back on track by Simone Chiaretta

From the post:

More than 6 months ago I blogged about Lucene.net starting his path toward extinction. Soon after that, due to the “stubbornness” of the main committer, a few forks appeared, the biggest of which was Lucere.net by Troy Howard.

At the end of the year, despite the promises of the main committer of complying to the request of the Apache board by himself, nothing happened and Lucene.net went really close to be being shut down. But luckily, the same Troy Howard that forked Lucene.net a few months before, decided, together with a bunch of other volunteers, to resubmit the documents required by the Apache Board for starting a new project into the Apache Incubator; by the beginning of February the new proposal was voted for by the Board and the project re-entered the incubator.

If you are interested in search engines and have .Net skills (or want to acquire them), this would be a good place to start.

July 20, 2011

K-sort: A new sorting algorithm that beats Heap sort for n <= 70 lakhs!

Filed under: Algorithms,Search Algorithms — Patrick Durusau @ 12:56 pm

K-sort: A new sorting algorithm that beats Heap sort for n <= 70 lakhs! by Kiran Kumar Sundararajan, Mita Pal, Soubhik Chakraborty, N.C. Mahanti.

From the description:

Sundararajan and Chakraborty (2007) introduced a new version of Quick sort removing the interchanges. Khreisat (2007) found this algorithm to be competing well with some other versions of Quick sort. However, it uses an auxiliary array thereby increasing the space complexity. Here, we provide a second version of our new sort where we have removed the auxiliary array. This second improved version of the algorithm, which we call K-sort, is found to sort elements faster than Heap sort for an appreciably large array size (n <= 70,00,000) for uniform U[0, 1] inputs.

OK, so some people have small data, < = 70 lakhs to sort at one time. 😉 Also shows that there are still interesting things to say about sorting. An operation that is of interest to topic mappers and others.

July 19, 2011

Build your own internet search engine

Filed under: Erlang,Search Algorithms,Search Engines,Searching — Patrick Durusau @ 7:53 pm

Build your own internet search engine by Daniel Himmelein.

Uses Erlang but also surveys the Apache search stack.

Not that you have to roll your own search engine but it will give you a different appreciate for the issues they face.


Update: Build your own internet search engine – Part 2

I ran across part 2 while cleaning up at year’s end. Enjoy!

July 1, 2011

Indexing The World Wide Web:…

Filed under: Indexing,Search Algorithms,Search Engines,Searching — Patrick Durusau @ 2:57 pm

Indexing The World Wide Web: The Journey So Far by Abhishek Das and Ankit Jain.

Abstract:

In this chapter, we describe the key indexing components of today’s web search engines. As the World Wide Web has grown, the systems and methods for indexing have changed significantly. We present the data structures used, the features extracted, the infrastructure needed, and the options available for designing a brand new search engine. We highlight techniques that improve relevance of results, discuss trade-offs to best utilize machine resources, and cover distributed processing concept in this context. In particular, we delve into the topics of indexing phrases instead of terms, storage in memory vs. on disk, and data partitioning. We will finish with some thoughts on information organization for the newly emerging data-forms.

A non-trivial survey of indexing the web attempts and issues. This is going to take a while to digest but it looks like a very good starting place to uncover what to try next.

April 11, 2011

A Data Parallel toolkit for Information Retrieval

Filed under: Data Mining,Information Retrieval,Search Algorithms,Searching — Patrick Durusau @ 5:53 am

A Data Parallel toolkit for Information Retrieval

From the website:

Many modern information retrieval data analyses need to operate on web-scale data collections. These collections are sufficiently large as to make single-computer implementations impractical, apparently necessitating custom distributed implementations.

Instead, we have implemented a collection of Information Retrieval analyses atop DryadLINQ, a research LINQ provider layer over Dryad, a reliable and scalable computational middleware. Our implementations are relatively simple data parallel adaptations of traditional algorithms, and, due entirely to the scalability of Dryad and DryadLINQ, scale up to very large data sets. The current version of the toolkit, available for download below, has been successfully tested against the ClueWeb corpus.

Are you using large data sets in the construction of your topic maps?

Where large is taken to mean data sets in the range of one billion documents. (http://boston.lti.cs.cmu.edu/Data/clueweb09/)

The authors of this work are attempting to extend access to large data sets to a larger audience.

Did they succeed?

Is their work useful for smaller data sets?

What tools would you add to assist more specifically with topic map construction?

April 10, 2011

TCS: Call for papers on Graph Searching

Filed under: Graphs,Networks,Search Algorithms,Searching — Patrick Durusau @ 2:52 pm

TCS: Call for papers on Graph Searching

From the call:

Manuscripts are solicited for a special issue in the journal “Theoretical Computer Science” (TCS) on “Theory and Applications of Graph Searching Problems”. This special issue will be dedicated to the 60th birthday of Lefteris M. Kirousis.

….

  • Graph Searching and Logic
  • Graph Parameters Related to Graph Searching
  • Graph searching and Robotics
  • Conquest and Expansion Games
  • Database Theory and Robber and Marshals Games
  • Probabilistic Techniques in Graph Searching
  • Monotonicity and Connectivity in Graph Searching
  • New Variants of Graph Searching
  • Graph Searching and Distributed Computing
  • Graph Searching and Network Security

Deadline for submission is: 31 July 2011.

Interesting as a submission venue or waiting for this issue to appear.

« Newer Posts

Powered by WordPress