Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 21, 2013

Search Rules using Mahout’s Association Rule Mining

Filed under: Machine Learning,Mahout,Searching — Patrick Durusau @ 2:05 pm

Search Rules using Mahout’s Association Rule Mining by Sujit Pal.

This work came about based on a conversation with one of our domain experts, who was relaying a conversation he had with one of our clients. The client was looking for ways to expand the query based on terms already in the query – for example, if a query contained “cattle” and “neurological disorder”, then we should also server results for “bovine spongiform encephalopathy”, also known as “mad cow disease”.

We do semantic search, which involves annotating words and phrases in documents with concepts from our taxonomy. One view of an annotated document is the bag of concepts view, where a document is modeled as a sparsely populated array of scores, each position corresponding to a concept. One way to address the client’s requirement would be to do Association Rule Mining on the concepts, looking for significant co-occurrences of a set of concepts per document across the corpus.

The data I used to build this proof-of-concept with came from one of my medium sized indexes, and contains 12,635,756 rows and 342,753 unique concepts. While Weka offers the Apriori algorithm, I suspect that it won’t be able to handle this data volume. Mahout is probably a better fit, and it offers the FPGrowth algorithm running on Hadoop, so thats what I used. This post describes the things I had to do to prepare my data for Mahout, run the job with Mahout on Amazon Elastic Map Reduce (EMR) platform, then post process the data to get useful information out of it.
(…)

I don’t know that I would call these “search rules” but they would certainly qualify as input into defining merging rules.

Particularly if I was mining domain literature where co-occurrences of terms are likely to have the same semantic. Not always but likely. The likelihood of semantic sameness is something you can sample for and develop confidence measures about.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress