Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 28, 2011

Surrogate Learning

Filed under: Entity Resolution,Linguistics,Record Linkage — Patrick Durusau @ 7:04 pm

Surrogate Learning – From Feature Independence to Semi-Supervised Classification by Sriharsha Veeramachaneni and Ravi Kumar Kondadadi.

Abstract:

We consider the task of learning a classifier from the feature space X to the set of classes $Y = {0, 1}$, when the features can be partitioned into class-conditionally independent feature sets $X1$ and $X2$. We show that the class-conditional independence can be used to represent the original learning task in terms of 1) learning a classifier from $X2$ to $X1$ (in the sense of estimating the probability $P(x1|x2))$ and 2) learning the class-conditional distribution of the feature set $X1$. This fact can be exploited for semi-supervised learning because the former task can be accomplished purely from unlabeled samples. We present experimental evaluation of the idea in two real world applications.

The two “real world” applications are ones you are likely to encounter:

First:

Our problem consisted of merging each of ≈ 20000 physician records, which we call the update database, to the record of the same physician in a master database of ≈ 106 records.

Our old friends record linkage and entity resolution. The solution depends upon a clever choice of features for application of the technique. (The thought occurs to me that a repository of data analysis snippets for particular techniques would be as valuable, if not more so, than the techniques themselves. Techniques come and go. Data analysis and the skills it requires goes on and on.)

Second:

Sentence classification is often a preprocessing step for event or relation extraction from text. One of the challenges posed by sentence classification is the diversity in the language for expressing the same event or relationship. We present a surrogate learning approach to generating paraphrases for expressing the merger-acquisition (MA) event between two organizations in financial news. Our goal is to find paraphrase sentences for the MA event from an unlabeled corpus of news articles, that might eventually be used to train a sentence classifier that discriminates between MA and non-MA sentences. (Emphasis added. This is one of the issues in the legal track at TREC.)

This test was against 700000 financial news records.

Both tests were quite successful.

Surrogate learning looks interesting for a range of NLP applications.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress