SEISA: set expansion by iterative similarity aggregation by Yeye He, University of Wisconsin-Madison, Madison, WI, USA, and Dong Xin, Microsoft Research, Redmond, WA, USA.
In this paper, we study the problem of expanding a set of given seed entities into a more complete set by discovering other entities that also belong to the same concept set. A typical example is to use “Canon” and “Nikon” as seed entities, and derive other entities (e.g., “Olympus”) in the same concept set of camera brands. In order to discover such relevant entities, we exploit several web data sources, including lists extracted from web pages and user queries from a web search engine. While these web data are highly diverse with rich information that usually cover a wide range of the domains of interest, they tend to be very noisy. We observe that previously proposed random walk based approaches do not perform very well on these noisy data sources. Accordingly, we propose a new general framework based on iterative similarity aggregation, and present detailed experimental results to show that, when using general-purpose web data for set expansion, our approach outperforms previous techniques in terms of both precision and recall.
To the uses of set expansion mentioned by the authors:
Set expansion systems are of practical importance and can be used in various applications. For instance, web search engines may use the set expansion tools to create a comprehensive entity repository (for, say, brand names of each product category), in order to deliver better results to entity-oriented queries. As another example, the task of named entity recognition can also leverage the results generated by set expansion tools [13]
I would add:
- augmented authoring of navigation tools for text corpora
- discovery of related entities (for associations)
While the authors concentrate on web-based documents, which for the most part are freely available, the techniques shown here could be just as easily applied to commercial texts or used to generate pay-for-view results.
It would have to really be a step up to get people to pay a premium for navigation of free content, but given the noisy nature of most information sites, that is certainly possible.