Cross Domain Search by Exploiting Wikipedia by Chen Liu, Sai Wu, Shouxu Jiang, and Anthony K. H. Tung.
Abstract:
The abundance of Web 2.0 resources in various media formats calls for better resource integration to enrich user experience. This naturally leads to a new cross domain resource search requirement, in which a query is a resource in one modal and the results are closely related resources in other modalities. With cross domain search, we can better exploit existing resources.
Intuitively, tags associated with Web 2.0 resources are a straightforward medium to link resources with different modality together. However, tagging is by nature an ad hoc activity. They often contain noises and are affected by the subjective inclination of the tagger. Consequently, linking resources simply by tags will not be reliable. In this paper, we propose an approach for linking tagged resources to concepts extracted from Wikipedia, which has become a fairly reliable reference over the last few years. Compared to the tags, the concepts are therefore of higher quality. We develop effective methods for cross-modal search based on the concepts associated with resources. Extensive experiments were conducted, and the results show that our solution achieves good performance.
When the authors say “cross domain,” they are referring to different types of resources, say text vs. images or images vs. sound or any of those three vs. some other type of resource. One search can return “related” resources of different resource types.
Although the “cross domain” searching is interesting, I am more interested in the mapping that was performed on Wikipedia. The authors define three semantic relationships:
- Link between Tag and Concept
- Correlation of Concepts
- Semantic Distance
It seems to me that the author’s are attacking “big data,” which has unbounded semantics from the “other” end. That is they are mapping a finite universe of semantics (Wikipedia) and then using that finite mapping to mine a much larger, unbounded semantic universe.
Or perhaps creating a semantic lens through which to view “related resources” in a much larger semantic universe. And without the overhead of Linked Data, which is mentioned under other work.