Detecting Semantic Overlap and Discovering Precedents in the Biodiversity Research Literature by Graeme Hirst, Nadia Talenty, and Sara Scharfz.
Abstract:
Scientific literature on biodiversity is longevous, but even when legacy publications are available online, researchers often fail to search it adequately or effectively for prior publications; consequently, new research may replicate, or fail to adequately take into account, previously published research. The mechanisms of the Semantic Web and methods developed in contemporary research in natural language processing could be used, in the near-term future, as the basis for a precedent-finding system that would take the text of an author’s early draft (or a submitted manuscript) and find potentially related ideas in published work. Methods would include text-similarity metrics that take different terminologies, synonymy, paraphrase, discourse relations, and structure of argumentation into account.
Footnote one (1) of the paper gives an idea of the problem the authors face:
Natural history scientists work in fragmented, highly distributed and parochial communities, each with domain specific requirements and methodologies [Scoble 2008]. Their output is heterogeneous, high volume and typically of low impact, but with a citation half-life that may run into centuries” (Smith et al. 2009). “The cited half-life of publications in taxonomy is longer than in any other scientific discipline, and the decay rate is longer than in any scientific discipline” (Moritz 2005). Unfortunately, we have been unable to identify the study that is the basis for Moritz’s remark.
The paper explores in detail issues that have daunted various search techniques, when the material is available in electronic format at all.
The authors make a general proposal for addressing these issues, with mention of the Semantic Web but omit from their plan:
The other omission is semantic interpretation into a logical form, represented in XML, that draws on ontologies in the style of the original Berners-Lee, Hendler, and Lassila (2001) proposal for the Semantic Web. The problem with logical-form representation is that it implies a degree of precision in meaning that is not appropriate for the kind of matching we are proposing here. This is not to say that logical forms would be useless. On the contrary, they are employed by some approaches to paraphrase and textual entailment (section 4.1 above) and hence might appear in the system if only for that reason; but even so, they would form only one component of a broader and somewhat looser kind of semantic representation.
That’s the problem with the Semantic Web in a nutshell:
The problem with logical-form representation is that it implies a degree of precision in meaning that is not appropriate for the kind of matching we are proposing here.
What if I want to be logically precise sometimes but not others?
What if I want to be more precise in some places and less precise in others?
What if I want to have different degrees or types of imprecision?
With topic maps the question is: How im/precise do you want to be?