C-Rank: A Link-based Similarity Measure for Scientific Literature Databases by Seok-Ho Yoon, Sang-Wook Kim, and Sunju Park.
As the number of people who use scientific literature databases grows, the demand for literature retrieval services has been steadily increased. One of the most popular retrieval services is to find a set of papers similar to the paper under consideration, which requires a measure that computes similarities between papers. Scientific literature databases exhibit two interesting characteristics that are different from general databases. First, the papers cited by old papers are often not included in the database due to technical and economic reasons. Second, since a paper references the papers published before it, few papers cite recently-published papers. These two characteristics cause all existing similarity measures to fail in at least one of the following cases: (1) measuring the similarity between old, but similar papers, (2) measuring the similarity between recent, but similar papers, and (3) measuring the similarity between two similar papers: one old, the other recent. In this paper, we propose a new link-based similarity measure called C-Rank, which uses both in-link and out-link by disregarding the direction of references. In addition, we discuss the most suitable normalization method for scientific literature databases and propose an evaluation method for measuring the accuracy of similarity measures. We have used a database with real-world papers from DBLP and their reference information crawled from Libra for experiments and compared the performance of C-Rank with those of existing similarity measures. Experimental results show that C-Rank achieves a higher accuracy than existing similarity measures.
Reviews other link-based similarity measures compares them to the proposed C-Rank measure both in theory as well as actual experiments.
Interesting use of domain experts to create baseline similarity measures, against which the similarity measures were compared.
I am not quite convinced by the “difference” argument for scientific literature:
First, few papers exist which are referenced by old papers. This is because very old papers are often not included in the database due to technical and economic reasons. Second, since a paper can reference only the papers published before it (and never the papers published after it), there exist few papers which reference recently-published papers.
As far as old papers not being included in the database, the authors should try philosophy articles which cite a wid range of material that is very unlikely to be a literature database. (Google Books may be changing that for “recent” literature.)
On scientific papers not citing recent papers, I suspect that simply isn’t true for scientific papers. David P. Hamilton (Science, 251:25, 1991) in Research Papers: Who’s Uncited Now?, commenting on work by David Pendlebury of the Institute for Scientific Information that demonstrated “Atomic, molecular, and chemical physics, a field in which onlv 9.2% of articles go uncited…” within five years of publication. That sounds like recent papers being cited to me.
If you are interested in citation practices for monographs, see: Citation Characteristics and Intellectual Acceptance of Scholarly Monographs by Rong Tang (Coll. res. libr. July 2008 69:356-369).
If it isn’t already on your regular reading list, College & Research Libraries should be.
I mention all that to point out that exploring the characteristics of information collections may turn up surprising facts, facts that can influence the development of algorithms for search, similarity and ultimately for use by a user.