Not all citations are equal: identifying key citations automatically by Daniel Lemire.
From the post:
Suppose that you are researching a given issue. Maybe you have a medical condition or you are looking for the best algorithm to solve your current problem.
A good heuristic is to enter reasonable keywords in Google Scholar. This will return a list of related research papers. If you are lucky, you may even have access to the full text of these research papers.
Is that good enough? No.
Scholarship, on the whole, tends to improve with time. More recent papers incorporate the best ideas from past work and correct mistakes. So, if you have found a given research paper, you’d really want to also get a list of all papers building on it…
Thankfully, a tool like Google Scholar allows you to quickly access a list of papers citing a given paper.
Great, right? So you just pick your research paper and review the papers citing them.
If you have ever done this work, you know that most of your effort will be wasted. Why? Because most citations are shallow. Almost none of the citing papers will build on the paper you picked. In fact, many researchers barely even read the papers that they cite.
Ideally, you’d want Google Scholar to automatically tell apart the shallow citations from the real ones.
…
The paper of the same title is due to appear in JASIST.
The abstract:
The importance of a research article is routinely measured by counting how many times it has been cited. However, treating all citations with equal weight ignores the wide variety of functions that citations perform. We want to automatically identify the subset of references in a bibliography that have a central academic influence on the citing paper. For this purpose, we examine the effectiveness of a variety of features for determining the academic influence of a citation.
By asking authors to identify the key references in their own work, we created a dataset in which citations were labeled according to their academic influence. Using automatic feature selection with supervised machine learning, we found a model for predicting academic influence that achieves good performance on this dataset using only four features.
The best features, among those we evaluated, were features based on the number of times a reference is mentioned in the body of a citing paper. The performance of these features inspired us to design an influence-primed h-index (the hip-index). Unlike the conventional h-index, it weights citations by how many times a reference is mentioned. According to our experiments, the hip-index is a better indicator of researcher performance than the conventional h-index.
What I find intriguing is the potential for this type of research to enable a type of semantic triage when creating topic maps or other semantic resources.
If only three out of thirty citations in a paper are determined to be “influential,” why should I use scarce resources to capture them as completely as the influential resources?
The corollary to Daniel’s “not all citations are equal,” is that “not all content is equal.”
We already make those sort of choices when we select some citations from the larger pool of possible citations.
I’m just suggesting that we make that decision explicit when creating semantic resources.
PS: I wonder how Daniel’s approach would work with opinions rendered in legal cases. Court’s often cite an entire block of prior decisions but no particular rule or fact from any of them. Could reduce the overhead of tracking influential prior case decisions.