Archive for the ‘Citation Analysis’ Category

Network Measures of the United States Code

Saturday, March 5th, 2016

Network Measures of the United States Code by Alexander Lyte, Dr. David Slater, Shaun Michel.


The U.S. Code represents the codification of the laws of the United States. While it is a well-organized and curated corpus of documents, the legal text remains nearly impenetrable for non-lawyers. In this paper, we treat the U.S. Code as a citation network and explore its complexity using traditional network metrics. We find interesting topical patterns emerge from the citation structure and begin to interpret network metrics in the context of the legal corpus. This approach has potential for determining policy dependency and robustness, as well as modeling of future policies.​

The citation network is quite impressive:


I have inquired about an interactive version of the network but no response as of yet.

Bluebook® vs. Baby Blue’s (Or, Bleak House “Lite”)

Friday, February 19th, 2016

The suspense over what objections The Bluebook® A Uniform System of Citation® could have to the publication of Baby Blue’s Manual of Legal Citation, ended with a whimper and not a bang on the publication of Baby Blue’s.

You may recall I have written in favor of Baby Blue’s, sight unseen, Bloggers! Help Defend The Public Domain – Prepare To Host/Repost “Baby Blue”, and, Oxford Legal Citations Free, What About BlueBook?.

Of course, then Baby Blue’s Manual of Legal Citation was published.

I firmly remain of the opinion that legal citations are in the public domain. Moreover, the use of legal citations is the goal of any citation originator so assertion of copyright on the same would be self-defeating, if not insane.

Having said that, Baby Blue’s Manual of Legal Citation is more of a Bleak House “Lite” than a useful re-imagining of legal citation in a modern context.

I don’t expect you to take my word for that judgment so I have prepared mappings from Bluebook® to Baby Blue’s and Baby Blue’s to Bluebook®.

Caveat 1: Baby Blue’s is still subject to revision and may tinker with its table numbering to further demonstrate its “originality” for example, so consider these mappings as provisional and subject to change.

Caveat 2: The mappings are pointers to equivalent subject matter and not strictly equivalent content.

How closely the content of these two publications track each other is best resolved by automated comparison of the two.

As general assistance, pages 68-191 (out of 198) of Baby Blue’s are in substantial accordance with pages 233-305 and 491-523 of the Bluebook®. Foreign citations, covered by pages 307-490 in the Bluebook®, merit a scant two pages, 192-193, in Baby Blue’s.

The substantive content of Baby Blue’s doesn’t begin until page 10 and continues to page 67, with tables beginning on page 68. In terms of non-table content, there is only 57 pages of material for comparison to the Bluebook®. As you can see from the mappings, the ordering of rules has been altered from the Bluebook®, no doubt as a showing of “originality.”

The public does need greater access to primary legal resources but treating the ability to cite Tucker and Celphane (District of Columbia, 1892-1893) [Baby Blue’s page 89] on a par with Federal Reporter [Baby Blue’s page 67], is not a step in that direction.

PS: To explore the issues and possibilities at hand, you will need a copy of the The Bluebook® A Uniform System of Citation®.

Some starter questions:

  1. What assumptions underlie the rules reported in the Bluebook®?
  2. How would you measure the impact of changing the rules it reports?
  3. What technologies drove the its form and organization?
  4. What modern technologies could alter its form and organization?
  5. How can modern technologies display content differently that used its citations?

A more specific question could be: Do we need 123 pages of abbreviations (Babyblue), 113 pages of abbreviations (Bluebook®) when software has the capability to display expanded abbreviations to any user? Even if written originally as an abbreviation.

Abbreviations being both a means of restricting access/understanding and partially a limitation of the printed page into which we sought to squeeze as much information as possible.

Should anyone raise the issue of “governance,” with you in regard to the Bluebook®, they are asking for a seat at the citation rule table for themselves, not you. My preference is to turn the table over in favor of modern mechanisms for citations that result in access, not promises of access if you learn a secret code.

PS: I use Bleak House as a pejorative above but it is one of my favorite novels. Bear in mind that I also enjoy reading the Bluebook and the Chicago Manual of Style. 😉

Inheritance Patterns in Citation Networks Reveal Scientific Memes

Sunday, December 14th, 2014

Inheritance Patterns in Citation Networks Reveal Scientific Memes by Tobias Kuhn, Matjaž Perc, and Dirk Helbing. (Phys. Rev. X 4, 041036 – Published 21 November 2014.)


Memes are the cultural equivalent of genes that spread across human culture by means of imitation. What makes a meme and what distinguishes it from other forms of information, however, is still poorly understood. Our analysis of memes in the scientific literature reveals that they are governed by a surprisingly simple relationship between frequency of occurrence and the degree to which they propagate along the citation graph. We propose a simple formalization of this pattern and validate it with data from close to 50 million publication records from the Web of Science, PubMed Central, and the American Physical Society. Evaluations relying on human annotators, citation network randomizations, and comparisons with several alternative approaches confirm that our formula is accurate and effective, without a dependence on linguistic or ontological knowledge and without the application of arbitrary thresholds or filters.

Popular Summary:

It is widely known that certain cultural entities—known as “memes”—in a sense behave and evolve like genes, replicating by means of human imitation. A new scientific concept, for example, spreads and mutates when other scientists start using and refining the concept and cite it in their publications. Unlike genes, however, little is known about the characteristic properties of memes and their specific effects, despite their central importance in science and human culture in general. We show that memes in the form of words and phrases in scientific publications can be characterized and identified by a simple mathematical regularity.

We define a scientific meme as a short unit of text that is replicated in citing publications (“graphene” and “self-organized criticality” are two examples). We employ nearly 50 million digital publication records from the American Physical Society, PubMed Central, and the Web of Science in our analysis. To identify and characterize scientific memes, we define a meme score that consists of a propagation score—quantifying the degree to which a meme aligns with the citation graph—multiplied by the frequency of occurrence of the word or phrase. Our method does not require arbitrary thresholds or filters and does not depend on any linguistic or ontological knowledge. We show that the results of the meme score are consistent with expert opinion and align well with the scientific concepts described on Wikipedia. The top-ranking memes, furthermore, have interesting bursty time dynamics, illustrating that memes are continuously developing, propagating, and, in a sense, fighting for the attention of scientists.

Our results open up future research directions for studying memes in a comprehensive fashion, which could lead to new insights in fields as disparate as cultural evolution, innovation, information diffusion, and social media.

You definitely should grab the PDF version of this article for printing and a slow read.

From Section III Discussion:

We show that the meme score can be calculated exactly and exhaustively without the introduction of arbitrary thresholds or filters and without relying on any kind of linguistic or ontological knowledge. The method is fast and reliable, and it can be applied to massive databases.

Fair enough but “black,” “inflation,” and, “traffic flow,” all appear in the top fifty memes in physics. I don’t know that I would consider any of them to be “memes.”

There is much left to be discovered about memes. Such as who is good at propagating memes? Would not hurt if your research paper is the origin of a very popular meme.

I first saw this in a tweet by Max Fisher.

Open and transparent altmetrics for discovery

Sunday, February 9th, 2014

Open and transparent altmetrics for discovery by Peter Kraker.

From the post:

Altmetrics are a hot topic in scientific community right now. Classic citation-based indicators such as the impact factor are amended by alternative metrics generated from online platforms. Usage statistics (downloads, readership) are often employed, but links, likes and shares on the web and in social media are considered as well. The altmetrics promise, as laid out in the excellent manifesto, is that they assess impact quicker and on a broader scale.

The main focus of altmetrics at the moment is evaluation of scientific output. Examples are the article-level metrics in PLOS journals, and the Altmetric donut. ImpactStory has a slightly different focus, as it aims to evaluate the oeuvre of an author rather than an individual paper.

This is all good and well, but in my opinion, altmetrics have a huge potential for discovery that goes beyond rankings of top papers and researchers. A potential that is largely untapped so far.

How so? To answer this question, it is helpful to shed a little light on the history of citation indices.

Peter observes that co-citation is a measure of subject similarity, without the need to use the same terminology (Science Citation Index). Peter discovered in his PhD research that co-readership is also an indicator of subject similarity.

But more research is needed on co-readership to make it into a reproducible and well understood measure.

Peter is appealing for data sets suitable for this research.

It is subject similarity at the document level but if as useful as co-citation analysis has proven to be, it will be well worth the effort.

Help out if you are able.

I first saw this in a tweet by Jason Priem.

Is Link Rot Destroying Stare Decisis…

Monday, December 30th, 2013

Is Link Rot Destroying Stare Decisis as We Know It? The Internet-Citation Practice of the Texas Appellate Courts by Arturo Torres (Journal of Appellate Practice and Process, Vol 13, No. 2, Fall 2012 )


In 1995 the first Internet-based citation was used in a federal court opinion. In 1996, a state appellate court followed suit; one month later, a member of the United States Supreme Court cited to the Internet; finally, in 1998 a Texas appellate court cited to the Internet in one of its opinions. In less than twenty years, it has become common to find appellate courts citing to Internet-based resources in opinions. Because of the current extent of Internet-citation practice varies by courts across jurisdictions, this paper will examine the Internet-citation practice of the Texas Appellate courts since 1998. Specifically, this study surveys the 1998 to 2011 published opinions of the Texas appellate courts and describes their Internet-citation practice.

A study that confirms what was found in …Link and Reference Rot in Legal Citations for the Harvard Law Review and the U.S. Supreme Court.

Curious that a West Key Numbers remain viable after more than a century of use (manual or electronic resolution) whereas Internet citations expire over the course of a few years.

What do you think is the difference in those citations, West Key Numbers versus URLs, that accounts for one being viable and the other only ephemerally so?

Not all citations are equal:… [Semantic Triage?]

Sunday, November 24th, 2013

Not all citations are equal: identifying key citations automatically by Daniel Lemire.

From the post:

Suppose that you are researching a given issue. Maybe you have a medical condition or you are looking for the best algorithm to solve your current problem.

A good heuristic is to enter reasonable keywords in Google Scholar. This will return a list of related research papers. If you are lucky, you may even have access to the full text of these research papers.

Is that good enough? No.

Scholarship, on the whole, tends to improve with time. More recent papers incorporate the best ideas from past work and correct mistakes. So, if you have found a given research paper, you’d really want to also get a list of all papers building on it…

Thankfully, a tool like Google Scholar allows you to quickly access a list of papers citing a given paper.

Great, right? So you just pick your research paper and review the papers citing them.

If you have ever done this work, you know that most of your effort will be wasted. Why? Because most citations are shallow. Almost none of the citing papers will build on the paper you picked. In fact, many researchers barely even read the papers that they cite.

Ideally, you’d want Google Scholar to automatically tell apart the shallow citations from the real ones.

The paper of the same title is due to appear in JASIST.

The abstract:

The importance of a research article is routinely measured by counting how many times it has been cited. However, treating all citations with equal weight ignores the wide variety of functions that citations perform. We want to automatically identify the subset of references in a bibliography that have a central academic influence on the citing paper. For this purpose, we examine the effectiveness of a variety of features for determining the academic influence of a citation.

By asking authors to identify the key references in their own work, we created a dataset in which citations were labeled according to their academic influence. Using automatic feature selection with supervised machine learning, we found a model for predicting academic influence that achieves good performance on this dataset using only four features.

The best features, among those we evaluated, were features based on the number of times a reference is mentioned in the body of a citing paper. The performance of these features inspired us to design an influence-primed h-index (the hip-index). Unlike the conventional h-index, it weights citations by how many times a reference is mentioned. According to our experiments, the hip-index is a better indicator of researcher performance than the conventional h-index.

What I find intriguing is the potential for this type of research to enable a type of semantic triage when creating topic maps or other semantic resources.

If only three out of thirty citations in a paper are determined to be “influential,” why should I use scarce resources to capture them as completely as the influential resources?

The corollary to Daniel’s “not all citations are equal,” is that “not all content is equal.”

We already make those sort of choices when we select some citations from the larger pool of possible citations.

I’m just suggesting that we make that decision explicit when creating semantic resources.

PS: I wonder how Daniel’s approach would work with opinions rendered in legal cases. Court’s often cite an entire block of prior decisions but no particular rule or fact from any of them. Could reduce the overhead of tracking influential prior case decisions.

…Link and Reference Rot in Legal Citations

Tuesday, September 24th, 2013

Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations by Jonathan Zittrain, Kendra Albert, Lawrence Lessig.


We document a serious problem of reference rot: more than 70% of the URLs within the Harvard Law Review and other journals, and 50% of the URLs found within U.S. Supreme Court opinions do not link to the originally cited information.

Given that, we propose a solution for authors and editors of new scholarship that involves libraries undertaking the distributed, long-term preservation of link contents.

Imagine trying to use a phone book where 70% of the addresses were wrong.

Or you are looking for your property deed and learn that only 50% of the references are correct.

Do those sound like acceptable situations?

Considering the Harvard Law Review and the U.S. Supreme Court put a good deal of effort into correct citations, the fate of the rest of the web must be far worse.

The about page for Perma reports:

Any author can go to the website and input a URL. downloads the material at that URL and gives back a new URL (a “ link”) that can then be inserted in a paper.

After the paper has been submitted to a journal, the journal staff checks that the provided link actually represents the cited material. If it does, the staff “vests” the link and it is forever preserved. Links that are not “vested” will be preserved for two years, at which point the author will have the option to renew the link for another two years.

Readers who encounter links can click on them like ordinary URLs. This takes them to the site where they are presented with a page that has links both to the original web source (along with some information, including the date of the link’s creation) and to the archived version stored by

I would caution that “forever” is a very long time.

What happens to the binding between an identifier and a URL when URLs are replaced by another network protocol?

After all the change over the history of the Internet, you don’t believe the current protocols will last “forever” Yes?

A more robust solution would divorce identifiers/citations from any particular network protocol, whether you think it will last forever or not.

That separation of identifier from network protocol preserves the possibility of an online database such as but also databases that have local caches of the citations and associated content, databases that point to multiple locations for associated content, and databases that support currently unknown protocols to access content associated with an identifier.

Just as a database of citations from Codex Justinianus could point to the latest printed Latin text, online versions or future versions.

Citations can become permanent identifiers if they don’t rely on a particular network addressing systems.

Pondering Bibliographic Coupling…

Sunday, June 2nd, 2013

Pondering Bibliographic Coupling and Co-citation Analyses in the Context of Company Directorships by Tony Hirst.

From the post:

Over the last month or so, I’ve made a start reading through Mark Newman’s Networks: An Introduction, trying (though I’m not sure how successfully!) to bring an element of discipline to my otherwise osmotically acquired understanding of the techniques employed by various network analysis tools.

One distinction that made a lot of sense to me came from the domain of bibliometrics, specifically between the notions of bibliographic coupling and co-citation.


The idea of co-citation will be familiar to many – when one article cites a set of other articles, those other articles are “co-cited” by the first. When the same articles are co-cited by lots of other articles, we may have reason to believe that they are somehow related in a meaningful way.


Bibliographic coupling

Bibliographic coupling is actually an earlier notion, describing the extent to which two works are related by virtue of them both referencing the same other work.

Interesting musings about applying well known views of bibliographic graphs to graphs composed of company directorships.

Tony’s suggestion of watching for patterns in directors moving together between companies is a good one but I would broaden the net a bit.

Why not track school, club, religious affiliations, etc.? All of those form networks as well.

Want some hackathon friendly altmetrics data?…

Wednesday, December 26th, 2012

Want some hackathon friendly altmetrics data? arXiv tweets dataset now up on figshare by Euan Adie.

From the post:

The dataset contains details of approximately 57k tweets linking to arXiv papers, found between 1st January and 1st October this year. You’ll need to supplement it with data from the arXiv API if you need metadata about the preprints linked to. The dataset does contain follower counts and lat/lng pairs for users where possible, which could be interesting to plot.

Euan has some suggested research directions and more details on the data set.

Something to play with during the holiday “down time.” 😉

I first saw this in a tweet by Jason Priem.

Is Your Information System “Sticky?”

Wednesday, December 19th, 2012

In “Put This On My List…” Michael Mitzenmacher writes:

Put this on my list of papers I wish I had written: Manipulating Google Scholar Citations and Google Scholar Metrics: simple, easy and tempting. I think the title is sufficiently descriptive of the content, but the idea was they created a fake researcher and posted fake papers on a real university web site to inflate citation counts for some papers. (Apparently, Google scholar is pretty “sticky”; even after the papers came down, the citation counts stayed up…)

The traditional way to boost citations is to re-arrange the order of the authors and the same paper, then re-publish it.

Gaming citation systems isn’t news, although the Google Scholar Citations paper demonstrates that it has become easier.

For me the “news” part was the “sticky” behavior of Google’s information system, retaining the citation counts even after the fake documents were removed.

Is your information system “sticky?” That is does it store information as “static” data that isn’t dependent on other data?

If it does, you and anyone who uses your data is running the risk of using stale or even incorrect data. The potential cost of that risk depends on your industry.

For legal, medical, banking and similar industries, the potential liability argues against assuming recorded data is current and valid data.

Representing critical data as a topic with constrained (TMCL) occurrences that must be present is one way to address this problem with a topic map.

If a constrained occurrences is absent, the topic in question fails the TMCL constraint and so can be reported as an error.

I suspect you could duplicate that behavior in a graph database.

When you query for a particular node (read “fact”), check to see if all the required links are present. Not as elegant as invalidation by constraint but should work.

Hunting Trolls with Neo4j!

Wednesday, October 3rd, 2012

Hunting Trolls with Neo4j! by Max De Marzi.

Max quotes from a video found by Alison Sparrow:

What we tried to do with it, is bypass any sort of keyword processing in order to find similar patents. The reason we’ve done this is to avoid the problems encountered by other systems that rely on natural language processing or semantic analysis simply because patents are built to avoid detection by similar keywords…we use network topology (specifically citation network topology) to mine the US patent database in order to predict similar documents.

The “note pad” in the demonstration would be more useful if it produced a topic map that merged results from other searchers.

Auto-magically creating associations based on data extracted from the patent database would be a nice feature as well.

Maybe I should get some sticky pads printed up with the logo: “You could be using a topic map!” 😉

(Let me know how many sticky pads you would like and I will get a quote for them.)

Discussion of scholarly information in research blogs

Sunday, June 3rd, 2012

Discussion of scholarly information in research blogs by Hadas Shema.

From the post:

As some of you know, Mike Thelwall, Judit Bar-Ilan (both are my dissertation advisors) and myself published an article called “Research Blogs and the Discussion of Scholarly Information” in PLoS One. Many people showed interest in the article, and I thought I’d write a “director’s commentary” post. Naturally, I’m saving all your tweets and blog posts for later research.

The Sample

We characterized 126 blogs with 135 authors from Researchblogging.Org (RB), an aggregator of blog posts dealing with peer-review research. Two over-achievers had two blogs each, and 11 blogs had two authors.

While our interest in research blogs started before we ever heard of RB, it was reading an article using RB that really kick-started the project. Groth & Gurney (2010) wrote an article titled “Studying scientific discourse on the Web using bibliometrics: A chemistry blogging case study.” The article made for a fascinating read, because it applied bibliometric methods to blogs. Just like it says in the title, Groth & Gurney took the references from 295 blog posts about Chemistry and analyzed them the way one would analyze citations from peer-reviewed articles. They managed that because they used RB, which aggregates only posts by bloggers who take the time to formally cite their sources. Major drooling ensued at that point. People citing in a scholarly manner out of their free will? It’s Christmas!

Questions that stand out for me on blogs:

Will our indexing/searching of blogs have the same all or nothing granularity of scholarly articles?

If not, why not?