Archive for the ‘Citation Practices’ Category

Bluebook® vs. Baby Blue’s (Or, Bleak House “Lite”)

Friday, February 19th, 2016

The suspense over what objections The Bluebook® A Uniform System of Citation® could have to the publication of Baby Blue’s Manual of Legal Citation, ended with a whimper and not a bang on the publication of Baby Blue’s.

You may recall I have written in favor of Baby Blue’s, sight unseen, Bloggers! Help Defend The Public Domain – Prepare To Host/Repost “Baby Blue”, and, Oxford Legal Citations Free, What About BlueBook?.

Of course, then Baby Blue’s Manual of Legal Citation was published.

I firmly remain of the opinion that legal citations are in the public domain. Moreover, the use of legal citations is the goal of any citation originator so assertion of copyright on the same would be self-defeating, if not insane.

Having said that, Baby Blue’s Manual of Legal Citation is more of a Bleak House “Lite” than a useful re-imagining of legal citation in a modern context.

I don’t expect you to take my word for that judgment so I have prepared mappings from Bluebook® to Baby Blue’s and Baby Blue’s to Bluebook®.

Caveat 1: Baby Blue’s is still subject to revision and may tinker with its table numbering to further demonstrate its “originality” for example, so consider these mappings as provisional and subject to change.

Caveat 2: The mappings are pointers to equivalent subject matter and not strictly equivalent content.

How closely the content of these two publications track each other is best resolved by automated comparison of the two.

As general assistance, pages 68-191 (out of 198) of Baby Blue’s are in substantial accordance with pages 233-305 and 491-523 of the Bluebook®. Foreign citations, covered by pages 307-490 in the Bluebook®, merit a scant two pages, 192-193, in Baby Blue’s.

The substantive content of Baby Blue’s doesn’t begin until page 10 and continues to page 67, with tables beginning on page 68. In terms of non-table content, there is only 57 pages of material for comparison to the Bluebook®. As you can see from the mappings, the ordering of rules has been altered from the Bluebook®, no doubt as a showing of “originality.”

The public does need greater access to primary legal resources but treating the ability to cite Tucker and Celphane (District of Columbia, 1892-1893) [Baby Blue’s page 89] on a par with Federal Reporter [Baby Blue’s page 67], is not a step in that direction.

PS: To explore the issues and possibilities at hand, you will need a copy of the The Bluebook® A Uniform System of Citation®.

Some starter questions:

  1. What assumptions underlie the rules reported in the Bluebook®?
  2. How would you measure the impact of changing the rules it reports?
  3. What technologies drove the its form and organization?
  4. What modern technologies could alter its form and organization?
  5. How can modern technologies display content differently that used its citations?

A more specific question could be: Do we need 123 pages of abbreviations (Babyblue), 113 pages of abbreviations (Bluebook®) when software has the capability to display expanded abbreviations to any user? Even if written originally as an abbreviation.

Abbreviations being both a means of restricting access/understanding and partially a limitation of the printed page into which we sought to squeeze as much information as possible.

Should anyone raise the issue of “governance,” with you in regard to the Bluebook®, they are asking for a seat at the citation rule table for themselves, not you. My preference is to turn the table over in favor of modern mechanisms for citations that result in access, not promises of access if you learn a secret code.

PS: I use Bleak House as a pejorative above but it is one of my favorite novels. Bear in mind that I also enjoy reading the Bluebook and the Chicago Manual of Style. 😉

What Should Remain After Retraction?

Thursday, April 30th, 2015

Antony Williams asks in a tweet:

If a paper is retracted shouldn’t it remain up but watermarked PDF as retracted? More than this?

Here is what you get instead of the front page:


A retraction should appear in bibliographic records maintained by the publisher as well as on any online version maintained by the publisher.

The Journal of the American Chemical Society (JACS) method of retraction, removal of the retracted content:

  • Presents a false view of the then current scientific context. Prior to retraction such an article is part of the overall scientific context in a field. Editing that context post-publication, is historical revisionism at its worst.
  • Interrupts the citation chain of publications cited in the retracted publication.
  • Leaves dangling citations of the retracted publication in later publications.
  • Places author who cited the retracted publication in an untenable position. Their citations of a retracted work are suspect with no opportunity to defend their citations.
  • Falsifies the memories of every reader who read the retracted publication. They cannot search for and retrieve that paper in order to revisit an idea, process or result sparked by the retracted publication.

Sound off to: Antony Williams (@ChemConnector) and @RetractionWatch

Let’s leave the creation of false histories to professionals, such as politicians.

Rich Citations: Open Data about the Network of Research

Thursday, October 23rd, 2014

Rich Citations: Open Data about the Network of Research by Adam Becker.

From the post:

Why are citations just binary links? There’s a huge difference between the article you cite once in the introduction alongside 15 others, and the data set that you cite eight times in the methods and results sections, and once more in the conclusions for good measure. Yet both appear in the list of references with a single chunk of undifferentiated plain text, and they’re indistinguishable in citation databases — databases that are nearly all behind paywalls. So literature searches are needlessly difficult, and maps of that literature are incomplete.

To address this problem, we need a better form of academic reference. We need citations that carry detailed information about the citing paper, the cited object, and the relationship between the two. And these citations need to be in a format that both humans and computers can read, available under an open license for anyone to use.

This is exactly what we’ve done here at PLOS. We’ve developed an enriched format for citations, called, appropriately enough, rich citations. Rich citations carry a host of information about the citing and cited entities (A and B, respectively), including:

  • Bibliographic information about A and B, including the full list of authors, titles, dates of publication, journal and publisher information, and unique identifiers (e.g. DOIs) for both;
  • The sections and locations in A where a citation to B appears;
  • The license under which B appears;
  • The CrossMark status of B (updated, retracted, etc);
  • How many times B is cited within A, and the context in which it is cited;
  • Whether A and B share any authors (self-citation);
  • Any additional works cited by A at the same location as B (i.e. citation groupings);
  • The data types of A and B (e.g. journal article, book, code, etc.).

As a demonstration of the power of this new citation format, we’ve built a new overlay for PLOS papers, which displays much more information about the references in our papers, and also makes it easier to navigate and search through them. Try it yourself here:
The suite of open-source tools we’ve built make it easy to extract and display rich citations for any PLOS paper. The rich citation API is available now for interested developers at

If you look at one of the test articles such as: Jealousy in Dogs, the potential of rich citations becomes immediately obvious.

Perhaps I was reading “… the relationship between the two…” a bit too much like an association between two topics. It’s great to know how many times a particular cite occurs in a paper, when it is a self-citation, etc. but is a long way from attaching properties to an association between two papers.

On the up side, however, PLOS is already has 10,000 papers with “smart cites” with more on the way.

A project to watch!

Data Citation Implementation Group

Thursday, January 23rd, 2014

Data Citation Implementation Group

I try to capture “new” citation groups as they arise, mostly so if I encounter the need to integrate two or more “new” citations I will know where to start.

I thought you might be amused at this latest edition to the seething welter of citation groups:

You must be a member of this group to view and participate in it. Membership is by invitation only.

This group is invite only, so you may not apply for membership.

So, not only will we have:

Traditional citations

New web?-based citations

but also:

Unknown citations.


Data integration, like grave digging, is an occupation with a lot of job security.

Is Link Rot Destroying Stare Decisis…

Monday, December 30th, 2013

Is Link Rot Destroying Stare Decisis as We Know It? The Internet-Citation Practice of the Texas Appellate Courts by Arturo Torres (Journal of Appellate Practice and Process, Vol 13, No. 2, Fall 2012 )


In 1995 the first Internet-based citation was used in a federal court opinion. In 1996, a state appellate court followed suit; one month later, a member of the United States Supreme Court cited to the Internet; finally, in 1998 a Texas appellate court cited to the Internet in one of its opinions. In less than twenty years, it has become common to find appellate courts citing to Internet-based resources in opinions. Because of the current extent of Internet-citation practice varies by courts across jurisdictions, this paper will examine the Internet-citation practice of the Texas Appellate courts since 1998. Specifically, this study surveys the 1998 to 2011 published opinions of the Texas appellate courts and describes their Internet-citation practice.

A study that confirms what was found in …Link and Reference Rot in Legal Citations for the Harvard Law Review and the U.S. Supreme Court.

Curious that a West Key Numbers remain viable after more than a century of use (manual or electronic resolution) whereas Internet citations expire over the course of a few years.

What do you think is the difference in those citations, West Key Numbers versus URLs, that accounts for one being viable and the other only ephemerally so?

Not all citations are equal:… [Semantic Triage?]

Sunday, November 24th, 2013

Not all citations are equal: identifying key citations automatically by Daniel Lemire.

From the post:

Suppose that you are researching a given issue. Maybe you have a medical condition or you are looking for the best algorithm to solve your current problem.

A good heuristic is to enter reasonable keywords in Google Scholar. This will return a list of related research papers. If you are lucky, you may even have access to the full text of these research papers.

Is that good enough? No.

Scholarship, on the whole, tends to improve with time. More recent papers incorporate the best ideas from past work and correct mistakes. So, if you have found a given research paper, you’d really want to also get a list of all papers building on it…

Thankfully, a tool like Google Scholar allows you to quickly access a list of papers citing a given paper.

Great, right? So you just pick your research paper and review the papers citing them.

If you have ever done this work, you know that most of your effort will be wasted. Why? Because most citations are shallow. Almost none of the citing papers will build on the paper you picked. In fact, many researchers barely even read the papers that they cite.

Ideally, you’d want Google Scholar to automatically tell apart the shallow citations from the real ones.

The paper of the same title is due to appear in JASIST.

The abstract:

The importance of a research article is routinely measured by counting how many times it has been cited. However, treating all citations with equal weight ignores the wide variety of functions that citations perform. We want to automatically identify the subset of references in a bibliography that have a central academic influence on the citing paper. For this purpose, we examine the effectiveness of a variety of features for determining the academic influence of a citation.

By asking authors to identify the key references in their own work, we created a dataset in which citations were labeled according to their academic influence. Using automatic feature selection with supervised machine learning, we found a model for predicting academic influence that achieves good performance on this dataset using only four features.

The best features, among those we evaluated, were features based on the number of times a reference is mentioned in the body of a citing paper. The performance of these features inspired us to design an influence-primed h-index (the hip-index). Unlike the conventional h-index, it weights citations by how many times a reference is mentioned. According to our experiments, the hip-index is a better indicator of researcher performance than the conventional h-index.

What I find intriguing is the potential for this type of research to enable a type of semantic triage when creating topic maps or other semantic resources.

If only three out of thirty citations in a paper are determined to be “influential,” why should I use scarce resources to capture them as completely as the influential resources?

The corollary to Daniel’s “not all citations are equal,” is that “not all content is equal.”

We already make those sort of choices when we select some citations from the larger pool of possible citations.

I’m just suggesting that we make that decision explicit when creating semantic resources.

PS: I wonder how Daniel’s approach would work with opinions rendered in legal cases. Court’s often cite an entire block of prior decisions but no particular rule or fact from any of them. Could reduce the overhead of tracking influential prior case decisions.

Current RFCs and Their Citations

Sunday, November 17th, 2013

Current RFCs and Their Citations

A resource I created to give authors and editors a cut-n-paste way to use correct citations to current RFCs.

I won’t spread bad data by repeating some of the more imaginative citations of RFCs that I have seen.

Being careless about citations has the same impact as being careless about URLs. The end result is at best added work for your reader and at worst, no communication at all.

I will be updating this resource on a weekly basis but remember the canonical source of information on RFCs is the RFC-Editor’s page.

From a topic map perspective, the URLs you see in this resource are subject locators for the subjects, which are the RFCs.

…Link and Reference Rot in Legal Citations

Tuesday, September 24th, 2013

Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations by Jonathan Zittrain, Kendra Albert, Lawrence Lessig.


We document a serious problem of reference rot: more than 70% of the URLs within the Harvard Law Review and other journals, and 50% of the URLs found within U.S. Supreme Court opinions do not link to the originally cited information.

Given that, we propose a solution for authors and editors of new scholarship that involves libraries undertaking the distributed, long-term preservation of link contents.

Imagine trying to use a phone book where 70% of the addresses were wrong.

Or you are looking for your property deed and learn that only 50% of the references are correct.

Do those sound like acceptable situations?

Considering the Harvard Law Review and the U.S. Supreme Court put a good deal of effort into correct citations, the fate of the rest of the web must be far worse.

The about page for Perma reports:

Any author can go to the website and input a URL. downloads the material at that URL and gives back a new URL (a “ link”) that can then be inserted in a paper.

After the paper has been submitted to a journal, the journal staff checks that the provided link actually represents the cited material. If it does, the staff “vests” the link and it is forever preserved. Links that are not “vested” will be preserved for two years, at which point the author will have the option to renew the link for another two years.

Readers who encounter links can click on them like ordinary URLs. This takes them to the site where they are presented with a page that has links both to the original web source (along with some information, including the date of the link’s creation) and to the archived version stored by

I would caution that “forever” is a very long time.

What happens to the binding between an identifier and a URL when URLs are replaced by another network protocol?

After all the change over the history of the Internet, you don’t believe the current protocols will last “forever” Yes?

A more robust solution would divorce identifiers/citations from any particular network protocol, whether you think it will last forever or not.

That separation of identifier from network protocol preserves the possibility of an online database such as but also databases that have local caches of the citations and associated content, databases that point to multiple locations for associated content, and databases that support currently unknown protocols to access content associated with an identifier.

Just as a database of citations from Codex Justinianus could point to the latest printed Latin text, online versions or future versions.

Citations can become permanent identifiers if they don’t rely on a particular network addressing systems.

Citing data (without tearing your hair out)

Saturday, August 24th, 2013

Citing data (without tearing your hair out) by Bonnie Swoger

From the post:

The changing nature of how and where scientists share raw data has sparked a growing need for guidelines on how to cite these increasingly available datasets.

Scientists are producing more data than ever before due to the (relative) ease of collecting and storing this data. Often, scientists are collecting more than they can analyze. Instead of allowing this un-analyzed data to die when the hard drive crashes, they are releasing the data in its raw form as a dataset. As a result, datasets are increasingly available as separate, stand-alone packages. In the past, any data available for other scientists to use would have been associated with some other kind of publication – printed as table in a journal article, included as an image in a book, etc. – and cited as such.

Now that we can find datasets “living on their own,” scientists need to be able to cite these sources.

Unfortunately, the traditional citation manuals do a poor job of helping a scientist figure out what elements to include in the reference list, either ignoring data or over-complicating things.

If you are building a topic map that relies upon data sets you didn’t create, get ready to cite data sets.

Citations, assuming they are correct, can give your users confidence in the data you present.

Bonnie does a good job providing basic rules that you should follow when citing data.

You can always do more than she suggests but you should never do any less.

Bad Data Report

Saturday, May 4th, 2013

The accuracy of references in PhD theses: a case study by Fereydoon Azadeh and Reyhaneh Vaez.



Inaccurate references and citations cause confusion, distrust in the accuracy of a report, waste of time and unnecessary financial charges for libraries, information centres and researchers.


The aim of the study was to establish the accuracy of article references in PhD theses from the Tehran and Tabriz Universities of Medical Sciences and their compliance with the Vancouver style.


We analysed 357 article references in the Tehran and 347 in the Tabriz. Six bibliographic elements were assessed: authors’ names, article title, journal title, publication year, volume and page range. Referencing errors were divided into major and minor.


Sixty two percent of references in the Tehran and 53% of those in the Tabriz were erroneous. In total, 164 references in the Tehran and 136 in the Tabriz were complete without error. Of 357 reference articles in the Tehran, 34 (9.8%) were in complete accordance with the Vancouver style, compared with none in the Tabriz. Accuracy of referencing did not differ significantly between the two groups, but compliance with the Vancouver style was significantly better in the Tehran.


The accuracy of referencing was not satisfactory in both groups, and students need to gain adequate instruction in appropriate referencing methods.

Now that’s bad data!

I have noticed errors on CS paper citations but not as high as reported here.

The ACM Digital Library could report for a given paper or conference the number of unknown citations, with a list, for checking.

Is Your Information System “Sticky?”

Wednesday, December 19th, 2012

In “Put This On My List…” Michael Mitzenmacher writes:

Put this on my list of papers I wish I had written: Manipulating Google Scholar Citations and Google Scholar Metrics: simple, easy and tempting. I think the title is sufficiently descriptive of the content, but the idea was they created a fake researcher and posted fake papers on a real university web site to inflate citation counts for some papers. (Apparently, Google scholar is pretty “sticky”; even after the papers came down, the citation counts stayed up…)

The traditional way to boost citations is to re-arrange the order of the authors and the same paper, then re-publish it.

Gaming citation systems isn’t news, although the Google Scholar Citations paper demonstrates that it has become easier.

For me the “news” part was the “sticky” behavior of Google’s information system, retaining the citation counts even after the fake documents were removed.

Is your information system “sticky?” That is does it store information as “static” data that isn’t dependent on other data?

If it does, you and anyone who uses your data is running the risk of using stale or even incorrect data. The potential cost of that risk depends on your industry.

For legal, medical, banking and similar industries, the potential liability argues against assuming recorded data is current and valid data.

Representing critical data as a topic with constrained (TMCL) occurrences that must be present is one way to address this problem with a topic map.

If a constrained occurrences is absent, the topic in question fails the TMCL constraint and so can be reported as an error.

I suspect you could duplicate that behavior in a graph database.

When you query for a particular node (read “fact”), check to see if all the required links are present. Not as elegant as invalidation by constraint but should work.

For Attribution… [If One Identifier/URL isn’t enough]

Tuesday, November 27th, 2012

For Attribution — Developing Data Attribution and Citation Practices and Standards: Summary of an International Workshop by Paul F. Uhlir.

From the preface:

The growth of electronic publishing of literature has created new challenges, such as the need for mechanisms for citing online references in ways that can assure discoverability and retrieval for many years into the future. The growth in online datasets presents related, yet more complex challenges. It depends upon the ability to reliably identify, locate, access, interpret and verify the version, integrity, and provenance of digital datasets.

Data citation standards and good practices can form the basis for increased incentives, recognition, and rewards for scientific data activities that in many cases are currently lacking in many fields of research. The rapidly-expanding universe of online digital data holds the promise of allowing peer-examination and review of conclusions or analysis based on experimental or observational data, the integration of data into new forms of scholarly publishing, and the ability for subsequent users to make new and unforeseen uses and analyses of the same data – either in isolation, or in combination with other datasets.

The problem of citing online data is complicated by the lack of established practices for referring to portions or subsets of data. As funding sources for scientific research have begun to require data management plans as part of their selection and approval processes, it is important that the necessary standards, incentives, and conventions to support data citation, preservation, and accessibility be put into place.

Of particular interest are the four questions that shaped this workshop:

1. What is the status of data attribution and citation practices in the natural and social (economic and political) sciences in United States and internationally?

2. Why is the attribution and citation of scientific data important and for what types of data? Is there substantial variation among disciplines?

3. What are the major scientific, technical, institutional, economic, legal, and socio-cultural issues that need to be considered in developing and implementing scientific data citation standards and practices? Which ones are universal for all types of research and which ones are field or context specific?

4. What are some of the options for the successful development and implementation of scientific data citation practices and standards, both across the natural and social sciences and in major contexts of research?

The workshop did not presume a solution (is that a URL in your pocket?) but explores the complex nature of attribution and citation.

Michael Sperberg-McQueen remarks:

Longevity: Finally, there is the question of longevity. It is well known that the half-life of citations is much higher in humanities than in the natural sciences. We have been cultivating a culture of citation of referencing for about 2,000 years in the West since the Alexandrian era. Our current citation practice may be 400 years old. The http scheme, by comparison, is about 19 years old. It is a long reach to assume, as some do, that http URLs are an adequate mechanism for all citations of digital (and non-digital!) objects. It is not unreasonable for scholars to be skeptical of the use of URLs to cite data of any long-term significance, even if they are interested in citing the data resources they use. [pp. 63-64]

What I find the most attractive about topic maps is you can have:

  • A single URL as a citation/identifier.
  • Multiple URLs as citations/identifiers (for the same data resource).
  • Multiple URLs and/or other forms of citations/identifiers as they develop(ed) over time for the same data resource.

Why the concept of multiple citations/identifiers (quite common in biblical studies) for a single resource is so difficult I cannot explain.