Archive for the ‘Citation Indexing’ Category

Network Measures of the United States Code

Saturday, March 5th, 2016

Network Measures of the United States Code by Alexander Lyte, Dr. David Slater, Shaun Michel.


The U.S. Code represents the codification of the laws of the United States. While it is a well-organized and curated corpus of documents, the legal text remains nearly impenetrable for non-lawyers. In this paper, we treat the U.S. Code as a citation network and explore its complexity using traditional network metrics. We find interesting topical patterns emerge from the citation structure and begin to interpret network metrics in the context of the legal corpus. This approach has potential for determining policy dependency and robustness, as well as modeling of future policies.​

The citation network is quite impressive:


I have inquired about an interactive version of the network but no response as of yet.

Is Your Information System “Sticky?”

Wednesday, December 19th, 2012

In “Put This On My List…” Michael Mitzenmacher writes:

Put this on my list of papers I wish I had written: Manipulating Google Scholar Citations and Google Scholar Metrics: simple, easy and tempting. I think the title is sufficiently descriptive of the content, but the idea was they created a fake researcher and posted fake papers on a real university web site to inflate citation counts for some papers. (Apparently, Google scholar is pretty “sticky”; even after the papers came down, the citation counts stayed up…)

The traditional way to boost citations is to re-arrange the order of the authors and the same paper, then re-publish it.

Gaming citation systems isn’t news, although the Google Scholar Citations paper demonstrates that it has become easier.

For me the “news” part was the “sticky” behavior of Google’s information system, retaining the citation counts even after the fake documents were removed.

Is your information system “sticky?” That is does it store information as “static” data that isn’t dependent on other data?

If it does, you and anyone who uses your data is running the risk of using stale or even incorrect data. The potential cost of that risk depends on your industry.

For legal, medical, banking and similar industries, the potential liability argues against assuming recorded data is current and valid data.

Representing critical data as a topic with constrained (TMCL) occurrences that must be present is one way to address this problem with a topic map.

If a constrained occurrences is absent, the topic in question fails the TMCL constraint and so can be reported as an error.

I suspect you could duplicate that behavior in a graph database.

When you query for a particular node (read “fact”), check to see if all the required links are present. Not as elegant as invalidation by constraint but should work.

Hunting Trolls with Neo4j!

Wednesday, October 3rd, 2012

Hunting Trolls with Neo4j! by Max De Marzi.

Max quotes from a video found by Alison Sparrow:

What we tried to do with it, is bypass any sort of keyword processing in order to find similar patents. The reason we’ve done this is to avoid the problems encountered by other systems that rely on natural language processing or semantic analysis simply because patents are built to avoid detection by similar keywords…we use network topology (specifically citation network topology) to mine the US patent database in order to predict similar documents.

The “note pad” in the demonstration would be more useful if it produced a topic map that merged results from other searchers.

Auto-magically creating associations based on data extracted from the patent database would be a nice feature as well.

Maybe I should get some sticky pads printed up with the logo: “You could be using a topic map!” 😉

(Let me know how many sticky pads you would like and I will get a quote for them.)

Discussion of scholarly information in research blogs

Sunday, June 3rd, 2012

Discussion of scholarly information in research blogs by Hadas Shema.

From the post:

As some of you know, Mike Thelwall, Judit Bar-Ilan (both are my dissertation advisors) and myself published an article called “Research Blogs and the Discussion of Scholarly Information” in PLoS One. Many people showed interest in the article, and I thought I’d write a “director’s commentary” post. Naturally, I’m saving all your tweets and blog posts for later research.

The Sample

We characterized 126 blogs with 135 authors from Researchblogging.Org (RB), an aggregator of blog posts dealing with peer-review research. Two over-achievers had two blogs each, and 11 blogs had two authors.

While our interest in research blogs started before we ever heard of RB, it was reading an article using RB that really kick-started the project. Groth & Gurney (2010) wrote an article titled “Studying scientific discourse on the Web using bibliometrics: A chemistry blogging case study.” The article made for a fascinating read, because it applied bibliometric methods to blogs. Just like it says in the title, Groth & Gurney took the references from 295 blog posts about Chemistry and analyzed them the way one would analyze citations from peer-reviewed articles. They managed that because they used RB, which aggregates only posts by bloggers who take the time to formally cite their sources. Major drooling ensued at that point. People citing in a scholarly manner out of their free will? It’s Christmas!

Questions that stand out for me on blogs:

Will our indexing/searching of blogs have the same all or nothing granularity of scholarly articles?

If not, why not?

From counting citations to measuring usage (help needed!)

Tuesday, March 20th, 2012

From counting citations to measuring usage (help needed!)

Daniel Lemire writes:

We sometimes measure the caliber of a researcher by how many research papers he wrote. This is silly. While there is some correlation between quantity and quality — people like Einstein tend to publish a lot — it can be gamed easily. Moreover, several major researchers have published relatively few papers: John Nash has about two dozens papers in Scopus. Even if you don’t know much about science, I am sure you can think of a few writers who have written only a couple of books but are still world famous.

A better measure is the number of citations a researcher has received. Google Scholar profiles display the citation record of researchers prominently. It is a slightly more robust measure, but it is still silly because 90% of citations are shallow: most authors haven’t even read the paper they are citing. We tend to cite famous authors and famous venues in the hope that some of the prestige will get reflected.

But why stop there? We have the technology to measure the usage made of a cited paper. Some citations are more significant: for example it can be an extension of the cited paper. Machine learning techniques can measure the impact of your papers based on how much following papers build on your results. Why isn’t it done?

Daniel wants to distinguish important papers that cite his papers from ho-hum papers that cite him. (my characterization, not his)

That isn’t happening now so Daniel has teamed up with Peter Turney and Andre Vellino to gather data from published authors (that would be you), to use in investigating this problem.

Topic maps of scholarly and other work face the same problem. How do you distinguish the important from the less so? For that matter, what criteria do you use? If an author who cites you wins the Nobel Prize for work that doesn’t cite you, does the importance of your paper go up? Stay the same? Goes down? 😉

It is an important issue so if you are a published author, see Daniel’s post and contribute to the data gathering.

Clickstream Data Yields High-Resolution Maps of Science

Monday, January 2nd, 2012

Clickstream Data Yields High-Resolution Maps of Science Citation: Bollen J, Van de Sompel H, Hagberg A, Bettencourt L, Chute R, et al. (2009) Clickstream Data Yields High-Resolution Maps of Science. PLoS ONE 4(3): e4803. doi:10.1371/journal.pone.0004803.

A bit dated but interesting none the less:



Intricate maps of science have been created from citation data to visualize the structure of scientific activity. However, most scientific publications are now accessed online. Scholarly web portals record detailed log data at a scale that exceeds the number of all existing citations combined. Such log data is recorded immediately upon publication and keeps track of the sequences of user requests (clickstreams) that are issued by a variety of users across many different domains. Given these advantages of log datasets over citation data, we investigate whether they can produce high-resolution, more current maps of science.


Over the course of 2007 and 2008, we collected nearly 1 billion user interactions recorded by the scholarly web portals of some of the most significant publishers, aggregators and institutional consortia. The resulting reference data set covers a significant part of world-wide use of scholarly web portals in 2006, and provides a balanced coverage of the humanities, social sciences, and natural sciences. A journal clickstream model, i.e. a first-order Markov chain, was extracted from the sequences of user interactions in the logs. The clickstream model was validated by comparing it to the Getty Research Institute’s Architecture and Art Thesaurus. The resulting model was visualized as a journal network that outlines the relationships between various scientific domains and clarifies the connection of the social sciences and humanities to the natural sciences.


Maps of science resulting from large-scale clickstream data provide a detailed, contemporary view of scientific activity and correct the underrepresentation of the social sciences and humanities that is commonly found in citation data.

An improvement over traditional citation analysis but it seems to be on the coarse side to me.

That is to say users don’t request nor do authors cite papers as a whole. In other words, there are any number of ideas in a particular paper which may merit citation and a user or author may be interested in only one.

Tracing the lineage of an idea should be getting easier, yet I have the uneasy feeling that it is becoming more difficult.



Sunday, December 25th, 2011

Citeology: Visualizing the Relationships between Research Publications

From the post:

Justin Matejka at Autodesk Research has recently released the sophisticated visualization “Citeology: Visualizing Paper Genealogy” []. The visualization shows the 3,502 unique academic research papers that were published at the CHI and UIST, two of the most renowned human-computer interaction (HCI) conferences, between the years 1982 and 2010.

All the articles are listed by year and sorted with the most cited papers in the middle, whereas the 11,699 citations that connect the articles to one another are represented by curved lines. Selecting a single paper reveals colors the papers from the past that the paper referenced in blue, in addition to the future articles which referenced it, in brownish-red. Titles, The resulting graphs can be explored as a low-rez interactive screen, or as a high-rez, static PDF graph.

Interesting visualization but what does it mean for one paper to cite another?

I was spoiled by the granularity of legal decision indexing, at least for United States decisions, that broke cases down by issues. So that you could separate out a case being cited for a jurisdictional issue versus the same case being cited on a damages issue. I realize it took a large number of very clever editors (now I assume assisted by computers) to create such an index but it made use of the vast archives of legal decisions possible.

I suppose my question is: Why does one paper cite another? To agree with some fact finding or to disagree? If either, which fact(s)? To extend, supprt or correct some technique? If so, which one? For exampe, so I could trace papers that extend the Patricia trie as opposed to those that cite it in passing. It would certainly make research in any number of areas much easier and possibly more effective.

An R function to analyze your Google Scholar Citations page

Thursday, November 24th, 2011

An R function to analyze your Google Scholar Citations page

From the post:

Google scholar has now made Google Scholar Citations profiles available to anyone. You can read about these profiles and set one up for yourself here.

I asked John Muschelli and Andrew Jaffe to write me a function that would download my Google Scholar Citations data so I could play with it. Then they got all crazy on it and wrote a couple of really neat functions. All cool/interesting components of these functions are their ideas and any bugs were introduced by me when I was trying to fiddle with the code at the end.

Features include:

The function will download all of Rafa’s citation data and put it in the matrix out. It will also make wordclouds of (a) the co-authors on his papers and (b) the titles of his papers and save them in the pdf file specified (There is an option to turn off plotting if you want).

It can also calculate citation indices.

Scholars are fairly peripatetic these days and so have webpages, projects, courses, not to mention social media postings using various university identities. A topic map would be a nice complement to this function to gather up the “grey” literature that underlies final publications.

“I say toh-mah-toh, you say toh-may-toh”

Friday, July 9th, 2010

Rough Fuzzies, and Beyond? I thought was a cute title.

But just scratching the surface in the area of rough sets and I find:

  • generalized fuzzy belief functions
  • generalized fuzzy rough approximation operators
  • fuzzy coverings
  • granular computing
  • training fuzzy systems
  • fuzzy generalization of rough sets
  • generalized fuzzy rough sets
  • fuzzy concept lattices
  • fuzzy implication operators
  • intuitionistic fuzzy implicators

How many of those would you think to search for?

Same semantic issues topic maps are designed to help resolve. But, resolving them means someone (err, that would be those of us interested in the area) have to undertake the real work to resolve those semantic issues.

The obvious answer is some robust system that allows tweets, instant messages, email (properly formatted), as well as updating protocols to update a topic map in real time. That is also an unlikely solution.


What about an easier to reach solution? Lutz Maicher’s bibmap is a likely starting point.

We would have to ask Lutz about merging in additional data but I suspect he would be amenable to the suggestion.

Building a robust bibliography of topic map relevant materials would occupy us while waiting on more futuristic solutions.

The Value of Indexing

Monday, June 7th, 2010

The Value of Indexing (2001) by Jan Sykes is a promotion piece for Factiva, a Dow Jones and Reuters Company, but is also a good overview of the value of indexing.

I find it interesting in its description of the use of a taxonomy for indexing purposes. You may remember from reading a print index the use of the term “see also.” This paper appears to argue that the indexing process consists of mapping one or more terms to a single term in the controlled vocabulary.

A single entry from the controlled vocabulary represents a particular concept no matter how it was referred to in the original article. (page 5)

I assume the mapping between the terms in the article and the term in the controlled vocabulary is documented. That mapping maybe of more interest to the professionals who create the indexes and power users than the typical user.

Perhaps that is a lesson in terms of what is presented to users of topic maps.

Delivery of the information a user wants/needs in their context is more important than demonstrating our cleverness.

That was one of the mistakes in promoting markup, too much emphasis on the cool, new, paradigm shifting and too little emphasis on the benefit to users. With office products that use markup in a non-visible manner to the average user, markup usage has spread rapidly around the world.

Suggestions on how to make that happen for topic maps?

PS: Obviously this is an old piece so in fairness I am contacting Factiva to advise them of this post and to ask if they have an updated paper, etc. that they might want me to post. I will take the opportunity to plug topic maps as well. 😉

Citation Indexing – Semantic Diversity – Exercise

Sunday, June 6th, 2010

In A Conceptual View of Citation Indexing, which is chapter 1 of Citation Indexing — Its Theory and Application in Science, Technology, and Humanities (1979), Garfield says of the problem of changing terminology and semantics:

Citations, used as indexing statements, provide these lost measures of search simplicity, productivity, and efficiency by avoiding the semantics problems. For example, suppose you want information on the physics of simple fluids. The simple citation “Fisher, M.E., Math. Phys., 5,944, 1964” would lead the searcher directly to a list of papers that have cited this important paper on the subject. Experience has shown that a significant percentage of the citing papers are likely to be relevant. There is no need for the searcher to decide which subject terms an indexer would be most likely to use to describe the relevant papers. The language habits of the searcher would not affect the search results, nor would any changes in scientific terminology that took place since the Fisher paper was published.

In other words, the citation is a precise, unambiguous representation of a subject that requires no interpretation and is immune to changes in terminology. In addition, the citation will retain its precision over time. It also can be used in documents written in different languages. The importance of this semantic stability and precision to the search process is best demonstrated by a series of examples.

Question: What subject does a citation represent?

Question: What “precision” does the citation retain over time?

Exercise: Select any article that interests you with more than twenty (20) non-self citations. Identify ten (10) ideas in the article and examine at least twenty (20) citing articles. Why was your article cited? Was your article cited for an idea you identified? Was your article cited for an idea you did not identify? (Either one is correct. This is not a test of guessing why an article will be cited. It is exploration of a problem space. Your fact finding is important.)

Extra credit: Did you notice any evidence to support or contradict the notion that citation indexing avoids the issue of semantic diversity? If your article has been cited for more than ten (10) years, try one or two citations per year for every year it is cited. Again, your factual observations are important.

Citation Indexing

Sunday, June 6th, 2010

Eugene Garfield’s homepage may not be familiar to topic map fans but it should be.

Garfield invented citation indexing in the late 1950’s/early 1960’s.

Among the treasures you will find here: