ADS: The Next Generation Search Platform

Monday, March 16th, 2015

ADS: The Next Generation Search Platform by Alberto Accomazzi et al.


Four years after the last LISA meeting, the NASA Astrophysics Data System (ADS) finds itself in the middle of major changes to the infrastructure and contents of its database. In this paper we highlight a number of features of great importance to librarians and discuss the additional functionality that we are currently developing. Starting in 2011, the ADS started to systematically collect, parse and index full-text documents for all the major publications in Physics and Astronomy as well as many smaller Astronomy journals and arXiv e-prints, for a total of over 3.5 million papers. Our citation coverage has doubled since 2010 and now consists of over 70 million citations. We are normalizing the affiliation information in our records and, in collaboration with the CfA library and NASA, we have started collecting and linking funding sources with papers in our system. At the same time, we are undergoing major technology changes in the ADS platform which affect all aspects of the system and its operations. We have rolled out and are now enhancing a new high-performance search engine capable of performing full-text as well as metadata searches using an intuitive query language which supports fielded, unfielded and functional searches. We are currently able to index acknowledgments, affiliations, citations, funding sources, and to the extent that these metadata are available to us they are now searchable under our new platform. The ADS private library system is being enhanced to support reading groups, collaborative editing of lists of papers, tagging, and a variety of privacy settings when managing one’s paper collection. While this effort is still ongoing, some of its benefits are already available through the ADS Labs user interface and API at this http URL

Now for a word from the people who were using “big data” before it was a buzz word!

The focus here is on smaller data, publications, but it makes a good read.

I have been following the work on Solr proper and am interested in learning more about the extensions created to Solr by ADS.


I first saw this in a tweet by Kirk Borne.

Every time you cite a paper w/o reading it,

Saturday, December 13th, 2014

Every time you cite a paper w/o reading it, b/c someone else cited it, a science fairy dies. (A tweet by realscientists.)

The tweet points to the paper, Mother’s Milk, Literature Sleuths, and Science Fairies by Katie Hinde.

Katie encountered an article that offered a model that was right on point for a chapter she was writing. But rather than simply citing that article, Katie started backtracking from that article to the articles it cited. After quite a bit of due diligence, Katie discovered that the cited articles did not make the claims for which they were cited. Not no way, not no how.

Some of the comments to Katie’s post suggest that students in biological sciences should learn from her example.

I would go further than that and say that all students, biological sciences, physical sciences, computer sciences, the humanities, etc., should all learn from Katie’s example.

If you can’t or don’t verify cited work, don’t cite it. (full stop)

I haven’t kept statistics on it but it isn’t uncommon to find citations in computer science work that don’t exist, are cited incorrectly and/or don’t support the claims made for them. Most of the “don’t exist” class appear to be conference papers that weren’t accepted or were never completed. But were cited as “going to appear…”

Someday soon linking of articles will make verification of references much easier than it is today. How will your publications fare on that day?

CERMINE: Content ExtRactor and MINEr

Wednesday, September 24th, 2014

CERMINE: Content ExtRactor and MINEr

From the webpage:

CERMINE is a Java library and a web service for extracting metadata and content from scientific articles in born-digital form. The system analyses the content of a PDF file and attempts to extract information such as:

  • Title of the article
  • Journal information (title, etc.)
  • Bibliographic information (volume, issue, page numbers, etc.)
  • Authors and affiliations
  • Keywords
  • Abstract
  • Bibliographic references

CERMINE at Github

I used the following three files for a very subjective test of the online interface:

I am mostly interested in extraction of bibliographic entries and can report that while CERMINE made some mistakes, it is quite useful.

I first saw this in a tweet by Docear.


Tuesday, September 10th, 2013


A mapping of papers from arXiv.

I had to “zoom in” a fair amount to get a useful view of the map. Choosing any paper displays its bibliographic information with links to that paper.

Quite clever but I can’t help but think of what a more granular map might offer.

More “granular” in the sense of going below the document level to terms/concepts in each paper and locating them in a stream of discussion by different authors.

Akin to the typical “review” article that traces particular ideas through a series of publications.

But in any event, I commend Paperscape to you as a very clever bit of work.

I first saw this in Nat Torkington’s Four short links: 9 September 2013.

BASE indexed 50 million OAI-records

Wednesday, August 28th, 2013

BASE indexed 50 million OAI-records by Sarah Dister.

From the post:

BASE, a search engine for academic open access web resources, has indexed more than 50,000,000 OAI-records. The records are provided by about 2,700 repositories among which many are related to agriculture.

BASE is a multi-disciplinary search engine for academically relevant OAI-Sources worldwide, which was created and developed by Bielefeld University Library.

Take a few minutes (or longer) to explore BASE.

It is a remarkable resource. For example, users can invoke the Eurovoc Thesaurus as part of their search query.

DBLP feeds

Sunday, August 4th, 2013

DBLP feeds

An RSS feed for conferences and journals that appear in the DBLP Computer Science Bibliography.

I count 1448 conference and 977 journal RSS feeds.

A great resource that merits your attention.

Bad Data Report

Saturday, May 4th, 2013

The accuracy of references in PhD theses: a case study by Fereydoon Azadeh and Reyhaneh Vaez.



Inaccurate references and citations cause confusion, distrust in the accuracy of a report, waste of time and unnecessary financial charges for libraries, information centres and researchers.


The aim of the study was to establish the accuracy of article references in PhD theses from the Tehran and Tabriz Universities of Medical Sciences and their compliance with the Vancouver style.


We analysed 357 article references in the Tehran and 347 in the Tabriz. Six bibliographic elements were assessed: authors’ names, article title, journal title, publication year, volume and page range. Referencing errors were divided into major and minor.


Sixty two percent of references in the Tehran and 53% of those in the Tabriz were erroneous. In total, 164 references in the Tehran and 136 in the Tabriz were complete without error. Of 357 reference articles in the Tehran, 34 (9.8%) were in complete accordance with the Vancouver style, compared with none in the Tabriz. Accuracy of referencing did not differ significantly between the two groups, but compliance with the Vancouver style was significantly better in the Tehran.


The accuracy of referencing was not satisfactory in both groups, and students need to gain adequate instruction in appropriate referencing methods.

Now that’s bad data!

I have noticed errors on CS paper citations but not as high as reported here.

The ACM Digital Library could report for a given paper or conference the number of unknown citations, with a list, for checking.

… Preservation and Stewardship of Scholarly Works, 2012 Supplement

Tuesday, March 19th, 2013

Digital Curation Bibliography: Preservation and Stewardship of Scholarly Works, 2012 Supplement by Charles W. Bailey, Jr.

From the webpage:

In a rapidly changing technological environment, the difficult task of ensuring long-term access to digital information is increasingly important. The Digital Curation Bibliography: Preservation and Stewardship of Scholarly Works, 2012 Supplement presents over 130 English-language articles, books, and technical reports published in 2012 that are useful in understanding digital curation and preservation. This selective bibliography covers digital curation and preservation copyright issues, digital formats (e.g., media, e-journals, research data), metadata, models and policies, national and international efforts, projects and institutional implementations, research studies, services, strategies, and digital repository concerns.

It is a supplement to the Digital Curation Bibliography: Preservation and Stewardship of Scholarly Works, which covers over 650 works published from 2000 through 2011. All included works are in English. The bibliography does not cover conference papers, digital media works (such as MP3 files), editorials, e-mail messages, letters to the editor, presentation slides or transcripts, unpublished e-prints, or weblog postings.

The bibliography includes links to freely available versions of included works. If such versions are unavailable, italicized links to the publishers' descriptions are provided.

Links, even to publisher versions and versions in disciplinary archives and institutional repositories, are subject to change. URLs may alter without warning (or automatic forwarding) or they may disappear altogether. Inclusion of links to works on authors' personal websites is highly selective. Note that e-prints and published articles may not be identical.

The bibliography is available under a Creative Commons Attribution-NonCommercial 3.0 Unported License.

Supplement to “the” starting point for research on digital curation.

NoSQL Bibliographic Records:…

Tuesday, October 30th, 2012

NoSQL Bibliographic Records: Implementing a Native FRBR Datastore with Redis by Jeremy Nelson.

From the background:

Using the Library of Congress Bibliographic Framework for the Digital Age as the starting point for software development requirements; the FRBR-Redis-Datastore project is a proof-of-concept for a next-generation bibliographic NoSQL system within the context of improving upon the current MARC catalog and digital repository of a small academic library at a top-tier liberal arts college.

The FRBR-Redis-Datastore project starts with a basic understanding of the MARC, MODS, and FRBR implemented using a NoSQL technology called Redis.

This presentation guides you through the theories and technologies behind one such proof-of-concept bibliographic framework for the 21st century.

I found the answer to “Well, Why Not Hadoop?”

Hadoop was just too complicated compared to the simple three-step Redis server set-up.


Simply because a technology is popular doesn’t mean it meets your requirements. Such as administration by non-full time technical experts.

An Oracle database supports applications that could manage garden club finances but that’s a poor choice under most circumstances.

The Redis part of the presentation is apparently not working (I get Python errors) as of today and I have sent a note with the error messages.

A “proof-of-concept” that merits your attention!

Procedia Computer Science

Friday, September 28th, 2012

Procedia Computer Science. Elsevier.

From about this journal:

Launched in 2009, Procedia Computer Science is an electronic product focusing entirely on publishing high quality conference proceedings. Procedia Computer Science enables fast dissemination so conference delegates can publish their papers in a dedicated online issue on ScienceDirect, which is then made freely available worldwide.

Only ten (10) volumes but open access.

The Proceedings of the International Conference on Computational Science, 2010, 2011, 2012, are all 2,000+ pages. With two hundred and twenty-five (225) articles in the 2012 volume, I am sure you will find something interesting.

Don’t neglect the other volumes but that’s where I am starting.

Author Identifiers (At Least for CS)

Tuesday, September 4th, 2012

I enhanced the VLDB 2012 program with author queries to the DBLP Computer Science Bibliography for my own purposes.

After using that listing myself for a few days, it occurred to me that I should be using DBLP entries as author identifiers throughout my posts, at least when such entries exist.

For several reasons, but mostly:

  • DBLP maintains the publication listings (not by me!)
  • DBLP maintains pointers to other databases and resources (also not by me!)
  • DBLP maintains advanced search capabilities beyond authors (again, not by me!)

If you noticed not by me forming a pattern, you would be correct. There is a pattern.

The pattern?

Using DBLP author pages as identifiers, I leverage on (not duplicate) the work of the DBLP project.

To the benefit of my readers. (Not to mention myself.)

The DBLP link brings an author’s publication history, their co-authors, and additional bibliographic resources. (That’s a triple I like.)

It takes a moment to insert the link but the payoff is substantial.

When you cite a CS author in your blog, include their DBLP link. We will all thank you for it.

(I did that once upon a time but lapsed. Will be cleaning up older entries and trying to do better in the future.)

PS: Similar sources of identifiers for other disciplines? – Product Requirement Document released!

Monday, March 12th, 2012 – Product Requirement Document released! by René Pickhardt.

From the post:

Recently I visited my friend Heinrich Hartmann in Oxford. We talked about various issues how research is done in these days and how the web could theoretically help to spread information faster and more efficiently connect people interested in the same paper / topics.

The idea of was born. A scientific platform which is open source and open data and tries to solve those problems.

But we did not want to reinvent the wheel. So we did some research on existing online solutions and also asked people from various disciplines to name their problems. Find below our product requirement document! If you like our approach you can contact us or contribute on the source code find some starting documentation!

So the plan is to fork an open source question answer system and enrich it with the features fulfilling the needs of scientists and some social aspects (hopefully using neo4j as a supporting data base technology) which will eventually help to rank related work of a paper.

Feel free to provide us with feedback and wishes and join our effort!

More of a “first cut” at requirements than a requirements document but it is an interesting starting point.

What requirements would you add?

CiteWiz: A Tool for the Visualization of Scientific Citation Networks (2007)

Thursday, December 1st, 2011

CiteWiz: A Tool for the Visualization of Scientific Citation Networks (2007) by Niklas Elmqvist and Philippas Tsigas.


We present CiteWiz, an extensible framework for visualization of scientific citation networks. The system is based on a taxonomy of citation database usage for researchers, and provides a timeline visualization for overviews and an influence visualization for detailed views. The timeline displays the general chronology and importance of authors and articles in a citation database, whereas the influence visualization is implemented using the Growing Polygons technique, suitably modified to the context of browsing citation data. Using the latter technique, hierarchies of articles with potentially very long citation chains can be graphically represented. The visualization is augmented with mechanisms for parent-child visualization and suitable interaction techniques for interacting with the view hierarchy and the individual articles in the dataset. We also provide an interactive concept map for keywords and co-authorship using a basic force-directed graph layout scheme. A formal user study indicates that CiteWiz is significantly more efficient than traditional database interfaces for high-level analysis tasks relating to influence and overviews, and equally efficient for low-level tasks such as finding a paper and correlating bibliographical data.

The interactive concept map is particularly interesting although the entire article will be useful for anyone experimenting with network or topic map visualization.

SciVerse Applications Beta

Thursday, November 17th, 2011

SciVerse Applications Beta

From the webpage:

SciVerse Applications Beta lets you integrate search and discovery applications into SciVerse, to help you be more productive in your research. Login or register, find an application and get started – there is nothing to download or install, the applications you’ve selected will appear immediately within SciVerse.

Developers can create applications for over 15 million SciVerse users worldwide. SciVerse Applications Beta lets you integrate your application directly into the core SciVerse user experience on article, record and search results pages. To learn more, please visit the Developer Network.

SciVerse Applications Beta has just launched and we continue to make improvements. We welcome your feedback on all aspects of this service.

Not a lot of folks but every application has to start somewhere. 😉

There was a contest recently for new apps. I will cover the winners in a separate post.


Thursday, October 27th, 2011

AnalyticBridge: A Social Network for Analytic Professionals

Some interesting resources, possibly useful groups.

Anyone with experience with this site?


Monday, October 17th, 2011


Please pass this along to your friends! An innovative way to preserve technical literature that will otherwise be difficult to access.

From the website:

Help make important research available online by adopting a U.S. Department of Energy (DOE) technical report. There are more than 200,000 DOE technical reports in need of digitization. In fact, most DOE technical reports from the 1940s to 1991 are still only available in hard copy or microfiche. This means that important research is not easily accessible by researchers and the public.

Why would I want to Adopt-A-Doc?

You may find a technical report that you want to share with others or you think worthy of making broadly available on the Web to support the advancement of science. When you search for important science information in your area of interest, you can choose to sponsor the digitization of any adoptable technical report. The cost is $85 (approximately the same cost as ordering a hard copy). Discounts for larger scale projects may be available. For additional information contact Susan Tackett at 865-576-5699 or

Biological and Environmental Research (BER) Abstracts Database

Monday, October 17th, 2011

Biological and Environmental Research (BER) Abstracts Database

From the webpage:

Since 1995, OSTI has provided assistance and support to the Office of Biological and Environmental Research (BER) by developing and maintaining a database of BER research project information. Called the BER Abstracts Database (, it contains summaries of research projects supported by the program. Made up of two divisions, Biological Systems Science Division and Climate and Environmental Sciences Division, BER is responsible for world-class biological and environmental research programs and scientific user facilities. BER’s research program is closely aligned with DOE’s mission goals and focuses on two main areas: the Nation’s Energy Security (developing cost-effective cellulosic biofuels) and the Nation’s Environmental Future (improving the ability to understand, predict, and mitigate the impacts of energy production and use on climate change).

The BER Abstracts Database is publicly available to scientists, researchers, and interested citizens. Each BER research project is represented in the database, including both current/active projects and historical projects dating back to 1995. The information available on each research project includes: project title, abstract, principal investigator, research institution, research area, project term, and funding. Users may conduct basic or advanced searches, and various sorting and downloading options are available.

The BER Abstracts Database serves as a tool for BER program managers and a valuable resource for the public. The database also meets the Department’s strategic goals to disseminate research information and results. Over the past 16 years, over 6,000 project records have been created for the database, offering a fascinating look into the BER research program and how it has evolved. BER played a major role in the development of genomics-based systems biology and in the biotechnology revolution occurring over this period, while also supporting ground-breaking research on the impacts of energy production and use on the environment. The BER Abstracts Database, made available through the collaborative partnership between BER and OSTI, highlights these scientific advancements and maximizes the public value of BER’s research.

Particularly if this is an area of interest for you, take some time to become familiar with the interface.

  1. What do you think about the basic vs. advanced search?
  2. Does the advanced search offer any substantial advantages or do you have to start off with more complete information?
  3. What advantages (if any) does the use of abstracts offer over full text searching?

Science Conference Proceedings Portal

Monday, October 17th, 2011

Science Conference Proceedings Portal

From the website:

Welcome to the DOE Office of Scientific and Technical Information’s (OSTI) Science Conference Proceedings Portal. This distributed portal provides access to science and technology conference proceedings and conference papers from a number of authoritative sites (professional societies and national labs, largely) whose areas of interest in the physical sciences and technology intersect those of the Department of Energy. Proceedings and papers from scientific meetings can be found in these fields, among others: particle physics, nuclear physics, chemistry, petroleum, aeronautics and astronautics, meteorology, engineering, computer science, electric power, fossil fuels. From here you can simultaneously query any or all of the listed organizations and collections for scientific and technical conference proceedings or papers. Simply enter your search term(s) in the “Search” box, check one or more of the listed sites (or check “Select All”), and click the “Search” button.

One of the conference organizations listed is the Association for Computing Machinery (ACM).

No doubt a very good site but I wonder about conferences that only appear as Springer publications, for example? Or that are concerned with computers but only appear as publications of other publishers or organizations?

Question: In a week, how many indexes that include computer science conferences can you find? How do they differ in terms of coverage?

Monday, October 17th, 2011 The Global Science Gateway

From the webpage: is a global science gateway—accelerating scientific discovery and progress through a multilateral partnership to enable federated searching of national and international scientific databases and portals.

You have to pick “Advanced Search” to get an idea of the range of coverage offered by this gateway.

Note that the service offers multilingual searching powered by Microsoft Translator.

I did a search for “partially observable Markov processes” (thinking to avoid a real flood of “hits”) and was quickly shown six (6) “hits.” Then a popup appeared advising that a full search was complete, asking if it should add another four hundred-forty seven (447) results. The criteria for the “quick” results isn’t clear but it is impressive. Now the interface advises: 453 results from at least 3266 found.

Odd to see SpringerLink listed first under the Author facet on the left-hand side of the screen.

The search “hits” re-ordered themselves and since I had used an exact match string, the first item was Technical rept. no. 4 from MIT, Corporate Author: MASSACHUSETTS INST OF TECH CAMBRIDGE OPERATIONS RESEARCH CENTER, Personal Author: Kramer,J. David R. ,Jr., Report Date: April 1964.

You get “alerts” of later results but only if you have a registered account. But you have to search before you see a link to the login page, where you can create an account. For your convenience, the login page.

It is a very interesting “federation” of search results but I am troubled by not knowing the limitations of the underlying search engines.

Bibliographic Wilderness

Tuesday, October 11th, 2011

Bibliographic Wilderness

An interesting bibliographic/library blog that I encountered. Posts on URLs, microdata, etc.

Joy of Clojure – Bibliography

Sunday, July 10th, 2011

Joy of Clojure – Bibliography

OK, I’m an academic. I like bibliographies. 😉

I noticed it repeats what you will find in the The Joy of Clojure. That’s useful if you are at the library without your copy of Joy and can’t remember a particular citation. But not very useful otherwise.

Suggestion: Make the bibliography a dynamic one that accepts suggested annotated references from readers that point to particular sections or discussions in the text. Posting subject to the judgment of the authors.

Could prove to be useful in the event of a second or following edition.

Multiple Criteria Decision Aid Bibliography

Wednesday, July 6th, 2011

Multiple Criteria Decision Aid Bibliography

I stumbled over this site while looking for a free copy of Amos Tversky’s “Features of Similarity” paper to cite for my readers. (I never was able to find a copy that wasn’t behind a pay-per-view wall. Sorry.)

It is maintained by the LAMSADE laboratory as materials on decision making, which identification of a subject certainly falls into that category.

The LAMSADE laboratory has been established in 1974 as a joint laboratory of the Université Paris-Dauphine and the CNRS. Its central research activity lies at the interface of two fundamental scientific areas: Computer Science and Decision Making (and, more generally, Operations Research).

LAMSADE’s research themes are both theoretical and applied and cover decision making, decision theory, social choice, operations research, combinatorial optimization, computational complexity, mathematical programming, interactions between decision and artificial intelligence, massive data computation, and information systems.

And yes, it is no mistake, the first entry in the bibliography is from 1736.


The Review of Metaphysics

Sunday, April 24th, 2011

The Review of Metaphysics is a journal I first encountered as an undergraduate.

The CURRENT PERIODICAL ARTICLES section offers abstracts of current articles from a large number of philosophical journals. Far more than I could acquire or review personally.

This is where I discovered the Music, Essential Metaphor and, And Private Language paper.

If you are interested in the theory side of knowledge/topic maps and/or something harder than representing spreadsheets as topic maps, this is a good source of starting points.

Zotero – Software

Monday, January 3rd, 2011


I don’t remember now how I stumbled across interesting project.

Looks like fertile ground for the discussion of subject identity.

Particularly since shared bibliographies are nice but merged bibliographies would be better.

Drop in, introduce yourself and topic map thinking about subject identity.

Open Bibliographic Working Group

Tuesday, November 23rd, 2010

Open Bibliographic Working Group

The group responsible for processing the British National Bibliography.

That was under the JISC OpenBibliography project.

They have several other projects I need to mention here.

If you are interested in bibliographic data, this is one group to follow and if you are able, please contribute to their efforts.