Archive for the ‘PubMed’ Category

PubMed comments & their continuing conversations

Monday, November 21st, 2016

PubMed comments & their continuing conversations

From the post:

We have many options for communication. We can choose platforms that fit our style, approach, and time constraints. From pop culture to current events, information and opinions are shared and discussed across multiple channels. And scientific publications are no exception.

PubMed Commons was established to enable commenting in PubMed, the largest biomedical literature database. In the past year, commenters posted to more than 1,400 publications. Of those publications, 80% have a single comment today, and 12% have comments from multiple members. The conversation carries forward in other venues.

Sometimes comments pull in discussion from other locations or spark exchanges elsewhere.Here are a few examples where social media prompted PubMed Commons posts or continued the commentary on publications.

An encouraging review of examples of sane discussion through the use of comments.

Unlike the abandoning of comments by some media outlets, NPR for example, NPR Website To Get Rid Of Comments by Elizabeth Jensen.

My take away from Jensen’s account was that NPR likes its free speech, not so much interested in the free speech of others.

See also: Have Comment Sections on News Media Websites Failed?, for op-ed pieces at the New York Times from a variety of perspectives.

Perhaps comments on news sites are examples of casting pearls before swine? (Matthew 7:6)

PMID-PMCID-DOI Mappings (monthly update)

Wednesday, July 8th, 2015

PMID-PMCID-DOI Mappings (monthly update)

Dario Taraborelli tweets:

All PMID-PMCID-DOI mappings known by @EuropePMC_news, refreshed monthly

The file lists at 150MB but be aware that it decompresses to 909MB+. Approximately 25.6 million lines.

In case you are unfamiliar with PMID/PMCID:

PMID and PMCID are not the same thing.

PMID is the unique identifier number used in PubMed. They are assigned to each article record when it enters the PubMed system, so an in press publication will not have one unless it is issued as an electronic pre-pub. The PMID# is always found at the end of a PubMed citation.

Example of PMID#: Diehl SJ. Incorporating health literacy into adult basic education: from life skills to life saving. N C Med J. 2007 Sep-Oct;68(5):336-9. Review. PubMed PMID: 18183754.

PMCID is the unique identifier number used in PubMed Central. People are usually looking for this number in order to comply with the NIH Public Access Regulations. We have a webpage that gathers information to guide compliance. You can find it here: (broken link) [updated link:]

A PMCID# is assigned after an author manuscript is deposited into PubMed Central. Some journals will deposit for you. Is this your publication? What is the journal?

PMCID#s can be found at the bottom of an article citation in PubMed, but only for articles that have been deposited in PubMed Central.

Example of a PMCID#: Ishikawa H, Kiuchi T. Health literacy and health communication. Biopsychosoc Med. 2010 Nov 5;4:18. PubMed PMID: 21054840; PubMed Central PMCID: PMC2990724.

From: how do I find the PMID (is that the same as the PMCID?) for in press publications?

If I were converting this into a topic map, I would use the PMID, PMCID, and DOI entries as subject identifiers. (PMIDs and PMCIDs can be expressed as hrefs.)

Understanding UMLS

Sunday, February 23rd, 2014

Understanding UMLS by Sujit Pal.

From the post:

I’ve been looking at Unified Medical Language System (UMLS) data this last week. The medical taxonomy we use at work is partly populated from UMLS, so I am familiar with the data, but only after it has been processed by our Informatics team. The reason I was looking at it is because I am trying to understand Apache cTakes, an open source NLP pipeline for the medical domain, which uses UMLS as one of its inputs.

UMLS is provided by the National Library of Medicine (NLM), and consists of 3 major parts: the Metathesaurus, consisting of over 1M medical concepts, a Semantic Network to categorize concepts by semantic type, and a Specialist Lexicon containing data to help do NLP on medical text. In addition, I also downloaded the RxNorm database that contains drug/medication information. I found that the biggest challenge was accessing the data, so I will describe that here, and point you to other web resources for the data descriptions.

Before getting the data, you have to sign up for a license with UMLS Terminology Services (UTS) – this is a manual process and can take a few days over email (I did this couple of years ago so details are hazy). UMLS data is distributed as .nlm files which can (as far as I can tell) be opened and expanded only by the Metamorphosis (mmsys) downloader, available on the UMLS download page. You need to run the following sequence of steps to capture the UMLS data into a local MySQL database. You can use other databases as well, but you would have to do a bit more work.


The table and column names are quite cryptic and the relationships are not evident from the tables. You will need to refer to the data dictionaries for each system to understand it before you do anything interesting with the data. Here are the links to the online references that describe the tables and their relationships for each system better than I can.

I have only captured the highlights from Sujit’s post so see his post for additional details.

There has been no small amount of time and effort invested in UMLS. Than names are cryptic and relationships not specified is more typical than any other state of data.

Take the opportunity to learn about UMLS and to ponder what solutions you would offer.

Analyzing PubMed Entries with Python and NLTK

Wednesday, February 19th, 2014

Analyzing PubMed Entries with Python and NLTK by Themos Kalafatis.

From the post:

I decided to take my first steps of learning Python with the following task : Retrieve all entries from PubMed and then analyze those entries using Python and the Text Mining library NLTK.

We assume that we are interested in learning more about a condition called Sudden Hearing Loss. Sudden Hearing Loss is considered a medical emergency and has several causes although usually it is idiopathic (a disease or condition the cause of which is not known or that arises spontaneously according to Wikipedia).

At the moment of writing, the PubMed Query for sudden hearing loss returns 2919 entries :

A great illustration of using NLTK but of the iterative nature of successful querying.

Some queries, quite simple ones, can and do succeed on the first attempt.

Themos demonstrates how to use NLTK to explore a data set where the first response isn’t all that helpful.

This is a starting idea for weekly exercises with NLTK. Exercises which emphasize different aspects of NLTK.

Analysis of PubMed search results using R

Friday, December 6th, 2013

Analysis of PubMed search results using R by Pilar Cacheiro.

From the post:

Looking for information about meta-analysis in R (subject for an upcoming post as it has become a popular practice to analyze data from different Genome Wide Association studies) I came across this tutorial from The R User Conference 2013 – I couldn´t make it this time, even when it was held so close, maybe Los Angeles next year…

Back to the topic at hand, that is how I found out about the RISmed package which is meant to retrieve information from PubMed. It looked really interesting because, as you may imagine,this is one of the most used resources in my daily routine.

Its use is quite straightforward. First, you define the query and download data from the database (be careful about your IP being blocked from accessing NCBI in the case of large jobs!) . Then, you might use the information to look for trends on a topic of interest or extracting specific information from abstracts, getting descriptives,…

Pliar does a great job introducing RISmed and pointing to additional sources for more examples and discussion of the package.

Meta-analysis is great but you could also be selling the results of your queries to PubMed.

After all, they would be logging your IP address, not that of your client.

Some people prefer more anonymity than others and are willing to pay for that privilege.

PubMed Watcher (beta)

Thursday, April 25th, 2013

PubMed Watcher (beta)

After logging it with a Google account:

Welcome on PubMed Watcher!

Thanks for registering, here is what you need to know to get quickly started:

Step 1 – Add a Key Article

Define your research topic by setting up to four Key Articles. For instance you can use your own work as input or the papers of the lab you are working in at the moment. Key Articles describe the science you care about. The articles must be referenced on PubMed.

Step 2 – Read relevant stuff

PubMed Watcher will provide you with a feed of related articles, sorted by relevance and similarity in regards to the Key Articles content. The more Key Articles you have, the more tailored the list will be. PubMed Watcher helps to abstract away from journals, impact factors and date of publishing. Spend time reading, not searching! Come back every now and then to monitor your field and to get relevant literature to read.

Ready? Add your first Key Article or learn more about PubMed Watcher machinery.

OK, so I picked four seed articles and then read the “about,” where a “pinch of heuristics” says:

Now the idea behind PubMed Watcher is to pool the feeds coming from each one of your Key Article. If an article is present in more than one feed, it means that this article seems to be even more interesting to you, that’s the heuristic. The redundant article then gets a new higher score which is the sum of all its indivual scores. Example, let’s say you have two Key Articles named A and B. A has two similar articles F and G with respective similarity scores of 4 and 2. The Key Article B has two similar articles too: M and G with scores 7 and 6. The feed presented to you by PubMed Watcher will then be: G first (score of 6+2=8), M (score of 7) and finally F (4). This score is standardised in percentages (relative relatedness, the blue bars in the application), so here we would get: G (100%), M (88%) and F (50%). This metrics is not perfect yet it’s intuitive and gives good enough results; plus it’s fast to compute.

Paper on the technique:

PubMed related articles: a probabilistic topic-based model for content similarity by Jimmy Lin and W John Wilbur.

Code on Github.

The interface is fairly “lite” and you can change your four articles easily.

One thing I like from the start is that all I need do it pick one to four articles and I’m setup.

Hard to imagine an easier setup process that comes close to matching your interests.

Visualizing the Topical Structure of the Medical Sciences:…

Thursday, March 14th, 2013

Visualizing the Topical Structure of the Medical Sciences: A Self-Organizing Map Approach by André Skupin, Joseph R. Biberstine, Katy Börner. (Skupin A, Biberstine JR, Börner K (2013) Visualizing the Topical Structure of the Medical Sciences: A Self-Organizing Map Approach. PLoS ONE 8(3): e58779. doi:10.1371/journal.pone.0058779)



We implement a high-resolution visualization of the medical knowledge domain using the self-organizing map (SOM) method, based on a corpus of over two million publications. While self-organizing maps have been used for document visualization for some time, (1) little is known about how to deal with truly large document collections in conjunction with a large number of SOM neurons, (2) post-training geometric and semiotic transformations of the SOM tend to be limited, and (3) no user studies have been conducted with domain experts to validate the utility and readability of the resulting visualizations. Our study makes key contributions to all of these issues.


Documents extracted from Medline and Scopus are analyzed on the basis of indexer-assigned MeSH terms. Initial dimensionality is reduced to include only the top 10% most frequent terms and the resulting document vectors are then used to train a large SOM consisting of over 75,000 neurons. The resulting two-dimensional model of the high-dimensional input space is then transformed into a large-format map by using geographic information system (GIS) techniques and cartographic design principles. This map is then annotated and evaluated by ten experts stemming from the biomedical and other domains.


Study results demonstrate that it is possible to transform a very large document corpus into a map that is visually engaging and conceptually stimulating to subject experts from both inside and outside of the particular knowledge domain. The challenges of dealing with a truly large corpus come to the fore and require embracing parallelization and use of supercomputing resources to solve otherwise intractable computational tasks. Among the envisaged future efforts are the creation of a highly interactive interface and the elaboration of the notion of this map of medicine acting as a base map, onto which other knowledge artifacts could be overlaid.

Impressive work to say the least!

But I was just as impressed by the future avenues for research:

Controlled Vocabularies

It appears that the use of indexer-chosen keywords, including in the case of a large controlled vocabulary-MeSH terms in this study-raises interesting questions. The rank transition diagram in particular helped to highlight the fact that different vocabulary items play different roles in indexers’ attempts to characterize the content of specific publications. The complex interplay of hierarchical relationships and functional roles of MeSH terms deserves further investigation, which may inform future efforts of how specific terms are handled in computational analysis. For example, models constructed from terms occurring at intermediate levels of the MeSH hierarchy might look and function quite different from the top-level model presented here.

User-centered Studies

Future user studies will include term differentiation tasks to help us understand whether/how users can differentiate senses of terms on the self-organizing map. When a term appears prominently in multiple places, that indicates multiple senses or contexts for that term. One study might involve subjects being shown two regions within which a particular label term appears and the abstracts of several papers containing that term. Subjects would then be asked to rate each abstract along a continuum between two extremes formed by the two senses/contexts. Studies like that will help us evaluate how understandable the local structure of the map is.

There are other, equally interesting future research questions but those are the two of most interest to me.

I take this research as evidence that managing semantic diversity is going to require human effort, augmented by automated means.

I first saw this in Nat Torkington’s Four short links: 13 March 2013.

In the red corner – PubMed and in the blue corner – Google Scholar

Monday, June 25th, 2012

Medical literature searches: a comparison of PubMed and Google Scholar by Eva Nourbakhsh, Rebecca Nugent, Helen Wang, Cihan Cevik and Kenneth Nugent. (Health Information & Libraries Journal, Article first published online: 19 JUN 2012)

From the abstract:


Medical literature searches provide critical information for clinicians. However, the best strategy for identifying relevant high-quality literature is unknown.


We compared search results using PubMed and Google Scholar on four clinical questions and analysed these results with respect to article relevance and quality.


Abstracts from the first 20 citations for each search were classified into three relevance categories. We used the weighted kappa statistic to analyse reviewer agreement and nonparametric rank tests to compare the number of citations for each article and the corresponding journals’ impact factors.


Reviewers ranked 67.6% of PubMed articles and 80% of Google Scholar articles as at least possibly relevant (P = 0.116) with high agreement (all kappa P-values < 0.01). Google Scholar articles had a higher median number of citations (34 vs. 1.5, P < 0.0001) and came from higher impact factor journals (5.17 vs. 3.55, P = 0.036). Conclusions

PubMed searches and Google Scholar searches often identify different articles. In this study, Google Scholar articles were more likely to be classified as relevant, had higher numbers of citations and were published in higher impact factor journals. The identification of frequently cited articles using Google Scholar for searches probably has value for initial literature searches.

I have several concerns that may or may not be allied by further investigation:

  • Four queries seems like an inadequate basis for evaluation. Not that I expect to see one “winner” and one “loser,” but am more concerned with what lead to the differences in results.
  • It is unclear why a citation from a journal with a higher impact factor is superior to one with a lesser impact factor? I assume the point of the query is to obtain a useful result (in the sense of medical treatment, not tenure).
  • Neither system enabled users to build upon the query experience of prior users with a similar query.
  • Neither system enabled users to avoid re-reading the same texts as other had read before them.


Why Are There So Few Efforts to Text Mine the Open Access Subset of PubMed Central?

Sunday, March 4th, 2012

Why Are There So Few Efforts to Text Mine the Open Access Subset of PubMed Central? by Casey Bergman.

From the post:

The open access movement in scientific publishing has two broad aims: (i) to make scientific articles more broadly accessible and (ii) to permit unrestricted re-use of published scientific content. From its humble beginnings in 2001 with only two journals, PubMed Central (PMC) has grown to become the world’s largest repository of full-text open-access biomedical articles, containing nearly 2.4 million biomedical articles that can be freely downloaded by anyone around the world. Thus, while holding only ~11% of the total published biomedical literature, PMC can be viewed clearly as a major success in terms of making the biomedical literature more broadly accessible.

However, I argue that PMC has yet catalyze similar success on the second goal of the open-access movement — unrestricted re-use of published scientific content. This point became clear to me when writing the discussions for two papers that my lab published last year. In digging around for references to cite, I was struck by how difficult it was to find examples of projects that applied text-mining tools to the entire set of open-access articles from PubMed Central. Unsure if this was a reflection of my ignorance or the actual state of the art in the field, I canvassed the biological text mining community, the bioinformatics community and two major open-access publishers for additional examples of text-mining on the the entire open-access subset of PMC.

Surprisingly, I found that after a decade of existence only ~15 articles* have ever been published that have used the entire open-access subset of PMC for text-mining research. In other words, less than 2 research articles per year are being published that actually use the open-access contents of PubMed Central for large-scale data mining or sevice provision. I find the lack of uptake of PMC by text-mining researchers to be rather astonishing, considering it is an incredibly rich achive of the combined output of thousands of scientists worldwide.

Good question.

Suggestions for answers? (post to the original posting)

BTW, Casey includes a listing of the articles based on mining of the open-access contents of PubMed Central.

What other open access data sets suffer from a lack of use? Comments on why?

Topical Classification of Biomedical Research Papers – Details

Tuesday, January 3rd, 2012

OK, I registered both on the site and for the contest.

From the Task:

Our team has invested a significant amount of time and effort to gather a corpus of documents containing 20,000 journal articles from the PubMed Central open-access subset. Each of those documents was labeled by biomedical experts from PubMed with several MeSH subheadings that can be viewed as different contexts or topics discussed in the text. With a use of our automatic tagging algorithm, which we will describe in details after completion of the contest, we associated all the documents with the most related MeSH terms (headings). The competition data consists of information about strengths of those bonds, expressed as numerical value. Intuitively, they can be interpreted as values of a rough membership function that measures a degree in which a term is present in a given text. The task for the participants is to devise algorithms capable of accurately predicting MeSH subheadings (topics) assigned by the experts, based on the association strengths of the automatically generated tags. Each document can be labeled with several subheadings and this number is not fixed. In order to ensure that participants who are not familiar with biomedicine, and with the MeSH ontology in particular, have equal chances as domain experts, the names of concepts and topical classifications are removed from data. Those names and relations between data columns, as well as a dictionary translating decision class identifiers into MeSH subheadings, can be provided on request after completion of the challenge.

Data format: The data set is provided in a tabular form as two tab-separated values files, namely trainingData.csv (the training set) and testData.csv (the test set). They can be downloaded only after a successful registration to the competition. Each row of those data files represents a single document and, in the consecutive columns, it contains integers ranging from 0 to 1000, expressing association strengths to corresponding MeSH terms. Additionally, there is a trainingLables.txt file, whose consecutive rows correspond to entries in the training set (trainingData.csv). Each row of that file is a list of topic identifiers (integers ranging from 1 to 83), separated by commas, which can be regarded as a generalized classification of a journal article. This information is not available for the test set and has to be predicted by participants.

It is worth noting that, due to nature of the considered problem, the data sets are highly dimensional – the number of columns roughly corresponds to the MeSH ontology size. The data sets are also sparse, since usually only a small fraction of the MeSH terms is assigned to a particular document by our tagging algorithm. Finally, a large number of data columns have little (or even none) non-zero values (corresponding concepts are rarely assigned to documents). It is up to participants to decide which of them are still useful for the task.

I am looking at it as an opportunity to learn a good bit about automatic text classification and what, if any, role that topic maps can play in such a scenario.

Suggestions as well as team members are most welcome!

Topical Classification of Biomedical Research Papers

Monday, January 2nd, 2012

JRS 2012 Data Mining Competition: Topical Classification of Biomedical Research Papers

From the webpage:

JRS 2012 Data Mining Competition: Topical Classification of Biomedical Research Papers, is a special event of Joint Rough Sets Symposium (JRS 2012, that will take place in Chengdu, China, August 17-20, 2012. The task is related to the problem of predicting topical classification of scientific publications in a field of biomedicine. Money prizes worth 1,500 USD will be awarded to the most successful teams. The contest is funded by the organizers of the JRS 2012 conference, Southwest Jiaotong University, with support from University of Warsaw, SYNAT project and TunedIT.

Introduction: Development of freely available biomedical databases allows users to search for documents containing highly specialized biomedical knowledge. Rapidly increasing size of scientific article meta-data and text repositories, such as MEDLINE [1] or PubMed Central (PMC) [2], emphasizes the growing need for accurate and scalable methods for automatic tagging and classification of textual data. For example, medical doctors often search through biomedical documents for information regarding diagnostics, drugs dosage and effect or possible complications resulting from specific treatments. In the queries, they use highly sophisticated terminology, that can be properly interpreted only with a use of a domain ontology, such as Medical Subject Headings (MeSH) [3]. In order to facilitate the searching process, documents in a database should be indexed with concepts from the ontology. Additionally, the search results could be grouped into clusters of documents, that correspond to meaningful topics matching different information needs. Such clusters should not necessarily be disjoint since one document may contain information related to several topics. In this data mining competition, we would like to raise both of the above mentioned problems, i.e. we are interested in identification of efficient algorithms for topical classification of biomedical research papers based on information about concepts from the MeSH ontology, that were automatically assigned by our tagging algorithm. In our opinion, this challenge may be appealing to all members of the Rough Set Community, as well as other data mining practitioners, due to its strong relations to well-founded subjects, such as generalized decision rules induction [4], feature extraction [5], soft and rough computing [6], semantic text mining [7], and scalable classification methods [8]. In order to ensure scientific value of this challenge, each of participating teams will be required to prepare a short report describing their approach. Those reports can be used for further validation of the results. Apart from prizes for top three teams, authors of selected solutions will be invited to prepare a paper for presentation at JRS 2012 special session devoted to the competition. Chosen papers will be published in the conference proceedings.

Data sets became available today.

This is one of those “praxis” opportunities for topic maps.