Archive for the ‘PubMed’ Category

PubMed Watcher (beta)

Thursday, April 25th, 2013

PubMed Watcher (beta)

After logging it with a Google account:

Welcome on PubMed Watcher!

Thanks for registering, here is what you need to know to get quickly started:

Step 1 – Add a Key Article

Define your research topic by setting up to four Key Articles. For instance you can use your own work as input or the papers of the lab you are working in at the moment. Key Articles describe the science you care about. The articles must be referenced on PubMed.

Step 2 – Read relevant stuff

PubMed Watcher will provide you with a feed of related articles, sorted by relevance and similarity in regards to the Key Articles content. The more Key Articles you have, the more tailored the list will be. PubMed Watcher helps to abstract away from journals, impact factors and date of publishing. Spend time reading, not searching! Come back every now and then to monitor your field and to get relevant literature to read.

Ready? Add your first Key Article or learn more about PubMed Watcher machinery.

OK, so I picked four seed articles and then read the “about,” where a “pinch of heuristics” says:

Now the idea behind PubMed Watcher is to pool the feeds coming from each one of your Key Article. If an article is present in more than one feed, it means that this article seems to be even more interesting to you, that’s the heuristic. The redundant article then gets a new higher score which is the sum of all its indivual scores. Example, let’s say you have two Key Articles named A and B. A has two similar articles F and G with respective similarity scores of 4 and 2. The Key Article B has two similar articles too: M and G with scores 7 and 6. The feed presented to you by PubMed Watcher will then be: G first (score of 6+2=8), M (score of 7) and finally F (4). This score is standardised in percentages (relative relatedness, the blue bars in the application), so here we would get: G (100%), M (88%) and F (50%). This metrics is not perfect yet it’s intuitive and gives good enough results; plus it’s fast to compute.

Paper on the technique:

PubMed related articles: a probabilistic topic-based model for content similarity by Jimmy Lin and W John Wilbur.

Code on Github.

The interface is fairly “lite” and you can change your four articles easily.

One thing I like from the start is that all I need do it pick one to four articles and I’m setup.

Hard to imagine an easier setup process that comes close to matching your interests.

Visualizing the Topical Structure of the Medical Sciences:…

Thursday, March 14th, 2013

Visualizing the Topical Structure of the Medical Sciences: A Self-Organizing Map Approach by André Skupin, Joseph R. Biberstine, Katy Börner. (Skupin A, Biberstine JR, Börner K (2013) Visualizing the Topical Structure of the Medical Sciences: A Self-Organizing Map Approach. PLoS ONE 8(3): e58779. doi:10.1371/journal.pone.0058779)

Abstract:

Background

We implement a high-resolution visualization of the medical knowledge domain using the self-organizing map (SOM) method, based on a corpus of over two million publications. While self-organizing maps have been used for document visualization for some time, (1) little is known about how to deal with truly large document collections in conjunction with a large number of SOM neurons, (2) post-training geometric and semiotic transformations of the SOM tend to be limited, and (3) no user studies have been conducted with domain experts to validate the utility and readability of the resulting visualizations. Our study makes key contributions to all of these issues.

Methodology

Documents extracted from Medline and Scopus are analyzed on the basis of indexer-assigned MeSH terms. Initial dimensionality is reduced to include only the top 10% most frequent terms and the resulting document vectors are then used to train a large SOM consisting of over 75,000 neurons. The resulting two-dimensional model of the high-dimensional input space is then transformed into a large-format map by using geographic information system (GIS) techniques and cartographic design principles. This map is then annotated and evaluated by ten experts stemming from the biomedical and other domains.

Conclusions

Study results demonstrate that it is possible to transform a very large document corpus into a map that is visually engaging and conceptually stimulating to subject experts from both inside and outside of the particular knowledge domain. The challenges of dealing with a truly large corpus come to the fore and require embracing parallelization and use of supercomputing resources to solve otherwise intractable computational tasks. Among the envisaged future efforts are the creation of a highly interactive interface and the elaboration of the notion of this map of medicine acting as a base map, onto which other knowledge artifacts could be overlaid.

Impressive work to say the least!

But I was just as impressed by the future avenues for research:

Controlled Vocabularies

It appears that the use of indexer-chosen keywords, including in the case of a large controlled vocabulary-MeSH terms in this study-raises interesting questions. The rank transition diagram in particular helped to highlight the fact that different vocabulary items play different roles in indexers’ attempts to characterize the content of specific publications. The complex interplay of hierarchical relationships and functional roles of MeSH terms deserves further investigation, which may inform future efforts of how specific terms are handled in computational analysis. For example, models constructed from terms occurring at intermediate levels of the MeSH hierarchy might look and function quite different from the top-level model presented here.

User-centered Studies

Future user studies will include term differentiation tasks to help us understand whether/how users can differentiate senses of terms on the self-organizing map. When a term appears prominently in multiple places, that indicates multiple senses or contexts for that term. One study might involve subjects being shown two regions within which a particular label term appears and the abstracts of several papers containing that term. Subjects would then be asked to rate each abstract along a continuum between two extremes formed by the two senses/contexts. Studies like that will help us evaluate how understandable the local structure of the map is.

There are other, equally interesting future research questions but those are the two of most interest to me.

I take this research as evidence that managing semantic diversity is going to require human effort, augmented by automated means.

I first saw this in Nat Torkington’s Four short links: 13 March 2013.

In the red corner – PubMed and in the blue corner – Google Scholar

Monday, June 25th, 2012

Medical literature searches: a comparison of PubMed and Google Scholar by Eva Nourbakhsh, Rebecca Nugent, Helen Wang, Cihan Cevik and Kenneth Nugent. (Health Information & Libraries Journal, Article first published online: 19 JUN 2012)

From the abstract:

Background

Medical literature searches provide critical information for clinicians. However, the best strategy for identifying relevant high-quality literature is unknown.

Objectives

We compared search results using PubMed and Google Scholar on four clinical questions and analysed these results with respect to article relevance and quality.

Methods

Abstracts from the first 20 citations for each search were classified into three relevance categories. We used the weighted kappa statistic to analyse reviewer agreement and nonparametric rank tests to compare the number of citations for each article and the corresponding journals’ impact factors.

Results

Reviewers ranked 67.6% of PubMed articles and 80% of Google Scholar articles as at least possibly relevant (P = 0.116) with high agreement (all kappa P-values < 0.01). Google Scholar articles had a higher median number of citations (34 vs. 1.5, P < 0.0001) and came from higher impact factor journals (5.17 vs. 3.55, P = 0.036).

Conclusions

PubMed searches and Google Scholar searches often identify different articles. In this study, Google Scholar articles were more likely to be classified as relevant, had higher numbers of citations and were published in higher impact factor journals. The identification of frequently cited articles using Google Scholar for searches probably has value for initial literature searches.

I have several concerns that may or may not be allied by further investigation:

  • Four queries seems like an inadequate basis for evaluation. Not that I expect to see one “winner” and one “loser,” but am more concerned with what lead to the differences in results.
  • It is unclear why a citation from a journal with a higher impact factor is superior to one with a lesser impact factor? I assume the point of the query is to obtain a useful result (in the sense of medical treatment, not tenure).
  • Neither system enabled users to build upon the query experience of prior users with a similar query.
  • Neither system enabled users to avoid re-reading the same texts as other had read before them.

Thoughts?

Why Are There So Few Efforts to Text Mine the Open Access Subset of PubMed Central?

Sunday, March 4th, 2012

Why Are There So Few Efforts to Text Mine the Open Access Subset of PubMed Central? by Casey Bergman.

From the post:

The open access movement in scientific publishing has two broad aims: (i) to make scientific articles more broadly accessible and (ii) to permit unrestricted re-use of published scientific content. From its humble beginnings in 2001 with only two journals, PubMed Central (PMC) has grown to become the world’s largest repository of full-text open-access biomedical articles, containing nearly 2.4 million biomedical articles that can be freely downloaded by anyone around the world. Thus, while holding only ~11% of the total published biomedical literature, PMC can be viewed clearly as a major success in terms of making the biomedical literature more broadly accessible.

However, I argue that PMC has yet catalyze similar success on the second goal of the open-access movement — unrestricted re-use of published scientific content. This point became clear to me when writing the discussions for two papers that my lab published last year. In digging around for references to cite, I was struck by how difficult it was to find examples of projects that applied text-mining tools to the entire set of open-access articles from PubMed Central. Unsure if this was a reflection of my ignorance or the actual state of the art in the field, I canvassed the biological text mining community, the bioinformatics community and two major open-access publishers for additional examples of text-mining on the the entire open-access subset of PMC.

Surprisingly, I found that after a decade of existence only ~15 articles* have ever been published that have used the entire open-access subset of PMC for text-mining research. In other words, less than 2 research articles per year are being published that actually use the open-access contents of PubMed Central for large-scale data mining or sevice provision. I find the lack of uptake of PMC by text-mining researchers to be rather astonishing, considering it is an incredibly rich achive of the combined output of thousands of scientists worldwide.

Good question.

Suggestions for answers? (post to the original posting)

BTW, Casey includes a listing of the articles based on mining of the open-access contents of PubMed Central.

What other open access data sets suffer from a lack of use? Comments on why?

Topical Classification of Biomedical Research Papers – Details

Tuesday, January 3rd, 2012

OK, I registered both on the site and for the contest.

From the Task:

Our team has invested a significant amount of time and effort to gather a corpus of documents containing 20,000 journal articles from the PubMed Central open-access subset. Each of those documents was labeled by biomedical experts from PubMed with several MeSH subheadings that can be viewed as different contexts or topics discussed in the text. With a use of our automatic tagging algorithm, which we will describe in details after completion of the contest, we associated all the documents with the most related MeSH terms (headings). The competition data consists of information about strengths of those bonds, expressed as numerical value. Intuitively, they can be interpreted as values of a rough membership function that measures a degree in which a term is present in a given text. The task for the participants is to devise algorithms capable of accurately predicting MeSH subheadings (topics) assigned by the experts, based on the association strengths of the automatically generated tags. Each document can be labeled with several subheadings and this number is not fixed. In order to ensure that participants who are not familiar with biomedicine, and with the MeSH ontology in particular, have equal chances as domain experts, the names of concepts and topical classifications are removed from data. Those names and relations between data columns, as well as a dictionary translating decision class identifiers into MeSH subheadings, can be provided on request after completion of the challenge.

Data format: The data set is provided in a tabular form as two tab-separated values files, namely trainingData.csv (the training set) and testData.csv (the test set). They can be downloaded only after a successful registration to the competition. Each row of those data files represents a single document and, in the consecutive columns, it contains integers ranging from 0 to 1000, expressing association strengths to corresponding MeSH terms. Additionally, there is a trainingLables.txt file, whose consecutive rows correspond to entries in the training set (trainingData.csv). Each row of that file is a list of topic identifiers (integers ranging from 1 to 83), separated by commas, which can be regarded as a generalized classification of a journal article. This information is not available for the test set and has to be predicted by participants.

It is worth noting that, due to nature of the considered problem, the data sets are highly dimensional – the number of columns roughly corresponds to the MeSH ontology size. The data sets are also sparse, since usually only a small fraction of the MeSH terms is assigned to a particular document by our tagging algorithm. Finally, a large number of data columns have little (or even none) non-zero values (corresponding concepts are rarely assigned to documents). It is up to participants to decide which of them are still useful for the task.

I am looking at it as an opportunity to learn a good bit about automatic text classification and what, if any, role that topic maps can play in such a scenario.

Suggestions as well as team members are most welcome!

Topical Classification of Biomedical Research Papers

Monday, January 2nd, 2012

JRS 2012 Data Mining Competition: Topical Classification of Biomedical Research Papers

From the webpage:

JRS 2012 Data Mining Competition: Topical Classification of Biomedical Research Papers, is a special event of Joint Rough Sets Symposium (JRS 2012, http://sist.swjtu.edu.cn/JRS2012/) that will take place in Chengdu, China, August 17-20, 2012. The task is related to the problem of predicting topical classification of scientific publications in a field of biomedicine. Money prizes worth 1,500 USD will be awarded to the most successful teams. The contest is funded by the organizers of the JRS 2012 conference, Southwest Jiaotong University, with support from University of Warsaw, SYNAT project and TunedIT.

Introduction: Development of freely available biomedical databases allows users to search for documents containing highly specialized biomedical knowledge. Rapidly increasing size of scientific article meta-data and text repositories, such as MEDLINE [1] or PubMed Central (PMC) [2], emphasizes the growing need for accurate and scalable methods for automatic tagging and classification of textual data. For example, medical doctors often search through biomedical documents for information regarding diagnostics, drugs dosage and effect or possible complications resulting from specific treatments. In the queries, they use highly sophisticated terminology, that can be properly interpreted only with a use of a domain ontology, such as Medical Subject Headings (MeSH) [3]. In order to facilitate the searching process, documents in a database should be indexed with concepts from the ontology. Additionally, the search results could be grouped into clusters of documents, that correspond to meaningful topics matching different information needs. Such clusters should not necessarily be disjoint since one document may contain information related to several topics. In this data mining competition, we would like to raise both of the above mentioned problems, i.e. we are interested in identification of efficient algorithms for topical classification of biomedical research papers based on information about concepts from the MeSH ontology, that were automatically assigned by our tagging algorithm. In our opinion, this challenge may be appealing to all members of the Rough Set Community, as well as other data mining practitioners, due to its strong relations to well-founded subjects, such as generalized decision rules induction [4], feature extraction [5], soft and rough computing [6], semantic text mining [7], and scalable classification methods [8]. In order to ensure scientific value of this challenge, each of participating teams will be required to prepare a short report describing their approach. Those reports can be used for further validation of the results. Apart from prizes for top three teams, authors of selected solutions will be invited to prepare a paper for presentation at JRS 2012 special session devoted to the competition. Chosen papers will be published in the conference proceedings.

Data sets became available today.

This is one of those “praxis” opportunities for topic maps.