Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 30, 2014

Expanded 19th-century Medical Collection

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 9:06 am

Wellcome Library and Jisc announce partners in 19th-century medical books digitisation project

From the post:

The libraries of six universities have joined the partnership – UCL (University College London), the University of Leeds, the University of Glasgow, the London School of Hygiene & Tropical Medicine, King’s College London and the University of Bristol – along with the libraries of the Royal College of Physicians of London, the Royal College of Physicians of Edinburgh and the Royal College of Surgeons of England.

Approximately 15 million pages of printed books and pamphlets from all ten partners will be digitised over a period of two years and will be made freely available to researchers and the public under an open licence. By pooling their collections the partners will create a comprehensive online library. The content will be available on multiple platforms to broaden access, including the Internet Archive, the Wellcome Library and Jisc Historic Books.

The project’s focus is on books and pamphlets from the 19th century that are on the subject of medicine or its related disciplines. This will include works relating to the medical sciences, consumer health, sport and fitness, as well as different kinds of medical practice, from phrenology to hydrotherapy. Works on food and nutrition will also feature: around 1400 cookery books from the University of Leeds are among those lined up for digitisation. They, along with works from the other partner institutions, will be transported to the Wellcome Library in London where a team from the Internet Archive will undertake the digitisation work. The project will build on the success of the US-based Medical Heritage Library consortium, of which the Wellcome Library is a part, which has already digitised over 50 000 books and pamphlets.

Digital coverage of the 19th century is taking another leap forward!

Given the changes in medical terminology (and practices!) since the 19th century, this should be a gold mine for topic map applications.

July 6, 2014

Medical Vocabulary

Filed under: Medical Informatics,Vocabularies — Patrick Durusau @ 4:00 pm

Medical Vocabulary by John D. Cook.

A new twitter account that tweets medical terms with definitions.

Would a twitter account that focuses on semantic terminology be useful?

No promises, just curious.

Thinking promising semantic searching/integration if there was some evidence that semanticists are aware of the vast and different terminology in their own field.

PS: John D. Cook has seventeen (17) Twitter accounts as of today:

I subscribe to several of them and they are very much worth the time to follow.

For the current list of John D. Cook twitter accounts, see: http://www.johndcook.com/twitter/

June 21, 2014

Egas:…

Filed under: Bioinformatics,Biomedical,Medical Informatics,Text Mining — Patrick Durusau @ 7:42 pm

Egas: a collaborative and interactive document curation platform by David Campos, el al.

Abstract:

With the overwhelming amount of biomedical textual information being produced, several manual curation efforts have been set up to extract and store concepts and their relationships into structured resources. As manual annotation is a demanding and expensive task, computerized solutions were developed to perform such tasks automatically. However, high-end information extraction techniques are still not widely used by biomedical research communities, mainly because of the lack of standards and limitations in usability. Interactive annotation tools intend to fill this gap, taking advantage of automatic techniques and existing knowledge bases to assist expert curators in their daily tasks. This article presents Egas, a web-based platform for biomedical text mining and assisted curation with highly usable interfaces for manual and automatic in-line annotation of concepts and relations. A comprehensive set of de facto standard knowledge bases are integrated and indexed to provide straightforward concept normalization features. Real-time collaboration and conversation functionalities allow discussing details of the annotation task as well as providing instant feedback of curator’s interactions. Egas also provides interfaces for on-demand management of the annotation task settings and guidelines, and supports standard formats and literature services to import and export documents. By taking advantage of Egas, we participated in the BioCreative IV interactive annotation task, targeting the assisted identification of protein–protein interactions described in PubMed abstracts related to neuropathological disorders. When evaluated by expert curators, it obtained positive scores in terms of usability, reliability and performance. These results, together with the provided innovative features, place Egas as a state-of-the-art solution for fast and accurate curation of information, facilitating the task of creating and updating knowledge bases and annotated resources.

Database URL: http://bioinformatics.ua.pt/egas

Read this article and/or visit the webpage and tell me this doesn’t have topic map editor written all over it!

Domain specific to be sure but any decent interface for authoring topic maps is going to be domain specific.

Very, very impressive!

I am following up with the team to check on the availability of the software.

A controlled vocabulary for pathway entities and events

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 7:24 pm

A controlled vocabulary for pathway entities and events by Steve Jupe, et al.

Abstract:

Entities involved in pathways and the events they participate in require descriptive and unambiguous names that are often not available in the literature or elsewhere. Reactome is a manually curated open-source resource of human pathways. It is accessible via a website, available as downloads in standard reusable formats and via Representational State Transfer (REST)-ful and Simple Object Access Protocol (SOAP) application programming interfaces (APIs). We have devised a controlled vocabulary (CV) that creates concise, unambiguous and unique names for reactions (pathway events) and all the molecular entities they involve. The CV could be reapplied in any situation where names are used for pathway entities and events. Adoption of this CV would significantly improve naming consistency and readability, with consequent benefits for searching and data mining within and between databases.

Database URL: http://www.reactome.org

There is no doubt that “unambiguous and unique names for reactions (pathway events) and all the molecular entities they involve” would have all the benefits listed by the authors.

Unfortunately, the experience of the HUGO Gene Nomenclature Committee, for example, has been that “other” names for genes are used and then the HUGO designation is created. Making the HUGO designation only one of several names a gene may have.

Another phrase for “universal name” is “an additional name.”

It is an impressive effort and should be useful in disambiguating the additional names for pathway entities and events.

FYI, from the homepage of the database:

Reactome is a free, open-source, curated and peer reviewed pathway database. Our goal is to provide intuitive bioinformatics tools for the visualization, interpretation and analysis of pathway knowledge to support basic research, genome analysis, modeling, systems biology and education.

June 2, 2014

openFDA

Filed under: Government,Government Data,Medical Informatics,Open Access,Open Data — Patrick Durusau @ 4:30 pm

openFDA

Not all the news out of government is bad.

Consider openFDA which is putting

More than 3 million adverse drug event reports at your fingertips.

From the “about” page:

OpenFDA is an exciting new initiative in the Food and Drug Administration’s Office of Informatics and Technology Innovation spearheaded by FDA’s Chief Health Informatics Officer. OpenFDA offers easy access to FDA public data and highlight projects using these data in both the public and private sector to further regulatory or scientific missions, educate the public, and save lives.

What does it do?

OpenFDA provides API and raw download access to a number of high-value structured datasets. The platform is currently in public beta with one featured dataset, FDA’s publically available drug adverse event reports.

In the future, openFDA will provide a platform for public challenges issued by the FDA and a place for the community to interact with each other and FDA domain experts with the goal of spurring innovation around FDA data.

We’re currently focused on working on datasets in the following areas:

  • Adverse Events: FDA’s publically available drug adverse event reports, a database that contains millions of adverse event and medication error reports submitted to FDA covering all regulated drugs.
  • Recalls (coming soon): Enforcement Report and Product Recalls Data, containing information gathered from public notices about certain recalls of FDA-regulated products
  • Documentation (coming soon): Structured Product Labeling Data, containing detailed product label information on many FDA-regulated product

We’ll be releasing a number of updates and additional datasets throughout the upcoming months.

OK, I’m Twitter follower #522 @openFDA.

What’s your @openFDA number?

A good experience, i.e., people making good use of released data, asking for more data, etc., is what will drive more open data. Make every useful government data project count.

May 10, 2014

Self organising hypothesis networks

Filed under: Medical Informatics,Networks,Self-Organizing — Patrick Durusau @ 3:51 pm

Self organising hypothesis networks: a new approach for representing and structuring SAR knowledge by Thierry Hanser, et al. (Journal of Cheminformatics 2014, 6:21)

Abstract:

Background

Combining different sources of knowledge to build improved structure activity relationship models is not easy owing to the variety of knowledge formats and the absence of a common framework to interoperate between learning techniques. Most of the current approaches address this problem by using consensus models that operate at the prediction level. We explore the possibility to directly combine these sources at the knowledge level, with the aim to harvest potentially increased synergy at an earlier stage. Our goal is to design a general methodology to facilitate knowledge discovery and produce accurate and interpretable models.

Results

To combine models at the knowledge level, we propose to decouple the learning phase from the knowledge application phase using a pivot representation (lingua franca) based on the concept of hypothesis. A hypothesis is a simple and interpretable knowledge unit. Regardless of its origin, knowledge is broken down into a collection of hypotheses. These hypotheses are subsequently organised into hierarchical network. This unification permits to combine different sources of knowledge into a common formalised framework. The approach allows us to create a synergistic system between different forms of knowledge and new algorithms can be applied to leverage this unified model. This first article focuses on the general principle of the Self Organising Hypothesis Network (SOHN) approach in the context of binary classification problems along with an illustrative application to the prediction of mutagenicity.

Conclusion

It is possible to represent knowledge in the unified form of a hypothesis network allowing interpretable predictions with performances comparable to mainstream machine learning techniques. This new approach offers the potential to combine knowledge from different sources into a common framework in which high level reasoning and meta-learning can be applied; these latter perspectives will be explored in future work.

One interesting feature of this publication is a graphic abstract:

abstract

Assuming one could control the length of the graphic abstracts, that would be an interesting feature for conference papers.

What should be the icon for repeating old news before getting to the new stuff? 😉

Among a number of good points in this paper, see in particular:

  • Distinction between SOHN and “a Galois lattice used in Formal Concept
    Analysis [19] (FCA)” (at page 10).
  • Discussion of the transparency of this approach at page 21.

In a very real sense, announcing an answer to a medical question may be welcome, but it isn’t very informative. Nor will it enable others to advance the medical arts.

Other domains where answers are important but how you arrived at an answer is equally important if not more so?

May 3, 2014

Facts vs. Expert Opinion

Filed under: Measurement,Medical Informatics — Patrick Durusau @ 4:31 pm

In a recent story about randomized medical trials:

“I should leave the final word to Archie Cochrane. In his trial of coronary care units, run in the teeth of vehement opposition, early results suggested that home care was at the time safer than hospital care. Mischievously, Cochrane swapped the results round, giving the cardiologists the (false) message that their hospitals were best all along.

“They were vociferous in their abuse,” he later wrote, and demanded that the “unethical” trial stop immediately. He then revealed the truth and challenged the cardiologists to close down their own hospital units without delay. “There was dead silence.”

Followed by Harford’s closing line: “The world often surprises even the experts. When considering an intervention that might profoundly affect people’s lives, if there is one thing more unethical than running a randomised trial, it’s not running the trial”

One of the persistent dangers of randomized trials is that the results can contradict what is “known” to be true by experts.

Another reason for user rather than c-suite “testing” of product interfaces, assuming the c-suite types are willing to hear “bad” news.

And a good illustration that claims of “ethics” can be hiding less pure concerns.

I first saw this in A brilliant anecdote on how scientists react to science against their interests by Chris Blattman, which lead me to: Weekly Links May 2: Mobile phones, working with messy data, funding, working with children, and more… and thence to the original post: The random risks of randomised trials by Tim Harford.

February 23, 2014

Understanding UMLS

Filed under: Bioinformatics,Medical Informatics,PubMed,UMLS — Patrick Durusau @ 6:02 pm

Understanding UMLS by Sujit Pal.

From the post:

I’ve been looking at Unified Medical Language System (UMLS) data this last week. The medical taxonomy we use at work is partly populated from UMLS, so I am familiar with the data, but only after it has been processed by our Informatics team. The reason I was looking at it is because I am trying to understand Apache cTakes, an open source NLP pipeline for the medical domain, which uses UMLS as one of its inputs.

UMLS is provided by the National Library of Medicine (NLM), and consists of 3 major parts: the Metathesaurus, consisting of over 1M medical concepts, a Semantic Network to categorize concepts by semantic type, and a Specialist Lexicon containing data to help do NLP on medical text. In addition, I also downloaded the RxNorm database that contains drug/medication information. I found that the biggest challenge was accessing the data, so I will describe that here, and point you to other web resources for the data descriptions.

Before getting the data, you have to sign up for a license with UMLS Terminology Services (UTS) – this is a manual process and can take a few days over email (I did this couple of years ago so details are hazy). UMLS data is distributed as .nlm files which can (as far as I can tell) be opened and expanded only by the Metamorphosis (mmsys) downloader, available on the UMLS download page. You need to run the following sequence of steps to capture the UMLS data into a local MySQL database. You can use other databases as well, but you would have to do a bit more work.

….

The table and column names are quite cryptic and the relationships are not evident from the tables. You will need to refer to the data dictionaries for each system to understand it before you do anything interesting with the data. Here are the links to the online references that describe the tables and their relationships for each system better than I can.

I have only captured the highlights from Sujit’s post so see his post for additional details.

There has been no small amount of time and effort invested in UMLS. Than names are cryptic and relationships not specified is more typical than any other state of data.

Take the opportunity to learn about UMLS and to ponder what solutions you would offer.

February 9, 2014

Medical research—still a scandal

Filed under: Medical Informatics,Open Access,Open Data,Research Methods — Patrick Durusau @ 5:45 pm

Medical research—still a scandal by Richard Smith.

From the post:

Twenty years ago this week the statistician Doug Altman published an editorial in the BMJ arguing that much medical research was of poor quality and misleading. In his editorial entitled, “The Scandal of Poor Medical Research,” Altman wrote that much research was “seriously flawed through the use of inappropriate designs, unrepresentative samples, small samples, incorrect methods of analysis, and faulty interpretation.” Twenty years later I fear that things are not better but worse.

Most editorials like most of everything, including people, disappear into obscurity very fast, but Altman’s editorial is one that has lasted. I was the editor of the BMJ when we published the editorial, and I have cited Altman’s editorial many times, including recently. The editorial was published in the dawn of evidence based medicine as an increasing number of people realised how much of medical practice lacked evidence of effectiveness and how much research was poor. Altman’s editorial with its concise argument and blunt, provocative title crystallised the scandal.

Why, asked Altman, is so much research poor? Because “researchers feel compelled for career reasons to carry out research that they are ill equipped to perform, and nobody stops them.” In other words, too much medical research was conducted by amateurs who were required to do some research in order to progress in their medical careers.

Ethics committees, who had to approve research, were ill equipped to detect scientific flaws, and the flaws were eventually detected by statisticians, like Altman, working as firefighters. Quality assurance should be built in at the beginning of research not the end, particularly as many journals lacked statistical skills and simply went ahead and published misleading research.

If you are thinking things are better today, consider a further comment from Richard:

The Lancet has this month published an important collection of articles on waste in medical research. The collection has grown from an article by Iain Chalmers and Paul Glasziou in which they argued that 85% of expenditure on medical research ($240 billion in 2010) is wasted. In a very powerful talk at last year’s peer review congress John Ioannidis showed that almost none of thousands of research reports linking foods to conditions are correct and how around only 1% of thousands of studies linking genes with diseases are reporting linkages that are real. His famous paper “Why most published research findings are false” continues to be the most cited paper of PLoS Medicine.

Not that I think open access would be a panacea for poor research quality but at least it would provide the opportunity for discovery.

All this talk about medical research reminds me of the Big Mechanism DARPA. Assume the research data on pathways is no better or no worse than mapping genes to diseases, DARPA will be spending $42 million to mine data with 1% accuracy.

A better use of those “Big Mechanism” dollars would be to test solutions to produce better medical research for mining.

1% sounds like low-grade ore to me.

January 31, 2014

Open Science Leaps Forward! (Johnson & Johnson)

Filed under: Bioinformatics,Biomedical,Data,Medical Informatics,Open Data,Open Science — Patrick Durusau @ 11:15 am

In Stunning Win For Open Science, Johnson & Johnson Decides To Release Its Clinical Trial Data To Researchers by Matthew Herper.

From the post:

Drug companies tend to be secretive, to say the least, about studies of their medicines. For years, negative trials would not even be published. Except for the U.S. Food and Drug Administration, nobody got to look at the raw information behind those studies. The medical data behind important drugs, devices, and other products was kept shrouded.

Today, Johnson & Johnson is taking a major step toward changing that, not only for drugs like the blood thinner Xarelto or prostate cancer pill Zytiga but also for the artificial hips and knees made for its orthopedics division or even consumer products. “You want to know about Listerine trials? They’ll have it,” says Harlan Krumholz of Yale University, who is overseeing the group that will release the data to researchers.

….

Here’s how the process will work: J&J has enlisted The Yale School of Medicine’s Open Data Access Project (YODA) to review requests from physicians to obtain data from J&J products. Initially, this will only include products from the drug division, but it will expand to include devices and consumer products. If YODA approves a request, raw, anonymized data will be provided to the physician. That includes not just the results of a study, but the results collected for each patient who volunteered for it with identifying information removed. That will allow researchers to re-analyze or combine that data in ways that would not have been previously possible.

….

Scientists can make a request for data on J&J drugs by going to www.clinicaltrialstudytransparency.com.

The ability to “…re-analyze or combine that data in ways that would not have been previously possible…” is the public benefit of Johnson & Johnson’s sharing of data.

With any luck, this will be the start of a general trend among drug companies.

Mappings of the semantics of such data sets should be contributed back to the Yale School of Medicine’s Open Data Access Project (YODA), to further enhance re-use of these data sets.

December 24, 2013

Resource Identification Initiative

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 4:30 pm

Resource Identification Initiative

From the webpage:

We are starting a pilot project, sponsored by the Neuroscience Information Framework and the International Neuroinformatics Coordinating Facility, to address the issue of proper resource identification within the neuroscience (actually biomedical) literature. We have now christened this project the Resource Identification Initiative (hastag #RII) and expanded the scope beyond neuroscience. This project is designed to make it easier for researchers to identify the key resources (materials, data, tools) used to produce the scientific findings within a published study and to find other studies that used the same resources. It is also designed to make it easier for resource providers to track usage of their resources and for funders to measure impacts of resource funding. The requirements are that key resources are identified in such a manner that they are identified uniquely and are:

1) Machine readable;

2) Are available outside the paywall;

3) Are uniform across publishers and journals. We are seeking broad input from the FORCE11 community to ensure that we come up with a solution that represents the best thinking available on these topics.

The pilot project was an outcome of a meeting held at the NIH on Jun 26th. A draft report from the June 26th Resource Identification meeting at the NIH is now available. As the report indicates, we have preliminary agreements from journals and publishers to implement a pilot project. We hope to extend this project well beyond the neuroscience literature, so please join this group if you are interested in participating.

….

Yes, another “unique identifier” project.

Don’t get me wrong, to the extent that a unique vocabulary can be developed and used, that’s great.

But it does not address:

  • tools/techniques/data that existed before the unique vocabulary came into existence
  • future tools/techniques/data that isn’t covered by the unique vocabulary
  • mappings between old, current and future tool/techniques/data

The project is trying to address a real need in neuroscience journals (lack of robust identification of organisms or antibodies).

If you have the time and interest, it is a worthwhile project that needs to consider the requirements for “robust” identification.

December 3, 2013

Project Tycho:… [125 Years of Disease Records]

Filed under: Health care,Medical Informatics — Patrick Durusau @ 4:33 pm

Project Tycho: Data for Health

From the webpage:

After four years of data digitization and processing, the Project Tycho™ Web site provites open access to newly digitized and integrated data from the entire 125 years history of United States weekly nationally notifiable disease surveillance data since 1888. These data can now be used by scientists, decision makers, investors, and the general public for any purpose. The Project Tycho™ aim is to advance the availability and use of public health data for science and decision making in public health, leading to better programs and more efficient control of diseases.

Three levels of data have been made available: Level 1 data include data that have been standardized for specific analyses, Level 2 data include standardized data that can be used immediately for analysis, and Level 3 data are raw data that cannot be used for analysis without extensive data management. See the video tutoral.

An interesting factoid concerning disease reporting in the United States, cica 1917. Influenza, in 1917, was not a reportable disease. The Great Influenza by John Barry.

I am curious about the Level 3 data.

Mostly in terms of how much “data management” would be needed to make it useful?

Could be a window into the data management required to unify medical records in the United States.

Or simply a way to practice your data management skills.

December 2, 2013

NIH deposits first batch of genomic data for Alzheimer’s disease

Filed under: Bioinformatics,Genomics,Medical Informatics — Patrick Durusau @ 5:44 pm

NIH deposits first batch of genomic data for Alzheimer’s disease

From the post:

Researchers can now freely access the first batch of genome sequence data from the Alzheimer’s Disease Sequencing Project (ADSP), the National Institutes of Health (NIH) announced today. The ADSP is one of the first projects undertaken under an intensified national program of research to prevent or effectively treat Alzheimer’s disease.

The first data release includes data from 410 individuals in 89 families. Researchers deposited completed WGS data on 61 families and have deposited WGS data on parts of the remaining 28 families, which will be completed soon. WGS determines the order of all 3 billion letters in an individual’s genome. Researchers can access the sequence data at dbGaP or the National Institute on Aging Genetics of Alzheimer’s Disease Data Storage Site (NIAGADS), https://www.niagads.org.

“Providing raw DNA sequence data to a wide range of researchers proves a powerful crowd-sourced way to find genomic changes that put us at increased risk for this devastating disease,” said NIH Director, Francis S. Collins, M.D., Ph.D., who announced the start of the project in February 2012. “The ADSP is designed to identify genetic risks for late-onset of Alzheimer’s disease, but it could also discover versions of genes that protect us. These insights could lead to a new era in prevention and treatment.”

As many as 5 million Americans 65 and older are estimated to have Alzheimer’s disease, and that number is expected to grow significantly with the aging of the baby boom generation. The National Alzheimer’s Project Act became law in 2011 in recognition of the need to do more to combat the disease. The law called for upgrading research efforts by the public and private sectors, as well as expanding access to and improving clinical and long term care. One of the first actions taken by NIH under Alzheimer’s Act was the allocation of additional funding in fiscal 2012 for a series of studies, including this genome sequencing effort. Today’s announcement marks the first data release from that project.

You will need to join with or enlist in a open project with bioinformatics and genmics expertise to make a contribution but the data is “out there.”

Not to mention the need to integrate existing medical literature, legacy data from prior patients, drug trials, etc., despite usual semantic confusion of the same.

August 4, 2013

Building Smaller Data

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 9:41 am

Throw the Bath Water Out, Keep the Baby: Keeping Medically-Relevant Terms for Text Mining by Jay Jarman, MS and Donald J. Berndt, PhD.

Abstract:

The purpose of this research is to answer the question, can medically-relevant terms be extracted from text notes and text mined for the purpose of classification and obtain equal or better results than text mining the original note? A novel method is used to extract medically-relevant terms for the purpose of text mining. A dataset of 5,009 EMR text notes (1,151 related to falls) was obtained from a Veterans Administration Medical Center. The dataset was processed with a natural language processing (NLP) application which extracted concepts based on SNOMED-CT terms from the Unified Medical Language System (UMLS) Metathesaurus. SAS Enterprise Miner was used to text mine both the set of complete text notes and the set represented by the extracted concepts. Logistic regression models were built from the results, with the extracted concept model performing slightly better than the complete note model.

The researchers created two datasets. One composed of the original text medical notes and the second of extracted named entities using NLP and medical vocabularies.

The named entity only dataset was found to perform better than the full text mining approach.

A smaller data set that had a higher performance than the larger data set of notes.

Wait! Isn’t that backwards? I thought “big data” was always better than “smaller data?”

Maybe not?

Maybe having the “right” dataset is better than having a “big data” set.

July 3, 2013

CHD@ZJU…

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 9:37 am

CHD@ZJU: a knowledgebase providing network-based research platform on coronary heart disease by Leihong Wu, Xiang Li, Jihong Yang, Yufeng Liu, Xiaohui Fan and Yiyu Cheng. (Database (2013) 2013 : bat047 doi: 10.1093/database/bat047)

From the webpage:

Abstract:

Coronary heart disease (CHD), the leading cause of global morbidity and mortality in adults, has been reported to be associated with hundreds of genes. A comprehensive understanding of the CHD-related genes and their corresponding interactions is essential to advance the translational research on CHD. Accordingly, we construct this knowledgebase, CHD@ZJU, which records CHD-related information (genes, pathways, drugs and references) collected from different resources and through text-mining method followed by manual confirmation. In current release, CHD@ZJU contains 660 CHD-related genes, 45 common pathways and 1405 drugs accompanied with >8000 supporting references. Almost half of the genes collected in CHD@ZJU were novel to other publicly available CHD databases. Additionally, CHD@ZJU incorporated the protein–protein interactions to investigate the cross-talk within the pathways from a multi-layer network view. These functions offered by CHD@ZJU would allow researchers to dissect the molecular mechanism of CHD in a systematic manner and therefore facilitate the research on CHD-related multi-target therapeutic discovery.

Database URL: http://tcm.zju.edu.cn/chd/

The article outlines the construction of CHD@ZJU as follows:

chd@zju

Figure 1.
Procedure for CHD@ZJU construction. CHD-related genes were extracted with text-mining technique and manual confirmation. PPI, pathway and drugs information were then collected from public resources such as KEGG and HPRD. Interactome network of every pathway was constructed based on their corresponding genes and related PPIs, and the whole CHD diseasome network was then constructed with all CHD-related genes. With CHD@ZJU, users could find information related to CHD from gene, pathway and the whole biological network level.

While assisted by computer technology, there is a manual confirmation step that binds all the information together.

May 30, 2013

Medicare Provider Charge Data

Filed under: Dataset,Health care,Medical Informatics — Patrick Durusau @ 2:47 pm

Medicare Provider Charge Data

From the webpage:

As part of the Obama administration’s work to make our health care system more affordable and accountable, data are being released that show significant variation across the country and within communities in what hospitals charge for common inpatient services.

The data provided here include hospital-specific charges for the more than 3,000 U.S. hospitals that receive Medicare Inpatient Prospective Payment System (IPPS) payments for the top 100 most frequently billed discharges, paid under Medicare based on a rate per discharge using the Medicare Severity Diagnosis Related Group (MS-DRG) for Fiscal Year (FY) 2011. These DRGs represent almost 7 million discharges or 60 percent of total Medicare IPPS discharges.

Hospitals determine what they will charge for items and services provided to patients and these charges are the amount the hospital bills for an item or service. The Total Payment amount includes the MS-DRG amount, bill total per diem, beneficiary primary payer claim payment amount, beneficiary Part A coinsurance amount, beneficiary deductible amount, beneficiary blood deducible amount and DRG outlier amount.

For these DRGs, average charges and average Medicare payments are calculated at the individual hospital level. Users will be able to make comparisons between the amount charged by individual hospitals within local markets, and nationwide, for services that might be furnished in connection with a particular inpatient stay.

Data are being made available in Microsoft Excel (.xlsx) format and comma separated values (.csv) format.

Inpatient Charge Data, FY2011, Microsoft Excel version
Inpatient Charge Data, FY2011, Comma Separated Values (CSV) version

A nice start towards a useful data set.

Next step would be tying identifiable physicians with ordered medical procedures and tests.

The only times I have arrived at a hospital by ambulance, I never thought to ask for a comparison of their prices with other local hospitals. Nor did I see any signs advertising discounts on particular procedures.

Have you?

Let’s not pretend medical care is a consumer market, where “consumers” are penalized for not being good shoppers.

I first saw this at Nathan Yau’s Medicare provider charge data released.

May 17, 2013

A self-updating road map of The Cancer Genome Atlas

Filed under: Bioinformatics,Biology,Biomedical,Medical Informatics,RDF,Semantic Web,SPARQL — Patrick Durusau @ 4:33 pm

A self-updating road map of The Cancer Genome Atlas by David E. Robbins, Alexander Grüneberg, Helena F. Deus, Murat M. Tanik and Jonas S. Almeida. (Bioinformatics (2013) 29 (10): 1333-1340. doi: 10.1093/bioinformatics/btt141)

Abstract:

Motivation: Since 2011, The Cancer Genome Atlas’ (TCGA) files have been accessible through HTTP from a public site, creating entirely new possibilities for cancer informatics by enhancing data discovery and retrieval. Significantly, these enhancements enable the reporting of analysis results that can be fully traced to and reproduced using their source data. However, to realize this possibility, a continually updated road map of files in the TCGA is required. Creation of such a road map represents a significant data modeling challenge, due to the size and fluidity of this resource: each of the 33 cancer types is instantiated in only partially overlapping sets of analytical platforms, while the number of data files available doubles approximately every 7 months.

Results: We developed an engine to index and annotate the TCGA files, relying exclusively on third-generation web technologies (Web 3.0). Specifically, this engine uses JavaScript in conjunction with the World Wide Web Consortium’s (W3C) Resource Description Framework (RDF), and SPARQL, the query language for RDF, to capture metadata of files in the TCGA open-access HTTP directory. The resulting index may be queried using SPARQL, and enables file-level provenance annotations as well as discovery of arbitrary subsets of files, based on their metadata, using web standard languages. In turn, these abilities enhance the reproducibility and distribution of novel results delivered as elements of a web-based computational ecosystem. The development of the TCGA Roadmap engine was found to provide specific clues about how biomedical big data initiatives should be exposed as public resources for exploratory analysis, data mining and reproducible research. These specific design elements align with the concept of knowledge reengineering and represent a sharp departure from top-down approaches in grid initiatives such as CaBIG. They also present a much more interoperable and reproducible alternative to the still pervasive use of data portals.

Availability: A prepared dashboard, including links to source code and a SPARQL endpoint, is available at http://bit.ly/TCGARoadmap. A video tutorial is available at http://bit.ly/TCGARoadmapTutorial.

Curious how the granularity of required semantics and the uniformity of the underlying data set impact the choice of semantic approaches?

And does access to data files present different challenges than say access to research publications in the same field?

May 2, 2013

FindZebra

Filed under: Medical Informatics,Search Algorithms,Search Engines,Searching — Patrick Durusau @ 4:48 am

FindZebra

From the about page:

FindZebra is a specialised search engine supporting medical professionals in diagnosing difficult patient cases. Rare diseases are especially difficult to diagnose and this online medical search engines comes in support of medical personnel looking for diagnostic hypotheses. With a simple and consistent interface across all devices, it can be easily used as an aid tool at the time and place where medical decisions are made. The retrieved information is collected from reputable sources across the internet storing public medical articles on rare and genetic diseases.

A search engine with: WARNING! This is a research project to be used only by medical professionals.

To avoid overwhelming researchers with search result “noise,” FindZebra deliberately restricts the content it indexes.

An illustration of the crudeness of current search algorithms that altering the inputs is the easiest way to improve outcomes for particular types of searches.

That seems to be an argument in favor of smaller than enterprise search engines, which could roll-up into broader search applications.

Of course, with a topic map you could retain the division between departments even as you roll-up the content into broader search applications.

April 28, 2013

Scientific Lenses over Linked Data… [Operational Equivalence]

Scientific Lenses over Linked Data: An approach to support task specifi c views of the data. A vision. by Christian Brenninkmeijer, Chris Evelo, Carole Goble, Alasdair J G Gray, Paul Groth, Steve Pettifer, Robert Stevens, Antony J Williams, and Egon L Willighagen.

Abstract:

Within complex scienti fic domains such as pharmacology, operational equivalence between two concepts is often context-, user- and task-specifi c. Existing Linked Data integration procedures and equivalence services do not take the context and task of the user into account. We present a vision for enabling users to control the notion of operational equivalence by applying scienti c lenses over Linked Data. The scientifi c lenses vary the links that are activated between the datasets which aff ects the data returned to the user.

Two additional quotes from this paper should convince you of the importance of this work:

We aim to support users in controlling and varying their view of the data by applying a scientifi c lens which govern the notions of equivalence applied to the data. Users will be able to change their lens based on the task and role they are performing rather than having one fixed lens. To support this requirement, we propose an approach that applies context dependent sets of equality links. These links are stored in a stand-off fashion so that they are not intermingled with the datasets. This allows for multiple, context-dependent, linksets that can evolve without impact on the underlying datasets and support diff ering opinions on the relationships between data instances. This flexibility is in contrast to both Linked Data and traditional data integration approaches. We look at the role personae can play in guiding the nature of relationships between the data resources and the desired a ffects of applying scientifi c lenses over Linked Data.

and,

Within scienti fic datasets it is common to fi nd links to the “equivalent” record in another dataset. However, there is no declaration of the form of the relationship. There is a great deal of variation in the notion of equivalence implied by the links both within a dataset’s usage and particularly across datasets, which degrades the quality of the data. The scienti fic user personae have very di fferent needs about the notion of equivalence that should be applied between datasets. The users need a simple mechanism by which they can change the operational equivalence applied between datasets. We propose the use of scientifi c lenses.

Obvious questions:

Does your topic map software support multiple operational equivalences?

Does your topic map interface enable users to choose “lenses” (I like lenses better than roles) to view equivalence?

Does your topic map software support declaring the nature of equivalence?

I first saw this in the slide deck: Scientific Lenses: Supporting Alternative Views of the Data by Alasdair J G Gray at: 4th Open PHACTS Community Workshop.

BTW, the notion of equivalence being represented by “links” reminds me of a comment Peter Neubauer (Neo4j) once made to me, saying that equivalence could be modeled as edges. Imagine typing equivalence edges. Will have to think about that some more.

4th Open PHACTS Community Workshop (slides) [Operational Equivalence]

Filed under: Bioinformatics,Biomedical,Drug Discovery,Linked Data,Medical Informatics — Patrick Durusau @ 12:24 pm

4th Open PHACTS Community Workshop : Using the power of Open PHACTS

From the post:

The fourth Open PHACTS Community Workshop was held at Burlington House in London on April 22 and 23, 2013. The Workshop focussed on “Using the Power of Open PHACTS” and featured the public release of the Open PHACTS application programming interface (API) and the first Open PHACTS example app, ChemBioNavigator.

The first day featured talks describing the data accessible via the Open PHACTS Discovery Platform and technical aspects of the API. The use of the API by example applications ChemBioNavigator and PharmaTrek was outlined, and the results of the Accelrys Pipeline Pilot Hackathon discussed.

The second day involved discussion of Open PHACTS sustainability and plans for the successor organisation, the Open PHACTS Foundation. The afternoon was attended by those keen to further discuss the potential of the Open PHACTS API and the future of Open PHACTS.

During talks, especially those detailing the Open PHACTS API, a good number of signup requests to the API via dev.openphacts.org were received. The hashtag #opslaunch was used to follow reactions to the workshop on Twitter (see storify), and showed the response amongst attendees to be overwhelmingly positive.

This summary is followed by slides from the two days of presentations.

Not like being there but still quite useful.

As a matter of fact, I found a lead on “operational equivalence” with this data set. More to follow in a separate post.

April 10, 2013

Apache cTAKES

Apache cTAKES

From the webpage:

Apache clinical Text Analysis and Knowledge Extraction System (cTAKES) is an open-source natural language processing system for information extraction from electronic medical record clinical free-text. It processes clinical notes, identifying types of clinical named entities from various dictionaries including the Unified Medical Language System (UMLS) – medications, diseases/disorders, signs/symptoms, anatomical sites and procedures. Each named entity has attributes for the text span, the ontology mapping code, subject (patient, family member, etc.) and context (negated/not negated, conditional, generic, degree of certainty). Some of the attributes are expressed as relations, for example the location of a clinical condition (locationOf relation) or the severity of a clinical condition (degreeOf relation).

Apache cTAKES was built using the Apache UIMA Unstructured Information Management Architecture engineering framework and Apache OpenNLP natural language processing toolkit. Its components are specifically trained for the clinical domain out of diverse manually annotated datasets, and create rich linguistic and semantic annotations that can be utilized by clinical decision support systems and clinical research. cTAKES has been used in a variety of use cases in the domain of biomedicine such as phenotype discovery, translational science, pharmacogenomics and pharmacogenetics.

Apache cTAKES employs a number of rule-based and machine learning methods. Apache cTAKES components include:

  1. Sentence boundary detection
  2. Tokenization (rule-based)
  3. Morphologic normalization
  4. POS tagging
  5. Shallow parsing
  6. Named Entity Recognition
    • Dictionary mapping
    • Semantic typing is based on these UMLS semantic types: diseases/disorders, signs/symptoms, anatomical sites, procedures, medications
  7. Assertion module
  8. Dependency parser
  9. Constituency parser
  10. Semantic Role Labeler
  11. Coreference resolver
  12. Relation extractor
  13. Drug Profile module
  14. Smoking status classifier

The goal of cTAKES is to be a world-class natural language processing system in the healthcare domain. cTAKES can be used in a great variety of retrievals and use cases. It is intended to be modular and expandable at the information model and method level.
The cTAKES community is committed to best practices and R&D (research and development) by using cutting edge technologies and novel research. The idea is to quickly translate the best performing methods into cTAKES code.

Processing a text with cTAKES is a processing of adding semantic information to the text.

As you can imagine, the better the semantics that are added, the better searching and other functions become.

In order to make added semantic information interoperable, well, that’s a topic map question.

I first saw this in a tweet by Tim O’Reilly.

April 7, 2013

Open PHACTS

Open PHACTS – Open Pharmacological Space

From the homepage:

Open PHACTS is building an Open Pharmacological Space in a 3-year knowledge management project of the Innovative Medicines Initiative (IMI), a unique partnership between the European Community and the European Federation of Pharmaceutical Industries and Associations (EFPIA).

The project is due to end in March 2014, and aims to deliver a sustainable service to continue after the project funding ends. The project consortium consists of leading academics in semantics, pharmacology and informatics, driven by solid industry business requirements: 28 partners, including 9 pharmaceutical companies and 3 biotechs.

Sourcecode has just appeared on GibHub: OpenPHACTS.

Important to different communities for different reasons. My interest isn’t the same as BigPharma. 😉

A project to watch as they navigate the thickets of vocabularies, ontologies and other semantically diverse information sources.

March 23, 2013

Using Bayesian networks to discover relations…

Filed under: Bayesian Data Analysis,Bayesian Models,Bioinformatics,Medical Informatics — Patrick Durusau @ 3:33 pm

Using Bayesian networks to discover relations between genes, environment, and disease by Chengwei Su, Angeline Andrew, Margaret R Karagas and Mark E Borsuk. (BioData Mining 2013, 6:6 doi:10.1186/1756-0381-6-6)

Abstract:

We review the applicability of Bayesian networks (BNs) for discovering relations between genes, environment, and disease. By translating probabilistic dependencies among variables into graphical models and vice versa, BNs provide a comprehensible and modular framework for representing complex systems. We first describe the Bayesian network approach and its applicability to understanding the genetic and environmental basis of disease. We then describe a variety of algorithms for learning the structure of a network from observational data. Because of their relevance to real-world applications, the topics of missing data and causal interpretation are emphasized. The BN approach is then exemplified through application to data from a population-based study of bladder cancer in New Hampshire, USA. For didactical purposes, we intentionally keep this example simple. When applied to complete data records, we find only minor differences in the performance and results of different algorithms. Subsequent incorporation of partial records through application of the EM algorithm gives us greater power to detect relations. Allowing for network structures that depart from a strict causal interpretation also enhances our ability to discover complex associations including gene-gene (epistasis) and gene-environment interactions. While BNs are already powerful tools for the genetic dissection of disease and generation of prognostic models, there remain some conceptual and computational challenges. These include the proper handling of continuous variables and unmeasured factors, the explicit incorporation of prior knowledge, and the evaluation and communication of the robustness of substantive conclusions to alternative assumptions and data manifestations.

From the introduction:

BNs have been applied in a variety of settings for the purposes of causal study and probabilistic prediction, including medical diagnosis, crime and terrorism risk, forensic science, and ecological conservation (see [7]). In bioinformatics, they have been used to analyze gene expression data [8,9], derive protein signaling networks [10-12], predict protein-protein interactions [13], perform pedigree analysis [14], conduct genetic epidemiological studies [5], and assess the performance of microsatellite markers on cancer recurrence [15].

Not to mention criminal investigations: Bayesian Network – [Crime Investigation] (Youtube). 😉

Once relations are discovered, you are free to decorate them with roles, properties, etc., in other words, associations.

March 14, 2013

Visualizing the Topical Structure of the Medical Sciences:…

Filed under: Medical Informatics,PubMed,Self Organizing Maps (SOMs),Text Mining — Patrick Durusau @ 2:48 pm

Visualizing the Topical Structure of the Medical Sciences: A Self-Organizing Map Approach by André Skupin, Joseph R. Biberstine, Katy Börner. (Skupin A, Biberstine JR, Börner K (2013) Visualizing the Topical Structure of the Medical Sciences: A Self-Organizing Map Approach. PLoS ONE 8(3): e58779. doi:10.1371/journal.pone.0058779)

Abstract:

Background

We implement a high-resolution visualization of the medical knowledge domain using the self-organizing map (SOM) method, based on a corpus of over two million publications. While self-organizing maps have been used for document visualization for some time, (1) little is known about how to deal with truly large document collections in conjunction with a large number of SOM neurons, (2) post-training geometric and semiotic transformations of the SOM tend to be limited, and (3) no user studies have been conducted with domain experts to validate the utility and readability of the resulting visualizations. Our study makes key contributions to all of these issues.

Methodology

Documents extracted from Medline and Scopus are analyzed on the basis of indexer-assigned MeSH terms. Initial dimensionality is reduced to include only the top 10% most frequent terms and the resulting document vectors are then used to train a large SOM consisting of over 75,000 neurons. The resulting two-dimensional model of the high-dimensional input space is then transformed into a large-format map by using geographic information system (GIS) techniques and cartographic design principles. This map is then annotated and evaluated by ten experts stemming from the biomedical and other domains.

Conclusions

Study results demonstrate that it is possible to transform a very large document corpus into a map that is visually engaging and conceptually stimulating to subject experts from both inside and outside of the particular knowledge domain. The challenges of dealing with a truly large corpus come to the fore and require embracing parallelization and use of supercomputing resources to solve otherwise intractable computational tasks. Among the envisaged future efforts are the creation of a highly interactive interface and the elaboration of the notion of this map of medicine acting as a base map, onto which other knowledge artifacts could be overlaid.

Impressive work to say the least!

But I was just as impressed by the future avenues for research:

Controlled Vocabularies

It appears that the use of indexer-chosen keywords, including in the case of a large controlled vocabulary-MeSH terms in this study-raises interesting questions. The rank transition diagram in particular helped to highlight the fact that different vocabulary items play different roles in indexers’ attempts to characterize the content of specific publications. The complex interplay of hierarchical relationships and functional roles of MeSH terms deserves further investigation, which may inform future efforts of how specific terms are handled in computational analysis. For example, models constructed from terms occurring at intermediate levels of the MeSH hierarchy might look and function quite different from the top-level model presented here.

User-centered Studies

Future user studies will include term differentiation tasks to help us understand whether/how users can differentiate senses of terms on the self-organizing map. When a term appears prominently in multiple places, that indicates multiple senses or contexts for that term. One study might involve subjects being shown two regions within which a particular label term appears and the abstracts of several papers containing that term. Subjects would then be asked to rate each abstract along a continuum between two extremes formed by the two senses/contexts. Studies like that will help us evaluate how understandable the local structure of the map is.

There are other, equally interesting future research questions but those are the two of most interest to me.

I take this research as evidence that managing semantic diversity is going to require human effort, augmented by automated means.

I first saw this in Nat Torkington’s Four short links: 13 March 2013.

March 13, 2013

SURAAK – When Search Is Not Enough [A “google” of search results, new metric]

Filed under: Health care,Medical Informatics,Searching — Patrick Durusau @ 2:21 pm

SURAAK – When Search Is Not Enough (video)

A new way to do research. SURAAK is a web application that uses natural language processing techniques to analyze big data of published healthcare articles in the area of geriatrics and senior care. See how SURAAK uses text causality to find and analyze word relationship is this and other areas of interest.

SURAAK = Semantic Understanding Research in the Automatic Acquisition of Knowledge.

NLP based system that extracts “causal” sentences.

Differences from Google (according to the video)

  • Extracts text from PDFs
  • Links concepts together building relationships found in extracted text
  • Links articles together based on shared concepts

Search demo was better than using Google but that’s not hard to do.

The “notes” that are extracted from texts are sentences.

I am uneasy about the use of sentences in isolation from the surrounding text as a “note.”

It’s clearly “doable,” but whether it is a good idea, remains to be seen. Particularly since users are rating sentences/notes in isolation from the text in which they occur.

BTW, funded with tax dollars from the National Institutes of Health and the National Institute on Aging, to the tune of $844K.

I am still trying to track down the resulting software.

I take this as an illustration that anything over a “google” of search results (a new metric), is of interest and fundable.

March 11, 2013

The Annotation-enriched non-redundant patent sequence databases [Curation vs. Search]

Filed under: Bioinformatics,Biomedical,Marketing,Medical Informatics,Patents,Topic Maps — Patrick Durusau @ 2:01 pm

The Annotation-enriched non-redundant patent sequence databases Weizhong Li, Bartosz Kondratowicz, Hamish McWilliam, Stephane Nauche and Rodrigo Lopez.

Not a real promising title is it? 😉 The reason I cite it here is that by curation, the database is “non-redundant.”

Try searching for some of these sequences at the USPTO and compare the results.

The power of curation will be immediately obvious.

Abstract:

The EMBL-European Bioinformatics Institute (EMBL-EBI) offers public access to patent sequence data, providing a valuable service to the intellectual property and scientific communities. The non-redundant (NR) patent sequence databases comprise two-level nucleotide and protein sequence clusters (NRNL1, NRNL2, NRPL1 and NRPL2) based on sequence identity (level-1) and patent family (level-2). Annotation from the source entries in these databases is merged and enhanced with additional information from the patent literature and biological context. Corrections in patent publication numbers, kind-codes and patent equivalents significantly improve the data quality. Data are available through various user interfaces including web browser, downloads via FTP, SRS, Dbfetch and EBI-Search. Sequence similarity/homology searches against the databases are available using BLAST, FASTA and PSI-Search. In this article, we describe the data collection and annotation and also outline major changes and improvements introduced since 2009. Apart from data growth, these changes include additional annotation for singleton clusters, the identifier versioning for tracking entry change and the entry mappings between the two-level databases.

Database URL: http://www.ebi.ac.uk/patentdata/nr/

Topic maps are curated data. Which one do you prefer?

March 10, 2013

Using and abusing evidence

Filed under: Government,Medical Informatics,Transparency — Patrick Durusau @ 3:14 pm

New thematic series: Using and abusing evidence by Adrian Aldcroft.

From the post:

Scientific evidence plays an important role in guiding medical laws and policies, but how evidence is represented, and often misrepresented, warrants careful consideration. A new cross-journal thematic series headed by Genome Medicine, Using and abusing evidence in science and health policy, explores the application of evidence in healthcare law and policy in an attempt to uncover how evidence from research is translated into the public sphere. Other journals involved in the series include BMC Medical Ethics, BMC Public Health, BMC Medical Genomics, BMC Psychiatry, and BMC Medicine.

Articles already published include an argument for reframing the obesity epidemic through the use of the term caloric overconsumption, an examination of bioethics in popular science literature, and a look at the gap between reality and public perception when discussing the potential of stem cell therapies. Other published articles look at the quality of informed consent in pediatric research and evidence for genetic discrimination in the life insurance industry. More articles will be added to the series as they are published.

Articles published in this series were invited from delegates at the meeting “Using and Abusing Evidence in Science and Health Policy” held in Banff, Alberta, on May 30th-June 1st, 2012. We hope the publication of the article collection will contribute to the understanding of the ethical and political implications associated with the application of evidence in research and politics.

A useful series but I wonder how effective the identification of “abuse” of evidence will be without identifying its abusers?

And making the case for “abuse” of evidence in a compelling manner?

For example, changing “obesity” to “caloric overconsumption” (Addressing the policy cacophony does not require more evidence: an argument for reframing obesity as caloric overconsumption), carries the day if and only if one presumes a regulatory environment with the goal of improving public health.

The near toxic levels of high fructose corn syrup in the average American diet demonstrate the goals of food regulation in the United States have little to do with public health and welfare.

Identification of who make such policies, who benefits and who is harmed:

obesity

could go a long way towards creating a different regulatory environment.

February 15, 2013

New Query Tool Searches EHR Unstructured Data

Filed under: Biomedical,Medical Informatics,Searching,Unstructured Data — Patrick Durusau @ 1:32 pm

New Query Tool Searches EHR Unstructured Data by Ken Terry.

From the post:

A new electronic health record “intelligence platform” developed at Massachusetts General Hospital (MGH) and its parent organization, Partners Healthcare, is being touted as a solution to the problem of searching structured and unstructured data in EHRs for clinically useful information.

QPID Inc., a new firm spun off from Partners and backed by venture capital funds, is now selling its Web-based search engine to other healthcare organizations. Known as the Queriable Patient Inference Dossier (QPID), the tool is designed to allow clinicians to make ad hoc queries about particular patients and receive the desired information within seconds.

Today, 80% of stored health information is believed to be unstructured. It is trapped in free text such as physician notes and reports, discharge summaries, scanned documents and e-mail messages. One reason for the prevalence of unstructured data is that the standard methods for entering structured data, such as drop-down menus and check boxes, don’t fit into traditional physician workflow. Many doctors still dictate their notes, and the transcription goes into the EHR as free text.

and,

QPID, which was first used in the radiology department of MGH in 2005, incorporates an EHR search engine, a library of search queries based on clinical concepts, and a programming system for application and query development. When a clinician submits a query, QPID presents the desired data in a “dashboard” format that includes abnormal results, contraindications and other alerts, Doyle said.

The core of the system is a form of natural language processing (NLP) based on a library encompassing “thousands and thousands” of clinical concepts, he said. Because it was developed collaboratively by physicians and scientists, QPID identifies medical concepts imbedded in unstructured data more effectively than do other NLP systems from IBM, Nuance and M*Modal, Doyle maintained.

Take away points for data search/integration solutions:

  1. 80% of stored health information (need)
  2. traditional methods for data entry….don’t fit into traditional physician workflow (user requirement)
  3. developed collaboratively by physicians and scientists (semantics originate with users, not top down)

I am interested in how QPID conforms (or not) QPID to local medical terminology practices.

To duplicate their earlier success, conforming to local terminology practices is critical.

If for no other reason it will give physicians and other health professionals “ownership” of the vocabulary and hence faith in the system.

February 11, 2013

A Tale of Five Languages

Filed under: Biomedical,Medical Informatics,SNOMED — Patrick Durusau @ 10:58 am

Evaluating standard terminologies for encoding allergy information by Foster R Goss, Li Zhou, Joseph M Plasek, Carol Broverman, George Robinson, Blackford Middleton, Roberto A Rocha. (J Am Med Inform Assoc doi:10.1136/amiajnl-2012-000816)

Abstract:

Objective Allergy documentation and exchange are vital to ensuring patient safety. This study aims to analyze and compare various existing standard terminologies for representing allergy information.

Methods Five terminologies were identified, including the Systemized Nomenclature of Medical Clinical Terms (SNOMED CT), National Drug File–Reference Terminology (NDF-RT), Medication Dictionary for Regulatory Activities (MedDRA), Unique Ingredient Identifier (UNII), and RxNorm. A qualitative analysis was conducted to compare desirable characteristics of each terminology, including content coverage, concept orientation, formal definitions, multiple granularities, vocabulary structure, subset capability, and maintainability. A quantitative analysis was also performed to compare the content coverage of each terminology for (1) common food, drug, and environmental allergens and (2) descriptive concepts for common drug allergies, adverse reactions (AR), and no known allergies.

Results Our qualitative results show that SNOMED CT fulfilled the greatest number of desirable characteristics, followed by NDF-RT, RxNorm, UNII, and MedDRA. Our quantitative results demonstrate that RxNorm had the highest concept coverage for representing drug allergens, followed by UNII, SNOMED CT, NDF-RT, and MedDRA. For food and environmental allergens, UNII demonstrated the highest concept coverage, followed by SNOMED CT. For representing descriptive allergy concepts and adverse reactions, SNOMED CT and NDF-RT showed the highest coverage. Only SNOMED CT was capable of representing unique concepts for encoding no known allergies.

Conclusions The proper terminology for encoding a patient’s allergy is complex, as multiple elements need to be captured to form a fully structured clinical finding. Our results suggest that while gaps still exist, a combination of SNOMED CT and RxNorm can satisfy most criteria for encoding common allergies and provide sufficient content coverage.

Interesting article but some things that may not be apparent to the casual reader:

MedDRA:

The Medical Dictionary for Regulatory Activities (MedDRA) was developed by the International Conference on Harmonisation (ICH) and is owned by the International Federation of Pharmaceutical Manufacturers and Associations (IFPMA) acting as trustee for the ICH steering committee. The Maintenance and Support Services Organization (MSSO) serves as the repository, maintainer, and distributor of MedDRA as well as the source for the most up-to-date information regarding MedDRA and its application within the biopharmaceutical industry and regulators. (source: http://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/MDR/index.html

MedDRA has a metathesaurus with translations into: Czech, Dutch, French, German, Hungarian, Italian, Japanese, Portuguese, and Spanish.

Unique Ingredient Identifier (UNII)

The overall purpose of the joint FDA/USP Substance Registration System (SRS) is to support health information technology initiatives by generating unique ingredient identifiers (UNIIs) for substances in drugs, biologics, foods, and devices. The UNII is a non- proprietary, free, unique, unambiguous, non semantic, alphanumeric identifier based on a substance’s molecular structure and/or descriptive information.

The UNII may be found in:

  • NLM’s Unified Medical Language System (UMLS)
  • National Cancer Institutes Enterprise Vocabulary Service
  • USP Dictionary of USAN and International Drug Names (future)
  • FDA Data Standards Council website
  • VA National Drug File Reference Terminology (NDF-RT)
  • FDA Inactive Ingredient Query Application

(source: http://www.fda.gov/ForIndustry/DataStandards/SubstanceRegistrationSystem-UniqueIngredientIdentifierUNII/

National Drug File – Reference Terminology (NDF-RT)

The National Drug File – Reference Terminology (NDF-RT) is produced by the U.S. Department of Veterans Affairs, Veterans Health Administration (VHA).

NDF-RT combines the NDF hierarchical drug classification with a multi-category reference model. The categories are:

  1. Cellular or Molecular Interactions [MoA]
  2. Chemical Ingredients [Chemical/Ingredient]
  3. Clinical Kinetics [PK]
  4. Diseases, Manifestations or Physiologic States [Disease/Finding]
  5. Dose Forms [Dose Form]
  6. Pharmaceutical Preparations
  7. Physiological Effects [PE]
  8. Therapeutic Categories [TC]
  9. VA Drug Interactions [VA Drug Interaction]

(source: http://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/NDFRT/

MedDRA, UNII, and NDF-RT have been in use for years, MedDRA internationally in multiple languages. An uncounted number of medical records, histories and no doubt publications rely upon these vocabularies.

Assume the conclusion: SNOMED CT with RxNorm (links between drug vocabularies) provide the best coverage for “encoding common allergies.”

A critical question remains:

How to access medical records using other terminologies?

Recalling from the adventures of owl:sameAs (The Semantic Web Is Failing — But Why? (Part 5)) that any single string identifier is subject to multiple interpretations. Interpretations that can only be disambiguated by additional information.

You might present a search engine with string to string mappings but those are inherently less robust and harder to maintain than richer mappings.

The sort of richer mappings that are supported by topic maps.

February 3, 2013

ToxPi GUI [Data Recycling]

Filed under: Bioinformatics,Biomedical,Integration,Medical Informatics,Subject Identity — Patrick Durusau @ 6:57 pm

ToxPi GUI: an interactive visualization tool for transparent integration of data from diverse sources of evidence by David M. Reif, Myroslav Sypa, Eric F. Lock, Fred A. Wright, Ander Wilson, Tommy Cathey, Richard R. Judson and Ivan Rusyn. (Bioinformatics (2013) 29 (3): 402-403. doi: 10.1093/bioinformatics/bts686)

Abstract:

Motivation: Scientists and regulators are often faced with complex decisions, where use of scarce resources must be prioritized using collections of diverse information. The Toxicological Prioritization Index (ToxPi™) was developed to enable integration of multiple sources of evidence on exposure and/or safety, transformed into transparent visual rankings to facilitate decision making. The rankings and associated graphical profiles can be used to prioritize resources in various decision contexts, such as testing chemical toxicity or assessing similarity of predicted compound bioactivity profiles. The amount and types of information available to decision makers are increasing exponentially, while the complex decisions must rely on specialized domain knowledge across multiple criteria of varying importance. Thus, the ToxPi bridges a gap, combining rigorous aggregation of evidence with ease of communication to stakeholders.

Results: An interactive ToxPi graphical user interface (GUI) application has been implemented to allow straightforward decision support across a variety of decision-making contexts in environmental health. The GUI allows users to easily import and recombine data, then analyze, visualize, highlight, export and communicate ToxPi results. It also provides a statistical metric of stability for both individual ToxPi scores and relative prioritized ranks.

Availability: The ToxPi GUI application, complete user manual and example data files are freely available from http://comptox.unc.edu/toxpi.php.

Contact: reif.david@gmail.com

Very cool!

Although like having a Ford automobile in any color, so long as the color was black, you can integrate any data source, so long as the format is csv. And values are numbers. Subject to other restrictions as well.

That’s an observation, not a criticism.

The application serves a purpose within a domain and does not “integrate” information in the sense of a topic map.

But a topic map could recycle its data to add other identifications and properties. Without having to re-write this application or its data.

Once curated, data should be re-used, not re-created/curated.

Topic maps give you more bang for your data buck.

« Newer PostsOlder Posts »

Powered by WordPress