MKEM: a Multi-level Knowledge Emergence Model for mining undiscovered public knowledge
Abstract:
Background
Since Swanson proposed the Undiscovered Public Knowledge (UPK) model, there have been many approaches to uncover UPK by mining the biomedical literature. These earlier works, however, required substantial manual intervention to reduce the number of possible connections and are mainly applied to disease-effect relation. With the advancement in biomedical science, it has become imperative to extract and combine information from multiple disjoint researches, studies and articles to infer new hypotheses and expand knowledge.
Methods
We propose MKEM, a Multi-level Knowledge Emergence Model, to discover implicit relationships using Natural Language Processing techniques such as Link Grammar and Ontologies such as Unified Medical Language System (UMLS) MetaMap. The contribution of MKEM is as follows: First, we propose a flexible knowledge emergence model to extract implicit relationships across different levels such as molecular level for gene and protein and Phenomic level for disease and treatment. Second, we employ MetaMap for tagging biological concepts. Third, we provide an empirical and systematic approach to discover novel relationships.
Results
We applied our system on 5000 abstracts downloaded from PubMed database. We performed the performance evaluation as a gold standard is not yet available. Our system performed with a good precision and recall and we generated 24 hypotheses.
Conclusions
Our experiments show that MKEM is a powerful tool to discover hidden relationships residing in extracted entities that were represented by our Substance-Effect-Process-Disease-Body Part (SEPDB) model.
From the article:
Swanson defined UPK is public and yet undiscovered in two complementary and non-interactive literature sets of articles (independently created fragments of knowledge), when they are considered together, can reveal useful information of scientific interest not apparent in either of the two sets alone [cites omitted].
Basis of UPK:
The underlying discovery method is based on the following principle: some links between two complementary passages of natural language texts can be largely a matter of form “A cause B” (association AB) and “B causes C” (association BC) (See Figure 1). From this, it can be seen that they are linked by B irrespective of the meaning of A, B, or C. However, perhaps nothing at all has been published concerning a possible connection between A and C, even though such link if validated would be of scientific interest. This allowed for the generation of several hypotheses such as “Fish’s oil can be used for treatment of Raynaud’s Disease” [cite omitted].
Fairly easy reading and interesting as well.
If you recognize TF*IDF, the primary basis for Lucene, you will be interested to learn it has some weaknesses for UPK. If I understand the authors correctly, ranking terms statistically is insufficient to mine implied relationships. Related terms aren’t ranked high enough. I don’t think “boosting” would help because the terms are not known ahead of time. I say that, although I suppose you could “boost” on the basis of implied relationships. Will have to think about that some more.
You will find “non-interactive literature sets of articles” in computer science, library science, mathematics, law, just about any field you can name. Although you can mine across those “literature sets,” it would be interesting to identify those sets, perhaps with a view towards refining UPK mining. Can you suggest ways to distinguish such “literature sets?”
Oh, link to the software: MKEM (Note to authors: Always include a link to your software, assuming it is available. Make it easy on readers to find and use your hard work!)