Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 4, 2013

Building Smaller Data

Filed under: Bioinformatics,Biomedical,Medical Informatics — Patrick Durusau @ 9:41 am

Throw the Bath Water Out, Keep the Baby: Keeping Medically-Relevant Terms for Text Mining by Jay Jarman, MS and Donald J. Berndt, PhD.

Abstract:

The purpose of this research is to answer the question, can medically-relevant terms be extracted from text notes and text mined for the purpose of classification and obtain equal or better results than text mining the original note? A novel method is used to extract medically-relevant terms for the purpose of text mining. A dataset of 5,009 EMR text notes (1,151 related to falls) was obtained from a Veterans Administration Medical Center. The dataset was processed with a natural language processing (NLP) application which extracted concepts based on SNOMED-CT terms from the Unified Medical Language System (UMLS) Metathesaurus. SAS Enterprise Miner was used to text mine both the set of complete text notes and the set represented by the extracted concepts. Logistic regression models were built from the results, with the extracted concept model performing slightly better than the complete note model.

The researchers created two datasets. One composed of the original text medical notes and the second of extracted named entities using NLP and medical vocabularies.

The named entity only dataset was found to perform better than the full text mining approach.

A smaller data set that had a higher performance than the larger data set of notes.

Wait! Isn’t that backwards? I thought “big data” was always better than “smaller data?”

Maybe not?

Maybe having the “right” dataset is better than having a “big data” set.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress