Disease named entity recognition using semisupervised learning and conditional random fields.
Nichalin, S., Zhu, Z., & Hsinchun, C. (2011). Disease named entity recognition using semisupervised learning and conditional random fields. Journal of the American Society for Information Science & Technology, 62(4), 727-737.
Abstract:
Information extraction is an important text-mining task that aims at extracting prespecified types of information from large text collections and making them available in structured representations such as databases. In the biomedical domain, information extraction can be applied to help biologists make the most use of their digital-literature archives. Currently, there are large amounts of biomedical literature that contain rich information about biomedical substances. Extracting such knowledge requires a good named entity recognition technique. In this article, we combine conditional random fields (CRFs), a state-of-the-art sequence-labeling algorithm, with two semisupervised learning techniques, bootstrapping and feature sampling, to recognize disease names from biomedical literature. Two data-processing strategies for each technique also were analyzed: one sequentially processing unlabeled data partitions and another one processing unlabeled data partitions in a round-robin fashion. The experimental results showed the advantage of semisupervised learning techniques given limited labeled training data. Specifically, CRFs with bootstrapping implemented in sequential fashion outperformed strictly supervised CRFs for disease name recognition.
Not to take anything away from this sort of technique, which would stand in good stead for topic map construction, but I am left feeling like it stops short of the mark.
In other words, say that I am happy with the result of its recognition, how do I share that with someone else, who has another set of identified subjects, perhaps from the same data?
Or for that matter, how do I combine it with data that I myself have extracted from the same data?
Can’t very well ask the software why it “recognized” one name or another can I?
Thinking I would have to add what seemed to me to be useful information to the name, in order to re-use it with other data.
Starting to sound like a topic map isn’t it?