Archive for the ‘Subject Recognition’ Category

Construction of Controlled Vocabularies

Tuesday, April 2nd, 2013

Construction of Controlled Vocabularies: A Primer by Marcia Lei Zeng.

From the “why” page:

Vocabulary control is used to improve the effectiveness of information storage and retrieval systems, Web navigation systems, and other environments that seek to both identify and locate desired content via some sort of description using language. The primary purpose of vocabulary control is to achieve consistency in the description of content objects and to facilitate retrieval.

1.1 Need for Vocabulary Control (1.1)

The need for vocabulary control arises from two basic features of natural language, namely:

• Two or more words or terms can be used to represent a single concept

Example:
salinity/saltiness
  VHF/Very High Frequency

• Two or more words that have the same spelling can represent different concepts

Example:
Mercury (planet)
  Mercury (metal)
  Mercury (automobile)
  Mercury (mythical being)

Great examples for vocabulary control but for topic maps as well!

The topic map question is:

What do you know about the subject(s) in either case, that would make you say the words mean the same subject or different subjects?

If we can capture the information you think makes them represent the same or different subjects, there is a basis for repeating that comparison.

Perhaps even automatically.

Mary Jane pointed out this resource in a recent comment.

> 4,000 Ways to say “You’re OK” [Breast Cancer Diagnosis]

Sunday, August 5th, 2012

The feasibility of using natural language processing to extract clinical information from breast pathology reports by Julliette M Buckley, et.al.

Abstract:

Objective: The opportunity to integrate clinical decision support systems into clinical practice is limited due to the lack of structured, machine readable data in the current format of the electronic health record. Natural language processing has been designed to convert free text into machine readable data. The aim of the current study was to ascertain the feasibility of using natural language processing to extract clinical information from >76,000 breast pathology reports.

Approach and Procedure: Breast pathology reports from three institutions were analyzed using natural language processing software (Clearforest, Waltham, MA) to extract information on a variety of pathologic diagnoses of interest. Data tables were created from the extracted information according to date of surgery, side of surgery, and medical record number. The variety of ways in which each diagnosis could be represented was recorded, as a means of demonstrating the complexity of machine interpretation of free text.

Results: There was widespread variation in how pathologists reported common pathologic diagnoses. We report, for example, 124 ways of saying invasive ductal carcinoma and 95 ways of saying invasive lobular carcinoma. There were >4000 ways of saying invasive ductal carcinoma was not present. Natural language processor sensitivity and specificity were 99.1% and 96.5% when compared to expert human coders.

Conclusion: We have demonstrated how a large body of free text medical information such as seen in breast pathology reports, can be converted to a machine readable format using natural language processing, and described the inherent complexities of the task.

The advantages of using current language practices include:

  • No new vocabulary needs to be developed.
  • No adoption curve for a new vocabulary.
  • No training required for users to introduce the new vocabulary
  • Works with historical data.

and I am sure there are others.

Add natural language usage to your topic map for immediately useful results for your clients.

Subject Recognition: Discrete or Continuous

Monday, October 24th, 2011

While creating the entry for Fast Deep/Recurrent Nets for AGI Vision, I took particular note of the unbroken hand writing competitions. That task, for computer vision, is more difficult than “segmented” hand writing with breaks between the letters.

Are there parallels to subject recognition as performed by our computers versus ourselves?

That is we record and use “discrete” values in computers that are used for subject recognition.

We as human observers report “discrete” values when asked about subject recognition but in fact recognize subjects along non-discrete continuum of values.

I am interested in the application of techniques similar to continuous handwriting recognition applied to subject recognition.

Comments?