### WikiSynonyms: Find synonyms using Wikipedia redirects

Tuesday, February 26th, 2013

WikiSynonyms: Find synonyms using Wikipedia redirects by Panos Ipeirotis.

Many many years back, I worked with Wisam Dakka on a paper to create faceted interfaced for text collections. One of the requirements for that project was to discover synonyms for named entities. While we explored a variety of directions, the one that I liked most was Wisam’s idea to use the Wikipedia redirects to discover terms that are mostly synonymous.

Did you know, for example, that ISO/IEC 14882:2003 and X3J16 are synonyms of C++? Yes, me neither. However, Wikipedia reveals that through its redirect structure.

This rocks!

Talk about an easy path to populating variant names for a topic map!

### A new Lucene highlighter is born [The final inch problem]

Monday, January 7th, 2013

A new Lucene highlighter is born Mike McCandless.

From the post:

Robert has created an exciting new highlighter for Lucene, PostingsHighlighter, our third highlighter implementation (Highlighter and FastVectorHighlighter are the existing ones). It will be available starting in the upcoming 4.1 release.

Highlighting is crucial functionality in most search applications since it’s the first step of the hard-to-solve final inch problem, i.e. of getting the user not only to the best matching documents but getting her to the best spot(s) within each document. The larger your documents are, the more crucial it is that you address the final inch. Ideally, your user interface would let the user click on each highlight snippet to jump to where it occurs in the full document, or at least scroll to the first snippet when the user clicks on the document link. This is in general hard to solve: which application renders the content is dependent on its mime-type (i.e., the browser will render HTML, but will embed Acrobat Reader to render PDF, etc.).

Google’s Chrome browser has an ingenious solution to the final inch problem, when you use “Find…” to search the current web page: it highlights the vertical scroll bar showing you where the matches are on the page. You can then scroll to those locations, or, click on the highlights in the scroll bar to jump there. Wonderful!

All Lucene highlighters require search-time access to the start and end offsets per token, which are character offsets indicating where in the original content that token started and ended. Analyzers set these two integers per-token via the OffsetAttribute, though some analyzers and token filters are known to mess up offsets which will lead to incorrect highlights or exceptions during highlighting. Highlighting while using SynonymFilter is also problematic in certain cases, for example when a rule maps multiple input tokens to multiple output tokens, because the Lucene index doesn’t store the full token graph.

An interesting addition to the highlighters in Lucene.

### Lucene’s new analyzing suggester [Can You Say Synonym?]

Saturday, September 29th, 2012

Lucene’s new analyzing suggester by Mike McCandless.

From the post:

Live suggestions as you type into a search box, sometimes called suggest or autocomplete, is now a standard, essential search feature ever since Google set a high bar after going live just over four years ago.

In Lucene we have several different suggest implementations, under the suggest module; today I’m describing the new AnalyzingSuggester (to be committed soon; it should be available in 4.1).

To use it, you provide the set of suggest targets, which is the full set of strings and weights that may be suggested. The targets can come from anywhere; typically you’d process your query logs to create the targets, giving a higher weight to those queries that appear more frequently. If you sell movies you might use all movie titles with a weight according to sales popularity.

You also provide an analyzer, which is used to process each target into analyzed form. Under the hood, the analyzed form is indexed into an FST. At lookup time, the incoming query is processed by the same analyzer and the FST is searched for all completions sharing the analyzed form as a prefix.

Even though the matching is performed on the analyzed form, what’s suggested is the original target (i.e., the unanalyzed input). Because Lucene has such a rich set of analyzer components, this can be used to create some useful suggesters:

One of the use cases that Mike mentions is use of the AnalyzingSuggester to suggest synonyms of terms entered by a user.

That presumes that you know the target of the search and likely synonyms that occur in it.

Use standard synonym sets and you will get standard synonym results.

Develop custom synonym sets and you can deliver time/resource saving results.

### > 4,000 Ways to say “You’re OK” [Breast Cancer Diagnosis]

Sunday, August 5th, 2012

The feasibility of using natural language processing to extract clinical information from breast pathology reports by Julliette M Buckley, et.al.

Abstract:

Objective: The opportunity to integrate clinical decision support systems into clinical practice is limited due to the lack of structured, machine readable data in the current format of the electronic health record. Natural language processing has been designed to convert free text into machine readable data. The aim of the current study was to ascertain the feasibility of using natural language processing to extract clinical information from >76,000 breast pathology reports.

Approach and Procedure: Breast pathology reports from three institutions were analyzed using natural language processing software (Clearforest, Waltham, MA) to extract information on a variety of pathologic diagnoses of interest. Data tables were created from the extracted information according to date of surgery, side of surgery, and medical record number. The variety of ways in which each diagnosis could be represented was recorded, as a means of demonstrating the complexity of machine interpretation of free text.

Results: There was widespread variation in how pathologists reported common pathologic diagnoses. We report, for example, 124 ways of saying invasive ductal carcinoma and 95 ways of saying invasive lobular carcinoma. There were >4000 ways of saying invasive ductal carcinoma was not present. Natural language processor sensitivity and specificity were 99.1% and 96.5% when compared to expert human coders.

Conclusion: We have demonstrated how a large body of free text medical information such as seen in breast pathology reports, can be converted to a machine readable format using natural language processing, and described the inherent complexities of the task.

The advantages of using current language practices include:

• No new vocabulary needs to be developed.
• No adoption curve for a new vocabulary.
• No training required for users to introduce the new vocabulary
• Works with historical data.

and I am sure there are others.

### If you are in Kolkata/Pune, India…a request.

Tuesday, July 17th, 2012

No emails are given for the authors of: Identify Web-page Content meaning using Knowledge based System for Dual Meaning Words but their locations were listed as Kolkata and Pune, India. I would appreciate your pointing the authors to this blog as one source of information on topic maps.

The authors have re-invented a small part of topic maps to deal with synonymy using XSD syntax. Quite doable but I think they would be better served by either using topic maps or engaging in improving topic maps.

Reinvention is rarely a step forward.

Abstract:

Meaning of Web-page content plays a big role while produced a search result from a search engine. Most of the cases Web-page meaning stored in title or meta-tag area but those meanings do not always match with Web-page content. To overcome this situation we need to go through the Web-page content to identify the Web-page meaning. In such cases, where Webpage content holds dual meaning words that time it is really difficult to identify the meaning of the Web-page. In this paper, we are introducing a new design and development mechanism of identifying the Web-page content meaning which holds dual meaning words in their Web-page content.

### Lucene-1622

Wednesday, May 16th, 2012

Multi-word synonym filter (synonym expansion at indexing time) Lucene-1622

From the description:

It would be useful to have a filter that provides support for indexing-time synonym expansion, especially for multi-word synonyms (with multi-word matching for original tokens).

The problem is not trivial, as observed on the mailing list. The problems I was able to identify (mentioned in the unit tests as well):

• if multi-word synonyms are indexed together with the original token stream (at overlapping positions), then a query for a partial synonym sequence (e.g., “big” in the synonym “big apple” for “new york city”) causes the document to match;
• there are problems with highlighting the original document when synonym is matched (see unit tests for an example),
• if the synonym is of different length than the original sequence of tokens to be matched, then phrase queries spanning the synonym and the original sequence boundary won’t be found. Example “big apple” synonym for “new york city”. A phrase query “big apple restaurants” won’t match “new york city restaurants”.

I am posting the patch that implements phrase synonyms as a token filter. This is not necessarily intended for immediate inclusion, but may provide a basis for many people to experiment and adjust to their own scenarios.

This remains an open issue as of 16 May 2012.

It is also an important open issue.

As “big data” gets larger and larger, at some point traditional ETL isn’t going to be practical. Due to storage, performance, selective granularity or other issues, ETL is going to fade into the sunset.

Indexing, on the other hand, which treats data “in situ” (“in position” for you non-archaeologists in the audience), avoids many of the issues with ETL.

The treatment of synonyms, that is synonyms across data sets, multi-word synonyms, specifying the ranges of synonyms (both for indexing and search), synonym expansion, a whole range of synonyms features and capabilities, needs to “man up” to take on “big data.”

### Synonyms in the TMDM Legend

Sunday, May 13th, 2012

I was going over some notes on synonyms this weekend when it occurred to me to ask:

How many synonyms does a topic item have in the TMDM legend?

A synonym being when one term can be freely substituted for another.

Not wanting to trust my memory, I quote from the TMDM legend (ISO/IEC 13250-2):

Two topic items are equal if they have:

• at least one equal string in their [subject identifiers] properties,
• at least one equal string in their [item identifiers] properties,
• at least one equal string in their [subject locators] properties,
• an equal string in the [subject identifiers] property of the one topic item and the [item identifiers] property of the other, or
• the same information item in their [reified] properties.

The wording is a bit awkward for my point about synonyms but I take it that if two topic had

at least one equal string in their [subject identifiers] properties,

I could substitute:

at least one equal string in their [item identifiers] properties, (in all relevant places)

and have the same effect.

I am going to be exploring the use of synonym based processing for TMDM governed topic maps.

Any thoughts or insights would be greatly appreciated.

### bibleQuran: Comparing the Word Frequency between Bible and Quran

Friday, November 4th, 2011

bibleQuran: Comparing the Word Frequency between Bible and Quran

From the post:

bibleQuran [pitchinteractive.com] by datavis design firm Pitch Interactive reveals the frequency of word usage between two of the most important holy books: the Bible and the Quran.

The densely populated interactive visualization allows people to search for any word (and similar variations of that word) to explore its frequency in both texts. As each verse is always visible, one is able to compare the relative density of ideas and topics between both passages. For instance, one could select verbs that represent acts of ‘terror’ or ‘love’, and investigate which book discusses the topics more. The appropriate little rectangles, each representing an according verse, which include such this chosen word, are then highlighted, and can be read in detail by hovering the mouse over them.

In addition to being a great graphic presentation of information, with my background and appreciation for both texts, you know why I had to include this post.

I like the synonym feature, although I reserve judgment on what is considered a synonym. I would have to read the original. Translations of both texts are, well, translations. Not really the same text in a very real sense of the word.

Just as a suggestion, I would do the word count statistics separately for the Old/New Testament.

Word of warning: Loads great with Firefox (7.1) on Windows XP, doesn’t load with IE 8 on Windows XP, doesn’t load with Firefox (3.6) on Ubuntu 10.04. So, your experience may vary.

Comments from users with other browser/OS combinations?

### UIMA Concept Mapping Interface to Lucene/Neo4j Datastore

Thursday, August 11th, 2011

UIMA Concept Mapping Interface to Lucene/Neo4j Datastore

From the post:

Over the past few weeks (months?) I have been trying to build a system for concept mapping text against a structured vocabulary of concepts stored in a RDBMS. The concepts (and their associated synonyms) are passed through a custom Lucene analyzer chain and stored in a Neo4j database with a Lucene Index retrieval backend. The concept mapping interface is an UIMA aggregate Analysis Engine (AE) that uses this Neo4j/Lucene combo to annotate HTML, plain text and query strings with concept annotations.

Sounds interesting.

In particular:

…concepts (and their associated synonyms) are passed through….

sounds like topic map talk to me partner!

Depends on where you put your emphasis.

### Recognizing Synonyms

Sunday, October 24th, 2010

I saw a synonym that I recognized the other day and started wondering how I recognized it?

The word I had in mind was “student” and the synonym was “pupil.”

Attempts to recognize synonyms:

• spelling: student, pupil – No.
• length: student 7 letters, pupil 5 letters – No.
• origin: student – late 14c., from O.Fr. estudient , pupil – from O.Fr. pupille (14c.) – No. [1]
• numerology: student (a = 1, b = 2 …) student = 19 + 20 + 21 + 4 + 5 + 14 + 20 = 69 ; pupil = 16 + 21 + 16 + 9 + 12 = 74 – No [2].

But I know “student” and “pupil” to be synonyms.[3]

I could just declare them to be synonyms.

But then how do I answer questions like:

• Why did I think “student” and “pupil” were synonyms?
• What would make some other term a synonym of either “student” or “pupil?”
• How can an automated system match my finding of more synonyms?