Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 21, 2015

Memantic: A Medical Knowledge Discovery Engine

Filed under: Bioinformatics,Knowledge Discovery,Synonymy,Visualization — Patrick Durusau @ 2:37 pm

Memantic: A Medical Knowledge Discovery Engine by Alexei Yavlinsky.

Abstract:

We present a system that constructs and maintains an up-to-date co-occurrence network of medical concepts based on continuously mining the latest biomedical literature. Users can explore this network visually via a concise online interface to quickly discover important and novel relationships between medical entities. This enables users to rapidly gain contextual understanding of their medical topics of interest, and we believe this constitutes a significant user experience improvement over contemporary search engines operating in the biomedical literature domain.

Alexei takes advantage of prior work on medical literature to index and display searches of medical literature in an “economical” way that can enable researchers to discover new relationships in the literature without being overwhelmed by bibliographic detail.

You will need to check my summary against the article but here is how I would describe Memantic:

Memantic indexes medical literature and records the co-occurrences of terms in every text. Those terms are mapped into a standard medical ontology (which reduces screen clutter). When a search is performed, the “results are displayed as nodes based on the medical ontology and includes relationships established by the co-occurrences found during indexing. This enables users to find relationships without the necessity of searching through multiple articles or deduping their search results manually.

As I understand it, Memantic is as much an effort at efficient visualization as it is an improvement in search technique.

Very much worth a slow read over the weekend!

I first saw this in a tweet by Sami Ghazali.

PS: I tried viewing the videos listed in the paper but wasn’t able to get any sound? Maybe you will have better luck.

July 30, 2014

Multi-Term Synonyms [Bags of Properties?]

Filed under: Lucene,Search Engines,Synonymy — Patrick Durusau @ 12:34 pm

Solution for multi-term synonyms in Lucene/Solr using the Auto Phrasing TokenFilter by Ted Sullivan.

From the post:

In a previous blog post, I introduced the AutoPhrasingTokenFilter. This filter is designed to recognize noun-phrases that represent a single entity or ‘thing’. In this post, I show how the use of this filter combined with a Synonym Filter configured to take advantage of auto phrasing, can help to solve an ongoing problem in Lucene/Solr – how to deal with multi-term synonyms.

The problem with multi-term synonyms in Lucene/Solr is well documented (see Jack Krupansky’s proposal, John Berryman’s excellent summary and Nolan Lawson’s query parser solution). Basically, what it boils down to is a problem with parallel term positions in the synonym-expanded token list – based on the way that the Lucene indexer ingests the analyzed token stream. The indexer pays attention to a token’s start position but does not attend to its position length increment. This causes multi-term tokens to overlap subsequent terms in the token stream rather than maintaining a strictly parallel relation (in terms of both start and end positions) with their synonymous terms. Therefore, rather than getting a clean ‘state-graph’, we get a pattern called “sausagination” that does not accurately reflect the 1-1 mapping of terms to synonymous terms within the flow of the text (see blog post by Mike McCandless on this issue). This problem disappears if all of the synonym pairs are single tokens.

The multi-term synonym problem was described in a Lucene JIRA ticket (LUCENE-1622) which is still marked as “Unresolved”:

Posts like this one are a temptation to sign off Twitter and read the ticket feeds for Lucene/Solr instead. Seriously.

Ted proposes a workaround to the multi-term synonym problem using the auto phrasing tokenfilter. Equally important is his conclusion:

The AutoPhrasingTokenFilter can be an important tool in solving one of the more difficult problems with Lucene/Solr search – how to deal with multi-term synonyms. Simultaneously, we can improve another serious problem that all search engines have – their focus on single tokens and the ambiguities that are present at that level. By shifting the focus more towards phrases that should be treated as semantic entities or units of language (i.e. “things”), the search engine is better able to return results based on ‘what’ the user is looking for rather than documents containing words that match the query. We are moving from searching with a “bag of words” to searching a “bag of things”.

Or more precisely:

…their focus on single tokens and the ambiguities that are present at that level. By shifting the focus more towards phrases that should be treated as semantic entities or units of language (i.e. “things”)…

Ambiguity at the token level remains, even if for particular cases phrases can be treated as semantic entities.

Rather than Ted’s “bag of things,” may I suggest indexing “bags of properties?” Where the lowliest token or a higher semantic unit can be indexed as a bag of properties.

Imagine indexing these properties* for a single token:

  • string: value
  • pubYear: value
  • author: value
  • journal: value
  • keywords: value

Would that suffice to distinguish a term in a medical journal from Vanity Fair?

Ambiguity is predicated upon a lack of information.

That should be suggestive of a potential cure.

*(I’m not suggesting that all of those properties or even most of them would literally appear in a bag. Most, if not all, could be defaulted from an indexed source.)

I first saw this in a tweet by SolrLucene.

November 24, 2013

Multi-term Synonym Mapping in Solr

Filed under: Solr,Synonymy — Patrick Durusau @ 2:08 pm

Why is Multi-term synonym mapping so hard in Solr? by John Berryman.

From the post:

There is a very common need for multi-term synonyms. We’ve actually run across several use cases among our recent clients. Consider the following examples:

  • Ecommerce: If a customer searches for “weed whacker”, but the more canonical name is “string trimmer”, then you need synonyms, otherwise you’re going to lose a sale.
  • Law: Consider a layperson attempting to find a section of legal code pertaining to their “truck”. If the law only talks about “motor vehicles”, then, without synonyms, this individual will go away uninformed.
  • Medicine: When a doctor is looking up recent publications on “heart attack”, synonyms make sure that he also finds documents that happen to only mention “myocardial infarction”.

One would hope that working with synonyms should be as simple as tossing a set of synonyms into the synonyms.txt file and just having Solr “do the right thing.”™ And when we’re talking about simple, single-term synonyms (e.g. TV = televisions), synonyms really are just that straight forward. Unfortunately, especially as you get into more complex uses of synonyms, such as multi-term synonyms, there are several gotchas. Sometimes, there are workarounds. And sometimes, for now at least, you’ll just have to make do what you can currently achieve using Solr! In this post we’ll provide a quick intro to synonyms in Solr, we’ll walk through some of the pain points, and then we’ll propose possible resolutions.

John does a great review of basic synonym mapping in Solr as a prelude to illustrating the difficulty with multi-term synonyms.

His example case is the mapping:

spider man ==> spiderman

“Obvious” solutions fail but John does conclude with a pointer to one solution to the issue.

Recommended for a deeper understanding of Solr’s handling of synonymy.

While reading John’s post it occurred to me to check with Wikipedia on disambiguation of the term “spider.”

  • Comics – 17
  • Other publications – 5
  • Culinary – 3
  • Film and television – 10
  • Games and sports – 10
  • Land vehicles – 4
  • Mathematics – 1
  • Music – 16
  • People – 7
  • Technology – 14
  • Other uses – 7

I count eighty-eight (88) distinct “spiders” (counting spider as “an air-breathing eight-legged animal“, of which there are 44032 species identified as of June 23, 2013).

John suggests a parsing solution for the multi-term synonym problem in Solr, but however “spider” is parsed, there remains ambiguity.

An 88-fold ambiguity (at minimum).

At least for Solr and other search engines.

Not so much for us as human readers.

A human reader is not limited to “spider” in deciding which of 88 possible spiders is the correct one and/or the appropriate synonyms to use.

Each “spider” is seen in a “context” and a human reader will attribute (perhaps not consciously) characteristics to a particular “spider” in order to identify it.

If we record characteristics for each “spider,” then distinguishing and matching spiders to synonyms (also with characteristics) becomes a task of:

  1. Deciding which characteristic(s) to require for identification/synonymy.
  2. Fashioning rules for identification/synonymy.

Much can be said about those two tasks but for now, I will leave you with a practical example of their application.

Assume that you are indexing some portion of web space and you encounter The World Spider Catalog, Version 14.0.

We know for every instance of “spider” (136) at that site has the characteristics of order Araneae. How you wish to associate that with every instance of “spider” or other names from the spider database is an implementation issue.

However, knowing “order Araneae” allows us to reliably distinguish all the instances of “spider” at this resource from other instances of “spider” that lack that characteristic.

Just as importantly, we only have to perform that task once. Not rely upon our users to perform that task over and over again.

The weakness of current indexing is that it harvests only the surface text and not the rich semantic soil in which it grows.

August 25, 2013

Better synonym handling in Solr

Filed under: Solr,Synonymy — Patrick Durusau @ 7:05 pm

Better synonym handling in Solr by Nolan Lawson.

A very deep dive into synonym handling in Solr, along with a proposed fix.

The problems Nolan uncovers are now in a JIRA issue, SOLR-4381.

And Nolan has a Github repository with his proposed fix.

The Solr JIRA lists the issue as still “open.”

Start with the post and then go onward to the JIRA issue and Github repository. I say that because Nolan does a great job detailing the issue he discovered and his proposed solution.

I can think of several other improvements to synonym handling in Solr.

Such as allowing specification of tokens and required values in other fields for synonyms. (An indexing analog to scope.)

Or even allowing Solr queries in a synonym table.

Not to mention making Solr synonym tables by default indexed.

Just to name a few.

August 24, 2013

On-demand Synonym Extraction Using Suffix Arrays

Filed under: Authoring Topic Maps,Suffix Array,Synonymy — Patrick Durusau @ 3:19 pm

On-demand Synonym Extraction Using Suffix Arrays by Minoru Yoshida, Hiroshi Nakagawa, and Akira Terada. (Yoshida, M., Nakagawa, H. & Terada, A. (2013). On-demand Synonym Extraction Using Suffix Arrays. Information Extraction from the Internet. ISBN: 978-1463743994. iConcept Press. Retrieved from http://www.iconceptpress.com/books//information-extraction-from-the-internet/)

From the introduction:

The amount of electronic documents available on the World Wide Web (WWW) is continuously growing. The situation is the same in a limited part of the WWW, e.g., Web documents from specific web sites such as ones of some specific companies or universities, or some special-purpose web sites such as www.wikipedia.org, etc. This chapter mainly focuses on such a limited-size corpus. Automatic analysis of this large amount of data by text-mining techniques can produce useful knowledge that is not found by human efforts only.

We can use the power of on-memory text mining for such a limited-size corpus. Fast search for required strings or words available by putting whole documents on memory contributes to not only speeding up of basic search operations like word counting, but also making possible more complicated tasks that require a number of search operations. For such advanced text-mining tasks, this chapter considers the problem of extracting synonymous strings for a query given by users. Synonyms, or paraphrases, are words or phrases that have the same meaning but different surface strings. “HDD” and “hard drive” in documents related to computers and “BBS” and “message boards” in Web pages are examples of synonyms. They appear ubiquitously in different types of documents because the same concept can often be described by two or more expressions, and different writers may select different words or phrases to describe the same concept. In such cases, the documents that include the string “hard drive” might not be found by if the query “HDD” is used, which results in a drop in the coverage of the search system. This could become a serious problem, especially for searches of limited-size corpora. Therefore, being able to find such synonyms significantly improves the usability of various systems. Our goal is to develop an algorithm that can find strings synonymous with the user input. The applications of such an algorithm include augmenting queries with synonyms in information retrieval or text-mining systems, and assisting input systems by suggesting expressions similar to the user input.

The authors concede the results of their method are inferior to the best results of other synonym extraction methods but go on to say:

However, note that the main advantage of our method is not its accuracy, but its ability to extract synonyms of any query without a priori construction of thesauri or preprocessing using other linguistic tools like POS taggers or dependency parsers, which are indispensable for previous methods.

An important point to remember about all semantic technologies. How appropriate a technique is for your project depends on your requirements, not qualities of a technique in the abstract.

Technique N may not support machine reasoning but sending coupons to mobile phones “near” a restaurant doesn’t require that overhead. (Neither does standing outside the restaurant with flyers.)

Choose semantic techniques based on their suitability for your purposes.

July 8, 2013

Better synonym handling in Solr

Filed under: Search Engines,Solr,Synonymy — Patrick Durusau @ 6:45 pm

Better synonym handling in Solr by Nolan Lawson.

From the post:

It’s a pretty common scenario when working with a Solr-powered search engine: you have a list of synonyms, and you want user queries to match documents with synonymous terms. Sounds easy, right? Why shouldn’t queries for “dog” also match documents containing “hound” and “pooch”? Or even “Rover” and “canis familiaris”?

(image omitted)

As it turns out, though, Solr doesn’t make synonym expansion as easy as you might like. And there are lots of good ways to shoot yourself in the foot.

Deep review of the handling of synonyms in Solr and a patch to improve its handling of synonyms.

The issue is now SOLR-4381 and is set for SOLR 4.4.

Interesting discussion continues under the SOLR issue.

May 27, 2013

Automatically Acquiring Synonym Knowledge from Wikipedia

Filed under: Lucene,Solr,Synonymy,Wikipedia — Patrick Durusau @ 7:36 pm

Automatically Acquiring Synonym Knowledge from Wikipedia by Koji Sekiguchi.

From the post:

Synonym search sure is convenient. However, in order for an administrator to allow users to use these convenient search functions, he or she has to provide them with a synonym dictionary (CSV file) described above. New words are created every day and so are new synonyms. A synonym dictionary might have been prepared by a person in charge with huge effort but sometimes will be left unmaintained as time goes by or his/her position is taken over.

That is a reason people start longing for an automatic creation of synonym dictionary. That request has driven me to write the system I will explain below. This system learns synonym knowledge from “dictionary corpus” and outputs “original word – synonym” combinations of high similarity to a CSV file, which in turn can be applied to the SynonymFilter of Lucene/Solr as is.

This “dictionary corpus” is a corpus that contains entries consisting of “keywords” and their “descriptions”. An electronic dictionary exactly is a dictionary corpus and so is Wikipedia, which you are familiar with and is easily accessible.

Let’s look at a method to use the Japanese version of Wikipedia to automatically get synonym knowledge.

Complex representation of synonyms, which includes domain or scope would be more robust.

On the other hand, some automatic generation of synonyms is better than no synonyms at all.

Take this as a good place to start but not as a destination for synonym generation.

February 26, 2013

WikiSynonyms: Find synonyms using Wikipedia redirects

Filed under: Synonymy,Wikipedia — Patrick Durusau @ 1:53 pm

WikiSynonyms: Find synonyms using Wikipedia redirects by Panos Ipeirotis.

Many many years back, I worked with Wisam Dakka on a paper to create faceted interfaced for text collections. One of the requirements for that project was to discover synonyms for named entities. While we explored a variety of directions, the one that I liked most was Wisam’s idea to use the Wikipedia redirects to discover terms that are mostly synonymous.

Did you know, for example, that ISO/IEC 14882:2003 and X3J16 are synonyms of C++? Yes, me neither. However, Wikipedia reveals that through its redirect structure.

This rocks!

Talk about an easy path to populating variant names for a topic map!

Complete with examples, code, suggestions on hacking Wikipedia data sets (downloaded).

January 7, 2013

A new Lucene highlighter is born [The final inch problem]

Filed under: Indexing,Lucene,Searching,Synonymy — Patrick Durusau @ 10:27 am

A new Lucene highlighter is born Mike McCandless.

From the post:

Robert has created an exciting new highlighter for Lucene, PostingsHighlighter, our third highlighter implementation (Highlighter and FastVectorHighlighter are the existing ones). It will be available starting in the upcoming 4.1 release.

Highlighting is crucial functionality in most search applications since it’s the first step of the hard-to-solve final inch problem, i.e. of getting the user not only to the best matching documents but getting her to the best spot(s) within each document. The larger your documents are, the more crucial it is that you address the final inch. Ideally, your user interface would let the user click on each highlight snippet to jump to where it occurs in the full document, or at least scroll to the first snippet when the user clicks on the document link. This is in general hard to solve: which application renders the content is dependent on its mime-type (i.e., the browser will render HTML, but will embed Acrobat Reader to render PDF, etc.).

Google’s Chrome browser has an ingenious solution to the final inch problem, when you use “Find…” to search the current web page: it highlights the vertical scroll bar showing you where the matches are on the page. You can then scroll to those locations, or, click on the highlights in the scroll bar to jump there. Wonderful!

All Lucene highlighters require search-time access to the start and end offsets per token, which are character offsets indicating where in the original content that token started and ended. Analyzers set these two integers per-token via the OffsetAttribute, though some analyzers and token filters are known to mess up offsets which will lead to incorrect highlights or exceptions during highlighting. Highlighting while using SynonymFilter is also problematic in certain cases, for example when a rule maps multiple input tokens to multiple output tokens, because the Lucene index doesn’t store the full token graph.

An interesting addition to the highlighters in Lucene.

Be sure to follow the link to Mike’s comments about the limitations on SynonymFilter and the difficulty of correction.

September 29, 2012

Lucene’s new analyzing suggester [Can You Say Synonym?]

Filed under: Lucene,Synonymy — Patrick Durusau @ 3:49 pm

Lucene’s new analyzing suggester by Mike McCandless.

From the post:

Live suggestions as you type into a search box, sometimes called suggest or autocomplete, is now a standard, essential search feature ever since Google set a high bar after going live just over four years ago.

In Lucene we have several different suggest implementations, under the suggest module; today I’m describing the new AnalyzingSuggester (to be committed soon; it should be available in 4.1).

To use it, you provide the set of suggest targets, which is the full set of strings and weights that may be suggested. The targets can come from anywhere; typically you’d process your query logs to create the targets, giving a higher weight to those queries that appear more frequently. If you sell movies you might use all movie titles with a weight according to sales popularity.

You also provide an analyzer, which is used to process each target into analyzed form. Under the hood, the analyzed form is indexed into an FST. At lookup time, the incoming query is processed by the same analyzer and the FST is searched for all completions sharing the analyzed form as a prefix.

Even though the matching is performed on the analyzed form, what’s suggested is the original target (i.e., the unanalyzed input). Because Lucene has such a rich set of analyzer components, this can be used to create some useful suggesters:

One of the use cases that Mike mentions is use of the AnalyzingSuggester to suggest synonyms of terms entered by a user.

That presumes that you know the target of the search and likely synonyms that occur in it.

Use standard synonym sets and you will get standard synonym results.

Develop custom synonym sets and you can deliver time/resource saving results.

August 5, 2012

> 4,000 Ways to say “You’re OK” [Breast Cancer Diagnosis]

The feasibility of using natural language processing to extract clinical information from breast pathology reports by Julliette M Buckley, et.al.

Abstract:

Objective: The opportunity to integrate clinical decision support systems into clinical practice is limited due to the lack of structured, machine readable data in the current format of the electronic health record. Natural language processing has been designed to convert free text into machine readable data. The aim of the current study was to ascertain the feasibility of using natural language processing to extract clinical information from >76,000 breast pathology reports.

Approach and Procedure: Breast pathology reports from three institutions were analyzed using natural language processing software (Clearforest, Waltham, MA) to extract information on a variety of pathologic diagnoses of interest. Data tables were created from the extracted information according to date of surgery, side of surgery, and medical record number. The variety of ways in which each diagnosis could be represented was recorded, as a means of demonstrating the complexity of machine interpretation of free text.

Results: There was widespread variation in how pathologists reported common pathologic diagnoses. We report, for example, 124 ways of saying invasive ductal carcinoma and 95 ways of saying invasive lobular carcinoma. There were >4000 ways of saying invasive ductal carcinoma was not present. Natural language processor sensitivity and specificity were 99.1% and 96.5% when compared to expert human coders.

Conclusion: We have demonstrated how a large body of free text medical information such as seen in breast pathology reports, can be converted to a machine readable format using natural language processing, and described the inherent complexities of the task.

The advantages of using current language practices include:

  • No new vocabulary needs to be developed.
  • No adoption curve for a new vocabulary.
  • No training required for users to introduce the new vocabulary
  • Works with historical data.

and I am sure there are others.

Add natural language usage to your topic map for immediately useful results for your clients.

July 17, 2012

If you are in Kolkata/Pune, India…a request.

Filed under: Search Engines,Synonymy,Word Meaning,XML — Patrick Durusau @ 1:55 pm

No emails are given for the authors of: Identify Web-page Content meaning using Knowledge based System for Dual Meaning Words but their locations were listed as Kolkata and Pune, India. I would appreciate your pointing the authors to this blog as one source of information on topic maps.

The authors have re-invented a small part of topic maps to deal with synonymy using XSD syntax. Quite doable but I think they would be better served by either using topic maps or engaging in improving topic maps.

Reinvention is rarely a step forward.

Abstract:

Meaning of Web-page content plays a big role while produced a search result from a search engine. Most of the cases Web-page meaning stored in title or meta-tag area but those meanings do not always match with Web-page content. To overcome this situation we need to go through the Web-page content to identify the Web-page meaning. In such cases, where Webpage content holds dual meaning words that time it is really difficult to identify the meaning of the Web-page. In this paper, we are introducing a new design and development mechanism of identifying the Web-page content meaning which holds dual meaning words in their Web-page content.

May 16, 2012

Lucene-1622

Filed under: Indexing,Lucene,Synonymy — Patrick Durusau @ 9:32 am

Multi-word synonym filter (synonym expansion at indexing time) Lucene-1622

From the description:

It would be useful to have a filter that provides support for indexing-time synonym expansion, especially for multi-word synonyms (with multi-word matching for original tokens).

The problem is not trivial, as observed on the mailing list. The problems I was able to identify (mentioned in the unit tests as well):

  • if multi-word synonyms are indexed together with the original token stream (at overlapping positions), then a query for a partial synonym sequence (e.g., “big” in the synonym “big apple” for “new york city”) causes the document to match;
  • there are problems with highlighting the original document when synonym is matched (see unit tests for an example),
  • if the synonym is of different length than the original sequence of tokens to be matched, then phrase queries spanning the synonym and the original sequence boundary won’t be found. Example “big apple” synonym for “new york city”. A phrase query “big apple restaurants” won’t match “new york city restaurants”.

I am posting the patch that implements phrase synonyms as a token filter. This is not necessarily intended for immediate inclusion, but may provide a basis for many people to experiment and adjust to their own scenarios.

This remains an open issue as of 16 May 2012.

It is also an important open issue.

Think about it.

As “big data” gets larger and larger, at some point traditional ETL isn’t going to be practical. Due to storage, performance, selective granularity or other issues, ETL is going to fade into the sunset.

Indexing, on the other hand, which treats data “in situ” (“in position” for you non-archaeologists in the audience), avoids many of the issues with ETL.

The treatment of synonyms, that is synonyms across data sets, multi-word synonyms, specifying the ranges of synonyms (both for indexing and search), synonym expansion, a whole range of synonyms features and capabilities, needs to “man up” to take on “big data.”

May 13, 2012

Synonyms in the TMDM Legend

Filed under: Synonymy,TMDM — Patrick Durusau @ 10:10 pm

I was going over some notes on synonyms this weekend when it occurred to me to ask:

How many synonyms does a topic item have in the TMDM legend?

A synonym being when one term can be freely substituted for another.

Not wanting to trust my memory, I quote from the TMDM legend (ISO/IEC 13250-2):

Two topic items are equal if they have:

  • at least one equal string in their [subject identifiers] properties,
  • at least one equal string in their [item identifiers] properties,
  • at least one equal string in their [subject locators] properties,
  • an equal string in the [subject identifiers] property of the one topic item and the [item identifiers] property of the other, or
  • the same information item in their [reified] properties.

The wording is a bit awkward for my point about synonyms but I take it that if two topic had

at least one equal string in their [subject identifiers] properties,

I could substitute:

at least one equal string in their [item identifiers] properties, (in all relevant places)

and have the same effect.

I am going to be exploring the use of synonym based processing for TMDM governed topic maps.

Any thoughts or insights would be greatly appreciated.

November 4, 2011

bibleQuran: Comparing the Word Frequency between Bible and Quran

Filed under: Bible,Quran,Synonymy,Visualization — Patrick Durusau @ 6:10 pm

bibleQuran: Comparing the Word Frequency between Bible and Quran

From the post:

bibleQuran [pitchinteractive.com] by datavis design firm Pitch Interactive reveals the frequency of word usage between two of the most important holy books: the Bible and the Quran.

The densely populated interactive visualization allows people to search for any word (and similar variations of that word) to explore its frequency in both texts. As each verse is always visible, one is able to compare the relative density of ideas and topics between both passages. For instance, one could select verbs that represent acts of ‘terror’ or ‘love’, and investigate which book discusses the topics more. The appropriate little rectangles, each representing an according verse, which include such this chosen word, are then highlighted, and can be read in detail by hovering the mouse over them.

In addition to being a great graphic presentation of information, with my background and appreciation for both texts, you know why I had to include this post.

I like the synonym feature, although I reserve judgment on what is considered a synonym. 😉 I would have to read the original. Translations of both texts are, well, translations. Not really the same text in a very real sense of the word.

Just as a suggestion, I would do the word count statistics separately for the Old/New Testament.

Word of warning: Loads great with Firefox (7.1) on Windows XP, doesn’t load with IE 8 on Windows XP, doesn’t load with Firefox (3.6) on Ubuntu 10.04. So, your experience may vary.

Comments from users with other browser/OS combinations?

August 11, 2011

UIMA Concept Mapping Interface to Lucene/Neo4j Datastore

Filed under: Concept Detection,Synonymy,UIMA — Patrick Durusau @ 6:34 pm

UIMA Concept Mapping Interface to Lucene/Neo4j Datastore

From the post:

Over the past few weeks (months?) I have been trying to build a system for concept mapping text against a structured vocabulary of concepts stored in a RDBMS. The concepts (and their associated synonyms) are passed through a custom Lucene analyzer chain and stored in a Neo4j database with a Lucene Index retrieval backend. The concept mapping interface is an UIMA aggregate Analysis Engine (AE) that uses this Neo4j/Lucene combo to annotate HTML, plain text and query strings with concept annotations.

Sounds interesting.

In particular:

…concepts (and their associated synonyms) are passed through….

sounds like topic map talk to me partner!

Depends on where you put your emphasis.

October 24, 2010

Recognizing Synonyms

Filed under: Marketing,Subject Identity,Synonymy — Patrick Durusau @ 11:04 am

I saw a synonym that I recognized the other day and started wondering how I recognized it?

The word I had in mind was “student” and the synonym was “pupil.”

Attempts to recognize synonyms:

  • spelling: student, pupil – No.
  • length: student 7 letters, pupil 5 letters – No.
  • origin: student – late 14c., from O.Fr. estudient , pupil – from O.Fr. pupille (14c.) – No. [1]
  • numerology: student (a = 1, b = 2 …) student = 19 + 20 + 21 + 4 + 5 + 14 + 20 = 69 ; pupil = 16 + 21 + 16 + 9 + 12 = 74 – No [2].

But I know “student” and “pupil” to be synonyms.[3]

I could just declare them to be synonyms.

But then how do I answer questions like:

  • Why did I think “student” and “pupil” were synonyms?
  • What would make some other term a synonym of either “student” or “pupil?”
  • How can an automated system match my finding of more synonyms?

Provisional thoughts on answers to follow this week.

Questions:

Without reviewing my answers in this series, pick a pair of synonyms and answer those three questions for that pair. (There are different answers than mine.)

*****

[1] Synonym origins from: Online Etymology Dictionary

[2] There may be some Bible code type operation that can discover synonyms but I am unaware of it.

[3] They are synonyms now, that wasn’t always the case.

Powered by WordPress