Synonym search sure is convenient. However, in order for an administrator to allow users to use these convenient search functions, he or she has to provide them with a synonym dictionary (CSV file) described above. New words are created every day and so are new synonyms. A synonym dictionary might have been prepared by a person in charge with huge effort but sometimes will be left unmaintained as time goes by or his/her position is taken over.

That is a reason people start longing for an automatic creation of synonym dictionary. That request has driven me to write the system I will explain below. This system learns synonym knowledge from “dictionary corpus” and outputs “original word – synonym” combinations of high similarity to a CSV file, which in turn can be applied to the SynonymFilter of Lucene/Solr as is.

This “dictionary corpus” is a corpus that contains entries consisting of “keywords” and their “descriptions”. An electronic dictionary exactly is a dictionary corpus and so is Wikipedia, which you are familiar with and is easily accessible.

Let’s look at a method to use the Japanese version of Wikipedia to automatically get synonym knowledge.

Complex representation of synonyms, which includes domain or scope would be more robust.

On the other hand, some automatic generation of synonyms is better than no synonyms at all.

Take this as a good place to start but not as a destination for synonym generation.

