On-demand Synonym Extraction Using Suffix Arrays

On-demand Synonym Extraction Using Suffix Arrays by Minoru Yoshida, Hiroshi Nakagawa, and Akira Terada. (Yoshida, M., Nakagawa, H. & Terada, A. (2013). On-demand Synonym Extraction Using Suffix Arrays. Information Extraction from the Internet. ISBN: 978-1463743994. iConcept Press. Retrieved from http://www.iconceptpress.com/books//information-extraction-from-the-internet/)

From the introduction:

The amount of electronic documents available on the World Wide Web (WWW) is continuously growing. The situation is the same in a limited part of the WWW, e.g., Web documents from specific web sites such as ones of some specific companies or universities, or some special-purpose web sites such as www.wikipedia.org, etc. This chapter mainly focuses on such a limited-size corpus. Automatic analysis of this large amount of data by text-mining techniques can produce useful knowledge that is not found by human efforts only.

We can use the power of on-memory text mining for such a limited-size corpus. Fast search for required strings or words available by putting whole documents on memory contributes to not only speeding up of basic search operations like word counting, but also making possible more complicated tasks that require a number of search operations. For such advanced text-mining tasks, this chapter considers the problem of extracting synonymous strings for a query given by users. Synonyms, or paraphrases, are words or phrases that have the same meaning but different surface strings. “HDD” and “hard drive” in documents related to computers and “BBS” and “message boards” in Web pages are examples of synonyms. They appear ubiquitously in different types of documents because the same concept can often be described by two or more expressions, and different writers may select different words or phrases to describe the same concept. In such cases, the documents that include the string “hard drive” might not be found by if the query “HDD” is used, which results in a drop in the coverage of the search system. This could become a serious problem, especially for searches of limited-size corpora. Therefore, being able to find such synonyms significantly improves the usability of various systems. Our goal is to develop an algorithm that can find strings synonymous with the user input. The applications of such an algorithm include augmenting queries with synonyms in information retrieval or text-mining systems, and assisting input systems by suggesting expressions similar to the user input.

The authors concede the results of their method are inferior to the best results of other synonym extraction methods but go on to say:

However, note that the main advantage of our method is not its accuracy, but its ability to extract synonyms of any query without a priori construction of thesauri or preprocessing using other linguistic tools like POS taggers or dependency parsers, which are indispensable for previous methods.

An important point to remember about all semantic technologies. How appropriate a technique is for your project depends on your requirements, not qualities of a technique in the abstract.

Technique N may not support machine reasoning but sending coupons to mobile phones “near” a restaurant doesn’t require that overhead. (Neither does standing outside the restaurant with flyers.)

Choose semantic techniques based on their suitability for your purposes.

One Response to “On-demand Synonym Extraction Using Suffix Arrays”

  1. […] Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity « On-demand Synonym Extraction Using Suffix Arrays […]