Index and Search Multilingual Documents in Hadoop by Justin Kestelyn.
From the post:
Basis Technology’s Rosette Base Linguistics for Java (RBL-JE) provides a comprehensive multilingual text analytics platform for improving search precision and recall. RBL provides tokenization, lemmatization, POS tagging, and de-compounding for Asian, European, Nordic, and Middle Eastern languages, and has just been certified for use with Cloudera Search.
Cloudera Search brings full-text, interactive search, and scalable indexing to Apache Hadoop by marrying SolrCloud with HDFS and Apache HBase, and other projects in CDH. Because it’s integrated with CDH, Cloudera Search brings the same fault tolerance, scale, visibility, and flexibility of your other Hadoop workloads to search, and allows for a number of indexing, access control, and manageability options.
In this post, you’ll learn how to use Cloudera Search and RBL-JE to index and search documents. Since Cloudera takes care of the plumbing for distributed search and indexing, the only work needed to incorporate Basis Technology’s linguistics is loading the software and configuring your Solr collections.
…
You may have guessed by the way the introduction is worded that Rosette Base Linguistics isn’t free. I checked at the website but found no pricing information. Not to mention that the coverage looks spotty:
- Arabic
- Chinese (simplified)
- Chinese (traditional)
- English
- Japanese
- Korean
- Arabic
- Brazilian Portuguese
- Bulgarian
- Catalan
- Chinese
- Simplified Chinese
- CJK
- Czech
- Danish
- Dutch
- Finnish
- French
- Galician
- German
- Greek
- Hebrew, Lao, Myanmar, Khmer
- Hindi
- Indonesian
- Italian
- Irish
- Kuromoji (Japanese)
- Latvian
- Norwegian
- Persian
- Polish
- Portuguese
- Romanian
- Russian
- Spanish
- Swedish
- Thai
- Turkish
If your multilingual needs fall in one or more of those languages, this may work for you.
On the other hand, for indexing and searching multilingual text, you should compare Solr, which has factories for the following languages:
Source: Solr Wiki.