Saturday, November 17th, 2012

Mining a multilingual association dictionary from Wikipedia for cross-language information retrieval by Zheng Ye, Jimmy Xiangji Huang, Ben He, Hongfei Lin.


Wikipedia is characterized by its dense link structure and a large number of articles in different languages, which make it a notable Web corpus for knowledge extraction and mining, in particular for mining the multilingual associations. In this paper, motivated by a psychological theory of word meaning, we propose a graph-based approach to constructing a cross-language association dictionary (CLAD) from Wikipedia, which can be used in a variety of cross-language accessing and processing applications. In order to evaluate the quality of the mined CLAD, and to demonstrate how the mined CLAD can be used in practice, we explore two different applications of the mined CLAD to cross-language information retrieval (CLIR). First, we use the mined CLAD to conduct cross-language query expansion; and, second, we use it to filter out translation candidates with low translation probabilities. Experimental results on a variety of standard CLIR test collections show that the CLIR retrieval performance can be substantially improved with the above two applications of CLAD, which indicates that the mined CLAD is of sound quality.

Is there a lesson here about using Wikipedia as a starter set of topics across languages?

Not the final product but a starting place other than ground zero for creation of a multi-lingual topic map.

Sunday, May 13th, 2012

Multilingual Natural Language Processing Applications: From Theory to Practice by Daniel Bikel and Imed Zitouni.

From the description:

Multilingual Natural Language Processing Applications is the first comprehensive single-source guide to building robust and accurate multilingual NLP systems. Edited by two leading experts, it integrates cutting-edge advances with practical solutions drawn from extensive field experience.

Part I introduces the core concepts and theoretical foundations of modern multilingual natural language processing, presenting today’s best practices for understanding word and document structure, analyzing syntax, modeling language, recognizing entailment, and detecting redundancy.

Part II thoroughly addresses the practical considerations associated with building real-world applications, including information extraction, machine translation, information retrieval/search, summarization, question answering, distillation, processing pipelines, and more.

This book contains important new contributions from leading researchers at IBM, Google, Microsoft, Thomson Reuters, BBN, CMU, University of Edinburgh, University of Washington, University of North Texas, and others.

Coverage includes

Core NLP problems, and today’s best algorithms for attacking them

  • Processing the diverse morphologies present in the world’s languages
  • Uncovering syntactical structure, parsing semantics, using semantic role labeling, and scoring grammaticality
  • Recognizing inferences, subjectivity, and opinion polarity
  • Managing key algorithmic and design tradeoffs in real-world applications
  • Extracting information via mention detection, coreference resolution, and events
  • Building large-scale systems for machine translation, information retrieval, and summarization
  • Answering complex questions through distillation and other advanced techniques
  • Creating dialog systems that leverage advances in speech recognition, synthesis, and dialog management
  • Constructing common infrastructure for multiple multilingual text processing applications

This book will be invaluable for all engineers, software developers, researchers, and graduate students who want to process large quantities of text in multiple languages, in any environment: government, corporate, or academic.

Monday, June 6th, 2011

Collocated with the 10th International Semantic Web Conference (ISWC2011) in Bonn, Germany.

Important Dates

August 15th – submission deadline
September 5th – notification
September 10th – camera-ready deadline
October 23th or 24th – workshop


Given the substantial growth of Web users that create and update knowledge all over the world in languages other than English, multilingualism has become an issue of major interest for the Semantic Web community. This process has been accelerated due to initiatives such as the Linked Data project, which encourages not only governments and public institutes to make their data available to the public, but also private organizations in domains such as medicine, geography, music etc. These actors often publish their data sources in their respective languages, and as such, in order to make this information interoperable and accessible to members of other linguistic communities, multilingual knowledge representation, access and translation are an impending need.

Items of special focus:

  • representation of multilingual information and language resources in Semantic Web and Linked Data formats
  • cross-lingual discovery and representation of mappings between multilingual Linked Data vocabularies and datasets
  • cross-lingual querying of knowledge repositories and Linked Data
  • machine translation and localization strategies for the Semantic Web

The first three are classic topic map fare and the last one isn’t that much of a reach.