Mining Wikipedia with Hadoop and Pig for Natural Language Processing
One problem with after-the-fact assignment of semantics to text is that the volume of text involved (usually) is too great for manual annotation.
This post walks you through the alternative of using automated annotation based upon Wikipedia content.
From the post:
Instead manually of annotating text, one should try to benefit from an existing annotated and publicly available text corpus that deals with a wide range of topics, namely Wikipedia.
Our approach is rather simple: the text body of Wikipedia articles is rich in internal links pointing to other Wikipedia articles. Some of those articles are referring to the entity classes we are interested in (e.g. person, countries, cities, …). Hence we just need to find a way to convert those links into entity class annotations on text sentences (without the Wikimarkup formatting syntax).
This is also an opportunity to try out cloud based computing if you are so inclined.
[…] Word For It Patrick Durusau on Topic Maps and Semantic Diversity « Mining Wikipedia with Hadoop and Pig for Natural Language Processing Java Wikipedia Library (JWPL) […]
Pingback by Introducing fise, the Open Source RESTful Semantic Engine « Another Word For It — October 22, 2011 @ 3:17 pm