Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9 by Ryan Rosario.
From the post:
Lately I have doing a lot of work with the Wikipedia XML dump as a corpus. Wikipedia provides a wealth information to researchers in easy to access formats including XML, SQL and HTML dumps for all language properties. Some of the data freely available from the Wikimedia Foundation include
- article content and template pages
- article content with revision history (huge files)
- article content including user pages and talk pages
- redirect graph
- page-to-page link lists: redirects, categories, image links, page links, interwiki etc.
- image metadata
- site statistics
All of that is available but also lacking any consistent usage of syntax. Ryan stumbles upon Wikipedia Extractor, which has pluses and minuses, an example of that latter being really slow. Things look up for Ryan when he is reminded about Cloud9, which is designed for a MapReduce environment.
Read the post to see how things turned out for Ryan using Cloud9.
Depending on your needs, Wikipedia URLs are a start on subject identifiers, although you will probably need to create some for your particular domain.