Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 28, 2011

Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9

Filed under: Data Mining,Dataset,Extraction — Patrick Durusau @ 7:05 pm

Parsing Wikipedia Articles: Wikipedia Extractor and Cloud9 by Ryan Rosario.

From the post:

Lately I have doing a lot of work with the Wikipedia XML dump as a corpus. Wikipedia provides a wealth information to researchers in easy to access formats including XML, SQL and HTML dumps for all language properties. Some of the data freely available from the Wikimedia Foundation include

  • article content and template pages
  • article content with revision history (huge files)
  • article content including user pages and talk pages
  • redirect graph
  • page-to-page link lists: redirects, categories, image links, page links, interwiki etc.
  • image metadata
  • site statistics

The above resources are available not only for Wikipedia, but for other Wikimedia Foundation projects such as Wiktionary, Wikibooks and Wikiquotes.

All of that is available but also lacking any consistent usage of syntax. Ryan stumbles upon Wikipedia Extractor, which has pluses and minuses, an example of that latter being really slow. Things look up for Ryan when he is reminded about Cloud9, which is designed for a MapReduce environment.

Read the post to see how things turned out for Ryan using Cloud9.

Depending on your needs, Wikipedia URLs are a start on subject identifiers, although you will probably need to create some for your particular domain.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress