Extracting SVO Triples from Wikipedia by Sujit Pal.
From the post:
I recently came across this discussion (login required) on LinkedIn about extracting (subject, verb, object) (SVO) triples from text. Jack Park, owner of the SolrSherlock project, suggested using ReVerb to do this. I remembered an entertaining Programming Assignment from when I did the Natural Language Processing Course on Coursera, that involved finding spouse names from a small subset of Wikipedia, so I figured I it would be interesting to try using ReVerb against this data.
This post describes that work. As before, given the difference between this and the “preferred” approach that the automatic grader expects, results are likely to be wildly off the mark. BTW, I highly recommend taking the course if you haven’t already, there are lots of great ideas in there. One of the ideas deals with generating “raw” triples, then filtering them using known (subject, object) pairs to find candidate verbs, then turning around and using the verbs to find unknown (subject, object) pairs.
So in order to find the known (subject, object) pairs, I decided to parse the Infobox content (the “semi-structured” part of Wikipedia pages). Wikipedia markup is a mini programming language in itself, so I went looking for some pointers on how to parse it (third party parsers or just ideas) on StackOverflow. Someone suggested using DBPedia instead, since they have already done the Infobox extraction for you. I tried both, and somewhat surprisingly, manually parsing Infobox gave me better results in some cases, so I describe both approaches below.
As Sujit points out, you will want to go beyond Wikipedia with this technique but it is a good place to start!
If somebody does leak the Senate Report on CIA Torture, that would be a great text (hopefully the full version) to mine with such techniques.
Remembering that anonymity = no accountability.