From the post:
In this tutorial we are going to create a PageRanking for Wikipedia with the use of Hadoop. This was a good hands-on excercise to get started with Hadoop. The page ranking is not a new thing, but a suitable usecase and way cooler than a word counter! The Wikipedia (en) has 3.7M articles at the moment and is still growing. Each article has many links to other articles. With those incomming and outgoing links we can determine which page is more important than others, which basically is what PageRanking does.
Excellent tutorial! Non-trivial data set and gets your hands wet with Hadoop, one of the rising stars in data processing. What’s not to like?
Question: What other processing looks interesting for the Wiki pages?
The running time on some jobs would be short enough to plan a job at the start of class, from live suggestions, then run the job during the presentation/lecture, present the results/post-mortem of mistakes after the break.
Now that would make an interesting class. Suggestions?