Analyzing the Enron Data: Frequency Distribution, Page Rank and Document Clustering by Sujit Pal.
From the post:
I’ve been using the Enron Dataset for a couple of projects now, and I figured that it would be interesting to see if I could glean some information out of the data. One can of course simply read the Wikipedia article, but that would be too easy and not as much fun :-).
My focus on this analysis is on the “what” and the “who”, ie, what are the important ideas in this corpus and who are the principal players. For that I did the following:
- Extracted the words from Lucene’s inverted index into (term, docID, freq) triples. Using this, I construct a frequency distribution of words in the corpus. Looking at the most frequent words gives us an idea of what is being discussed.
- Extract the email (from, {to, cc, bcc}) pairs from MongoDB. Using this, I piggyback on Scalding’s PageRank implementation to produce a list of emails by page rank. This gives us an idea of the “important” players.
- Using the triples extracted from Lucene, construct tuples of (docID, termvector), then cluster the documents using KMeans. This gives us an idea of the spread of ideas in the corpus. Originally, the idea was to use Mahout for the clustering, but I ended up using Weka instead.
I also wanted to get more familiar with Scalding beyond the basic stuff I did before, so I used that where I would have used Hadoop previously. The rest of the code is in Scala as usual.
Good practice for discovery of the players and main ideas when the “fiscal cliff” document set “leaks,” as you know it will.
Relationships between players and their self-serving recountings versus the data set will make an interesting topic map.