Using Latent Dirichlet Allocation to Categorize My Twitter Feed by Joseph Misiti.
From the post:
Over the past 3 years, I have tweeted about 4100 times, mostly URLS, and mostly about machine learning, statistics, big data, etc. I spent some time this past weekend seeing if I could categorize the tweets using Latent Dirichlet Allocation. For a great introduction to Latent Dirichlet Allocation (LDA), you can read the following link here. For the more mathematically inclined, you can read through this excellent paper which explains LDA in a lot more detail.
The first step to categorizing my tweets was pulling the data. I initially downloaded and installed Twython and tried to pull all of my tweets using the Twitter API, but that quickly realized there was an archive button under settings. So I stopped writing code and just double clicked the archive button. Apparently 4100 tweets is fairly easy to archive, because I received an email from Twitter within 15 seconds with a download link.
…
When you read Joseph’s post, note that he doesn’t use the content of his tweets but rather the content of the URLs he tweeted as the subject of the LDA analysis.
Still a valid corpus for LDA analysis but I would not characterize it as “categorizing” his tweet feed, meaning the tweets, but rather “categorizing” the content he tweeted about. Not the same thing.
A useful exercise because it uses LDA on a corpus with which you should be familiar, the materials you tweeted about.
As opposed to using LDA on a corpus that is less well known to you and you are reduced to running sanity checks with no real feel for the data.
It would be an interesting exercise, to discover the top topics for the corpus you tweeted about (Joseph’s post) and also for the corpus of #tags that you used in your tweets. Are they the same or different?
I first saw this in a tweet by Christophe Lalanne.