PLoS Biology Bigrams by Georg.
From the post:
Here I will use the Natural Language Toolkit and a recipe from Python Text Processing with NLTK 2.0 Cookbook to work out the most frequent bigrams in the PLoS Biology articles that I downloaded last year and have described in previous posts here and here.
The amusing twist in this blog post is that the most frequent bigram, after filtering out stopwords, is unpublished data.
Not a trivial data set, some 1,754 articles.
Do you see the flaw in saying that most articles in PLoS data use “unpublished” data?
First, without looking at the data, I would be asking for the number of bigrams for each of the top six bigrams. I suspect that “gene expression” is used frequently relative to the number of articles, but I can’t make that judgment with the information given.
Second, the other question you would need to ask is why an article used the bigram “unpublished data.”
If I were writing a paper about papers that used “unpublished data” or more generally about “unpublished data,” I would use the bigram a lot. That would not mean my article was based on “unpublished data.”
NLTK can point you to the articles but deeper analysis is going to require you.