Calculating Word and N-Gram Statistics from the Gutenberg Corpus by Richard Marsden.
From the post:
Following on from the previous article about scanning text files for word statistics, I shall extend this to use real large corpora. First we shall use this script to create statistics for the entire Gutenberg English language corpus. Next I shall do the same with the entire English language Wikipedia.
A “get your feet wet” sort of exercise with the script included.
The Gutenberg project isn’t “big data” but it is more than your usual inbox.
Think of it as learning about the data set for application of more sophisticated algorithms.