Download Google n gram data set and neo4j source code for storing it by René Pickhardt.
From the post:
In the end of September I discovered an amazing data set which is provided by Google! It is called the Google n gram data set. Even thogh the english wikipedia article about ngrams needs some clen up it explains nicely what an ngram is. http://en.wikipedia.org/wiki/N-gram
The data set is available in several languages and I am sure it is very useful for many tasks in web retrieval, data mining, information retrieval and natural language processing.
I forwarded this data set to two high school students which I was teaching last summer at the dsa. Now they are working on a project for a German student competition. They are using the n-grams and neo4j to predict sentences and help people to improve typing.
The idea is that once a user has started to type a sentence statistics about the n-grams can be used to semantically and syntactically correctly predict what the next word will be and in this way increase the speed of typing by making suggestions to the user. This will be in particular usefull with all these mobile devices where typing is really annoying.
Now, imagine having users mark subjects in texts (highlighting?) and using a sufficient number of such texts to automate the recognition of subjects and their relationships to document authors and other subjects. Does that sound like an easy way to author an ongoing topic map based on the output of an office? Or project?
Feel free to mention at least one prior project that used a very similar technique with texts. If no one does, I will post its name and links to the papers tomorrow.