Using BerkeleyDB to Create a Large N-gram Table by Richard Marsden.
From the post:
Previously, I showed you how to create N-Gram frequency tables from large text datasets. Unfortunately, when used on very large datasets such as the English language Wikipedia and Gutenberg corpora, memory limitations limited these scripts to unigrams. Here, I show you how to use the BerkeleyDB database to create N-gram tables of these large datasets.
Large datasets such as the Wikipedia and Gutenberg English language corpora cannot be used to create N-gram frequency tables using the previous script due to the script’s large in-memory requirements. The solution is to create the frequency table as a disk-based dataset. For this, the BerkeleyDB database in key-value mode is ideal. This is an open source “NoSQL” library which supports a disk based database and in-memory caching. BerkeleyDB can be downloaded from the Oracle website, and also ships with a number of Linux distributions, including Ubuntu. To use BerkeleyDB from Python, you will need the bsddb3 package. This is included with Python 2.* but is an additional download for Python 3 installations.
Richard promises to make the resulting data sets available as an Azure service. Sample code, etc, will be posted to his blog.
Another Wikipedia based analysis.