FYI: COHA Full-Text Data: 385 Million Words, 116k Texts

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 9, 2015

FYI: COHA Full-Text Data: 385 Million Words, 116k Texts

Filed under: Corpora,Corpus Linguistics,Linguistics — Patrick Durusau @ 3:16 pm

FYI: COHA Full-Text Data: 385 Million Words, 116k Texts by Mark Davies.

From the post:

This announcement is for those who are interested in historical corpora and who may want a large dataset to work with on their own machine. This is a real corpus, rather than just n-grams (as with the Google Books n-grams; see a comparison at http://googlebooks.byu.edu/compare-googleBooks.asp).

We are pleased to announce that the Corpus of Historical American English (COHA; http://corpus.byu.edu/coha/) is now available in downloadable full-text format, for use on your own computer.
http://corpus.byu.edu/full-text/

COHA joins COCA and GloWbE, which have been available in downloadable full-text format since March 2014.

The downloadable version of COHA contains 385 million words of text in more than 115,000 separate texts, covering fiction, popular magazines, newspaper articles, and non-fiction books from the 1810s to the 2000s (see http://corpus.byu.edu/full-text/coha_full_text.asp).

At 385 million words in size, the downloadable COHA corpus is much larger than any other structured historical corpus of English. With this large amount of data, you can carry out many types of research that would not be possible with much smaller 5-10 million word historical corpora of English (see http://corpus.byu.edu/compare-smallCorpora.asp).

The corpus is available in several formats: sentence/paragraph, PoS-tagged and lemmatized (one word per line), and for input into a relational database. Samples of each format (3.6 million words each) are available at the full-text website.

We hope that this new resource is of value to you in your research and teaching.

Mark Davies
Brigham Young University
http://davies-linguistics.byu.edu/
http://corpus.byu.edu/

I haven’t ever attempted a systematic ranking of American universities but in terms of contributions to the public domain in the humanities, Brigham Young is surely in the top ten (10), however you might rank the members of that group individually.

Correction: A comment pointed out that this data set is for sale and not in the public domain. My bad, I read the announcement and not the website. Still, given the amount of work required to create such a corpus, I don’t find the fees offensive.

Take the data set being formatted for input into a relational database as a reason for inputting it into a non-relational database.

Enjoy!

I first saw this in a tweet by the https://twitter.com/linguistlistLinguist List.

Comments (2)

2 Comments

They can’t be given any credit for putting it in the public domain. They are charging $250 for academic use and $800 for non-academic use.

Comment by dkincaid — March 24, 2015 @ 8:25 pm
I have corrected to post. I read the announcement and not the website, sorry. Good catch! Thanks!

Comment by Patrick Durusau — March 27, 2015 @ 9:46 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.