ETS Corpus of Non-Native Written English by Daniel Blanchard, Joel Tetreault, Derrick Higgins, Aoife Cahill, and Martin Chodorow. (Blanchard, Daniel, et al. ETS Corpus of Non-Native Written English LDC2014T06. Web Download. Philadelphia: Linguistic Data Consortium, 2014.)
From the webpage:
ETS Corpus of Non-Native Written English was developed by Educational Testing Service and is comprised of 12,100 English essays written by speakers of 11 non-English native languages as part of an international test of academic English proficiency, TOEFL (Test of English as a Foreign Language). The test includes reading, writing, listening, and speaking sections and is delivered by computer in a secure test center. This release contains 1,100 essays for each of the 11 native languages sampled from eight topics with information about the score level (low/medium/high) for each essay.
The corpus was developed with the specific task of native language identification in mind, but is likely to support tasks and studies in the educational domain, including grammatical error detection and correction and automatic essay scoring, in addition to a broad range of research studies in the fields of natural language processing and corpus linguistics. For the task of native language identification, the following division is recommended: 82% as training data, 9% as development data and 9% as test data, split according to the file IDs accompanying the data set.
A data set for detecting the native language of authors writing in English. Not unlike the post earlier today on LDA, which attempts to detect topics that are (allegedly) behind words in a text.
I mention that because some CS techniques start with the premise that words are indirect representatives of something hidden, while other parts of CS, search for example, presume that words have no depth, only surface. The Google books N-Gram Viewer makes that assumption.
The N-Gram Viewer makes no distinction between any use of these words:
- awful
- backlog
- bad
- cell
- fantastic
- gay
- rubbers
- tool
Some have changed meaning recently, others, not quite so recently.
This is a partial list from a common resource: These 12 Everyday Words Used To Have Completely Different Meanings. Imagine if you did the historical research to place words in their particular social context.
It may be necessary for some purposes to presume words are shallow, but always remember that is a presumption and not a truth.
I first saw this in a tweet by Christopher Phipps.