Archive for the ‘Corpus Linguistics’ Category

Corpus of American Tract Society Publications

Friday, September 11th, 2015

Corpus of American Tract Society Publications by Lincoln Mullen.

From the post:

I’ve created a small to mid-sized corpus of publications by the American Tract Society up to the year 1900 in plain text. This corpus has been gathered from the Internet Archive. It includes 641 documents containing just under sixty million words, along with a CSV file containing metadata for each of the files. I don’t make any claims that this includes all of the ATS publications from that time period, and it is pretty obvious that the metadata from the Internet Archive is not much good. The titles are mostly correct; the dates are pretty far off in cases.

This corpus was created for the purpose of testing document similarity and text reuse algorithms. I need a corpus for testing the textreuse, which is in very early stages of development. From reading many, many of these tracts, I already know the patterns of text reuse. (And of course, the documents are historically interesting in their own right, and might be a good candidate for text mining.) The ATS frequently republished tracts under the same title. Furthermore, they published volumes containing the entire series of tracts that they had published individually. So there are examples of entire documents which are reprinted, but also some documents which are reprinted inside others. Then as a extra wrinkle, the corpus contains the editions of the Bible published by the ATS, plus their edition of Cruden’s concordance and a Bible dictionary. Likely all of the tracts quote the Bible, some at great length, so there are many examples of borrowing there.

Here is the corpus and its repository:

With the described repetition, the corpus must compress well. 😉

Makes me wonder how much near-repetition occurs in CS papers?

Graph papers than repeat graph fundamentals, in nearly the same order, in paper after paper.

At what level would you measure re-use? Sentence? Paragraph? Larger divisions?

FYI: COHA Full-Text Data: 385 Million Words, 116k Texts

Monday, March 9th, 2015

FYI: COHA Full-Text Data: 385 Million Words, 116k Texts by Mark Davies.

From the post:

This announcement is for those who are interested in historical corpora and who may want a large dataset to work with on their own machine. This is a real corpus, rather than just n-grams (as with the Google Books n-grams; see a comparison at

We are pleased to announce that the Corpus of Historical American English (COHA; is now available in downloadable full-text format, for use on your own computer.

COHA joins COCA and GloWbE, which have been available in downloadable full-text format since March 2014.

The downloadable version of COHA contains 385 million words of text in more than 115,000 separate texts, covering fiction, popular magazines, newspaper articles, and non-fiction books from the 1810s to the 2000s (see

At 385 million words in size, the downloadable COHA corpus is much larger than any other structured historical corpus of English. With this large amount of data, you can carry out many types of research that would not be possible with much smaller 5-10 million word historical corpora of English (see

The corpus is available in several formats: sentence/paragraph, PoS-tagged and lemmatized (one word per line), and for input into a relational database. Samples of each format (3.6 million words each) are available at the full-text website.

We hope that this new resource is of value to you in your research and teaching.

Mark Davies
Brigham Young University

I haven’t ever attempted a systematic ranking of American universities but in terms of contributions to the public domain in the humanities, Brigham Young is surely in the top ten (10), however you might rank the members of that group individually.

Correction: A comment pointed out that this data set is for sale and not in the public domain. My bad, I read the announcement and not the website. Still, given the amount of work required to create such a corpus, I don’t find the fees offensive.

Take the data set being formatted for input into a relational database as a reason for inputting it into a non-relational database.


I first saw this in a tweet by the List.

WebCorp Linguist’s Search Engine

Saturday, November 22nd, 2014

WebCorp Linguist’s Search Engine

From the homepage:

The WebCorp Linguist’s Search Engine is a tool for the study of language on the web. The corpora below were built by crawling the web and extracting textual content from web pages. Searches can be performed to find words or phrases, including pattern matching, wildcards and part-of-speech. Results are given as concordance lines in KWIC format. Post-search analyses are possible including time series, collocation tables, sorting and summaries of meta-data from the matched web pages.

Synchronic English Web Corpus 470 million word corpus built from web-extracted texts. Including a randomly selected ‘mini-web’ and high-level subject classification. About

Diachronic English Web Corpus 130 million word corpus randomly selected from a larger collection and balanced to contain the same number of words per month. About

Birmingham Blog Corpus 630 million word corpus built from blogging websites. Including a 180 million word sub-section separated into posts and comments. About

Anglo-Norman Correspondence Corpus A corpus of approximately 150 personal letters written by users of Anglo-Norman. Including bespoke part-of-speech annotation. About

Novels of Charles Dickens A searchable collection of the novels of Charles Dickens. Results can be visualised across chapters and novels. About

You have to register to use the service but registration is free.

The way I toss subject around on this blog you would think it has only one meaning. Not so as shown by the first twenty “hits” on subject in the Synchronic English Web Corpus:

1    Service agencies.  'Merit' is subject to various interpretations depending 
2		amount of oxygen a subject breathes in," he says, "
3		    to work on the subject again next month "to 
4	    of Durham degrees were subject to a religion test 
5	    London, which were not subject to any religious test, 
6	cited researchers in broad subject categories in life sciences, 
7    Losing Weight.  Broaching the subject of weight can be 
8    by survey respondents include subject and curriculum, assessment, pastoral, 
9       knowledge in teachers' own subject area, the use of 
10     each addressing a different subject and how citizenship and 
11	     and school staff, but subject to that they dismissed 
12	       expressed but it is subject to the qualifications set 
13	        last piece on this subject was widely criticised and 
14    saw themselves as foreigners subject to oppression by the 
15	 to suggest that, although subject to similar experiences, other 
16	       since you raise the subject, it's notable that very 
17	position of the privileged subject with their disorderly emotions 
18	 Jimmy may include radical subject matter in his scripts, 
19	   more than sufficient as subject matter and as an 
20	      the NATO script were subject to personal attacks from 

There are a host of options for using the corpus and exporting the results. See the Users Guide for full details.

A great tool not only for linguists but anyone who wants to explore English as a language with professional grade tools.

If you re-read Dickens with concordance in hand, please let me know how it goes. That has the potential to be a very interesting experience.

Free for personal/academic work, commercial use requires a license.

I first saw this in a tweet by Claire Hardaker

Non-Native Written English

Wednesday, June 18th, 2014

ETS Corpus of Non-Native Written English by Daniel Blanchard, Joel Tetreault, Derrick Higgins, Aoife Cahill, and Martin Chodorow. (Blanchard, Daniel, et al. ETS Corpus of Non-Native Written English LDC2014T06. Web Download. Philadelphia: Linguistic Data Consortium, 2014.)

From the webpage:

ETS Corpus of Non-Native Written English was developed by Educational Testing Service and is comprised of 12,100 English essays written by speakers of 11 non-English native languages as part of an international test of academic English proficiency, TOEFL (Test of English as a Foreign Language). The test includes reading, writing, listening, and speaking sections and is delivered by computer in a secure test center. This release contains 1,100 essays for each of the 11 native languages sampled from eight topics with information about the score level (low/medium/high) for each essay.

The corpus was developed with the specific task of native language identification in mind, but is likely to support tasks and studies in the educational domain, including grammatical error detection and correction and automatic essay scoring, in addition to a broad range of research studies in the fields of natural language processing and corpus linguistics. For the task of native language identification, the following division is recommended: 82% as training data, 9% as development data and 9% as test data, split according to the file IDs accompanying the data set.

A data set for detecting the native language of authors writing in English. Not unlike the post earlier today on LDA, which attempts to detect topics that are (allegedly) behind words in a text.

I mention that because some CS techniques start with the premise that words are indirect representatives of something hidden, while other parts of CS, search for example, presume that words have no depth, only surface. The Google books N-Gram Viewer makes that assumption.

The N-Gram Viewer makes no distinction between any use of these words:

  • awful
  • backlog
  • bad
  • cell
  • fantastic
  • gay
  • rubbers
  • tool

Some have changed meaning recently, others, not quite so recently.

This is a partial list from a common resource: These 12 Everyday Words Used To Have Completely Different Meanings. Imagine if you did the historical research to place words in their particular social context.

It may be necessary for some purposes to presume words are shallow, but always remember that is a presumption and not a truth.

I first saw this in a tweet by Christopher Phipps.

The IMS Open Corpus Workbench (CWB)

Monday, October 7th, 2013

The IMS Open Corpus Workbench (CWB)

From the webpage:

The IMS Open Corpus Workbench (CWB) is a collection of open-source tools for managing and querying large text corpora (ranging from 10 million to 2 billion words) with linguistic annotations. Its central component is the flexible and efficient query processor CQP.

The first official open-source release of the Corpus Workbench (Version 3.0) is now available from this website. While many pages are still under construction, you can download release versions of the CWB, associated software and sample corpora. You will also find some documentation and other information in the different sections of this site.

If you are investigating large amounts of text, this may be the tool for you.

BTW, don’t miss: Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium by Stefan Evert and Andrew Hardie.


Corpus Workbench (CWB) is a widely-used architecture for corpus analysis, originally designed at the IMS, University of Stuttgart (Christ 1994). It consists of a set of tools for indexing, managing and querying very large corpora with multiple layers of word-level annotation. CWB’s central component is the Corpus Query Processor (CQP), an extremely powerful and efficient concordance system implementing a flexible two-level search language that allows complex query patterns to be specified both at the level of an individual word or annotation, and at the level of a fully- or partially-specified pattern of tokens. CWB and CQP are commonly used as the back-end for web-based corpus interfaces, for example, in the popular BNCweb interface to the British National Corpus (Hoffmann et al. 2008). CWB has influenced other tools, such as the Manatee software used in SketchEngine, which implements the same query language (Kilgarriff et al. 2004).

This paper details recent work to update CWB for the new century. Perhaps the most significant development is that CWB version 3 is now an open source project, licensed under the GNU General Public Licence. This change has substantially enlarged the community of developers and users and has enabled us to leverage existing open-source libraries in extending CWB’s capabilities. As a result, several key improvements were made to the CWB core: (i) support for multiple character sets, most especially Unicode (in the form of UTF-8), allowing all the world’s writing systems to be utilised within a CWB-indexed corpus; (ii) support for powerful Perl-style regular expressions in CQP queries, based on the open-source PCRE library; (iii) support for a wider range of OS platforms including Mac OS X, Linux, and Windows; and (iv) support for larger corpus sizes of up to 2 billion words on 64-bit platforms.

Outside the CWB core, a key concern is the user-friendliness of the interface. CQP itself can be daunting for beginners. However, it is common for access to CQP queries to be provided via a web-interface, supported in CWB version 3 by several Perl modules that give easy access to different facets of CWB/CQP functionality. The CQPweb front-end (Hardie forthcoming) has now been adopted as an integral component of CWB. CQPweb provides analysis options beyond concordancing (such as collocations, frequency lists, and keywords) by using a MySQL database alongside CQP. Available in both the Perl interface and CQPweb is the Common Elementary Query Language (CEQL), a simple-syntax set of search patterns and wildcards which puts much of
the power of CQP in a form accessible to beginning students and non-corpus-linguists.

The paper concludes with a roadmap for future development of the CWB (version 4 and above), with a focus on even larger corpora, full support for XML and dependency annotation, new types of query languages, and improved efficiency of complex CQP queries. All interested users are invited to help us shape the future of CWB by discussing requirements and contributing to the implementation of these features.

I have been using some commercial concordance software recently on standards drafts.

I need to give the IMS Open Corpus Workbench (CBW) a spin.

I would not worry about the 2 billion word corpus limitation.

That’s approximately 3,333.33 times the number of words in War and Peace by Leo Tolstoy. (I rounded the English translation word count up to 600,000 for an even number.)

GroningenMeaningBank (GMB)

Thursday, April 11th, 2013

GroningenMeaningBank (GMB)

From the “about” page:

The Groningen Meaning Bank consists of public domain English texts with corresponding syntactic and semantic representations.

Key features

The GMB supports deep semantics, opening the way to theoretically grounded, data-driven approaches to computational semantics. It integrates phenomena instead of covering single phenomena in isolation. This provides a better handle on explaining dependencies between various ambiguous linguistic phenomena, including word senses, thematic roles, quantifier scrope, tense and aspect, anaphora, presupposition, and rhetorical relations. In the GMB texts are annotated rather than
isolated sentences, which provides a means to deal with ambiguities on the sentence level that require discourse context for resolving them.


The GMB is being built using a bootstrapping approach. We employ state-of-the-art NLP tools (notably the C&C tools and Boxer) to produce a reasonable approximation to gold-standard annotations. From release to release, the annotations are corrected and refined using human annotations coming from two main sources: experts who directly edit the annotations in the GMB via the Explorer, and non-experts who play a game with a purpose called Wordrobe.

Theoretical background

The theoretical backbone for the semantic annotations in the GMB is established by Discourse Representation Theory (DRT), a formal theory of meaning developed by the philosopher of language Hans Kamp (Kamp, 1981; Kamp and Reyle, 1993). Extensions of the theory bridge the gap between theory and practice. In particular, we use VerbNet for thematic roles, a variation on ACE‘s named entity classification, WordNet for word senses and Segmented DRT for rhetorical relations (Asher and Lascarides, 2003). Thanks to the DRT backbone, all these linguistic phenomena can be expressed in a first-order language, enabling the practical use of first-order theorem provers and model builders.

Step back towards the source of semantics (that would be us).

One practical question is how to capture semantics for a particular domain or enterprise?

Another is what to capture to enable the mapping of those semantics to those of other domains or enterprises?

…Wikilinks Corpus With 40M Mentions And 3M Entities

Saturday, March 9th, 2013

Google Research Releases Wikilinks Corpus With 40M Mentions And 3M Entities by Frederic Lardinois.

From the post:

Google Research just launched its Wikilinks corpus, a massive new data set for developers and researchers that could make it easier to add smart disambiguation and cross-referencing to their applications. The data could, for example, make it easier to find out if two web sites are talking about the same person or concept, Google says. In total, the corpus features 40 million disambiguated mentions found within 10 million web pages. This, Google notes, makes it “over 100 times bigger than the next largest corpus,” which features fewer than 100,000 mentions.

For Google, of course, disambiguation is something that is a core feature of the Knowledge Graph project, which allows you to tell Google whether you are looking for links related to the planet, car or chemical element when you search for ‘mercury,’ for example. It takes a large corpus like this one and the ability to understand what each web page is really about to make this happen.

Details follow on how to create this data set.

Very cool!

The only caution is that your entities, those specific to your enterprise, are unlikely to appear, even in 40M mentions.

But the Wikilinks Corpus + your entities, now that is something with immediate ROI for your enterprise.

Using information retrieval technology for a corpus analysis platform

Wednesday, September 26th, 2012

Using information retrieval technology for a corpus analysis platform by Carsten Schnober.


This paper describes a practical approach to use the information retrieval engine Lucene for the corpus analysis platform KorAP, currently being developed at the Institut für Deutsche Sprache (IDS Mannheim). It presents a method to use Lucene’s indexing technique and to exploit it for linguistically annotated data, allowing full flexibility to handle multiple annotation layers. It uses multiple indexes and MapReduce techniques in order to keep KorAP scalable.

The support for multiple annotation layers is of particular interest to me because the “subjects” of interest in a text may vary from one reader to another.

Being mindful that for topic maps, the annotation layers and annotations themselves may be subjects for some purposes.

Gold Standard (or Bronze, Tin?)

Sunday, August 19th, 2012

A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools by Karin M Verspoor, Kevin B Cohen, Arrick Lanfranchi, Colin Warner, Helen L Johnson, Christophe Roeder, Jinho D Choi, Christopher Funk, Yuriy Malenkiy, Miriam Eckert, Nianwen Xue, William A Baumgartner, Michael Bada, Martha Palmer and Lawrence E Hunter. BMC Bioinformatics 2012, 13:207 doi:10.1186/1471-2105-13-207.



We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus.


Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data.


The finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.

This is the article that I discovered and then worked my way to it from BioNLP.

Important as a deeply annotated text corpus.

But also a reminder that human annotators created the “gold standard,” against which other efforts are judged.

If you are ill, do you want gold standard research into the medical literature (which involves librarians)? Or is bronze or tin standard research good enough?

PS: I will be going back to pickup the other resources as appropriate.

GATE Teamware: Collaborative Annotation Factories (HOT!)

Wednesday, May 9th, 2012

GATE Teamware: Collaborative Annotation Factories

From the webpage:

Teamware is a web-based management platform for collaborative annotation & curation. It is a cost-effective environment for annotation and curation projects, enabling you to harness a broadly distributed workforce and monitor progress & results remotely in real time.

It’s also very easy to use. A new project can be up and running in less than five minutes. (As far as we know, there is nothing else like it in this field.)

GATE Teamware delivers a multi-function user interface over the Internet for viewing, adding and editing text annotations. The web-based management interface allows for project set-up, tracking, and management:

  • Loading document collections (a “corpus” or “corpora”)
  • Creating re-usable project templates
  • Initiating projects based on templates
  • Assigning project roles to specific users
  • Monitoring progress and various project statistics in real time
  • Reporting of project status, annotator activity and statistics
  • Applying GATE-based processing routines (automatic annotations or post-annotation processing)

I have known about the GATE project in general for years and came to this site after reading: Crowdsourced Legal Case Annotation.

Could be the basis for annotations that are converted into a topic map, but…, I have been a sysadmin before. Maintaining servers, websites, software, etc. Great work, interesting work, but not what I want to be doing now.

Then I read:

Where to get it? The easiest way to get started is to buy a ready-to-run Teamware virtual server from

Not saying it will or won’t meet your particular needs, but, certainly is worth a “look see.”

Let me know if you take the plunge!

The communicative function of ambiguity in language

Friday, January 20th, 2012

The communicative function of ambiguity in language by Steven T. Piantadosi, Harry Tily and Edward Gibson. (Cognition, 2011) (PDF file)


We present a general information-theoretic argument that all efficient communication systems will be ambiguous, assuming that context is informative about meaning. We also argue that ambiguity allows for greater ease of processing by permitting efficient linguistic units to be re-used. We test predictions of this theory in English, German, and Dutch. Our results and theoretical analysis suggest that ambiguity is a functional property of language that allows for greater communicative efficiency. This provides theoretical and empirical arguments against recent suggestions that core features of linguistic systems are not designed for communication.

This is a must read paper if you are interesting in ambiguity and similar issues.

At page 289, the authors report:

These findings suggest that ambiguity is not enough of a problem to real-world communication that speakers would make much effort to avoid it. This may well be because actual language in context provides other information that resolves the ambiguities most of the time.

I don’t know if our communication systems are efficient or not but I think the phrase “in context” is covering up a very important point.

Our communication systems came about in very high-bandwidth circumstances. We were in the immediate presence of a person speaking. With all the context that provides.

Even if we accept an origin of language of say 200,000 years ago, written language, which provides the basis for communication without the presence of another person, emerges only in the last five or six thousand years. Just to keep it simple, 5 thousand years would be 2.5% of the entire history of language.

So for 97.5% of the history of language, it has been used in a high bandwidth situation. No wonder it has yet to adapt to narrow bandwidth situations.

If writing puts us into a narrow bandwidth situation and ambiguity, where does that leave our computers?

Creating a Live, Public Short Message Service Corpus: The NUS SMS Corpus

Wednesday, December 14th, 2011

Creating a Live, Public Short Message Service Corpus: The NUS SMS Corpus by Tao Chen, Min-Yen Kan.


Short Message Service (SMS) messages are largely sent directly from one person to another from their mobile phones. They represent a means of personal communication that is an important communicative artifact in our current digital era. As most existing studies have used private access to SMS corpora, comparative studies using the same raw SMS data has not been possible up to now. We describe our efforts to collect a public SMS corpus to address this problem. We use a battery of methodologies to collect the corpus, paying particular attention to privacy issues to address contributors’ concerns. Our live project collects new SMS message submissions, checks their quality and adds the valid messages, releasing the resultant corpus as XML and as SQL dumps, along with corpus statistics, every month. We opportunistically collect as much metadata about the messages and their sender as possible, so as to enable different types of analyses. To date, we have collected about 60,000 messages, focusing on English and Mandarin Chinese.

A unique and publicly available corpus of material.

Your average marketing company might not have an SMS corpus for you to work with but I can think of some other organizations that do. 😉 Train on this one to win your spurs.

The web of topics: discovering the topology of topic evolution in a corpus

Thursday, March 31st, 2011

The web of topics: discovering the topology of topic evolution in a corpus by Yookyung Jo, John E. Hopcroft, and, Carl Lagoze, Cornell University, Ithaca, NY, USA.


In this paper we study how to discover the evolution of topics over time in a time-stamped document collection. Our approach is uniquely designed to capture the rich topology of topic evolution inherent in the corpus. Instead of characterizing the evolving topics at fixed time points, we conceptually define a topic as a quantized unit of evolutionary change in content and discover topics with the time of their appearance in the corpus. Discovered topics are then connected to form a topic evolution graph using a measure derived from the underlying document network. Our approach allows inhomogeneous distribution of topics over time and does not impose any topological restriction in topic evolution graphs. We evaluate our algorithm on the ACM corpus. The topic evolution graphs obtained from the ACM corpus provide an effective and concrete summary of the corpus with remarkably rich topology that are congruent to our background knowledge. In a finer resolution, the graphs reveal concrete information about the corpus that were previously unknown to us, suggesting the utility of our approach as a navigational tool for the corpus.

The term topic is being used in this paper to mean a subject in topic map parlance.

From the paper:

Our work is built on the premise that the words relevant to a topic are distributed over documents such that the distribution is correlated with the underlying document network such as a citation network. Specifically, in our topic discovery methodology, in order to test if a multinomial word distribution derived from a document constitutes a new topic, the following heuristic is used. We check that the distribution is exclusively correlated to the document network by requiring it to be significantly present in other documents that are network neighbors of the given document while suppressing the nondiscriminative words using the background model.

Navigation of a corpus on the basis of such a process would indeed be rich, but it would be even richer were multiple ways to represent the same subjects mapped together.

It would also be interesting to see how the resulting graphs, which included only the document titles and abstracts, compared to graphs constructed using the entire documents.

Trending Terms in Google’s Book Corpus – Post

Tuesday, December 28th, 2010

Trending Terms in Google’s Book Corpus

Matthew Hurst covers an interesting new tool at Google Book Corpus that allows tracking of terms over time.


  1. Pick at least 3 pairs of terms to track through this interface. (3-5 pages, no citations)
  2. Track only one term over 300 years of publication with one example of usage for every 30 years. (3-5 pages, citations)
  3. What similarity measures would you use to detect variation in the semantic of that term in a corpus covering 300 years? (3-5 pages, citations)

Building Concept Structures/Concept Trails

Thursday, December 2nd, 2010

Automatically Building Concept Structures and Displaying Concept Trails for the Use in Brainstorming Sessions and Content Management Systems Authors: Christian Biemann, Karsten Böhm, Gerhard Heyer and Ronny Melz


The automated creation and the visualization of concept structures become more important as the number of relevant information continues to grow dramatically. Especially information and knowledge intensive tasks are relying heavily on accessing the relevant information or knowledge at the right time. Moreover the capturing of relevant facts and good ideas should be focused on as early as possible in the knowledge creation process.

In this paper we introduce a technology to support knowledge structuring processes already at the time of their creation by building up concept structures in real time. Our focus was set on the design of a minimal invasive system, which ideally requires no human interaction and thus gives the maximum freedom to the participants of a knowledge creation or exchange processes. The initial prototype concentrates on the capturing of spoken language to support meetings of human experts, but can be easily adapted for the use in Internet communities that have to rely on knowledge exchange using electronic communication channel.

I don’t share the author’s confidence that corpus linguistics are going to provide the level of accuracy expected.

But, I find the notion of a dynamic semantic map that grows, changes and evolves during a discussion to be intriguing.

This article was published in 2006 so I will follow up to see what later results have been reported.