Archive for the ‘Corpora’ Category

Corpus of American Tract Society Publications

Friday, September 11th, 2015

Corpus of American Tract Society Publications by Lincoln Mullen.

From the post:

I’ve created a small to mid-sized corpus of publications by the American Tract Society up to the year 1900 in plain text. This corpus has been gathered from the Internet Archive. It includes 641 documents containing just under sixty million words, along with a CSV file containing metadata for each of the files. I don’t make any claims that this includes all of the ATS publications from that time period, and it is pretty obvious that the metadata from the Internet Archive is not much good. The titles are mostly correct; the dates are pretty far off in cases.

This corpus was created for the purpose of testing document similarity and text reuse algorithms. I need a corpus for testing the textreuse, which is in very early stages of development. From reading many, many of these tracts, I already know the patterns of text reuse. (And of course, the documents are historically interesting in their own right, and might be a good candidate for text mining.) The ATS frequently republished tracts under the same title. Furthermore, they published volumes containing the entire series of tracts that they had published individually. So there are examples of entire documents which are reprinted, but also some documents which are reprinted inside others. Then as a extra wrinkle, the corpus contains the editions of the Bible published by the ATS, plus their edition of Cruden’s concordance and a Bible dictionary. Likely all of the tracts quote the Bible, some at great length, so there are many examples of borrowing there.

Here is the corpus and its repository:

With the described repetition, the corpus must compress well. 😉

Makes me wonder how much near-repetition occurs in CS papers?

Graph papers than repeat graph fundamentals, in nearly the same order, in paper after paper.

At what level would you measure re-use? Sentence? Paragraph? Larger divisions?

FYI: COHA Full-Text Data: 385 Million Words, 116k Texts

Monday, March 9th, 2015

FYI: COHA Full-Text Data: 385 Million Words, 116k Texts by Mark Davies.

From the post:

This announcement is for those who are interested in historical corpora and who may want a large dataset to work with on their own machine. This is a real corpus, rather than just n-grams (as with the Google Books n-grams; see a comparison at

We are pleased to announce that the Corpus of Historical American English (COHA; is now available in downloadable full-text format, for use on your own computer.

COHA joins COCA and GloWbE, which have been available in downloadable full-text format since March 2014.

The downloadable version of COHA contains 385 million words of text in more than 115,000 separate texts, covering fiction, popular magazines, newspaper articles, and non-fiction books from the 1810s to the 2000s (see

At 385 million words in size, the downloadable COHA corpus is much larger than any other structured historical corpus of English. With this large amount of data, you can carry out many types of research that would not be possible with much smaller 5-10 million word historical corpora of English (see

The corpus is available in several formats: sentence/paragraph, PoS-tagged and lemmatized (one word per line), and for input into a relational database. Samples of each format (3.6 million words each) are available at the full-text website.

We hope that this new resource is of value to you in your research and teaching.

Mark Davies
Brigham Young University

I haven’t ever attempted a systematic ranking of American universities but in terms of contributions to the public domain in the humanities, Brigham Young is surely in the top ten (10), however you might rank the members of that group individually.

Correction: A comment pointed out that this data set is for sale and not in the public domain. My bad, I read the announcement and not the website. Still, given the amount of work required to create such a corpus, I don’t find the fees offensive.

Take the data set being formatted for input into a relational database as a reason for inputting it into a non-relational database.


I first saw this in a tweet by the List.

WebCorp Linguist’s Search Engine

Saturday, November 22nd, 2014

WebCorp Linguist’s Search Engine

From the homepage:

The WebCorp Linguist’s Search Engine is a tool for the study of language on the web. The corpora below were built by crawling the web and extracting textual content from web pages. Searches can be performed to find words or phrases, including pattern matching, wildcards and part-of-speech. Results are given as concordance lines in KWIC format. Post-search analyses are possible including time series, collocation tables, sorting and summaries of meta-data from the matched web pages.

Synchronic English Web Corpus 470 million word corpus built from web-extracted texts. Including a randomly selected ‘mini-web’ and high-level subject classification. About

Diachronic English Web Corpus 130 million word corpus randomly selected from a larger collection and balanced to contain the same number of words per month. About

Birmingham Blog Corpus 630 million word corpus built from blogging websites. Including a 180 million word sub-section separated into posts and comments. About

Anglo-Norman Correspondence Corpus A corpus of approximately 150 personal letters written by users of Anglo-Norman. Including bespoke part-of-speech annotation. About

Novels of Charles Dickens A searchable collection of the novels of Charles Dickens. Results can be visualised across chapters and novels. About

You have to register to use the service but registration is free.

The way I toss subject around on this blog you would think it has only one meaning. Not so as shown by the first twenty “hits” on subject in the Synchronic English Web Corpus:

1    Service agencies.  'Merit' is subject to various interpretations depending 
2		amount of oxygen a subject breathes in," he says, "
3		    to work on the subject again next month "to 
4	    of Durham degrees were subject to a religion test 
5	    London, which were not subject to any religious test, 
6	cited researchers in broad subject categories in life sciences, 
7    Losing Weight.  Broaching the subject of weight can be 
8    by survey respondents include subject and curriculum, assessment, pastoral, 
9       knowledge in teachers' own subject area, the use of 
10     each addressing a different subject and how citizenship and 
11	     and school staff, but subject to that they dismissed 
12	       expressed but it is subject to the qualifications set 
13	        last piece on this subject was widely criticised and 
14    saw themselves as foreigners subject to oppression by the 
15	 to suggest that, although subject to similar experiences, other 
16	       since you raise the subject, it's notable that very 
17	position of the privileged subject with their disorderly emotions 
18	 Jimmy may include radical subject matter in his scripts, 
19	   more than sufficient as subject matter and as an 
20	      the NATO script were subject to personal attacks from 

There are a host of options for using the corpus and exporting the results. See the Users Guide for full details.

A great tool not only for linguists but anyone who wants to explore English as a language with professional grade tools.

If you re-read Dickens with concordance in hand, please let me know how it goes. That has the potential to be a very interesting experience.

Free for personal/academic work, commercial use requires a license.

I first saw this in a tweet by Claire Hardaker

Annotated Chemical Patent Corpus: A Gold Standard for Text Mining

Friday, October 10th, 2014

Annotated Chemical Patent Corpus: A Gold Standard for Text Mining by Saber A. Akhondi, et al. (Published: September 30, 2014 DOI: 10.1371/journal.pone.0107477)


Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, targets, and modes of action. Spelling mistakes and spurious line break due to optical character recognition errors were also annotated. A subset of 47 patents was annotated by at least three annotator groups, from which harmonized annotations and inter-annotator agreement scores were derived. One group annotated the full set. The patent corpus includes 400,125 annotations for the full set and 36,537 annotations for the harmonized set. All patents and annotated entities are publicly available at

Highly recommended both as a “gold standard” for chemical patent text mining but also as the state of the art in developing such a standard.

To say nothing of annotation as a means of automatic creation of topic maps where entities are imbued with subject identity properties.

I first saw this in a tweet by ChemConnector.

BootCaT: Simple Utilities to Bootstrap Corpora And Terms from the Web

Tuesday, September 9th, 2014

BootCaT: Simple Utilities to Bootstrap Corpora And Terms from the Web

From the webpage:

Despite certain obvious drawbacks (e.g. lack of control, sampling, documentation etc.), there is no doubt that the World Wide Web is a mine of language data of unprecedented richness and ease of access.

It is also the only viable source of “disposable” corpora built ad hoc for a specific purpose (e.g. a translation or interpreting task, the compilation of a terminological database, domain-specific machine learning tasks). These corpora are essential resources for language professionals who routinely work with specialized languages, often in areas where neologisms and new terms are introduced at a fast pace and where standard reference corpora have to be complemented by easy-to-construct, focused, up-to-date text collections.

While it is possible to construct a web-based corpus through manual queries and downloads, this process is extremely time-consuming. The time investment is particularly unjustified if the final result is meant to be a single-use corpus.

The command-line scripts included in the BootCaT toolkit implement an iterative procedure to bootstrap specialized corpora and terms from the web, requiring only a list of “seeds” (terms that are expected to be typical of the domain of interest) as input.

In implementing the algorithm, we followed the old UNIX adage that each program should do only one thing, but do it well. Thus, we developed a small, independent tool for each separate subtask of the algorithm.

As a result, BootCaT is extremely modular: one can easily run a subset of the programs, look at intermediate output files, add new tools to the suite, or change one program without having to worry about the others.

Any application following “the old UNIX adage that each program should do only one thing, but do it well” merits serious consideration.

Occurs to me that BootCaT would also be useful for creating small text collections for comparison to each other.


I first saw this in a tweet by Alyona Medelyan.

Non-Native Written English

Wednesday, June 18th, 2014

ETS Corpus of Non-Native Written English by Daniel Blanchard, Joel Tetreault, Derrick Higgins, Aoife Cahill, and Martin Chodorow. (Blanchard, Daniel, et al. ETS Corpus of Non-Native Written English LDC2014T06. Web Download. Philadelphia: Linguistic Data Consortium, 2014.)

From the webpage:

ETS Corpus of Non-Native Written English was developed by Educational Testing Service and is comprised of 12,100 English essays written by speakers of 11 non-English native languages as part of an international test of academic English proficiency, TOEFL (Test of English as a Foreign Language). The test includes reading, writing, listening, and speaking sections and is delivered by computer in a secure test center. This release contains 1,100 essays for each of the 11 native languages sampled from eight topics with information about the score level (low/medium/high) for each essay.

The corpus was developed with the specific task of native language identification in mind, but is likely to support tasks and studies in the educational domain, including grammatical error detection and correction and automatic essay scoring, in addition to a broad range of research studies in the fields of natural language processing and corpus linguistics. For the task of native language identification, the following division is recommended: 82% as training data, 9% as development data and 9% as test data, split according to the file IDs accompanying the data set.

A data set for detecting the native language of authors writing in English. Not unlike the post earlier today on LDA, which attempts to detect topics that are (allegedly) behind words in a text.

I mention that because some CS techniques start with the premise that words are indirect representatives of something hidden, while other parts of CS, search for example, presume that words have no depth, only surface. The Google books N-Gram Viewer makes that assumption.

The N-Gram Viewer makes no distinction between any use of these words:

  • awful
  • backlog
  • bad
  • cell
  • fantastic
  • gay
  • rubbers
  • tool

Some have changed meaning recently, others, not quite so recently.

This is a partial list from a common resource: These 12 Everyday Words Used To Have Completely Different Meanings. Imagine if you did the historical research to place words in their particular social context.

It may be necessary for some purposes to presume words are shallow, but always remember that is a presumption and not a truth.

I first saw this in a tweet by Christopher Phipps.

Corpus-based Empirical Software Engineering

Wednesday, May 21st, 2014

Corpus-based Empirical Software Engineering – Ekaterina Pek by Felienne Hermans.

Felienne was live blogging Ekaterina’s presentation and defense (Defense of Ekaterina Pek May 21, 2014) today.

From the presentation notes:

The motivation for Kate’s work, she tells us, is the work of Knuth who empirically studied punchcards with FORTRAN code, in order to discover ‘what programmers really do’, as opposed to ‘what programmers should do’

Kate has the same goal: she wants to measure use of languages:

  • frequency counts -> How often are parts of the language used?
  • coverage -> What parts of the language are used?
  • footprint -> How much of each language part is used?

In order to be able to perform such analyses, we need a ‘corpus’ a big set of language data to work on. Knuth even collected punch cards from garbage bins, because it was so important for him to get more data.

And it is not just code she looked at, also libraries, bugs, emails and commits are taken into account. But some have to be sanitized in order to be usable for the corpus.

Now there is an interesting sea of subjects.

Imagine exploring such a corpus for patterns of bugs and merging in patterns found in bug reports.

After all, bugs are introduced with programmers program as they do in real life, not as they would in theory.

Comparison of Corpora through Narrative Structure

Friday, May 16th, 2014

Comparison of Corpora through Narrative Structure by Dan Simonson.

A very interesting slide deck from a presentation on how news coverage of police activity may have changed from before and after September 11th.

An early slide that caught my attention:

As a computational linguist, I can study 106 —instead of 100.6 —documents.

The sort of claim that clients might look upon with favor.

I first saw this in a tweet by Dominique Mariko.

The IMS Open Corpus Workbench (CWB)

Monday, October 7th, 2013

The IMS Open Corpus Workbench (CWB)

From the webpage:

The IMS Open Corpus Workbench (CWB) is a collection of open-source tools for managing and querying large text corpora (ranging from 10 million to 2 billion words) with linguistic annotations. Its central component is the flexible and efficient query processor CQP.

The first official open-source release of the Corpus Workbench (Version 3.0) is now available from this website. While many pages are still under construction, you can download release versions of the CWB, associated software and sample corpora. You will also find some documentation and other information in the different sections of this site.

If you are investigating large amounts of text, this may be the tool for you.

BTW, don’t miss: Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium by Stefan Evert and Andrew Hardie.


Corpus Workbench (CWB) is a widely-used architecture for corpus analysis, originally designed at the IMS, University of Stuttgart (Christ 1994). It consists of a set of tools for indexing, managing and querying very large corpora with multiple layers of word-level annotation. CWB’s central component is the Corpus Query Processor (CQP), an extremely powerful and efficient concordance system implementing a flexible two-level search language that allows complex query patterns to be specified both at the level of an individual word or annotation, and at the level of a fully- or partially-specified pattern of tokens. CWB and CQP are commonly used as the back-end for web-based corpus interfaces, for example, in the popular BNCweb interface to the British National Corpus (Hoffmann et al. 2008). CWB has influenced other tools, such as the Manatee software used in SketchEngine, which implements the same query language (Kilgarriff et al. 2004).

This paper details recent work to update CWB for the new century. Perhaps the most significant development is that CWB version 3 is now an open source project, licensed under the GNU General Public Licence. This change has substantially enlarged the community of developers and users and has enabled us to leverage existing open-source libraries in extending CWB’s capabilities. As a result, several key improvements were made to the CWB core: (i) support for multiple character sets, most especially Unicode (in the form of UTF-8), allowing all the world’s writing systems to be utilised within a CWB-indexed corpus; (ii) support for powerful Perl-style regular expressions in CQP queries, based on the open-source PCRE library; (iii) support for a wider range of OS platforms including Mac OS X, Linux, and Windows; and (iv) support for larger corpus sizes of up to 2 billion words on 64-bit platforms.

Outside the CWB core, a key concern is the user-friendliness of the interface. CQP itself can be daunting for beginners. However, it is common for access to CQP queries to be provided via a web-interface, supported in CWB version 3 by several Perl modules that give easy access to different facets of CWB/CQP functionality. The CQPweb front-end (Hardie forthcoming) has now been adopted as an integral component of CWB. CQPweb provides analysis options beyond concordancing (such as collocations, frequency lists, and keywords) by using a MySQL database alongside CQP. Available in both the Perl interface and CQPweb is the Common Elementary Query Language (CEQL), a simple-syntax set of search patterns and wildcards which puts much of
the power of CQP in a form accessible to beginning students and non-corpus-linguists.

The paper concludes with a roadmap for future development of the CWB (version 4 and above), with a focus on even larger corpora, full support for XML and dependency annotation, new types of query languages, and improved efficiency of complex CQP queries. All interested users are invited to help us shape the future of CWB by discussing requirements and contributing to the implementation of these features.

I have been using some commercial concordance software recently on standards drafts.

I need to give the IMS Open Corpus Workbench (CBW) a spin.

I would not worry about the 2 billion word corpus limitation.

That’s approximately 3,333.33 times the number of words in War and Peace by Leo Tolstoy. (I rounded the English translation word count up to 600,000 for an even number.)

Quantifying the Language of British Politics, 1880-1914

Friday, September 27th, 2013

Quantifying the Language of British Politics, 1880-1914


This paper explores the power, potential, and challenges of studying historical political speeches using a specially constructed multi-million word corpus via quantitative computer software. The techniques used – inspired particularly by Corpus Linguists – are almost entirely novel in the field of political history, an area where research into language is conducted nearly exclusively qualitatively. The paper argues that a corpus gives us the crucial ability to investigate matters of historical interest (e.g. the political rhetoric of imperialism, Ireland, and class) in a more empirical and systematic manner, giving us the capacity to measure scope, typicality, and power in a massive text like a national general election campaign which it would be impossible to read in entirety.

The paper also discusses some of the main arguments against this approach which are commonly presented by critics, and reflects on the challenges faced by quantitative language analysis in gaining more widespread acceptance and recognition within the field.

Points to a podcast by Luke Blaxill presenting the results of his Ph.D research.

Luke Blaxill’s dissertation: The Language of British Electoral Politics 1880-1910.

Important work that strikes a balance between a “close reading” of the relevant texts and using a one million word corpus (two corpora actually) to trace language usage.

Think of it as the opposite of tools that flatten the meaning of words across centuries.

in the dark heart of a language model lies madness….

Saturday, August 3rd, 2013

in the dark heart of a language model lies madness…. by Chris.

From the post:

This is the second in a series of post detailing experiments with the Java Graphical Authorship Attribution Program. The first post is here.


In my first run (seen above), I asked JGAAP to normalize for white space, strip punctuation, turn everything into lowercase. Then I had it run a Naive Bayes classifier on the top 50 tri-grams from the three known authors (Shakespeare, Marlowe, Bacon) and one unknown author (Shakespeare’s sonnets).

Based on that sample, JGAAP came to the conclusion that Francis Bacon wrote the sonnets. We know that because it lists its guesses in order from best to worst in the left window in the above image. Bacon is on top. This alone is cause to start tinkering with the model, but the results didn’t look as flat weird until I looked at the image again today. It lists the probability that the sonnets were written by Bacon as 1. A probability of 1 typically means absolute certainty. So this model, given the top 50 trigrams, is absolutely certain that Francis Bacon wrote those sonnets … Bullshit. A probabilistic model is never absolutely certain of anything. That’s what makes it probabilistic, right?

So where’s the bug? Turns out, it might have been poor data management on my part. I didn’t bother to sample in any kind of fair and reasonable way. Here are my corpora:


You may not be a stakeholder in the Shakespeare vs. Bacon debate, but you are likely to encounter questions about the authorship of data. Particularly text data.

The tool that Chris describes is a great introduction to that type of analysis.

Metaphor Identification in Large Texts Corpora

Tuesday, May 21st, 2013

Metaphor Identification in Large Texts Corpora by Yair Neuman, Dan Assaf, Yohai Cohen, Mark Last, Shlomo Argamon, Newton Howard, Ophir Frieder. (Neuman Y, Assaf D, Cohen Y, Last M, Argamon S, et al. (2013) Metaphor Identification in Large Texts Corpora. PLoS ONE 8(4): e62343. doi:10.1371/journal.pone.0062343)


Identifying metaphorical language-use (e.g., sweet child) is one of the challenges facing natural language processing. This paper describes three novel algorithms for automatic metaphor identification. The algorithms are variations of the same core algorithm. We evaluate the algorithms on two corpora of Reuters and the New York Times articles. The paper presents the most comprehensive study of metaphor identification in terms of scope of metaphorical phrases and annotated corpora size. Algorithms’ performance in identifying linguistic phrases as metaphorical or literal has been compared to human judgment. Overall, the algorithms outperform the state-of-the-art algorithm with 71% precision and 27% averaged improvement in prediction over the base-rate of metaphors in the corpus.

A deep review of current work and promising new algorithms on metaphor identification.

I first saw this in Nat Torkinton’s Four short links: 14 May 2013.


Thursday, May 9th, 2013

AntConc by Laurence Anthony.

From the help file:


The Concordance tool generates KWIC (key word in context) concordance lines from one or more target texts chosen by the user.

Concordance Plot

The Concordance Plot tool generates an alternative view of search term hits in a corpus compared with the Concordance tool. Here the relative position of each hit in a file is displayed as a line in bar chart. (Search terms can be inputted in an identical way to that in the Concordance Tool.)

File View

The File View tool is used to display the original files of the corpus. It can also be used to search for terms within individual files in a similar way to searches using the Concordance and Concordance Plot tools.

Word Clusters

The Word Clusters tool is used to generate an ordered list of clusters that appear around a search term in the target files listed in the left frame of the main window.


The N-grams tool is used to generate an ordered list of N-grams that appear in the target files listed in the left frame of the main window. N-grams are word N-grams, and therefore, large files will create huge numbers of N-grams. For example, N-grams of size 2 for the sentence “this is a pen”, are ‘this is’, ‘is a’ and ‘a pen’.


The Collocates tool is used to generate an ordered list of collocates that appear near a search term in the target files listed in the left frame of the main window.

Word List

The Word List feature is used to generate a list of ordered words that appear in the target files listed in the left frame of the main window.

Keyword List

In addition to generating word lists using the Word List tool, AntConc can compare the words that appear in the target files with the words that appear in a ‘reference corpus’ to generate a list of “Keywords”, that are unusually frequent (or infrequent) in the target files.

The 1.0 version appeared in 2002 and the current beta version is 3.3.5.

Great for exploring texts!

Did I mention it is freeware?

50,000 Lessons on How to Read:…

Friday, April 12th, 2013

50,000 Lessons on How to Read: a Relation Extraction Corpus by Dave Orr, Product Manager, Google Research.

From the post:

One of the most difficult tasks in NLP is called relation extraction. It’s an example of information extraction, one of the goals of natural language understanding. A relation is a semantic connection between (at least) two entities. For instance, you could say that Jim Henson was in a spouse relation with Jane Henson (and in a creator relation with many beloved characters and shows).

The goal of relation extraction is to learn relations from unstructured natural language text. The relations can be used to answer questions (“Who created Kermit?”), learn which proteins interact in the biomedical literature, or to build a database of hundreds of millions of entities and billions of relations to try and help people explore the world’s information.

To help researchers investigate relation extraction, we’re releasing a human-judged dataset of two relations about public figures on Wikipedia: nearly 10,000 examples of “place of birth”, and over 40,000 examples of “attended or graduated from an institution”. Each of these was judged by at least 5 raters, and can be used to train or evaluate relation extraction systems. We also plan to release more relations of new types in the coming months.

Another step in the “right” direction.

This is a human-curated set of relation semantics.

Rather than trying to apply this as a universal “standard,” what if you were to create a similar data set for your domain/enterprise?

Using human curators to create and maintain a set of relation semantics?

Being a topic mappish sort of person, I suggest the basis for their identification of the relationship be explicit, for robust re-use.

But you can repeat the same analysis over and over again if you prefer.

GroningenMeaningBank (GMB)

Thursday, April 11th, 2013

GroningenMeaningBank (GMB)

From the “about” page:

The Groningen Meaning Bank consists of public domain English texts with corresponding syntactic and semantic representations.

Key features

The GMB supports deep semantics, opening the way to theoretically grounded, data-driven approaches to computational semantics. It integrates phenomena instead of covering single phenomena in isolation. This provides a better handle on explaining dependencies between various ambiguous linguistic phenomena, including word senses, thematic roles, quantifier scrope, tense and aspect, anaphora, presupposition, and rhetorical relations. In the GMB texts are annotated rather than
isolated sentences, which provides a means to deal with ambiguities on the sentence level that require discourse context for resolving them.


The GMB is being built using a bootstrapping approach. We employ state-of-the-art NLP tools (notably the C&C tools and Boxer) to produce a reasonable approximation to gold-standard annotations. From release to release, the annotations are corrected and refined using human annotations coming from two main sources: experts who directly edit the annotations in the GMB via the Explorer, and non-experts who play a game with a purpose called Wordrobe.

Theoretical background

The theoretical backbone for the semantic annotations in the GMB is established by Discourse Representation Theory (DRT), a formal theory of meaning developed by the philosopher of language Hans Kamp (Kamp, 1981; Kamp and Reyle, 1993). Extensions of the theory bridge the gap between theory and practice. In particular, we use VerbNet for thematic roles, a variation on ACE‘s named entity classification, WordNet for word senses and Segmented DRT for rhetorical relations (Asher and Lascarides, 2003). Thanks to the DRT backbone, all these linguistic phenomena can be expressed in a first-order language, enabling the practical use of first-order theorem provers and model builders.

Step back towards the source of semantics (that would be us).

One practical question is how to capture semantics for a particular domain or enterprise?

Another is what to capture to enable the mapping of those semantics to those of other domains or enterprises?

European Parliament Proceedings Parallel Corpus 1996-2011

Sunday, November 18th, 2012

European Parliament Proceedings Parallel Corpus 1996-2011

From the webpage:

For a detailed description of this corpus, please read:

Europarl: A Parallel Corpus for Statistical Machine Translation, Philipp Koehn, MT Summit 2005, pdf.

Please cite the paper, if you use this corpus in your work. See also the extended (but earlier) version of the report (ps, pdf).

The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.

The goal of the extraction and processing was to generate sentence aligned text for statistical machine translation systems. For this purpose we extracted matching items and labeled them with corresponding document IDs. Using a preprocessor we identified sentence boundaries. We sentence aligned the data using a tool based on the Church and Gale algorithm.

Version 7, released in May of 2012, has around 60 million words per language.

Just in case you need a corpus for the EU.

I would be mindful of its parlimentary context. Semantic equivalent or similarity there may not hold true for other contexts.

Using information retrieval technology for a corpus analysis platform

Wednesday, September 26th, 2012

Using information retrieval technology for a corpus analysis platform by Carsten Schnober.


This paper describes a practical approach to use the information retrieval engine Lucene for the corpus analysis platform KorAP, currently being developed at the Institut für Deutsche Sprache (IDS Mannheim). It presents a method to use Lucene’s indexing technique and to exploit it for linguistically annotated data, allowing full flexibility to handle multiple annotation layers. It uses multiple indexes and MapReduce techniques in order to keep KorAP scalable.

The support for multiple annotation layers is of particular interest to me because the “subjects” of interest in a text may vary from one reader to another.

Being mindful that for topic maps, the annotation layers and annotations themselves may be subjects for some purposes.

Concept Annotation in the CRAFT corpus

Sunday, August 19th, 2012

Concept Annotation in the CRAFT corpus by Michael Bada, Miriam Eckert, Donald Evans, Kristin Garcia, Krista Shipley, Dmitry Sitnikov, William A. Baumgartner, K. Bretonnel Cohen, Karin Verspoor, Judith A. Blake and Lawrence E. Hunter by BMC Bioinformatics 2012, 13:161 doi:10.1186/1471-2105-13-161.



Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text.


This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement.


As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at

Lessons on what it takes to create a “gold standard” corpus to advance NLP application development.

What do you think the odds are of “high inter[author] agreement” in the absence of such planning and effort?

Sorry, I meant “high interannotator agreement.”

Guess we have to plan for “low inter[author] agreement.”


Gold Standard (or Bronze, Tin?)

Sunday, August 19th, 2012

A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools by Karin M Verspoor, Kevin B Cohen, Arrick Lanfranchi, Colin Warner, Helen L Johnson, Christophe Roeder, Jinho D Choi, Christopher Funk, Yuriy Malenkiy, Miriam Eckert, Nianwen Xue, William A Baumgartner, Michael Bada, Martha Palmer and Lawrence E Hunter. BMC Bioinformatics 2012, 13:207 doi:10.1186/1471-2105-13-207.



We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus.


Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data.


The finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.

This is the article that I discovered and then worked my way to it from BioNLP.

Important as a deeply annotated text corpus.

But also a reminder that human annotators created the “gold standard,” against which other efforts are judged.

If you are ill, do you want gold standard research into the medical literature (which involves librarians)? Or is bronze or tin standard research good enough?

PS: I will be going back to pickup the other resources as appropriate.


Sunday, August 19th, 2012


From the Quick Facts:

  • 67 full text articles
  • >560,000 Tokens
  • >21,000 Sentences
  • ~100,000 concept annotations to 7 different biomedical ontologies/terminologies
    • Chemical Entities of Biological Interest (ChEBI)
    • Cell Type Ontology (CL)
    • Entrez Gene
    • Gene Ontology (biological process, cellular component, and molecular function)
    • NCBI Taxonomy
    • Protein Ontology
    • Sequence Ontology
  • Penn Treebank markup for each sentence
  • Multiple output formats available

Let’s see: 67 articles resulted in 100,000 concept annotations, or about 1,493 per article for seven (7) ontologies/terminologies.

Ready to test this mapping out in your topic map application?

GATE Teamware: Collaborative Annotation Factories (HOT!)

Wednesday, May 9th, 2012

GATE Teamware: Collaborative Annotation Factories

From the webpage:

Teamware is a web-based management platform for collaborative annotation & curation. It is a cost-effective environment for annotation and curation projects, enabling you to harness a broadly distributed workforce and monitor progress & results remotely in real time.

It’s also very easy to use. A new project can be up and running in less than five minutes. (As far as we know, there is nothing else like it in this field.)

GATE Teamware delivers a multi-function user interface over the Internet for viewing, adding and editing text annotations. The web-based management interface allows for project set-up, tracking, and management:

  • Loading document collections (a “corpus” or “corpora”)
  • Creating re-usable project templates
  • Initiating projects based on templates
  • Assigning project roles to specific users
  • Monitoring progress and various project statistics in real time
  • Reporting of project status, annotator activity and statistics
  • Applying GATE-based processing routines (automatic annotations or post-annotation processing)

I have known about the GATE project in general for years and came to this site after reading: Crowdsourced Legal Case Annotation.

Could be the basis for annotations that are converted into a topic map, but…, I have been a sysadmin before. Maintaining servers, websites, software, etc. Great work, interesting work, but not what I want to be doing now.

Then I read:

Where to get it? The easiest way to get started is to buy a ready-to-run Teamware virtual server from

Not saying it will or won’t meet your particular needs, but, certainly is worth a “look see.”

Let me know if you take the plunge!

Parallel Language Corpus Hunting?

Friday, April 27th, 2012

Parallel language corpus hunters, particularly in legal informatics can rejoice!

[A] parallel corpus of all European Union legislation, called the Acquis Communautaire, translated into all 22 languages of the EU nations — has been expanded to include EU legislation from 2004-2010…

If you think semantic impedance in one language is tough, step up and try that across twenty-two (22) languages.

Of course, these countries share something of a common historical context. Imagine the gulf when you move up to languages from other historical contexts.

See: DGT-TM-2011, Parallel Corpus of All EU Legislation in Translation, Expanded to Include Data from 2004-2010 for links and other details.

Corpus-Wide Association Studies

Sunday, March 11th, 2012

Corpus-Wide Association Studies by Mark Liberman.

From the post:

I’ve spent the past couple of days at GURT 2012, and one of the interesting talks that I’ve heard was Julian Brooke and Sali Tagliamonte, “Hunting the linguistic variable: using computational techniques for data exploration and analysis”. Their abstract (all that’s available of the work so far) explains that:

The selection of an appropriate linguistic variable is typically the first step of a variationist analysis whose ultimate goal is to identify and explain social patterns. In this work, we invert the usual approach, starting with the sociolinguistic metadata associated with a large scale socially stratified corpus, and then testing the utility of computational tools for finding good variables to study. In particular, we use the ‘information gain’ metric included in data mining software to automatically filter a huge set of potential variables, and then apply our own corpus reader software to facilitate further human inspection. Finally, we subject a small set of particularly interesting features to a more traditional variationist analysis.

This type of data-mining for interesting patterns is likely to become a trend in sociolinguistics, as it is in other areas of the social and behavioral sciences, and so it’s worth giving some thought to potential problems as well as opportunities.

If you think about it, the social/behavioral sciences are being applied to the results of data mining of user behavior now. Perhaps you can “catch the wave” early on this cycle of research.

BUCC 2012: The Fifth Workshop on Building and Using Comparable Corpora

Saturday, December 31st, 2011

BUCC 2012: The Fifth Workshop on Building and Using Comparable Corpora (Special topic: Language Resources for Machine Translation in Less-Resourced Languages and Domains


DEADLINE FOR PAPERS: 15 February 2012
Workshop Saturday, 26 May 2012
Lütfi Kirdar Istanbul Exhibition and Congress Centre
Istanbul, Turkey

Some of the information is from: Call for papers. the main conference site does not (yet) have the call for papers posted. Suggest that you verify dates with conference organizers before making travel arrangements.

From the call for papers:

In the language engineering and the linguistics communities, research in comparable corpora has been motivated by two main reasons. In language engineering, it is chiefly motivated by the need to use comparable corpora as training data for statistical NLP applications such as statistical machine translation or cross-lingual retrieval. In linguistics, on the other hand, comparable corpora are of interest in themselves by making possible inter-linguistic discoveries and comparisons. It is generally accepted in both communities that comparable corpora are documents in one or several languages that are comparable in content and form in various degrees and dimensions. We believe that the linguistic definitions and observations related to comparable corpora can improve methods to mine such corpora for applications of statistical NLP. As such, it is of great interest to bring together builders and users of such corpora.

The scarcity of parallel corpora has motivated research concerning the use of comparable corpora: pairs of monolingual corpora selected according to the same set of criteria, but in different languages or language varieties. Non-parallel yet comparable corpora overcome the two limitations of parallel corpora, since sources for original, monolingual texts are much more abundant than translated texts. However, because of their nature, mining translations in comparable corpora is much more challenging than in parallel corpora. What constitutes a good comparable corpus, for a given task or per se, also requires specific attention: while the definition of a parallel corpus is fairly straightforward, building a non-parallel corpus requires control over the selection of source texts in both languages.

Parallel corpora are a key resource as training data for statistical machine translation, and for building or extending bilingual lexicons and terminologies. However, beyond a few language pairs such as English-French or English-Chinese and a few contexts such as parliamentary debates or legal texts, they remain a scarce resource, despite the creation of automated methods to collect parallel corpora from the Web. To exemplify such issues in a practical setting, this year’s special focus will be on

Language Resources for Machine Translation in Less-Resourced Languages and Domains

with the aim of overcoming the shortage of parallel resources when building MT systems for less-resourced languages and domains, particularly by usage of comparable corpora for finding parallel data within and by reaching out for “hidden” parallel data. Lack of sufficient language resources for many language pairs and domains is currently one of the major obstacles in further advancement of machine translation.

Curious about the use of topic maps in the creation of comparable corpora? Seems like the use of language/domain scopes on linguistic data could result in easier construction of comparable corpora.