Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 22, 2014

WebCorp Linguist’s Search Engine

Filed under: Corpora,Corpus Linguistics,Linguistics — Patrick Durusau @ 2:58 pm

WebCorp Linguist’s Search Engine

From the homepage:

The WebCorp Linguist’s Search Engine is a tool for the study of language on the web. The corpora below were built by crawling the web and extracting textual content from web pages. Searches can be performed to find words or phrases, including pattern matching, wildcards and part-of-speech. Results are given as concordance lines in KWIC format. Post-search analyses are possible including time series, collocation tables, sorting and summaries of meta-data from the matched web pages.

Synchronic English Web Corpus 470 million word corpus built from web-extracted texts. Including a randomly selected ‘mini-web’ and high-level subject classification. About

Diachronic English Web Corpus 130 million word corpus randomly selected from a larger collection and balanced to contain the same number of words per month. About

Birmingham Blog Corpus 630 million word corpus built from blogging websites. Including a 180 million word sub-section separated into posts and comments. About

Anglo-Norman Correspondence Corpus A corpus of approximately 150 personal letters written by users of Anglo-Norman. Including bespoke part-of-speech annotation. About

Novels of Charles Dickens A searchable collection of the novels of Charles Dickens. Results can be visualised across chapters and novels. About

You have to register to use the service but registration is free.

The way I toss subject around on this blog you would think it has only one meaning. Not so as shown by the first twenty “hits” on subject in the Synchronic English Web Corpus:

1    Service agencies.  'Merit' is subject to various interpretations depending 
2		amount of oxygen a subject breathes in," he says, "
3		    to work on the subject again next month "to 
4	    of Durham degrees were subject to a religion test 
5	    London, which were not subject to any religious test, 
6	cited researchers in broad subject categories in life sciences, 
7    Losing Weight.  Broaching the subject of weight can be 
8    by survey respondents include subject and curriculum, assessment, pastoral, 
9       knowledge in teachers' own subject area, the use of 
10     each addressing a different subject and how citizenship and 
11	     and school staff, but subject to that they dismissed 
12	       expressed but it is subject to the qualifications set 
13	        last piece on this subject was widely criticised and 
14    saw themselves as foreigners subject to oppression by the 
15	 to suggest that, although subject to similar experiences, other 
16	       since you raise the subject, it's notable that very 
17	position of the privileged subject with their disorderly emotions 
18	 Jimmy may include radical subject matter in his scripts, 
19	   more than sufficient as subject matter and as an 
20	      the NATO script were subject to personal attacks from 

There are a host of options for using the corpus and exporting the results. See the Users Guide for full details.

A great tool not only for linguists but anyone who wants to explore English as a language with professional grade tools.

If you re-read Dickens with concordance in hand, please let me know how it goes. That has the potential to be a very interesting experience.

Free for personal/academic work, commercial use requires a license.

I first saw this in a tweet by Claire Hardaker

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress