WebCorp Linguist’s Search Engine
From the homepage:
The WebCorp Linguist’s Search Engine is a tool for the study of language on the web. The corpora below were built by crawling the web and extracting textual content from web pages. Searches can be performed to find words or phrases, including pattern matching, wildcards and part-of-speech. Results are given as concordance lines in KWIC format. Post-search analyses are possible including time series, collocation tables, sorting and summaries of meta-data from the matched web pages.
Synchronic English Web Corpus 470 million word corpus built from web-extracted texts. Including a randomly selected ‘mini-web’ and high-level subject classification. About
Diachronic English Web Corpus 130 million word corpus randomly selected from a larger collection and balanced to contain the same number of words per month. AboutBirmingham Blog Corpus 630 million word corpus built from blogging websites. Including a 180 million word sub-section separated into posts and comments. About
Anglo-Norman Correspondence Corpus A corpus of approximately 150 personal letters written by users of Anglo-Norman. Including bespoke part-of-speech annotation. About
Novels of Charles Dickens A searchable collection of the novels of Charles Dickens. Results can be visualised across chapters and novels. About
You have to register to use the service but registration is free.
The way I toss subject around on this blog you would think it has only one meaning. Not so as shown by the first twenty “hits” on subject in the Synchronic English Web Corpus:
1 Service agencies. 'Merit' is subject to various interpretations depending 2 amount of oxygen a subject breathes in," he says, " 3 to work on the subject again next month "to 4 of Durham degrees were subject to a religion test 5 London, which were not subject to any religious test, 6 cited researchers in broad subject categories in life sciences, 7 Losing Weight. Broaching the subject of weight can be 8 by survey respondents include subject and curriculum, assessment, pastoral, 9 knowledge in teachers' own subject area, the use of 10 each addressing a different subject and how citizenship and 11 and school staff, but subject to that they dismissed 12 expressed but it is subject to the qualifications set 13 last piece on this subject was widely criticised and 14 saw themselves as foreigners subject to oppression by the 15 to suggest that, although subject to similar experiences, other 16 since you raise the subject, it's notable that very 17 position of the privileged subject with their disorderly emotions 18 Jimmy may include radical subject matter in his scripts, 19 more than sufficient as subject matter and as an 20 the NATO script were subject to personal attacks from
There are a host of options for using the corpus and exporting the results. See the Users Guide for full details.
A great tool not only for linguists but anyone who wants to explore English as a language with professional grade tools.
If you re-read Dickens with concordance in hand, please let me know how it goes. That has the potential to be a very interesting experience.
Free for personal/academic work, commercial use requires a license.
I first saw this in a tweet by Claire Hardaker