Archive for the ‘Linguistics’ Category

Nine Kinds of Ancient Greek Treebanks

Thursday, December 21st, 2017

Nine Kinds of Ancient Greek Treebanks by Jonathan Robie.

When I blog or speak about Greek treebanks, I frequently refer to one or more of the treebanks that are currently available. Few people realize how many treebanks exist for ancient Greek, and even fewer have ever seriously looked at more than one. I do not know of a web page that lists all of the ones I know of, so I thought it would be helpful to list them in one blog post, providing basic information about each.

So here is a catalog of treebanks for ancient Greek.

Most readers of this blog know Jonathan Robie from his work on XQuery and XPath, two of the XML projects that have benefited from his leadership.

What readers may not know is that Jonathan originated both b-greek (Biblical Greek Forum, est. 1992) and b-hebrew (Biblical Hebrew Forum, est. 1997). Those are not typos, b-greek began in 1992 and b-hebrew in 1997. (I checked the archives before posting.)

Not content to be the origin and maintainer of two of the standard discussion forums for biblical languages, Jonathan has undertaken to produce high quality open data for serious Bible students and professional scholars.

Texts in multiple treebanks, such as the Greek NT, make a great use case for display and analysis of overlapping trees.

Syntacticus – Early Indo-European Languages

Saturday, September 23rd, 2017


From the about page:

Syntacticus provides easy access to around a million morphosyntactically annotated sentences from a range of early Indo-European languages.

Syntacticus is an umbrella project for the PROIEL Treebank, the TOROT Treebank and the ISWOC Treebank, which all use the same annotation system and share similar linguistic priorities. In total, Syntacticus contains 80,138 sentences or 936,874 tokens in 10 languages.

We are constantly adding new material to Syntacticus. The ultimate goal is to have a representative sample of different text types from each branch of early Indo-European. We maintain lists of texts we are working on at the moment, which you can find on the PROIEL Treebank and the TOROT Treebank pages, but this is extremely time-consuming work so please be patient!

The focus for Syntacticus at the moment is to consolidate and edit our documentation so that it is easier to approach. We are very aware that the current documentation is inadequate! But new features and better integration with our development toolchain are also on the horizon in the near future.

Language Size
Ancient Greek 250,449 tokens
Latin 202,140 tokens
Classical Armenian 23,513 tokens
Gothic 57,211 tokens
Portuguese 36,595 tokens
Spanish 54,661 tokens
Old English 29,406 tokens
Old French 2,340 tokens
Old Russian 209,334 tokens
Old Church Slavonic 71,225 tokens

The mention of Old Russian should attract attention, given the media frenzy over Russia these days. However, the data at Syntacticus is meaningful, unlike news reports that reflect Western ignorance more often than news.

You may have noticed US reports have moved from guilt by association to guilt by nationality (anyone who is Russian = Putin confidant) and are approaching guilt by proximity (citizen of any country near Russia = Putin puppet).

It’s hard to imagine a political campaign without crimes being committed by someone but traditionally, in law courts anyway, proof precedes a decision of guilt.

Looking forward to competent evidence (that’s legal terminology with a specific meaning), tested in an open proceeding against the elements of defined offenses. That’s a far cry from current discussions.

Graphing the distribution of English letters towards…

Tuesday, July 11th, 2017

Graphing the distribution of English letters towards the beginning, middle or end of words by David Taylor.

From the post:

(partial image)

Some data visualizations tell you something you never knew. Others tell you things you knew, but didn’t know you knew. This was the case for this visualization.

Many choices had to be made to visually present this essentially semi-quantitative data (how do you compare a 3- and a 13-letter word?). I semi-exhaustively explain everything at on my other, geekier blog, prooffreaderplus, and provide the code I used; I’ll just repeat the most crucial here:

The counts here were generated from Brown corpus, which is composed of texts printed in 1961.

Take Taylor’s post as an inducement to read both Prooffreader Plus and Prooffreader on a regular basis.

New MorphGNT Releases and Accentuation Analysis

Thursday, February 16th, 2017

New MorphGNT Releases and Accentuation Analysis by James Tauber.

From the post:

Back in 2015, I talked about Annotating the Normalization Column in MorphGNT. This post could almost be considered Part 2.

I recently went back to that work and made a fresh start on a new repo gnt-accentuation intended to explain the accentuation of each word in the GNT (and eventually other Greek texts). There’s two parts to that: explaining why the normalized form is accented the way it but then explaining why the word-in-context might be accented differently (clitics, etc). The repo is eventually going to do both but I started with the latter.

My goal with that repo is to be part of the larger vision of an “executable grammar” I’ve talked about for years where rules about, say, enclitics, are formally written up in a way that can be tested against the data. This means:

  • students reading a rule can immediately jump to real examples (or exceptions)
  • students confused by something in a text can immediately jump to rules explaining it
  • the correctness of the rules can be tested
  • errors in the text can be found

It is the fourth point that meant that my recent work uncovered some accentuation issues in the SBLGNT, normalization and lemmatization. Some of that has been corrected in a series of new releases of the MorphGNT: 6.08, 6.09, and 6.10. See for details of specifics. The reason for so many releases was I wanted to get corrections out as soon as I made them but then I found more issues!

There are some issues in the text itself which need to be resolved. See the Github issue for details. I’d very much appreciate people’s input.

In the meantime, stay tuned for more progress on gnt-accentuation.

Was it random chance that I saw this announcement from James and Getting your hands dirty with the Digital Manuscripts Toolkit on the same day?


I should mention that Codex Sinaiticus (second oldest witness to the Greek New Testament) and numerous other Greek NT manuscripts have been digitized by the British Library.

Paring these resources together offers a great opportunity to discover the Greek NT text as choices made by others. (Same holds true for the Hebrew Bible as well.)

Modelling Stems and Principal Part Lists (Attic Greek)

Friday, June 17th, 2016

Modelling Stems and Principal Part Lists by James Tauber.

From the post:

This is part 0 of a series of blog posts about modelling stems and principal part lists, particularly for Attic Greek but hopefully more generally applicable. This is largely writing up work already done but I’m doing cleanup as I go along as well.

A core part of the handling of verbs in the Morphological Lexicon is the set of terminations and sandhi rules that can generate paradigms attested in grammars like Louise Pratt’s The Essentials of Greek Grammar. Another core part is the stem information for a broader range of verbs usually conveyed in works like Pratt’s in the form of lists of principal parts.

A rough outline of future posts is:

  • the sources of principal part lists for this work
  • lemmas in the Pratt principal parts
  • lemma differences across lists
  • what information is captured in each of the lists individually
  • how to model a merge of the lists
  • inferring stems from principal parts
  • stems, terminations and sandhi
  • relationships between stems
  • ???

I’ll update this outline with links as posts are published.

(emphasis in original)

A welcome reminder of projects that transcend the ephemera that is social media.

Or should I say “modern” social media?

The texts we parse so carefully were originally spoken, recorded and copied, repeatedly, without the benefit of modern reference grammars and/or dictionaries.


For Linguists on Your Holiday List

Saturday, December 12th, 2015

Hey Linguists!—Get Them to Get You a Copy of The Speculative Grammarian Essential Guide to Linguistics.

From the website:

Hey Linguists! Do you know why it is better to give than to receive? Because giving requires a lot more work! You have to know what someone likes, what someone wants, who someone is, to get them a proper, thoughtful gift. That sounds like a lot of work.

No, wait. That’s not right. It’s actually more work to be the recipient—if you are going to do it right. You can’t just trust people to know what you like, what you want, who you are.

You could try to help your loved ones understand a linguist’s needs and wants and desires—but you’d have to give them a mini course on historical, computational, and forensic linguistics first. Instead, you can assure them that SpecGram has the right gift for you—a gift you, their favorite linguist, will treasure for years to come: The Speculative Grammarian Essential Guide to Linguistics.

So drop some subtle or not-so-subtle hints and help your loved ones do the right thing this holiday season: gift you with this hilarious compendium of linguistic sense and nonsense.

If you need to convince your friends and family that they can’t find you a proper gift on their own, send them one of the images below, and try to explain to them why it amuses you. That’ll show ’em! (More will be added through the rest of 2015, just in case your friends and family are a little thick.)

• If guilt is more your style, check out 2013’s Sad Holiday Linguists.

• If semi-positive reinforcement is your thing, check out 2014’s Because You Can’t Do Everything You Want for Your Favorite Linguist.

Disclaimer: I haven’t proofed the diagrams against the sources cited. Rely on them at your own risk. 😉

There are others but the Hey Semioticians! reminded me of John Sowa (sorry John):


The greatest mistake across all disciplines is taking ourselves (and our positions) far too seriously.


Deep Learning and Parsing

Sunday, November 22nd, 2015

Jason Baldridge tweets that the work of James Henderson (Google Scholar) should get more cites for deep learning and parsing.

Jason points to the following two works (early 1990’s) in particular:

Description Based Parsing in a Connectionist Network by James B. Henderson.


Recent developments in connectionist architectures for symbolic computation have made it possible to investigate parsing in a connectionist network while still taking advantage of the large body of work on parsing in symbolic frameworks. This dissertation investigates syntactic parsing in the temporal synchrony variable binding model of symbolic computation in a connectionist network. This computational architecture solves the basic problem with previous connectionist architectures,
while keeping their advantages. However, the architecture does have some limitations, which impose computational constraints on parsing in this architecture. This dissertation argues that, despite these constraints, the architecture is computationally adequate for syntactic parsing, and that these constraints make signi cant linguistic predictions. To make these arguments, the nature of the architecture’s limitations are fi rst characterized as a set of constraints on symbolic
computation. This allows the investigation of the feasibility and implications of parsing in the architecture to be investigated at the same level of abstraction as virtually all other investigations of syntactic parsing. Then a specifi c parsing model is developed and implemented in the architecture. The extensive use of partial descriptions of phrase structure trees is crucial to the ability of this model to recover the syntactic structure of sentences within the constraints. Finally, this parsing model is tested on those phenomena which are of particular concern given the constraints, and on an approximately unbiased sample of sentences to check for unforeseen difficulties. The results show that this connectionist architecture is powerful enough for syntactic parsing. They also show that some linguistic phenomena are predicted by the limitations of this architecture. In particular, explanations are given for many cases of unacceptable center embedding, and for several signifi cant constraints on long distance dependencies. These results give evidence for the cognitive signi ficance
of this computational architecture and parsing model. This work also shows how the advantages of both connectionist and symbolic techniques can be uni ed in natural language processing applications. By analyzing how low level biological and computational considerations influence higher level processing, this work has furthered our understanding of the nature of language and how it can be efficiently and e ffectively processed.

Connectionist Syntactic Parsing Using Temporal Variable Binding by James Henderson.


Recent developments in connectionist architectures for symbolic computation have made it possible to investigate parsing in a connectionist network while still taking advantage of the large body of work on parsing in symbolic frameworks. The work discussed here investigates syntactic parsing in the temporal synchrony variable binding model of symbolic computation in a connectionist network. This computational architecture solves the basic problem with previous connectionist architectures, while keeping their advantages. However, the architecture does have some limitations, which impose constraints on parsing in this architecture. Despite these constraints, the architecture is computationally adequate for syntactic parsing. In addition, the constraints make some signifi cant linguistic predictions. These arguments are made using a specifi c parsing model. The extensive use of partial descriptions of phrase structure trees is crucial to the ability of this model to recover the syntactic structure of sentences within the constraints imposed by the architecture.


King – Man + Woman = Queen:…

Tuesday, September 22nd, 2015

King – Man + Woman = Queen: The Marvelous Mathematics of Computational Linguistics.

From the post:

Computational linguistics has dramatically changed the way researchers study and understand language. The ability to number-crunch huge amounts of words for the first time has led to entirely new ways of thinking about words and their relationship to one another.

This number-crunching shows exactly how often a word appears close to other words, an important factor in how they are used. So the word Olympics might appear close to words like running, jumping, and throwing but less often next to words like electron or stegosaurus. This set of relationships can be thought of as a multidimensional vector that describes how the word Olympics is used within a language, which itself can be thought of as a vector space.

And therein lies this massive change. This new approach allows languages to be treated like vector spaces with precise mathematical properties. Now the study of language is becoming a problem of vector space mathematics.

Today, Timothy Baldwin at the University of Melbourne in Australia and a few pals explore one of the curious mathematical properties of this vector space: that adding and subtracting vectors produces another vector in the same space.

The question they address is this: what do these composite vectors mean? And in exploring this question they find that the difference between vectors is a powerful tool for studying language and the relationship between words.

A great lay introduction to:

Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vector Differences for Lexical Relation Learning by Ekaterina Vylomova, Laura Rimell, Trevor Cohn, Timothy Baldwin.


Recent work on word embeddings has shown that simple vector subtraction over pre-trained embeddings is surprisingly effective at capturing different lexical relations, despite lacking explicit supervision. Prior work has evaluated this intriguing result using a word analogy prediction formulation and hand-selected relations, but the generality of the finding over a broader range of lexical relation types and different learning settings has not been evaluated. In this paper, we carry out such an evaluation in two learning settings: (1) spectral clustering to induce word relations, and (2) supervised learning to classify vector differences into relation types. We find that word embeddings capture a surprising amount of information, and that, under suitable supervised training, vector subtraction generalises well to a broad range of relations, including over unseen lexical items.

The authors readily admit, much to their credit, this isn’t a one size fits all solution.

But, a line of research that merits your attention.

Open Review: Grammatical theory:…

Thursday, June 4th, 2015

Open Review: Grammatical theory: From transformational grammar to constraint-based approaches by Stefan Müller (Author).

From the webpage:

This book is currently at the Open Review stage. You can help the author by making comments on the preliminary version: Part 1, Part 2. Read our user guide to get acquainted with the software.

This book introduces formal grammar theories that play a role in current linguistics or contributed tools that are relevant for current linguistic theorizing (Phrase Structure Grammar, Transformational Grammar/Government & Binding, Mimimalism, Generalized Phrase Structure Grammar, Lexical Functional Grammar, Categorial Grammar, Head-Driven Phrase Structure Grammar, Construction Grammar, Tree Adjoining Grammar, Dependency Grammar). The key assumptions are explained and it is shown how each theory treats arguments and adjuncts, the active/passive alternation, local reorderings, verb placement, and fronting of constituents over long distances. The analyses are explained with German as the object language.

In a final part of the book the approaches are compared with respect to their predictions regarding language acquisition and psycholinguistic plausibility. The nativism hypothesis that claims that humans posses genetically determined innate language-specific knowledge is examined critically and alternative models of language acquisition are discussed. In addition this more general part addresses issues that are discussed controversially in current theory building such as the question whether flat or binary branching structures are more appropriate, the question whether constructions should be treated on the phrasal or the lexical level, and the question whether abstract, non-visible entities should play a role in syntactic analyses. It is shown that the analyses that are suggested in the various frameworks are often translatable into each other. The book closes with a section that shows how properties that are common to all languages or to certain language classes can be captured.

(emphasis in the original)

Part of walking the walk of open access means participating in open reviews as your time and expertise permits.

Even if grammar theory isn’t your field, professionally speaking, it will be good mental exercise to see another view of the world of language.

I am intrigued by the suggestion “It shows that the analyses that are suggested in the various frameworks are often translatable into each other.” Shades of the application of category theory to linguistics? Mappings of identifications?

Glossary of linguistic terms

Wednesday, May 6th, 2015

Glossary of linguistic terms by Eugene E. Loos (general editor), Susan Anderson (editor), Dwight H., Day, Jr. (editor), Paul C. Jordan (editor), J. Douglas Wingate (editor).

An excellent source for linguistic terminology.

If you have any interest in languages or linguistics you should give SIL International a visit.

BTW, the last update on the glossary page was in 2004 so if you can suggest some updates or additions, I am sure they would be appreciated.


Unker Non-Linear Writing System

Thursday, April 23rd, 2015

Unker Non-Linear Writing System by Alex Fink & Sai.

From the webpage:


“I understood from my parents, as they did from their parents, etc., that they became happier as they more fully grokked and were grokked by their cat.”[3]

Here is another snippet from the text:

Binding points, lines and relations

Every glyph includes a number of binding points, one for each of its arguments, the semantic roles involved in its meaning. For instance, the glyph glossed as eat has two binding points—one for the thing consumed and one for the consumer. The glyph glossed as (be) fish has only one, the fish. Often we give glosses more like “X eat Y”, so as to give names for the binding points (X is eater, Y is eaten).

A basic utterance in UNLWS is put together by writing out a number of glyphs (without overlaps) and joining up their binding points with lines. When two binding points are connected, this means the entities filling those semantic roles of the glyphs involved coincide. Thus when the ‘consumed’ binding point of eat is connected to the only binding point of fish, the connection refers to an eaten fish.

This is the main mechanism by which UNLWS clauses are assembled. To take a worked example, here are four glyphs:


If you are interested in graphical representations for design or presentation, this may be of interest.

Sam Hunting forwarded this while we were exploring TeX graphics.

PS: The “cat” people on Twitter may appreciate the first graphic. 😉

FYI: COHA Full-Text Data: 385 Million Words, 116k Texts

Monday, March 9th, 2015

FYI: COHA Full-Text Data: 385 Million Words, 116k Texts by Mark Davies.

From the post:

This announcement is for those who are interested in historical corpora and who may want a large dataset to work with on their own machine. This is a real corpus, rather than just n-grams (as with the Google Books n-grams; see a comparison at

We are pleased to announce that the Corpus of Historical American English (COHA; is now available in downloadable full-text format, for use on your own computer.

COHA joins COCA and GloWbE, which have been available in downloadable full-text format since March 2014.

The downloadable version of COHA contains 385 million words of text in more than 115,000 separate texts, covering fiction, popular magazines, newspaper articles, and non-fiction books from the 1810s to the 2000s (see

At 385 million words in size, the downloadable COHA corpus is much larger than any other structured historical corpus of English. With this large amount of data, you can carry out many types of research that would not be possible with much smaller 5-10 million word historical corpora of English (see

The corpus is available in several formats: sentence/paragraph, PoS-tagged and lemmatized (one word per line), and for input into a relational database. Samples of each format (3.6 million words each) are available at the full-text website.

We hope that this new resource is of value to you in your research and teaching.

Mark Davies
Brigham Young University

I haven’t ever attempted a systematic ranking of American universities but in terms of contributions to the public domain in the humanities, Brigham Young is surely in the top ten (10), however you might rank the members of that group individually.

Correction: A comment pointed out that this data set is for sale and not in the public domain. My bad, I read the announcement and not the website. Still, given the amount of work required to create such a corpus, I don’t find the fees offensive.

Take the data set being formatted for input into a relational database as a reason for inputting it into a non-relational database.


I first saw this in a tweet by the List.

GOLD (General Ontology for Linguistic Description) Standard

Thursday, February 19th, 2015

GOLD (General Ontology for Linguistic Description) Standard

From the homepage:

The purpose of the GOLD Community is to bring together scholars interested in best-practice encoding of linguistic data. We promote best practice as suggested by E-MELD, encourage data interoperability through the use of the GOLD Standard, facilitate search across disparate data sets and provide a platform for sharing existing data and tools from related research projects. The development and refinement of the GOLD Standard will be the basis for and the product of the combined efforts of the GOLD Community. This standard encompasses linguistic concepts, definitions of these concepts and relationships between them in a freely available ontology.

The GOLD standard is dated 2010 and I didn’t see any updates for it.

If you are interested in capturing the subject identity properties before new nomenclatures replace the ones found here now would be a good time.

I first saw this in a tweet by the Linguist List.

Inheritance Patterns in Citation Networks Reveal Scientific Memes

Sunday, December 14th, 2014

Inheritance Patterns in Citation Networks Reveal Scientific Memes by Tobias Kuhn, Matjaž Perc, and Dirk Helbing. (Phys. Rev. X 4, 041036 – Published 21 November 2014.)


Memes are the cultural equivalent of genes that spread across human culture by means of imitation. What makes a meme and what distinguishes it from other forms of information, however, is still poorly understood. Our analysis of memes in the scientific literature reveals that they are governed by a surprisingly simple relationship between frequency of occurrence and the degree to which they propagate along the citation graph. We propose a simple formalization of this pattern and validate it with data from close to 50 million publication records from the Web of Science, PubMed Central, and the American Physical Society. Evaluations relying on human annotators, citation network randomizations, and comparisons with several alternative approaches confirm that our formula is accurate and effective, without a dependence on linguistic or ontological knowledge and without the application of arbitrary thresholds or filters.

Popular Summary:

It is widely known that certain cultural entities—known as “memes”—in a sense behave and evolve like genes, replicating by means of human imitation. A new scientific concept, for example, spreads and mutates when other scientists start using and refining the concept and cite it in their publications. Unlike genes, however, little is known about the characteristic properties of memes and their specific effects, despite their central importance in science and human culture in general. We show that memes in the form of words and phrases in scientific publications can be characterized and identified by a simple mathematical regularity.

We define a scientific meme as a short unit of text that is replicated in citing publications (“graphene” and “self-organized criticality” are two examples). We employ nearly 50 million digital publication records from the American Physical Society, PubMed Central, and the Web of Science in our analysis. To identify and characterize scientific memes, we define a meme score that consists of a propagation score—quantifying the degree to which a meme aligns with the citation graph—multiplied by the frequency of occurrence of the word or phrase. Our method does not require arbitrary thresholds or filters and does not depend on any linguistic or ontological knowledge. We show that the results of the meme score are consistent with expert opinion and align well with the scientific concepts described on Wikipedia. The top-ranking memes, furthermore, have interesting bursty time dynamics, illustrating that memes are continuously developing, propagating, and, in a sense, fighting for the attention of scientists.

Our results open up future research directions for studying memes in a comprehensive fashion, which could lead to new insights in fields as disparate as cultural evolution, innovation, information diffusion, and social media.

You definitely should grab the PDF version of this article for printing and a slow read.

From Section III Discussion:

We show that the meme score can be calculated exactly and exhaustively without the introduction of arbitrary thresholds or filters and without relying on any kind of linguistic or ontological knowledge. The method is fast and reliable, and it can be applied to massive databases.

Fair enough but “black,” “inflation,” and, “traffic flow,” all appear in the top fifty memes in physics. I don’t know that I would consider any of them to be “memes.”

There is much left to be discovered about memes. Such as who is good at propagating memes? Would not hurt if your research paper is the origin of a very popular meme.

I first saw this in a tweet by Max Fisher.

EMNLP 2014: Conference on Empirical Methods in Natural Language Processing

Thursday, December 11th, 2014

EMNLP 2014: Conference on Empirical Methods in Natural Language Processing

I rather quickly sorted these tutorials into order by the first author’s last name:

The links will take you to the conference site and descriptions with links to videos and other materials.

You can download the complete conference proceedings: EMNLP 2014 The 2014 Conference on Empirical Methods In Natural Language Processing Proceedings of the Conference, which at two thousand one hundred and ninety-one (2191) pages, should keep you busy through the holiday season. 😉

Or if you are interested in a particular paper, see the Main Conference Program, which has links to individual papers and videos of the presentations in many cases.

A real wealth of materials here! I must say the conference servers are the most responsive I have ever seen.

I first saw this in a tweet by Jason Baldridge.

Semantic Parsing with Combinatory Categorial Grammars (Videos)

Thursday, December 11th, 2014

Semantic Parsing with Combinatory Categorial Grammars by Yoav Artzi, Nicholas FitzGerald and Luke Zettlemoyer. (Tutorial)


Semantic parsers map natural language sentences to formal representations of their underlying meaning. Building accurate semantic parsers without prohibitive engineering costs is a long-standing, open research problem.

The tutorial will describe general principles for building semantic parsers. The presentation will be divided into two main parts: modeling and learning. The modeling section will include best practices for grammar design and choice of semantic representation. The discussion will be guided by examples from several domains. To illustrate the choices to be made and show how they can be approached within a real-life representation language, we will use λ-calculus meaning representations. In the learning part, we will describe a unified approach for learning Combinatory Categorial Grammar (CCG) semantic parsers, that induces both a CCG lexicon and the parameters of a parsing model. The approach learns from data with labeled meaning representations, as well as from more easily gathered weak supervision. It also enables grounded learning where the semantic parser is used in an interactive environment, for example to read and execute instructions.

The ideas we will discuss are widely applicable. The semantic modeling approach, while implemented in λ-calculus, could be applied to many other formal languages. Similarly, the algorithms for inducing CCGs focus on tasks that are formalism independent, learning the meaning of words and estimating parsing parameters. No prior knowledge of CCGs is required. The tutorial will be backed by implementation and experiments in the University of Washington Semantic Parsing Framework (UW SPF).

I previously linked to the complete slide set for this tutorial.

This page offers short videos (twelve (12) currently) and links into the slide set. More videos are forthcoming.

The goal of the project is “recover complete meaning representation” where complete meaning = “Complete meaning is sufficient to complete the task.” (from video 1).

That definition of “complete meaning” dodges a lot of philosophical as well as practical issues with semantic parsing.

Take the time to watch the videos, Yoav is a good presenter.


Cultural Fault Lines Determine How New Words Spread On Twitter, Say Computational Linguists

Saturday, December 6th, 2014

Cultural Fault Lines Determine How New Words Spread On Twitter, Say Computational Linguists

From the post:

A dialect is a particular form of language that is limited to a specific location or population group. Linguists are fascinated by these variations because they are determined both by geography and by demographics. So studying them can produce important insights into the nature of society and how different groups within it interact.

That’s why linguists are keen to understand how new words, abbreviations and usages spread on new forms of electronic communication, such as social media platforms. It is easy to imagine that the rapid spread of neologisms could one day lead to a single unified dialect of netspeak. An interesting question is whether there is any evidence that this is actually happening.

Today, we get a fascinating insight into this problem thanks to the work of Jacob Eisenstein at the Georgia Institute of Technology in Atlanta and a few pals. These guys have measured the spread of neologisms on Twitter and say they have clear evidence that online language is not converging at all. Indeed, they say that electronic dialects are just as common as ordinary ones and seem to reflect same fault lines in society.

Disappointment for those who thought the Net would help people overcome the curse of Babel.

When we move into new languages or means of communication, we simply take our linguistic diversity with us, like well traveled but familiar luggage.

If you think about it, the difficulties of multiple semantics for OWL same:As is another instance of the same phenomena. Semantically distinct groups assigned the same token, OWL same:As different semantics. That should not have been a surprise. But it was and it will be every time on community privileges itself to be the giver of meaning for any term.

If you want to see the background for the post in full:

Diffusion of Lexical Change in Social Media by Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, Eric P. Xing.


Computer-mediated communication is driving fundamental changes in the nature of written language. We investigate these changes by statistical analysis of a dataset comprising 107 million Twitter messages (authored by 2.7 million unique user accounts). Using a latent vector autoregressive model to aggregate across thousands of words, we identify high-level patterns in diffusion of linguistic change over the United States. Our model is robust to unpredictable changes in Twitter’s sampling rate, and provides a probabilistic characterization of the relationship of macro-scale linguistic influence to a set of demographic and geographic predictors. The results of this analysis offer support for prior arguments that focus on geographical proximity and population size. However, demographic similarity — especially with regard to race — plays an even more central role, as cities with similar racial demographics are far more likely to share linguistic influence. Rather than moving towards a single unified “netspeak” dialect, language evolution in computer-mediated communication reproduces existing fault lines in spoken American English.

Linguistic Mapping Reveals How Word Meanings Sometimes Change Overnight

Sunday, November 23rd, 2014

Linguistic Mapping Reveals How Word Meanings Sometimes Change Overnight Data mining the way we use words is revealing the linguistic earthquakes that constantly change our language.

From the post:

language change

In October 2012, Hurricane Sandy approached the eastern coast of the United States. At the same time, the English language was undergoing a small earthquake of its own. Just months before, the word “sandy” was an adjective meaning “covered in or consisting mostly of sand” or “having light yellowish brown colour”. Almost overnight, this word gained an additional meaning as a proper noun for one of the costliest storms in US history.

A similar change occurred to the word “mouse” in the early 1970s when it gained the new meaning of “computer input device”. In the 1980s, the word “apple” became a proper noun synonymous with the computer company. And later, the word “windows” followed a similar course after the release of the Microsoft operating system.

All this serves to show how language constantly evolves, often slowly but at other times almost overnight. Keeping track of these new senses and meanings has always been hard. But not anymore.

Today, Vivek Kulkarni at Stony Brook University in New York and a few pals show how they have tracked these linguistic changes by mining the corpus of words stored in databases such as Google Books, movie reviews from Amazon and of course the microblogging site Twitter.

These guys have developed three ways to spot changes in the language. The first is a simple count of how often words are used, using tools such as Google Trends. For example, in October 2012, the frequency of the words “Sandy” and “hurricane” both spiked in the run-up to the storm. However, only one of these words changed its meaning, something that a frequency count cannot spot.

A very good overview of:

Statistically Significant Detection of Linguistic Change by Vivek Kulkarni, Rami Al-Rfou, Bryan Perozzi, and Steven Skiena.


We propose a new computational approach for tracking and detecting statistically significant linguistic shifts in the meaning and usage of words. Such linguistic shifts are especially prevalent on the Internet, where the rapid exchange of ideas can quickly change a word’s meaning. Our meta-analysis approach constructs property time series of word usage, and then uses statistically sound change point detection algorithms to identify significant linguistic shifts.

We consider and analyze three approaches of increasing complexity to generate such linguistic property time series, the culmination of which uses distributional characteristics inferred from word co-occurrences. Using recently proposed deep neural language models, we first train vector representations of words for each time period. Second, we warp the vector spaces into one unified coordinate system. Finally, we construct a distance-based distributional time series for each word to track it’s linguistic displacement over time.

We demonstrate that our approach is scalable by tracking linguistic change across years of micro-blogging using Twitter, a decade of product reviews using a corpus of movie reviews from Amazon, and a century of written books using the Google Book-ngrams. Our analysis reveals interesting patterns of language usage change commensurate with each medium.

While the authors are concerned with scaling, I would think detecting cracks, crevasses, and minor tremors in the meaning and usage of words, say between a bank and its regulators, or stock traders and the SEC, would be equally important.

Even if auto-detection of the “new” or “changed” meaning is too much to expect, simply detecting dissonance in the usage of terms would be a step in the right direction.

Detecting earthquakes in meaning is a worthy endeavor but there is more tripping on cracks than falling from earthquakes, linguistically speaking.

WebCorp Linguist’s Search Engine

Saturday, November 22nd, 2014

WebCorp Linguist’s Search Engine

From the homepage:

The WebCorp Linguist’s Search Engine is a tool for the study of language on the web. The corpora below were built by crawling the web and extracting textual content from web pages. Searches can be performed to find words or phrases, including pattern matching, wildcards and part-of-speech. Results are given as concordance lines in KWIC format. Post-search analyses are possible including time series, collocation tables, sorting and summaries of meta-data from the matched web pages.

Synchronic English Web Corpus 470 million word corpus built from web-extracted texts. Including a randomly selected ‘mini-web’ and high-level subject classification. About

Diachronic English Web Corpus 130 million word corpus randomly selected from a larger collection and balanced to contain the same number of words per month. About

Birmingham Blog Corpus 630 million word corpus built from blogging websites. Including a 180 million word sub-section separated into posts and comments. About

Anglo-Norman Correspondence Corpus A corpus of approximately 150 personal letters written by users of Anglo-Norman. Including bespoke part-of-speech annotation. About

Novels of Charles Dickens A searchable collection of the novels of Charles Dickens. Results can be visualised across chapters and novels. About

You have to register to use the service but registration is free.

The way I toss subject around on this blog you would think it has only one meaning. Not so as shown by the first twenty “hits” on subject in the Synchronic English Web Corpus:

1    Service agencies.  'Merit' is subject to various interpretations depending 
2		amount of oxygen a subject breathes in," he says, "
3		    to work on the subject again next month "to 
4	    of Durham degrees were subject to a religion test 
5	    London, which were not subject to any religious test, 
6	cited researchers in broad subject categories in life sciences, 
7    Losing Weight.  Broaching the subject of weight can be 
8    by survey respondents include subject and curriculum, assessment, pastoral, 
9       knowledge in teachers' own subject area, the use of 
10     each addressing a different subject and how citizenship and 
11	     and school staff, but subject to that they dismissed 
12	       expressed but it is subject to the qualifications set 
13	        last piece on this subject was widely criticised and 
14    saw themselves as foreigners subject to oppression by the 
15	 to suggest that, although subject to similar experiences, other 
16	       since you raise the subject, it's notable that very 
17	position of the privileged subject with their disorderly emotions 
18	 Jimmy may include radical subject matter in his scripts, 
19	   more than sufficient as subject matter and as an 
20	      the NATO script were subject to personal attacks from 

There are a host of options for using the corpus and exporting the results. See the Users Guide for full details.

A great tool not only for linguists but anyone who wants to explore English as a language with professional grade tools.

If you re-read Dickens with concordance in hand, please let me know how it goes. That has the potential to be a very interesting experience.

Free for personal/academic work, commercial use requires a license.

I first saw this in a tweet by Claire Hardaker


Sunday, September 7th, 2014

Lgram: A memory-efficient ngram builder

From the webpage:

Lgram is a cross–platform tool for calculating ngrams in a memory–efficient manner. The current crop of n-gram tools have non–constant memory usage such that ngrams cannot be computed for large input texts. Given the prevalence of large texts in computational and corpus linguistics, this deficit is problematic. Lgram has constant memory usage so it can compute ngrams on arbitrarily sized input texts. Lgram achieves constant memory usages by periodically syncing the computed ngrams to an sqlite database stored on disk.

Lgram was written by Edward J. L. Bell at Lancaster University and funded by UCREL. The project was initiated by Dr Paul Rayson.

Not recent (2011) but new to me. Enjoy!

I first saw this in a tweet by Christopher Phipps.

Computational Linguistics [09-2014]

Sunday, September 7th, 2014

Chris Callison-Burch tweets that Volume 40, Issue 3 – September 2014, ACL Anthology is now available!

In the September issue:

J14-3001: Montserrat Marimon; Núria Bel; Lluís Padró
Squibs: Automatic Selection of HPSG-Parsed Sentences for Treebank Construction

J14-3002: Jürgen Wedekind
Squibs: On the Universal Generation Problem for Unification Grammars

J14-3003: Ahmed Hassan; Amjad Abu-Jbara; Wanchen Lu; Dragomir Radev
A Random Walk–Based Model for Identifying Semantic Orientation

J14-3004: Xu Sun; Wenjie Li; Houfeng Wang; Qin Lu
Feature-Frequency–Adaptive On-line Training for Fast and Accurate Natural Language Processing

J14-3005: Diarmuid Ó Séaghdha; Anna Korhonen
Probabilistic Distributional Semantics with Latent Variable Models

J14-3006: Joel Lang; Mirella Lapata
Similarity-Driven Semantic Role Induction via Graph Partitioning

J14-3007: Linlin Li; Ivan Titov; Caroline Sporleder
Improved Estimation of Entropy for Evaluation of Word Sense Induction

J14-3008: Cyril Allauzen; Bill Byrne; Adrià de Gispert; Gonzalo Iglesias; Michael Riley
Pushdown Automata in Statistical Machine Translation

J14-3009: Dan Jurafsky
Obituary: Charles J. Fillmore

All issues of 2014.

New York Times Annotated Corpus Add-On

Wednesday, August 27th, 2014

New York Times corpus add-on annotations: MIDs and Entity Salience. (GitHub – Data)

From the webpage:

The data included in this release accompanies the paper, entitled “A New Entity Salience Task with Millions of Training Examples” by Jesse Dunietz and Dan Gillick (EACL 2014).

The training data includes 100,834 documents from 2003-2006, with 19,261,118 annotated entities. The evaluation data includes 9,706 documents from 2007, with 187,080 annotated entities.

An empty line separates each document annotation. The first line of a document’s annotation contains the NYT document id followed by the title. Each subsequent line refers to an entity, with the following tab-separated fields:

entity index automatically inferred salience {0,1} mention count (from our coreference system) first mention’s text byte offset start position for the first mention byte offset end position for the first mention MID (from our entity resolution system)

The background in Teaching machines to read between the lines (and a new corpus with entity salience annotations) by Dan Gillick and Dave Orr, will be useful.

From the post:

Language understanding systems are largely trained on freely available data, such as the Penn Treebank, perhaps the most widely used linguistic resource ever created. We have previously released lots of linguistic data ourselves, to contribute to the language understanding community as well as encourage further research into these areas.

Now, we’re releasing a new dataset, based on another great resource: the New York Times Annotated Corpus, a set of 1.8 million articles spanning 20 years. 600,000 articles in the NYTimes Corpus have hand-written summaries, and more than 1.5 million of them are tagged with people, places, and organizations mentioned in the article. The Times encourages use of the metadata for all kinds of things, and has set up a forum to discuss related research.

We recently used this corpus to study a topic called “entity salience”. To understand salience, consider: how do you know what a news article or a web page is about? Reading comes pretty easily to people — we can quickly identify the places or things or people most central to a piece of text. But how might we teach a machine to perform this same task? This problem is a key step towards being able to read and understand an article.

Term ratios are a start, but we can do better. Search indexing these days is much more involved, using for example the distances between pairs of words on a page to capture their relatedness. Now, with the Knowledge Graph, we are beginning to think in terms of entities and relations rather than keywords. “Basketball” is more than a string of characters; it is a reference to something in the real word which we already already know quite a bit about. (emphasis added)

Truly an important data set but I’m rather partial to that last line. 😉

So the question is if we “recognize” a entity as salient, do we annotate the entity and:

  • Present the reader with a list of links, each to a separate mention with or without ads?
  • Present the reader with what is known about the entity, with or without ads?

I see enough divided posts and other information that forces readers to endure more ads that I consciously avoid buying anything for which I see a web ad. Suggest you do the same. (If possible.) I buy books, for example, because someone known to me recommends it, not because some marketeer pushes it at me across many domains.

Improving sparse word similarity models…

Tuesday, August 26th, 2014

Improving sparse word similarity models with asymmetric measures by Jean Mark Gawron.


We show that asymmetric models based on Tversky (1977) improve correlations with human similarity judgments and nearest neighbor discovery for both frequent and middle-rank words. In accord with Tversky’s discovery that asymmetric similarity judgments arise when comparing sparse and rich representations, improvement on our two tasks can be traced to heavily weighting the feature bias toward the rarer word when comparing high- and mid- frequency words.

From the introduction:

A key assumption of most models of similarity is that a similarity relation is symmetric. This assumption is foundational for some conceptions, such as the idea of a similarity space, in which similarity is the inverse of distance; and it is deeply embedded into many of the algorithms that build on a similarity relation among objects, such as clustering algorithms. The symmetry assumption is not, however, universal, and it is not essential to all applications of similarity, especially when it comes to modeling human similarity judgments.

What assumptions underlie your “similarity” measures?

Not that we can get away from “assumptions” but are your assumptions based on evidence or are they unexamined assumptions?

Do you know of any general techniques for discovering assumptions in algorithms?

SAMUELS [English Historical Semantic Tagger]

Wednesday, July 9th, 2014

SAMUELS (Semantic Annotation and Mark-Up for Enhancing Lexical Searches)

From the webpage:

The SAMUELS project (Semantic Annotation and Mark-Up for Enhancing Lexical Searches) is funded by the Arts and Humanities Research Council in conjunction with the Economic and Social Research Council (grant reference AH/L010062/1) from January 2014 to April 2015. It will deliver a system for automatically annotating words in texts with their precise meanings, disambiguating between possible meanings of the same word, ultimately enabling a step-change in the way we deal with large textual data. It uses the Historical Thesaurus of English as its core dataset, and will provide for each word in a text the Historical Thesaurus reference code for that concept. Textual data tagged in this way can then be accurately searched and precisely investigated, producing results which can be automatically aggregated at a range of levels of precision. The project also draws on a series of research sub-projects which will employ the software thus developed, testing and validating the utility of the SAMUELS tagger as a tool for wide-ranging further research.

To really appreciate this project, visit SAMUELS English Semantic Tagger Test Site.

There you can enter up to 2000 English words and select low/upper year boundaries!

Just picking a text at random, ;-), I chose:

Greenpeace flew its 135-foot-long thermal airship over the Bluffdale, UT, data center early Friday morning, carrying the message: “NSA Illegal Spying Below” along with a link steering people to a new web site,, which the three groups launched with the support of a separate, diverse coalition of over 20 grassroots advocacy groups and Internet companies. The site grades members of Congress on what they have done, or often not done, to rein in the NSA.

Some terms and Semtag3 by time period:


  • congress: C09d01 [Sexual intercourse]; E07e16 [Inclination]; E08e12 [Movement towards a thing/person/position]
  • data: 04.10[Unrecognised]
  • thermal: 04.10[Unrecognised]
  • UT: 04.10[Unrecognised]
  • web: B06a07 [Disorders of eye/vision]; B22h08 [Class Arachnida (spiders, scorpions)]; B10 [Biological Substance];


  • congress: S06k17a [Diplomacy]; C09d01 [Sexual intercourse]; E07e16 [Inclination];
  • data: 04.10[Unrecognised]
  • thermal: 04.10[Unrecognised]
  • UT: 04.10[Unrecognised]
  • web: B06a07 [Disorders of eye/vision]; B22h08 [Class Arachnida (spiders, scorpions)]; B10 [Biological Substance];


  • congress: S06k17a [Diplomacy]; C09d01 [Sexual intercourse]; O07 [Conversation];
  • data: H55a [Attestation, witness, evidence];
  • thermal: A04b02 [Spring]; C09a [Sexual desire]; D03c02 [Heat];
  • UT: 04.10[Unrecognised]
  • web: B06a07 [Disorders of eye/vision]; B06d01 [Deformities of specific parts]; B25d [Tools and implements];


  • congress: S06k17a [Diplomacy]; C09d01 [Sexual intercourse]; O07 [Conversation];
  • data: F04v04 [Data]; H55a [Attestation, witness, evidence]; W05 [Information];
  • thermal: A04b02 [Spring]; B28b [Types/styles of clothing]; D03c02 [Heat];
  • UT: 04.10[Unrecognised]
  • web: B06d01 [Deformities of specific parts]; B22h08 [Class Arachnida (spiders, scorpions)]; B10 [Biological Substance];


  • congress: 04.10[Unrecognised]
  • data: 04.10[Unrecognised]
  • thermal: 04.10[Unrecognised]
  • UT: 04.10[Unrecognised]
  • web: 04.10[Unrecognised]

I am assuming that the “04.10[unrecognized]” for all terms in 2000-2014 means there is no usage data for that time period.

I have never heard anyone deny that meanings of words change over time and domain.

What remains a mystery is why the value-add of documenting the meanings of words isn’t obvious?

I say “words,” I should be saying “data.” Remembering the loss of the $125 Million Mars Climate Orbiter. One system read a value as “pounds of force” and another read the same data as “newtons.” In that scenario, ET doesn’t get to call home.

So let’s rephrase the question to: Why isn’t the value-add of documenting the semantics of data obvious?


Frege in Space:…

Wednesday, July 2nd, 2014

Frege in Space: A Program of Compositional Distributional Semantics by Marco Baroni, Raffaela Bernardi, Roberto Zamparelli.


The lexicon of any natural language encodes a huge number of distinct word meanings. Just to understand this article, you will need to know what thousands of words mean. The space of possible sentential meanings is infinite: In this article alone, you will encounter many sentences that express ideas you have never heard before, we hope. Statistical semantics has addressed the issue of the vastness of word meaning by proposing methods to harvest meaning automatically from large collections of text (corpora). Formal semantics in the Fregean tradition has developed methods to account for the infinity of sentential meaning based on the crucial insight of compositionality, the idea that meaning of sentences is built incrementally by combining the meanings of their constituents. This article sketches a new approach to semantics that brings together ideas from statistical and formal semantics to account, in parallel, for the richness of lexical meaning and the combinatorial power of sentential semantics. We adopt, in particular, the idea that word meaning can be approximated by the patterns of co-occurrence of words in corpora from statistical semantics, and the idea that compositionality can be captured in terms of a syntax-driven calculus of function application from formal semantics.

At one hundred and ten (110) pages this is going to take a while to read and even longer to digest. What I have read so far is both informative and surprisingly, for the subject area, quite pleasant reading.

Thoughts about building up a subject identification by composition?


I first saw this in a tweet by Stefano Bertolo.

Non-Native Written English

Wednesday, June 18th, 2014

ETS Corpus of Non-Native Written English by Daniel Blanchard, Joel Tetreault, Derrick Higgins, Aoife Cahill, and Martin Chodorow. (Blanchard, Daniel, et al. ETS Corpus of Non-Native Written English LDC2014T06. Web Download. Philadelphia: Linguistic Data Consortium, 2014.)

From the webpage:

ETS Corpus of Non-Native Written English was developed by Educational Testing Service and is comprised of 12,100 English essays written by speakers of 11 non-English native languages as part of an international test of academic English proficiency, TOEFL (Test of English as a Foreign Language). The test includes reading, writing, listening, and speaking sections and is delivered by computer in a secure test center. This release contains 1,100 essays for each of the 11 native languages sampled from eight topics with information about the score level (low/medium/high) for each essay.

The corpus was developed with the specific task of native language identification in mind, but is likely to support tasks and studies in the educational domain, including grammatical error detection and correction and automatic essay scoring, in addition to a broad range of research studies in the fields of natural language processing and corpus linguistics. For the task of native language identification, the following division is recommended: 82% as training data, 9% as development data and 9% as test data, split according to the file IDs accompanying the data set.

A data set for detecting the native language of authors writing in English. Not unlike the post earlier today on LDA, which attempts to detect topics that are (allegedly) behind words in a text.

I mention that because some CS techniques start with the premise that words are indirect representatives of something hidden, while other parts of CS, search for example, presume that words have no depth, only surface. The Google books N-Gram Viewer makes that assumption.

The N-Gram Viewer makes no distinction between any use of these words:

  • awful
  • backlog
  • bad
  • cell
  • fantastic
  • gay
  • rubbers
  • tool

Some have changed meaning recently, others, not quite so recently.

This is a partial list from a common resource: These 12 Everyday Words Used To Have Completely Different Meanings. Imagine if you did the historical research to place words in their particular social context.

It may be necessary for some purposes to presume words are shallow, but always remember that is a presumption and not a truth.

I first saw this in a tweet by Christopher Phipps.

Master Metaphor List

Friday, June 13th, 2014

Master Metaphor List (2nd edition) by George Lakoff, Jane Espenson, and Alan Schwartz.

From the cover page:

This is the second attempt to compile in one place the results of metaphor research since the publication of Reddys‘ ‘The Conduit Metaphor’ and Lakoff and Johnson’s Metaphors We Live By. This list is a compilation taken from published books and papers, student papers at Berkeley and else where, and research seminars. This represents perhaps 20 percent (a very rough estimate) of the material we have that needs to be compiled.

‘Compiling’ includes reanalyzing the metaphors and fitting them into something resembling a uniform format. The present list is anything but a finished product. This catalog is not intended to be definitive in any way. It is simply what happens to have been catalogued by volunteer labor by the date of distribution. We are making it available to students and colleagues in the hope that they can improve upon it and use it as a place to begin further research.

We expect to have subsequent drafts appearing at regular intervals. Readers are encouraged to submit ideas for additions and revisions.

Because of the size and complexity of the list, we have included a number of features to make using it easier. The Table of Contents at the beginning of the catalog lists the files in each of the four sections in the order in which they appear. At the beginning of each section is a brief description of the metaphors contained within. Finally, an alphabetized index of metaphor names has been provided.

What I haven’t seen at George Lakoff’s website are “subsequent drafts” of this document. You?

Nowadays I would expect the bibliography entries to be pointers to specific documents.

It was in looking for later resources that I discovered:

The EuroWordNet project was completed in the summer of 1999. The design of the database, the defined relations, the top-ontology and the Inter-Lingual-Index are now frozen. EuroWordNet)

I wasn’t aware that new words and metaphors had stopped entering Dutch, Italian, Spanish, German, French, Czech and Estonian in 1999. You see, it is true, you can learn something new everyday!

Of course, in this case, what I learned is false. Dutch, Italian, Spanish, German, French, Czech and Estonian continue to enrich themselves and create new metaphors.

Unlike first order logic (FOL) in the views of some.

Maybe that is why Dutch, Italian, Spanish, German, French, Czech and Estonian are all more popular than FOL by varying orders of magnitude.

I first saw this in a tweet by Francis Storr.


Thursday, June 5th, 2014

UniArab: An RRG Arabic-to-English Machine Translation Software by Dr. Brian Nolan and Yasser Salem.

A slide deck introducing UniArab.

I first saw this mentioned in a tweet by Christopher Phipps.

Which was enough to make me curious about the software and perhaps the original paper.

UniArab: An RRG Arabic-to-English Machine Translation Software (paper) by Brian Nolan and Yasser Salem.


This paper presents a machine translation system (Hutchins 2003) called UniArab (Salem, Hensman and Nolan 2008). It is a proof-of-concept system supporting the fundamental aspects of Arabic, such as the parts of speech, agreement and tenses. UniArab is based on the linking algorithm of RRG (syntax to semantics and vice versa). UniArab takes MSA Arabic as input in the native orthography, parses the sentence(s) into a logical meta-representation based on the fully expanded RRG logical structures and, using this, generates perfectly grammatical English output with full agreement and morphological resolution. UniArab utilizes an XML-based implementation of elements of the Role and Reference Grammar theory in software. In order to analyse Arabic by computer we first extract the lexical properties of the Arabic words (Al-Sughaiyer and Al-Kharashi 2004). From the parse, it then creates a computer-based representation for the logical structure of the Arabic sentence(s). We use the RRG theory to motivate the computational implementation of the architecture of the lexicon in software. We also implement in software the RRG bidirectional linking system to build the parse and generate functions between the syntax-semantic interfaces. Through seven input phases, including the morphological and syntactic unpacking, UniArab extracts the logical structure of an Arabic sentence. Using the XML-based metadata representing the RRG logical structure, UniArab then accurately generates an equivalent grammatical sentence in the target language through four output phases. We discuss the technologies used to support its development and also the user interface that allows for the
addition of lexical items directly to the lexicon in real time. The UniArab system has been tested and evaluated generating equivalent grammatical sentences, in English, via the logical structure of Arabic sentences, based on MSA Arabic input with very significant and accurate results (Izwaini 2006). At present we are working to greatly extend the coverage by the addition of more verbs to the lexicon. We have demonstrated in this research that RRG is a viable linguistic model for building accurate rulebased semantically oriented machine translation software. Role and Reference Grammar (RRG) is a functional theory of grammar that posits a direct mapping between the semantic representation of a sentence and its syntactic representation. The theory allows a sentence in a specific language to be described in terms of its logical structure and grammatical procedures. RRG creates a linking relationship between syntax and semantics, and can account for how semantic representations are mapped into syntactic representations. We claim that RRG is very suitable for machine translation of Arabic, notwithstanding well-documented difficulties found within Arabic MT (Izwaini, S. 2006), and that RRG can be implemented in software as the rule-based kernel of an Interlingua bridge MT engine. The version of Arabic (Ryding 2005, Alosh 2005, Schulz 2005), we consider in this paper is Modern Standard Arabic (MSA), which is distinct from classical Arabic. In the Arabic linguistic tradition there is not a clear-cut, well defined analysis of the inventory of parts of speech in Arabic.

At least as of today, times out. Other pointers?

Interesting work on Arabic translation. Makes me curious about adaptation of these techniques to map between semantic domains.

I first saw this in a tweet by Christopher Phipps.

Online Language Taggers

Tuesday, May 13th, 2014

UCREL Semantic Analysis System (USAS)

From the homepage:

The UCREL semantic analysis system is a framework for undertaking the automatic semantic analysis of text. The framework has been designed and used across a number of research projects and this page collects together various pointers to those projects and publications produced since 1990.

The semantic tagset used by USAS was originally loosely based on Tom McArthur’s Longman Lexicon of Contemporary English (McArthur, 1981). It has a multi-tier structure with 21 major discourse fields (shown here on the right), subdivided, and with the possibility of further fine-grained subdivision in certain cases. We have written an introduction to the USAS category system (PDF file) with examples of prototypical words and multi-word units in each semantic field.

There are four online taggers available:

English: 100,000 word limit

Italian: 2,000 word limit

Dutch: 2,000 word limit

Chinese: 3,000 character limit


I first saw this in a tweet by Paul Rayson.

GATE 8.0

Monday, May 12th, 2014

GATE (general architecture for text engineering) 8.0

From the download page:

Release 8.0 (May 11th 2014)

Most users should download the installer package (~450MB):

If the installer does not work for you, you can download one of the following packages instead. See the user guide for installation instructions:

The BIN, SRC and ALL packages all include the full set of GATE plugins and all the libraries GATE requires to run, including sample trained models for the LingPipe and OpenNLP plugins.

Version 8.0 requires Java 7 or 8, and Mac users must install the full JDK, not just the JRE.

Four major changes in this release:

  1. Requires Java 7 or later to run
  2. Tools for Twitter.
  3. ANNIE (named entity annotation pipeline) Refreshed.
  4. Tools for Crowd Sourcing.

Not bad for a project that will turn twenty (20) next year!

More resources:


Nightly Snapshots

Mastering a substantial portion of GATE should keep you in nearly constant demand.