Archive for the ‘Computational Linguistics’ Category

King – Man + Woman = Queen:…

Tuesday, September 22nd, 2015

King – Man + Woman = Queen: The Marvelous Mathematics of Computational Linguistics.

From the post:

Computational linguistics has dramatically changed the way researchers study and understand language. The ability to number-crunch huge amounts of words for the first time has led to entirely new ways of thinking about words and their relationship to one another.

This number-crunching shows exactly how often a word appears close to other words, an important factor in how they are used. So the word Olympics might appear close to words like running, jumping, and throwing but less often next to words like electron or stegosaurus. This set of relationships can be thought of as a multidimensional vector that describes how the word Olympics is used within a language, which itself can be thought of as a vector space.

And therein lies this massive change. This new approach allows languages to be treated like vector spaces with precise mathematical properties. Now the study of language is becoming a problem of vector space mathematics.

Today, Timothy Baldwin at the University of Melbourne in Australia and a few pals explore one of the curious mathematical properties of this vector space: that adding and subtracting vectors produces another vector in the same space.

The question they address is this: what do these composite vectors mean? And in exploring this question they find that the difference between vectors is a powerful tool for studying language and the relationship between words.

A great lay introduction to:

Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vector Differences for Lexical Relation Learning by Ekaterina Vylomova, Laura Rimell, Trevor Cohn, Timothy Baldwin.


Recent work on word embeddings has shown that simple vector subtraction over pre-trained embeddings is surprisingly effective at capturing different lexical relations, despite lacking explicit supervision. Prior work has evaluated this intriguing result using a word analogy prediction formulation and hand-selected relations, but the generality of the finding over a broader range of lexical relation types and different learning settings has not been evaluated. In this paper, we carry out such an evaluation in two learning settings: (1) spectral clustering to induce word relations, and (2) supervised learning to classify vector differences into relation types. We find that word embeddings capture a surprising amount of information, and that, under suitable supervised training, vector subtraction generalises well to a broad range of relations, including over unseen lexical items.

The authors readily admit, much to their credit, this isn’t a one size fits all solution.

But, a line of research that merits your attention.

Cultural Fault Lines Determine How New Words Spread On Twitter, Say Computational Linguists

Saturday, December 6th, 2014

Cultural Fault Lines Determine How New Words Spread On Twitter, Say Computational Linguists

From the post:

A dialect is a particular form of language that is limited to a specific location or population group. Linguists are fascinated by these variations because they are determined both by geography and by demographics. So studying them can produce important insights into the nature of society and how different groups within it interact.

That’s why linguists are keen to understand how new words, abbreviations and usages spread on new forms of electronic communication, such as social media platforms. It is easy to imagine that the rapid spread of neologisms could one day lead to a single unified dialect of netspeak. An interesting question is whether there is any evidence that this is actually happening.

Today, we get a fascinating insight into this problem thanks to the work of Jacob Eisenstein at the Georgia Institute of Technology in Atlanta and a few pals. These guys have measured the spread of neologisms on Twitter and say they have clear evidence that online language is not converging at all. Indeed, they say that electronic dialects are just as common as ordinary ones and seem to reflect same fault lines in society.

Disappointment for those who thought the Net would help people overcome the curse of Babel.

When we move into new languages or means of communication, we simply take our linguistic diversity with us, like well traveled but familiar luggage.

If you think about it, the difficulties of multiple semantics for OWL same:As is another instance of the same phenomena. Semantically distinct groups assigned the same token, OWL same:As different semantics. That should not have been a surprise. But it was and it will be every time on community privileges itself to be the giver of meaning for any term.

If you want to see the background for the post in full:

Diffusion of Lexical Change in Social Media by Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, Eric P. Xing.


Computer-mediated communication is driving fundamental changes in the nature of written language. We investigate these changes by statistical analysis of a dataset comprising 107 million Twitter messages (authored by 2.7 million unique user accounts). Using a latent vector autoregressive model to aggregate across thousands of words, we identify high-level patterns in diffusion of linguistic change over the United States. Our model is robust to unpredictable changes in Twitter’s sampling rate, and provides a probabilistic characterization of the relationship of macro-scale linguistic influence to a set of demographic and geographic predictors. The results of this analysis offer support for prior arguments that focus on geographical proximity and population size. However, demographic similarity — especially with regard to race — plays an even more central role, as cities with similar racial demographics are far more likely to share linguistic influence. Rather than moving towards a single unified “netspeak” dialect, language evolution in computer-mediated communication reproduces existing fault lines in spoken American English.

Computational Linguistics [09-2014]

Sunday, September 7th, 2014

Chris Callison-Burch tweets that Volume 40, Issue 3 – September 2014, ACL Anthology is now available!

In the September issue:

J14-3001: Montserrat Marimon; Núria Bel; Lluís Padró
Squibs: Automatic Selection of HPSG-Parsed Sentences for Treebank Construction

J14-3002: Jürgen Wedekind
Squibs: On the Universal Generation Problem for Unification Grammars

J14-3003: Ahmed Hassan; Amjad Abu-Jbara; Wanchen Lu; Dragomir Radev
A Random Walk–Based Model for Identifying Semantic Orientation

J14-3004: Xu Sun; Wenjie Li; Houfeng Wang; Qin Lu
Feature-Frequency–Adaptive On-line Training for Fast and Accurate Natural Language Processing

J14-3005: Diarmuid Ó Séaghdha; Anna Korhonen
Probabilistic Distributional Semantics with Latent Variable Models

J14-3006: Joel Lang; Mirella Lapata
Similarity-Driven Semantic Role Induction via Graph Partitioning

J14-3007: Linlin Li; Ivan Titov; Caroline Sporleder
Improved Estimation of Entropy for Evaluation of Word Sense Induction

J14-3008: Cyril Allauzen; Bill Byrne; Adrià de Gispert; Gonzalo Iglesias; Michael Riley
Pushdown Automata in Statistical Machine Translation

J14-3009: Dan Jurafsky
Obituary: Charles J. Fillmore

All issues of 2014.

Comparison of Corpora through Narrative Structure

Friday, May 16th, 2014

Comparison of Corpora through Narrative Structure by Dan Simonson.

A very interesting slide deck from a presentation on how news coverage of police activity may have changed from before and after September 11th.

An early slide that caught my attention:

As a computational linguist, I can study 106 —instead of 100.6 —documents.

The sort of claim that clients might look upon with favor.

I first saw this in a tweet by Dominique Mariko.

European Computational Linguistics

Tuesday, April 29th, 2014

From the ACL Anthology:

Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics

Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics

Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers

A snapshot of the current state of computational linguistics and perhaps inspiration for the next advance.


Papers: ACL 2014

Friday, March 14th, 2014

Papers: ACL 2014

The list of accepted papers for Association of Computational Linguistics has been posted for the June 22-27 conference in Baltimore, Maryland.

I am sure out of the one hundred and forty-six (146) you will find at least a few that will be of interest. 😉

I first saw this in a tweet by Shane Bergsma.

Introduction to Computational Linguistics (Scala too!)

Saturday, February 1st, 2014

Introduction to Computational Linguistics by Jason Baldridge.

From the webpage:

Advances in computational linguistics have not only led to industrial applications of language technology; they can also provide useful tools for linguistic investigations of large online collections of text and speech, or for the validation of linguistic theories.

Introduction to Computational Linguistics introduces the most important data structures and algorithmic techniques underlying computational linguistics: regular expressions and finite-state methods, categorial grammars and parsing, feature structures and unification, meaning representations and compositional semantics. The linguistic levels covered are morphology, syntax, and semantics. While the focus is on the symbolic basis underlying computational linguistics, a high-level overview of statistical techniques in computational linguistics will also be given. We will apply the techniques in actual programming exercises, using the programming language Scala. Practical programming techniques, tips and tricks, including version control systems, will also be discussed.

Jason has created a page of links, which includes a twelve part tutorial on Scala:

If you want to walk through the course on your own, see the schedule.


EACL 2014 – Gothenburg, Sweden – Call for Papers

Wednesday, August 7th, 2013

EACL 2014 – 26-30 April, Gothenburg, Sweden


Long papers:

  • Long paper submissions due: 18 October 2013
  • Long paper reviews due: 19 November 2013
  • Long paper author responses due: 29 November 2013
  • Long paper notification to authors: 20 December 2013
  • Long paper camera-ready due: 14 February 2014

Short papers:

  • Short paper submissions due: 6 January 2014
  • Short paper reviews due: 3 February 2014
  • Short paper notification to authors: 24 February 2014
  • Short paper camera-ready due: 3 March 2014

EACL conference: 26–30 April 2014

From the call:

The 14th Conference of the European Chapter of the Association for Computational Linguistics invites the submission of long and short papers on substantial, original, and unpublished research in all aspects of automated natural language processing, including but not limited to the following areas:

  • computational and cognitive models of language acquisition and language processing
  • information retrieval and question answering
  • generation and summarization
  • language resources and evaluation
  • machine learning methods and algorithms for natural language processing
  • machine translation and multilingual systems
  • phonetics, phonology, morphology, word segmentation, tagging, and chunking
  • pragmatics, discourse, and dialogue
  • semantics, textual entailment
  • social media, sentiment analysis and opinion mining
  • spoken language processing and language modeling
  • syntax, parsing, grammar formalisms, and grammar induction
  • text mining, information extraction, and natural language processing applications

Papers accepted to TACL by 30 November 2013 will also be eligible for presentation at EACL 2014; please see the TACL website ( for details.

It’s not too early to begin making plans for next Spring!

PPDB: The Paraphrase Database

Monday, July 22nd, 2013

PPDB: The Paraphrase Database by Juri Ganitkevitch, Benjamin Van Durme and Chris Callison-Burch.


We present the 1.0 release of our paraphrase database, PPDB. Its English portion, PPDB:Eng, contains over 220 million paraphrase pairs, consisting of 73 million phrasal and 8 million lexical paraphrases, as well as 140 million paraphrase patterns, which capture many meaning-preserving syntactic transformations. The paraphrases are extracted from bilingual parallel corpora totaling over 100 million sentence pairs and over 2 billion English words. We also release PPDB:Spa, a collection of 196 million Spanish paraphrases. Each paraphrase pair in PPDB contains a set of associated scores, including paraphrase probabilities derived from the bitext data and a variety of monolingual distributional similarity scores computed from the Google n-grams and the Annotated Gigaword corpus. Our release includes pruning tools that allow users to determine their own precision/recall tradeoff.

A resource that should improve your subject identification!

PPDB data sets range from 424MB 6.8M rules to 5.7 GB, 86.4 rules. Download PPDB data sets.

NAACL 2013 – Videos!

Monday, July 22nd, 2013

NAACL 2013

Videos of the presentations at the 2013 Conference of the North American Chapter of the Association for Computational Linguistics.

Along with the papers, you should not lack for something to do over the summer!


Friday, June 14th, 2013

2013 Conference of the North American Chapter of the Association for Computational Linguistics

The NAACL conference wraps up tomorrow in Atlanta but in case you are running low on summer reading materials:

Proceedings for the 2013 NAACL and *SEM conferences. Not quite 180MB but close.

Scanning the accepted papers will give you an inkling of what awaits.


UMD CMSC 723: Computational Linguistics I

Tuesday, February 21st, 2012

UMD CMSC 723: Computational Linguistics I

Twenty-five (25) posts by Hal Daume III as part of his course on computational linguistics. References, pointers, examples, explanations.

I haven’t read these in detail. As always, welcome your comments/suggestions.

It would be interesting to take the major university computational linguistics courses and create a topic map of the topics covered and recommended resources. Could be useful for students with different learning styles to find an approach that works for them.

Anyone care to hazard a list of say the top twenty (20) schools in computational linguistics? (without ranking, just in the top 20)

PS: The course homepage.

50th Annual Meeting of the Association for Computational Linguistics

Saturday, December 3rd, 2011

50th Annual Meeting of the Association for Computational Linguistics

Important dates:

January 15, 2012 (11:59pm PST) : Long Submission Deadline
March 11, 2012 : Long Notification
April 30, 2012 : Long Camera Ready Deadline
March 18, 2012 (11:59pm PST) : Short Submission Deadline
April 23, 2012 : Short Notification
May 7, 2012 : Short Camera Ready Deadline
July 9, 2012 : Conference Starts – July 14, 2012 : Conference Ends

From the call for papers:

The 50th Annual Meeting of the Association for Computational Linguistics and the Human Language Technologies conference will be organized as a single event to be held at the International Convention Center Jeju, Jeju, Korea, on July 8-14, 2012. The conference will cover a broad spectrum of technical areas related to natural language and computation. ACL 2012 will include full papers, short papers, oral presentations, poster presentations, demonstrations, tutorials, and workshops. The conference is organized by the Association for Computational Linguistics.

The conference invites the submission of papers on original and unpublished research on all aspects of computational linguistics, including but not limited to:

1. Discourse, Dialogue, and Pragmatics
2. Information Extraction
3. Information Retrieval
4. Language Resources
5. Lexical Semantics
6. Lexicon and ontology development
7. Machine Translation
8. Multilinguality
9. Multimodal representations and processing
10. Social Media
11. Natural Language Processing Applications
12. Phonology/Morphology, Tagging and Chunking, Word Segmentation
13. Question Answering
14. Sentiment Analysis and Opinion Mining
15. Spoken Language Processing
16. Statistical and Machine Learning Methods
17. Summarization and Generation
18. Syntax and Parsing
19. Text Classification
20. Text Mining
21. User Studies and Evaluation Methods

Detecting Structure in Scholarly Discourse

Saturday, December 3rd, 2011

Detecting Structure in Scholarly Discourse (DSSD2012)

Important Dates:

March 11, 2012 Submission Deadline
April 15, 2012 Notification of acceptance
April 30, 2012 Camera-ready papers due
July 12 or 13, 2012 Workshop

From the Call for Papers:

The detection of discourse structure in scientific documents is important for a number of tasks, including biocuration efforts, text summarization, error correction, information extraction and the creation of enriched formats for scientific publishing. Currently, many parallel efforts exist to detect a range of discourse elements at different levels of granularity and for different purposes. Discourse elements detected include the statement of facts, claims and hypotheses, the identification of methods and protocols, and as the differentiation between new and existing work. In medical texts, efforts are underway to automatically identify prescription and treatment guidelines, patient characteristics, and to annotate research data. Ambitious long-term goals include the modeling of argumentation and rhetorical structure and more recently narrative structure, by recognizing ‘motifs’ inspired by folktale analysis.

A rich variety of feature classes is used to identify discourse elements, including verb tense/mood/voice, semantic verb class, speculative language or negation, various classes of stance markers, text-structural components, or the location of references. These features are motivated by linguistic inquiry into the detection of subjectivity, opinion, entailment, inference, but also author stance and author disagreement, motif and focus.

Several workshops have been focused on the detection of some of these features in scientific text, such as speculation and negation in the 2010 workshop on Negation and Speculation in Natural Language Processing and the BioNLP’09 Shared Task, and hedging in the CoNLL-2010 Shared Task Learning to detect hedges and their scope in natural language textM. Other efforts that have included a clear focus on scientific discourse annotation include STIL2011 and Force11, the Future of Research Communications and e-Science. There have been several efforts to produce large-scale corpora in this field, such as BioScope, where negation and speculation information were annotated, and the GENIA Event corpus.

The goal of the 2012 workshop Detecting Structure in Scholarly Discourse is to discuss and compare the techniques and principles applied in these various approaches, to consider ways in which they can complement each other, and to initiate collaborations to develop standards for annotating appropriate levels of discourse, with enhanced accuracy and usefulness.

This conference is being held in conjunction with ACL 2012.

Scala Tutorial – Tuples, Lists, methods on Lists and Strings

Monday, October 3rd, 2011

Scala Tutorial – Tuples, Lists, methods on Lists and Strings

I mention this not only because it looks like a good Scala tutorial series but also because it is being developed in connection with a course on computational linguistics at UT Austin (sorry, University of Texas at Austin, USA).

The cross-over between computer programming and computational linguistics illustrates the artificial nature of the divisions we make between disciplines and professions.

Graph-based Clustering for Computational Linguistics: A Survey

Thursday, March 17th, 2011

Graph-based Clustering for Computational Linguistics: A Survey

Slides by Zheng Chen and Heng Ji, City University of New York, July 2010.

A very concise summary of graph methods with citations to the literature.

You won’t be able to run off and become a hairy-chested graph warrior with these slides but you will have a better idea of why graphs are important.

Association for Computational Linguistics: Human Language Technologies (2011 Portland)

Monday, March 14th, 2011

Association for Computational Linguistics: Human Language Technologies (49th annual meeting)

The time for submitting papers is past but a quick look at the list of accepted papers gives plenty of reasons to attend.

To be held at the Portland Marriott Downtown Waterfront in Portland, Oregon, USA, June 19-24, 2011.

So you don’t miss 2012, it will be held on Jeju Island, Republic of Korea. I have been to Jeju Island. It is awesome!