Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 12, 2014

GATE 8.0

Filed under: Annotation,Linguistics,Text Analytics,Text Corpus,Text Mining — Patrick Durusau @ 2:34 pm

GATE (general architecture for text engineering) 8.0

From the download page:

Release 8.0 (May 11th 2014)

Most users should download the installer package (~450MB):

If the installer does not work for you, you can download one of the following packages instead. See the user guide for installation instructions:

The BIN, SRC and ALL packages all include the full set of GATE plugins and all the libraries GATE requires to run, including sample trained models for the LingPipe and OpenNLP plugins.

Version 8.0 requires Java 7 or 8, and Mac users must install the full JDK, not just the JRE.

Four major changes in this release:

  1. Requires Java 7 or later to run
  2. Tools for Twitter.
  3. ANNIE (named entity annotation pipeline) Refreshed.
  4. Tools for Crowd Sourcing.

Not bad for a project that will turn twenty (20) next year!

More resources:

UsersGuide

Nightly Snapshots

Mastering a substantial portion of GATE should keep you in nearly constant demand.

April 29, 2014

European Computational Linguistics

Filed under: Computational Linguistics,Linguistics — Patrick Durusau @ 2:00 pm

From the ACL Anthology:

Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics

Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics

Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers

A snapshot of the current state of computational linguistics and perhaps inspiration for the next advance.

Enjoy!

April 16, 2014

…Generalized Language Models…

Filed under: Language,Linguistics,Modeling — Patrick Durusau @ 1:19 pm

How Generalized Language Models outperform Modified Kneser Ney Smoothing by a Perplexity drop of up to 25% by René Pickhardt.

René reports on the core of his dissertation work.

From the post:

When you want to assign a probability to a sequence of words you will run into the Problem that longer sequences are very rare. People fight this problem by using smoothing techniques and interpolating longer order models (models with longer word sequences) with lower order language models. While this idea is strong and helpful it is usually applied in the same way. In order to use a shorter model the first word of the sequence is omitted. This will be iterated. The Problem occurs if one of the last words of the sequence is the really rare word. In this way omiting words in the front will not help.

So the simple trick of Generalized Language models is to smooth a sequence of n words with n-1 shorter models which skip a word at position 1 to n-1 respectively.

Then we combine everything with Modified Kneser Ney Smoothing just like it was done with the previous smoothing methods.

Unlike some white papers, webinars and demos, you don’t have to register, list your email and phone number, etc. to see both the test data and code that implements René’s ideas.

Data, Source.

Please send René useful feedback as a way to say thank you for sharing both data and code.

March 24, 2014

The GATE Crowdsourcing Plugin:…

Filed under: Annotation,Crowd Sourcing,Linguistics — Patrick Durusau @ 4:27 pm

The GATE Crowdsourcing Plugin: Crowdsourcing Annotated Corpora Made Easy by Kalina Bontcheva, Ian Roberts, Leon Derczynski, and Dominic Rout.

Abstract:

Crowdsourcing is an increasingly popular, collaborative approach for acquiring annotated corpora. Despite this, reuse of corpus conversion tools and user interfaces between projects is still problematic, since these are not generally made available. This demonstration will introduce the new, open-source GATE Crowd-sourcing plugin, which offers infrastructural support for mapping documents to crowdsourcing units and back, as well as automatically generating reusable crowd-sourcing interfaces for NLP classification and selection tasks. The entire work-flow will be demonstrated on: annotating named entities; disambiguating words and named entities with respect to DBpedia URIs; annotation of opinion holders and targets; and sentiment.

From the introduction:

A big outstanding challenge for crowdsourcing projects is that the cost to define a single annotation task remains quite substantial. This demonstration will introduce the new, open-source GATE Crowdsourcing plugin, which offers infrastructural support for mapping documents to crowdsourcing units, as well as automatically generated, reusable user interfaces [1] for NLP classification and selection tasks. Their use will be demonstrated on annotating named entities (selection task), disambiguating words and named entities with respect to DBpedia URIs (classification task), annotation of opinion holders and targets (selection task), as well as sentiment (classification task).

Interesting.

Are the difficulties associated with annotation UIs a matter of creating the UI or the choices that underlie the UI?

This plugin may shed light on possible answers to that question.

March 15, 2014

Words as Tags?

Filed under: Linguistics,Text Mining,Texts,Word Meaning — Patrick Durusau @ 8:46 pm

Wordcounts are amazing. by Ted Underwood.

From the post:

People new to text mining are often disillusioned when they figure out how it’s actually done — which is still, in large part, by counting words. They’re willing to believe that computers have developed some clever strategy for finding patterns in language — but think “surely it’s something better than that?“

Uneasiness with mere word-counting remains strong even in researchers familiar with statistical methods, and makes us search restlessly for something better than “words” on which to apply them. Maybe if we stemmed words to make them more like concepts? Or parsed sentences? In my case, this impulse made me spend a lot of time mining two- and three-word phrases. Nothing wrong with any of that. These are all good ideas, but they may not be quite as essential as we imagine.

Working with text is like working with a video where every element of every frame has already been tagged, not only with nouns but with attributes and actions. If we actually had those tags on an actual video collection, I think we’d recognize it as an enormously valuable archive. The opportunities for statistical analysis are obvious! We have trouble recognizing the same opportunities when they present themselves in text, because we take the strengths of text for granted and only notice what gets lost in the analysis. So we ignore all those free tags on every page and ask ourselves, “How will we know which tags are connected? And how will we know which clauses are subjunctive?”
….

What a delightful insight!

When we say text is “unstructured” what we really mean is something as dumb as a computer sees no structure in the text.

A human reader, even a 5 or 6 year old reader of a text sees lots of structure, meaning too.

Rather than trying to “teach” computers to read, perhaps we should use computers to facilitate reading by those who already can.

Yes?

I first saw this in a tweet by Matthew Brook O’Donnell.

March 14, 2014

Papers: ACL 2014

Filed under: Computational Linguistics,Conferences,Linguistics — Patrick Durusau @ 7:21 pm

Papers: ACL 2014

The list of accepted papers for Association of Computational Linguistics has been posted for the June 22-27 conference in Baltimore, Maryland.

I am sure out of the one hundred and forty-six (146) you will find at least a few that will be of interest. 😉

I first saw this in a tweet by Shane Bergsma.

March 7, 2014

Language: Vol 89, Issue 1 (March 2013)

Filed under: Linguistics — Patrick Durusau @ 3:05 pm

Language: Vol 89, Issue 1 (March 2013)

Language is a publication of the Linguistic Society of America:

The Linguistic Society of America is the major professional society in the United States that is exclusively dedicated to the advancement of the scientific study of language. As such, the LSA plays a critical role in supporting and disseminating linguistic scholarship, as well as facilitating the application of current research to scientific, educational, and social issues concerning language.

Language is a defining characteristic of the human species and impacts virtually all aspects of human experience. For this reason linguists seek not only to discover properties of language in general and of languages in particular but also strive to understand the interface of the phenomenon of language with culture, cognition, history, literature, and so forth.

With over 5,000 members, the LSA speaks on behalf of the field of linguistics and also serves as an advocate for sound educational and political policies that affect not only professionals and students of language, but virtually all segments of society. Founded in 1924, the LSA has on many occasions made the case to governments, universities, foundations, and the public to support linguistic research and to see that our scientific discoveries are effectively applied. As part of its outreach activities, the LSA attempts to provide information and educate both officials and the public about language.

You might want to note that access to all of Language is subject to a one year embargo.

Quite reasonable when compared to embargoes calculated to give those with institutional subscriptions an advantage. I guess if you can’t get published without such advantages that sounds reasonable as well.

Enjoy!

March 4, 2014

Part-of-Speech Tagging from 97% to 100%:…

Filed under: Linguistics,Natural Language Processing — Patrick Durusau @ 8:21 pm

Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? by Christopher D. Manning.

Abstract:

I examine what would be necessary to move part-of-speech tagging performance from its current level of about 97.3% token accuracy (56% sentence accuracy) to close to 100% accuracy. I suggest that it must still be possible to greatly increase tagging performance and examine some useful improvements that have recently been made to the Stanford Part-of-Speech Tagger. However, an error analysis of some of the remaining errors suggests that there is limited further mileage to be had either from better machine learning or better features in a discriminative sequence classifier. The prospects for further gains from semi-supervised learning also seem quite limited. Rather, I suggest and begin to demonstrate that the largest opportunity for further progress comes from improving the taxonomic basis of the linguistic resources from which taggers are trained. That is, from improved descriptive linguistics. However, I conclude by suggesting that there are also limits to this process. The status of some words may not be able to be adequately captured by assigning them to one of a small number of categories. While conventions can be used in such cases to improve tagging consistency, they lack a strong linguistic basis.

I was struck by Christopher’s observation:

The status of some words may not be able to be adequately captured by assigning them to one of a small number of categories. While conventions can be used in such cases to improve tagging consistency, they lack a strong linguistic basis.

which comes up again in his final sentence:

But in such cases, we must accept that we are assigning parts of speech by convention for engineering convenience rather than achieving taxonomic truth, and there are still very interesting issues for linguistics to continue to investigate, along the lines of [27].

I suppose the observation stood out for me because on what other basis would we assign properties other than “convenience?”

When I construct a topic, I assign properties that I hope are useful to others when they view that particular topic. I don’t assign it properties unknown to me. I don’t necessarily assign it all the properties I may know for a given topic.

I may even assign it properties that I know will cause a topic to merge with other topics.

BTW, footnote [27] refers to:

Aarts, B.: Syntactic gradience: the nature of grammatical indeterminacy. Oxford University Press, Oxford (2007)

Sounds like an interesting work. I did search for “semantic indeterminacy” while at Amazon but it marked out “semantic” and returned results for indeterminacy. 😉

I first saw this in a tweet by the Stanford NLP Group.

November 19, 2013

Bridging Semantic Gaps

Filed under: Language,Lexicon,Linguistics,Sentiment Analysis — Patrick Durusau @ 4:50 pm

OK, the real title is: Cross-Language Opinion Lexicon Extraction Using Mutual-Reinforcement Label Propagation by Zheng Lin, Songbo Tan, Yue Liu, Xueqi Cheng, Xueke Xu. (Lin Z, Tan S, Liu Y, Cheng X, Xu X (2013) Cross-Language Opinion Lexicon Extraction Using Mutual-Reinforcement Label Propagation. PLoS ONE 8(11): e79294. doi:10.1371/journal.pone.0079294)

Abstract:

There is a growing interest in automatically building opinion lexicon from sources such as product reviews. Most of these methods depend on abundant external resources such as WordNet, which limits the applicability of these methods. Unsupervised or semi-supervised learning provides an optional solution to multilingual opinion lexicon extraction. However, the datasets are imbalanced in different languages. For some languages, the high-quality corpora are scarce or hard to obtain, which limits the research progress. To solve the above problems, we explore a mutual-reinforcement label propagation framework. First, for each language, a label propagation algorithm is applied to a word relation graph, and then a bilingual dictionary is used as a bridge to transfer information between two languages. A key advantage of this model is its ability to make two languages learn from each other and boost each other. The experimental results show that the proposed approach outperforms baseline significantly.

I have always wondered when someone would notice the WordNet database is limited to the English language. 😉

The authors are seeking to develop “…a language-independent approach for resource-poor language,” saying:

Our approach differs from existing approaches in the following three points: first, it does not depend on rich external resources and it is language-independent. Second, our method is domain-specific since the polarity of opinion word is domain-aware. We aim to extract the domain-dependent opinion lexicon (i.e. an opinion lexicon per domain) instead of a universal opinion lexicon. Third, the most importantly, our approach can mine opinion lexicon for a target language by leveraging data and knowledge available in another language…

Our approach propagates information back and forth between source language and target language, which is called mutual-reinforcement label propagation. The mutual-reinforcement label propagation model follows a two-stage framework. At the first stage, for each language, a label propagation algorithm is applied to a large word relation graph to produce a polarity estimate for any given word. This stage solves the problem of external resource dependency, and can be easily transferred to almost any language because all we need are unlabeled data and a couple of seed words. At the second stage, a bilingual dictionary is introduced as a bridge between source and target languages to start a bootstrapping process. Initially, information about the source language can be utilized to improve the polarity assignment in target language. In turn, the updated information of target language can be utilized to improve the polarity assignment in source language as well.

Two points of particular interest:

  1. The authors focus on creating domain specific lexicons and don’t attempt to boil the ocean. Useful semantic results will arrive sooner if you avoid attempts at universal solutions.
  2. English speakers are a large market, but the target of this exercise is the #1 language of the world, Mandarin Chinese.

    Taking the numbers for English speakers at face value, approximately 0.8 billion speakers, with a world population of 7.125 billion, that leaves 6.3 billion potential customers.

You’ve heard what they say: A billion potential customers here and a billion potential customers there, pretty soon you are talking about a real market opportunity. (The original quote misattributed to Sen. Everett Dirksen.)

November 18, 2013

jLemmaGen

Filed under: Lexicon,Linguistics — Patrick Durusau @ 7:17 pm

jLemmaGen by Michal Hlaváč.

From the webpage:

JLemmaGen is java implmentation of LemmaGen project. It’s open source lemmatizer with 15 prebuilded european lexicons. Of course you can build your own lexicon.

LemmaGen project aims at providing standardized open source multilingual platform for lemmatisation.

Project contains 2 libraries:

  • lemmagen.jar – implementation of lemmatizer and API for building own lemmatizers
  • lemmagen-lang.jar – prebuilded lemmatizers from Multext Eastern dictionaries

Whether you want to expand your market or just to avoid officious U.S. officials for the next decade or so, multilingual resources are the key to making that happen.

Enjoy!

October 26, 2013

Center for Language and Speech Processing Archives

Filed under: Linguistics,Natural Language Processing — Patrick Durusau @ 4:52 pm

Center for Language and Speech Processing Archives

Archived seminars from the Center for Language and Speech Processing (CLSP) at John Hopkins University.

I mentioned recently that Chris Callison-Burch is digitizing these videos and posting them to Vimeo. (Say Good-Bye to iTunes: > 400 NLP Videos)

Unfortunately, Vimeo offers primitive sorting (by upload date), etc.

Works if you are a Kim Kardashian fan. One tweet, photo or video is as meaningful (sic) as another.

Works less well if you looking for specific and useful content.

CLSP offers searching “by speaker, year, or keyword from title, abstract, bio.”

Enjoy!

October 20, 2013

Say Good-Bye to iTunes: > 400 NLP Videos

Filed under: Linguistics,Natural Language Processing — Patrick Durusau @ 2:59 pm

Chris Callison-Burch’s Videos

Chris tweeted today that he is less than twenty-five videos away from digitizing the entire CLSP video archive.

Currently there are four hundred and twenty-five NLP videos at Chris’ Vimeo page.

Way to go Chris!

Spread the word about this remarkable resource!

October 18, 2013

Enhancing Linguistic Search with…

Filed under: Linguistics,Ngram Viewer — Patrick Durusau @ 3:27 pm

Enhancing Linguistic Search with the Google Books Ngram Viewer by Slav Petrov and Dipanjan Das.

From the post:


With our interns Jason Mann, Lu Yang, and David Zhang, we’ve added three new features. The first is wildcards: by putting an asterisk as a placeholder in your query, you can retrieve the ten most popular replacement. For instance, what noun most often follows “Queen” in English fiction? The answer is “Elizabeth”:

Another feature we’ve added is the ability to search for inflections: different grammatical forms of the same word. (Inflections of the verb “eat” include “ate”, “eating”, “eats”, and “eaten”.) Here, we can see that the phrase “changing roles” has recently surged in popularity in English fiction, besting “change roles”, which earlier dethroned “changed roles”:

Finally, we’ve implemented the most common feature request from our users: the ability to search for multiple capitalization styles simultaneously. Until now, searching for common capitalizations of “Mother Earth” required using a plus sign to combine ngrams (e.g., “Mother Earth + mother Earth + mother earth”), but now the case-insensitive checkbox makes it easier:

The ngram data sets are available for download.

As of the date of this post, the data sets go up to 5-grams in multiple languages.

Be mindful of semantic drift, the changing of the meaning of words, over centuries or decades. Even across social, economic strata and work domains at the same time.

October 16, 2013

The LaTeX for Linguists Home Page

Filed under: Linguistics,TeX/LaTeX — Patrick Durusau @ 6:28 pm

The LaTeX for Linguists Home Page

From the webpage:

These pages provide information on how to use LaTeX for writing Linguistics papers (articles, books, etc.). In particular, they provide instructions and advice on creating the things Linguists standardly need, like trees, numbered examples, and so on, as well as advice on some things that most people need (like bibliographies), but with an eye to standard Linguistic practice.

Topic maps being a methodology to reconcile divergent uses of language, tools for the study of language seem like a close fit.

October 9, 2013

Logical and Computational Structures for Linguistic Modeling

Filed under: Language,Linguistics,Logic,Modeling — Patrick Durusau @ 7:25 pm

Logical and Computational Structures for Linguistic Modeling

From the webpage:

Computational linguistics employs mathematical models to represent morphological, syntactic, and semantic structures in natural languages. The course introduces several such models while insisting on their underlying logical structure and algorithmics. Quite often these models will be related to mathematical objects studied in other MPRI courses, for which this course provides an original set of applications and problems.

The course is not a substitute for a full cursus in computational linguistics; it rather aims at providing students with a rigorous formal background in the spirit of MPRI. Most of the emphasis is put on the symbolic treatment of words, sentences, and discourse. Several fields within computational linguistics are not covered, prominently speech processing and pragmatics. Machine learning techniques are only very sparsely treated; for instance we focus on the mathematical objects obtained through statistical and corpus-based methods (i.e. weighted automata and grammars) and the associated algorithms, rather than on automated learning techniques (which is the subject of course 1.30).

Abundant supplemental materials, slides, notes, further references.

In particular you may like Notes on Computational Aspects of Syntax by Sylvain Schmitz, that cover the first part of Logical and Computational Structures for Linguistic Modeling.

As with any model, there are trade-offs and assumptions build into nearly every choice.

Knowing where to look for those trade-offs and assumptions will give you a response to: “Well, but the model shows that….”

September 24, 2013

August 7, 2013

EACL 2014 – Gothenburg, Sweden – Call for Papers

Filed under: Computational Linguistics,Conferences,Linguistics — Patrick Durusau @ 6:02 pm

EACL 2014 – 26-30 April, Gothenburg, Sweden

IMPORTANT DATES

Long papers:

  • Long paper submissions due: 18 October 2013
  • Long paper reviews due: 19 November 2013
  • Long paper author responses due: 29 November 2013
  • Long paper notification to authors: 20 December 2013
  • Long paper camera-ready due: 14 February 2014

Short papers:

  • Short paper submissions due: 6 January 2014
  • Short paper reviews due: 3 February 2014
  • Short paper notification to authors: 24 February 2014
  • Short paper camera-ready due: 3 March 2014

EACL conference: 26–30 April 2014

From the call:

The 14th Conference of the European Chapter of the Association for Computational Linguistics invites the submission of long and short papers on substantial, original, and unpublished research in all aspects of automated natural language processing, including but not limited to the following areas:

  • computational and cognitive models of language acquisition and language processing
  • information retrieval and question answering
  • generation and summarization
  • language resources and evaluation
  • machine learning methods and algorithms for natural language processing
  • machine translation and multilingual systems
  • phonetics, phonology, morphology, word segmentation, tagging, and chunking
  • pragmatics, discourse, and dialogue
  • semantics, textual entailment
  • social media, sentiment analysis and opinion mining
  • spoken language processing and language modeling
  • syntax, parsing, grammar formalisms, and grammar induction
  • text mining, information extraction, and natural language processing applications

Papers accepted to TACL by 30 November 2013 will also be eligible for presentation at EACL 2014; please see the TACL website (http://www.transacl.org) for details.

It’s not too early to begin making plans for next Spring!

July 23, 2013

FreeLing 3.0 – Demo

Filed under: Linguistics,Natural Language Processing — Patrick Durusau @ 2:29 pm

FreeLing 3.0 – Demo

I have posted about FreeLing before but this web-based demo merits separate mention.

If you are not familiar with natural language processing (NLP), visit the FreeLing 3.0 demo and type in some sample text.

Not suitable for making judgements on proposed NLP solutions but it will give you a rough idea of what is or is not possible.

July 22, 2013

PPDB: The Paraphrase Database

Filed under: Computational Linguistics,Linguistics,Natural Language Processing — Patrick Durusau @ 2:46 pm

PPDB: The Paraphrase Database by Juri Ganitkevitch, Benjamin Van Durme and Chris Callison-Burch.

Abstract:

We present the 1.0 release of our paraphrase database, PPDB. Its English portion, PPDB:Eng, contains over 220 million paraphrase pairs, consisting of 73 million phrasal and 8 million lexical paraphrases, as well as 140 million paraphrase patterns, which capture many meaning-preserving syntactic transformations. The paraphrases are extracted from bilingual parallel corpora totaling over 100 million sentence pairs and over 2 billion English words. We also release PPDB:Spa, a collection of 196 million Spanish paraphrases. Each paraphrase pair in PPDB contains a set of associated scores, including paraphrase probabilities derived from the bitext data and a variety of monolingual distributional similarity scores computed from the Google n-grams and the Annotated Gigaword corpus. Our release includes pruning tools that allow users to determine their own precision/recall tradeoff.

A resource that should improve your subject identification!

PPDB data sets range from 424MB 6.8M rules to 5.7 GB, 86.4 rules. Download PPDB data sets.

NAACL 2013 – Videos!

NAACL 2013

Videos of the presentations at the 2013 Conference of the North American Chapter of the Association for Computational Linguistics.

Along with the papers, you should not lack for something to do over the summer!

June 14, 2013

NAACL ATL 2013

2013 Conference of the North American Chapter of the Association for Computational Linguistics

The NAACL conference wraps up tomorrow in Atlanta but in case you are running low on summer reading materials:

Proceedings for the 2013 NAACL and *SEM conferences. Not quite 180MB but close.

Scanning the accepted papers will give you an inkling of what awaits.

Enjoy!

June 4, 2013

Speech Recognition vs. Language Processing [noise-burst classification]

Filed under: Linguistics,Natural Language Processing — Patrick Durusau @ 12:06 pm

Speech Recognition vs. Language Processing by Geoffrey Pullum.

From the post:

I have stressed that we are still waiting for natural language processing (NLP). One thing that might lead you to believe otherwise is that some companies run systems that enable you to hold a conversation with a machine. But that doesn’t involve NLP, i.e. syntactic and semantic analysis of sentences. It involves automatic speech recognition (ASR), which is very different.

ASR systems deal with words and phrases rather as the song “Rawhide” recommends for cattle: “Don’t try to understand ’em; just rope and throw and brand ’em.”

Labeling noise bursts is the goal, not linguistically based understanding.

(…)

Prompting a bank customer with “Do you want to pay a bill or transfer funds between accounts?” considerably improves the chances of getting something with either “pay a bill” or “transfer funds” in it; and they sound very different.

In the latter case, no use is made by the system of the verb + object structure of the two phrases. Only the fact that the customer appears to have uttered one of them rather than the other is significant. What’s relevant about pay is not that it means “pay” but that it doesn’t sound like tran-. As I said, this isn’t about language processing; it’s about noise-burst classification.

I can see why the NLP engineers dislike Pullum so intensely.

Characterizing “speech recognition” as “noise-burst classification,” while entirely accurate, is also offensive.

😉

“Speech recognition” fools a layperson into thinking NLP is more sophisticated than it is in fact.

The question for NLP engineers is: Why the pretense at sophistication?

May 16, 2013

Linguists Circle the Wagons, or Disagreement != Danger

Filed under: Artificial Intelligence,Linguistics,Natural Language Processing — Patrick Durusau @ 2:47 pm

Pullum’s NLP Lament: More Sleight of Hand Than Fact by Christopher Phipps.

From the post:

My first reading of both of Pullum’s recent NLP posts (one and two) interpreted them to be hostile, an attack on a whole field (see my first response here). Upon closer reading, I see Pullum chooses his words carefully and it is less of an attack and more of a lament. He laments that the high-minded goals of early NLP (to create machines that process language like humans do) has not been reached, and more to the point, that commercial pressures have distracted the field from pursuing those original goals, hence they are now neglected. And he’s right about this to some extent.

But, he’s also taking the commonly used term “natural language processing” and insisting that it NOT refer to what 99% of people who use the term use it for, but rather only a very narrow interpretation consisting of something like “computer systems that mimic human language processing.” This is fundamentally unfair.

In the 1980s I was convinced that computers would soon be able to simulate the basics of what (I hope) you are doing right now: processing sentences and determining their meanings.

I feel Pullum is moving the goal posts on us when he says “there is, to my knowledge, no available system for unaided machine answering of free-form questions via general syntactic and semantic analysis” [my emphasis]. Pullum’s agenda appears to be to create a straw-man NLP world where NLP techniques are only admirable if they mimic human processing. And this is unfair for two reasons.

If there is unfairness in this discussion, it is the insistence by Christopher Phipps (and others) that Pullum has invented “…a straw-man NLP world where NLP techniques are only admirable if they mimic human processing.”

On the contrary, it was 1949 when Warren Weaver first proposed computers as the solution to world-wide translation problems. Weaver’s was not the only optimistic projection of language processing by computers. Those have continued up to and including the Semantic Web.

Yes, NLP practitioners such as Christopher Phipps use NLP in a more precise sense than Pullum. And NLP as defined by Phipps has too many achievements to easily list.

Neither one of those statements takes anything away from Pullum’s point that Google found a “sweet spot” between machine processing and human intelligence for search purposes.

What other insights Pullum has to offer may be obscured by the “…circle the wagons…” attitude from linguists.

Disagreement != Danger.

April 29, 2013

scalingpipe – …

Filed under: LingPipe,Linguistics,Scala — Patrick Durusau @ 2:07 pm

scalingpipe – porting LingPipe tutorial examples to Scala by Sujit Pal.

From the post:

Recently, I was tasked with evaluating LingPipe for use in our NLP processing pipeline. I have looked at LingPipe before, but have generally kept away from it because of its licensing – while it is quite friendly to individual developers such as myself (as long as I share the results of my work, I can use LingPipe without any royalties), a lot of the stuff I do is motivated by problems at work, and LingPipe based solutions are only practical when the company is open to the licensing costs involved.

So anyway, in an attempt to kill two birds with one stone, I decided to work with the LingPipe tutorial, but with Scala. I figured that would allow me to pick up the LingPipe API as well as give me some additional experience in Scala coding. I looked around to see if anybody had done something similar and I came upon the scalingpipe project on GitHub where Alexy Khrabov had started with porting the Interesting Phrases tutorial example.

Now there’s a clever idea!

Achieves a deep understanding of the LingPipe API and Scala experience.

Not to mention having useful results for other users.

April 11, 2013

GroningenMeaningBank (GMB)

Filed under: Corpora,Corpus Linguistics,Linguistics,Semantics — Patrick Durusau @ 2:19 pm

GroningenMeaningBank (GMB)

From the “about” page:

The Groningen Meaning Bank consists of public domain English texts with corresponding syntactic and semantic representations.

Key features

The GMB supports deep semantics, opening the way to theoretically grounded, data-driven approaches to computational semantics. It integrates phenomena instead of covering single phenomena in isolation. This provides a better handle on explaining dependencies between various ambiguous linguistic phenomena, including word senses, thematic roles, quantifier scrope, tense and aspect, anaphora, presupposition, and rhetorical relations. In the GMB texts are annotated rather than
isolated sentences, which provides a means to deal with ambiguities on the sentence level that require discourse context for resolving them.

Method

The GMB is being built using a bootstrapping approach. We employ state-of-the-art NLP tools (notably the C&C tools and Boxer) to produce a reasonable approximation to gold-standard annotations. From release to release, the annotations are corrected and refined using human annotations coming from two main sources: experts who directly edit the annotations in the GMB via the Explorer, and non-experts who play a game with a purpose called Wordrobe.

Theoretical background

The theoretical backbone for the semantic annotations in the GMB is established by Discourse Representation Theory (DRT), a formal theory of meaning developed by the philosopher of language Hans Kamp (Kamp, 1981; Kamp and Reyle, 1993). Extensions of the theory bridge the gap between theory and practice. In particular, we use VerbNet for thematic roles, a variation on ACE‘s named entity classification, WordNet for word senses and Segmented DRT for rhetorical relations (Asher and Lascarides, 2003). Thanks to the DRT backbone, all these linguistic phenomena can be expressed in a first-order language, enabling the practical use of first-order theorem provers and model builders.

Step back towards the source of semantics (that would be us).

One practical question is how to capture semantics for a particular domain or enterprise?

Another is what to capture to enable the mapping of those semantics to those of other domains or enterprises?

March 29, 2013

Learning Grounded Models of Meaning

Filed under: Linguistics,Meaning,Modeling,Semantics — Patrick Durusau @ 2:16 pm

Learning Grounded Models of Meaning

Schedule and readings for seminar by Katrin Erk and Jason Baldridge:

Natural language processing applications typically need large amounts of information at the lexical level: words that are similar in meaning, idioms and collocations, typical relations between entities,lexical patterns that can be used to draw inferences, and so on. Today such information is mostly collected automatically from large amounts of data, making use of regularities in the co-occurrence of words. But documents often contain more than just co-occurring words, for example illustrations, geographic tags, or a link to a date. Just like co-occurrences between words, these co-occurrences of words and extra-linguistic data can be used to automatically collect information about meaning. The resulting grounded models of meaning link words to visual, geographic, or temporal information. Such models can be used in many ways: to associate documents with geographic locations or points in time, or to automatically find an appropriate image for a given document, or to generate text to accompany a given image.

In this seminar, we discuss different types of extra-linguistic data, and their use for the induction of grounded models of meaning.

Very interesting reading that should keep you busy for a while! 😉

March 7, 2013

February 8, 2013

PyPLN: a Distributed Platform for Natural Language Processing

Filed under: Linguistics,Natural Language Processing,Python — Patrick Durusau @ 5:16 pm

PyPLN: a Distributed Platform for Natural Language Processing by Flávio Codeço Coelho, Renato Rocha Souza, Álvaro Justen, Flávio Amieiro, Heliana Mello.

Abstract:

This paper presents a distributed platform for Natural Language Processing called PyPLN. PyPLN leverages a vast array of NLP and text processing open source tools, managing the distribution of the workload on a variety of configurations: from a single server to a cluster of linux servers. PyPLN is developed using Python 2.7.3 but makes it very easy to incorporate other softwares for specific tasks as long as a linux version is available. PyPLN facilitates analyses both at document and corpus level, simplifying management and publication of corpora and analytical results through an easy to use web interface. In the current (beta) release, it supports English and Portuguese languages with support to other languages planned for future releases. To support the Portuguese language PyPLN uses the PALAVRAS parser\citep{Bick2000}. Currently PyPLN offers the following features: Text extraction with encoding normalization (to UTF-8), part-of-speech tagging, token frequency, semantic annotation, n-gram extraction, word and sentence repertoire, and full-text search across corpora. The platform is licensed as GPL-v3.

Demo: http://demo.pypln.org

Source code: http://pypln.org.

Have you noticed that tools for analysis are getting easier, not harder to use?

Is there a lesson there for tools to create topic map content?

January 7, 2013

English Letter Frequency Counts: Mayzner Revisited…

Filed under: Language,Linguistics — Patrick Durusau @ 6:27 am

English Letter Frequency Counts: Mayzner Revisited or ETAOIN SRHLDCU by Peter Norvig.

From the post:

On December 17th 2012, I got a nice letter from Mark Mayzner, a retired 85-year-old researcher who studied the frequency of letter combinations in English words in the early 1960s. His 1965 publication has been cited in hundreds of articles. Mayzner describes his work:

I culled a corpus of 20,000 words from a variety of sources, e.g., newspapers, magazines, books, etc. For each source selected, a starting place was chosen at random. In proceeding forward from this point, all three, four, five, six, and seven-letter words were recorded until a total of 200 words had been selected. This procedure was duplicated 100 times, each time with a different source, thus yielding a grand total of 20,000 words. This sample broke down as follows: three-letter words, 6,807 tokens, 187 types; four-letter words, 5,456 tokens, 641 types; five-letter words, 3,422 tokens, 856 types; six-letter words, 2,264 tokens, 868 types; seven-letter words, 2,051 tokens, 924 types. I then proceeded to construct tables that showed the frequency counts for three, four, five, six, and seven-letter words, but most importantly, broken down by word length and letter position, which had never been done before to my knowledge.

and he wonders if:

perhaps your group at Google might be interested in using the computing power that is now available to significantly expand and produce such tables as I constructed some 50 years ago, but now using the Google Corpus Data, not the tiny 20,000 word sample that I used.

The answer is: yes indeed, I am interested! And it will be a lot easier for me than it was for Mayzner. Working 60s-style, Mayzner had to gather his collection of text sources, then go through them and select individual words, punch them on Hollerith cards, and use a card-sorting machine.

Peter rises to the occasion, using thirty-seven (37) times as much data as Mayzner. Not to mention detailing his analysis and posting the resulting data sets for more analysis.

November 16, 2012

Phrase Detectives

Filed under: Annotation,Games,Interface Research/Design,Linguistics — Patrick Durusau @ 5:21 am

Phrase Detectives

This annotation game was also mentioned in Bob Carpenter’s Another Linguistic Corpus Collection Game, but it merits separate mention.

From the description:

Welcome to Phrase Detectives

Lovers of literature, grammar and language, this is the place where you can work together to improve future generations of technology. By indicating relationships between words and phrases you will help to create a resource that is rich in linguistic information.

It is easy to see how this could be adapted to identification of subjects, roles and associations in texts.

And in a particular context, the interest would be in capturing usage in that context, not the wider world.

Definitely has potential as a topic map authoring interface.

« Newer PostsOlder Posts »

Powered by WordPress