### Linguists Circle the Wagons, or Disagreement != Danger

Thursday, May 16th, 2013

Pullum’s NLP Lament: More Sleight of Hand Than Fact by Christopher Phipps.

My first reading of both of Pullum’s recent NLP posts (one and two) interpreted them to be hostile, an attack on a whole field (see my first response here). Upon closer reading, I see Pullum chooses his words carefully and it is less of an attack and more of a lament. He laments that the high-minded goals of early NLP (to create machines that process language like humans do) has not been reached, and more to the point, that commercial pressures have distracted the field from pursuing those original goals, hence they are now neglected. And he’s right about this to some extent.

But, he’s also taking the commonly used term “natural language processing” and insisting that it NOT refer to what 99% of people who use the term use it for, but rather only a very narrow interpretation consisting of something like “computer systems that mimic human language processing.” This is fundamentally unfair.

In the 1980s I was convinced that computers would soon be able to simulate the basics of what (I hope) you are doing right now: processing sentences and determining their meanings.

I feel Pullum is moving the goal posts on us when he says “there is, to my knowledge, no available system for unaided machine answering of free-form questions via general syntactic and semantic analysis” [my emphasis]. Pullum’s agenda appears to be to create a straw-man NLP world where NLP techniques are only admirable if they mimic human processing. And this is unfair for two reasons.

If there is unfairness in this discussion, it is the insistence by Christopher Phipps (and others) that Pullum has invented “…a straw-man NLP world where NLP techniques are only admirable if they mimic human processing.”

On the contrary, it was 1949 when Warren Weaver first proposed computers as the solution to world-wide translation problems. Weaver’s was not the only optimistic projection of language processing by computers. Those have continued up to and including the Semantic Web.

Yes, NLP practitioners such as Christopher Phipps use NLP in a more precise sense than Pullum. And NLP as defined by Phipps has too many achievements to easily list.

Neither one of those statements takes anything away from Pullum’s point that Google found a “sweet spot” between machine processing and human intelligence for search purposes.

What other insights Pullum has to offer may be obscured by the “…circle the wagons…” attitude from linguists.

Disagreement != Danger.

### scalingpipe – …

Monday, April 29th, 2013

Recently, I was tasked with evaluating LingPipe for use in our NLP processing pipeline. I have looked at LingPipe before, but have generally kept away from it because of its licensing – while it is quite friendly to individual developers such as myself (as long as I share the results of my work, I can use LingPipe without any royalties), a lot of the stuff I do is motivated by problems at work, and LingPipe based solutions are only practical when the company is open to the licensing costs involved.

So anyway, in an attempt to kill two birds with one stone, I decided to work with the LingPipe tutorial, but with Scala. I figured that would allow me to pick up the LingPipe API as well as give me some additional experience in Scala coding. I looked around to see if anybody had done something similar and I came upon the scalingpipe project on GitHub where Alexy Khrabov had started with porting the Interesting Phrases tutorial example.

Now there’s a clever idea!

Achieves a deep understanding of the LingPipe API and Scala experience.

Not to mention having useful results for other users.

### GroningenMeaningBank (GMB)

Thursday, April 11th, 2013

GroningenMeaningBank (GMB)

The Groningen Meaning Bank consists of public domain English texts with corresponding syntactic and semantic representations.

Key features

The GMB supports deep semantics, opening the way to theoretically grounded, data-driven approaches to computational semantics. It integrates phenomena instead of covering single phenomena in isolation. This provides a better handle on explaining dependencies between various ambiguous linguistic phenomena, including word senses, thematic roles, quantifier scrope, tense and aspect, anaphora, presupposition, and rhetorical relations. In the GMB texts are annotated rather than
isolated sentences, which provides a means to deal with ambiguities on the sentence level that require discourse context for resolving them.

Method

The GMB is being built using a bootstrapping approach. We employ state-of-the-art NLP tools (notably the C&C tools and Boxer) to produce a reasonable approximation to gold-standard annotations. From release to release, the annotations are corrected and refined using human annotations coming from two main sources: experts who directly edit the annotations in the GMB via the Explorer, and non-experts who play a game with a purpose called Wordrobe.

Theoretical background

The theoretical backbone for the semantic annotations in the GMB is established by Discourse Representation Theory (DRT), a formal theory of meaning developed by the philosopher of language Hans Kamp (Kamp, 1981; Kamp and Reyle, 1993). Extensions of the theory bridge the gap between theory and practice. In particular, we use VerbNet for thematic roles, a variation on ACE‘s named entity classification, WordNet for word senses and Segmented DRT for rhetorical relations (Asher and Lascarides, 2003). Thanks to the DRT backbone, all these linguistic phenomena can be expressed in a first-order language, enabling the practical use of first-order theorem provers and model builders.

Step back towards the source of semantics (that would be us).

One practical question is how to capture semantics for a particular domain or enterprise?

Another is what to capture to enable the mapping of those semantics to those of other domains or enterprises?

### Learning Grounded Models of Meaning

Friday, March 29th, 2013

Learning Grounded Models of Meaning

Schedule and readings for seminar by Katrin Erk and Jason Baldridge:

Natural language processing applications typically need large amounts of information at the lexical level: words that are similar in meaning, idioms and collocations, typical relations between entities,lexical patterns that can be used to draw inferences, and so on. Today such information is mostly collected automatically from large amounts of data, making use of regularities in the co-occurrence of words. But documents often contain more than just co-occurring words, for example illustrations, geographic tags, or a link to a date. Just like co-occurrences between words, these co-occurrences of words and extra-linguistic data can be used to automatically collect information about meaning. The resulting grounded models of meaning link words to visual, geographic, or temporal information. Such models can be used in many ways: to associate documents with geographic locations or points in time, or to automatically find an appropriate image for a given document, or to generate text to accompany a given image.

In this seminar, we discuss different types of extra-linguistic data, and their use for the induction of grounded models of meaning.

Very interesting reading that should keep you busy for a while!

### PyPLN: a Distributed Platform for Natural Language Processing

Friday, February 8th, 2013

PyPLN: a Distributed Platform for Natural Language Processing by Flávio Codeço Coelho, Renato Rocha Souza, Álvaro Justen, Flávio Amieiro, Heliana Mello.

This paper presents a distributed platform for Natural Language Processing called PyPLN. PyPLN leverages a vast array of NLP and text processing open source tools, managing the distribution of the workload on a variety of configurations: from a single server to a cluster of linux servers. PyPLN is developed using Python 2.7.3 but makes it very easy to incorporate other softwares for specific tasks as long as a linux version is available. PyPLN facilitates analyses both at document and corpus level, simplifying management and publication of corpora and analytical results through an easy to use web interface. In the current (beta) release, it supports English and Portuguese languages with support to other languages planned for future releases. To support the Portuguese language PyPLN uses the PALAVRAS parser\citep{Bick2000}. Currently PyPLN offers the following features: Text extraction with encoding normalization (to UTF-8), part-of-speech tagging, token frequency, semantic annotation, n-gram extraction, word and sentence repertoire, and full-text search across corpora. The platform is licensed as GPL-v3.

Source code: http://pypln.org.

Have you noticed that tools for analysis are getting easier, not harder to use?

Is there a lesson there for tools to create topic map content?

### English Letter Frequency Counts: Mayzner Revisited…

Monday, January 7th, 2013

On December 17th 2012, I got a nice letter from Mark Mayzner, a retired 85-year-old researcher who studied the frequency of letter combinations in English words in the early 1960s. His 1965 publication has been cited in hundreds of articles. Mayzner describes his work:

I culled a corpus of 20,000 words from a variety of sources, e.g., newspapers, magazines, books, etc. For each source selected, a starting place was chosen at random. In proceeding forward from this point, all three, four, five, six, and seven-letter words were recorded until a total of 200 words had been selected. This procedure was duplicated 100 times, each time with a different source, thus yielding a grand total of 20,000 words. This sample broke down as follows: three-letter words, 6,807 tokens, 187 types; four-letter words, 5,456 tokens, 641 types; five-letter words, 3,422 tokens, 856 types; six-letter words, 2,264 tokens, 868 types; seven-letter words, 2,051 tokens, 924 types. I then proceeded to construct tables that showed the frequency counts for three, four, five, six, and seven-letter words, but most importantly, broken down by word length and letter position, which had never been done before to my knowledge.

and he wonders if:

perhaps your group at Google might be interested in using the computing power that is now available to significantly expand and produce such tables as I constructed some 50 years ago, but now using the Google Corpus Data, not the tiny 20,000 word sample that I used.

The answer is: yes indeed, I am interested! And it will be a lot easier for me than it was for Mayzner. Working 60s-style, Mayzner had to gather his collection of text sources, then go through them and select individual words, punch them on Hollerith cards, and use a card-sorting machine.

Peter rises to the occasion, using thirty-seven (37) times as much data as Mayzner. Not to mention detailing his analysis and posting the resulting data sets for more analysis.

### Phrase Detectives

Friday, November 16th, 2012

Phrase Detectives

This annotation game was also mentioned in Bob Carpenter’s Another Linguistic Corpus Collection Game, but it merits separate mention.

Welcome to Phrase Detectives

Lovers of literature, grammar and language, this is the place where you can work together to improve future generations of technology. By indicating relationships between words and phrases you will help to create a resource that is rich in linguistic information.

It is easy to see how this could be adapted to identification of subjects, roles and associations in texts.

And in a particular context, the interest would be in capturing usage in that context, not the wider world.

Definitely has potential as a topic map authoring interface.

### Linguistic Society of America (LSA)

Tuesday, October 23rd, 2012

Linguistic Society of America (LSA)

The Linguistic Society of America is the major professional society in the United States that is exclusively dedicated to the advancement of the scientific study of language. With nearly 4,000 members, the LSA speaks on behalf of the field of linguistics and also serves as an advocate for sound educational and political policies that affect not only professionals and students of language, but virtually all segments of society. Founded in 1924, the LSA has on many occasions made the case to governments, universities, foundations, and the public to support linguistic research and to see that our scientific discoveries are effectively applied.

Language and linguistics are important in the description of numeric data but even more so for non-numeric data.

Another avenue to sharpen your skills at both.

PS: I welcome your suggestions of other language/linguistic institutions and organizations. Even if our machines don’t understand natural language, our users do.

### Natural Language Processing | Hub

Saturday, July 7th, 2012

Natural Language Processing | Hub

NLP|Hub is an aggregator of news about Natural Language Processing and other related topics, such as Text Mining, Information Retrieval, Linguistics or Machine Learning.

NLP|Hub finds, collects and arranges related news from different sites, from academic webs to company blogs.

NLP|Hub is a product of Cilenis, a company specialized in Natural Language Processing.

Definitely going on my short list of sites to check!

### GATE Teamware: Collaborative Annotation Factories (HOT!)

Wednesday, May 9th, 2012

GATE Teamware: Collaborative Annotation Factories

Teamware is a web-based management platform for collaborative annotation & curation. It is a cost-effective environment for annotation and curation projects, enabling you to harness a broadly distributed workforce and monitor progress & results remotely in real time.

It’s also very easy to use. A new project can be up and running in less than five minutes. (As far as we know, there is nothing else like it in this field.)

GATE Teamware delivers a multi-function user interface over the Internet for viewing, adding and editing text annotations. The web-based management interface allows for project set-up, tracking, and management:

• Creating re-usable project templates
• Initiating projects based on templates
• Assigning project roles to specific users
• Monitoring progress and various project statistics in real time
• Reporting of project status, annotator activity and statistics
• Applying GATE-based processing routines (automatic annotations or post-annotation processing)

I have known about the GATE project in general for years and came to this site after reading: Crowdsourced Legal Case Annotation.

Could be the basis for annotations that are converted into a topic map, but…, I have been a sysadmin before. Maintaining servers, websites, software, etc. Great work, interesting work, but not what I want to be doing now.

Where to get it? The easiest way to get started is to buy a ready-to-run Teamware virtual server from GATECloud.net.

Not saying it will or won’t meet your particular needs, but, certainly is worth a “look see.”

Let me know if you take the plunge!

### Parallel Language Corpus Hunting?

Friday, April 27th, 2012

Parallel language corpus hunters, particularly in legal informatics can rejoice!

[A] parallel corpus of all European Union legislation, called the Acquis Communautaire, translated into all 22 languages of the EU nations — has been expanded to include EU legislation from 2004-2010…

If you think semantic impedance in one language is tough, step up and try that across twenty-two (22) languages.

Of course, these countries share something of a common historical context. Imagine the gulf when you move up to languages from other historical contexts.

See: DGT-TM-2011, Parallel Corpus of All EU Legislation in Translation, Expanded to Include Data from 2004-2010 for links and other details.

### The Anachronism Machine: The Language of Downton Abbey

Saturday, March 17th, 2012

The Anachronism Machine: The Language of Downton Abbey

David Smith writes::

I’ve recently become hooked on the TV series Downton Abbey. I’m not usually one for costume dramas, but the mix of fine acting, the intriguing social relationships, and the larger WW1-era story make for compelling viewing. (Also: Maggie Smith is a treasure.)

Despite the widespread criticial acclaim, Downton has met with criticism for some period-innapropriate uses of language. For example, at one point Lady Mary laments “losting the high ground”, a phrase that didn’t come into use until the 1960s. But is this just a random slip, or are such anachronistic phrases par for the course on Downton? And how does it compare to other period productions in its use of language?

To answer these questions, Ben Schmidt (a graduate student in history at Princeton University and Visiting Graduate Fellow at the Cultural Observatory at Harvard) created an “Anachronism Machine“. Using the R statistical programming language and Google n-grams, it analyzes all of the two-word phrases in a Downton Abbey script, and compares their frequency of use with that in books written around the WW1 era (when Downton is set). For example, Schmidt finds that Downton characters, if they were to follow societal norms of the 1910′s (as reflected in books from that period), would rarely use the phrase “guest bedroom”, but in fact it’s regularly uttered during the series. Schmidt charts the frequency these phrases appear in the show versus the frequency they appear in contemporaneous books below:

Good post on the use of R for linguistic analysis!

As a topic map person, I am more curious what should be substituted for “guest bedroom” in a 1910′s series? Thinking it would be interesting to have a mapping between the “normal” speech patterns for various time periods.

### Segmenting Words and Sentences

Wednesday, March 14th, 2012

Segmenting Words and Sentences by Richard Marsden.

Even simple NLP tasks such as tokenizing words and segmenting sentences can have their complexities. Punctuation characters could be used to segment sentences, but this requires the punctuation marks to be treated as separate tokens. This would result in abbreviations being split into separate words and sentences.

This post uses a classification approach to create a parser that returns lists of sentences of tokenized words and punctuation.

Splitting text into words and sentences seems like it should be the simplest NLP task. It probably is, but there are a still number of potential problems. For example, a simple approach could use space characters to divide words. Punctuation (full stop, question mark, exclamation mark) could be used to divide sentences. This quickly comes into problems when an abbreviation is processed. “etc.” would be interpreted as a sentence terminator, and “U.N.E.S.C.O.” would be interpreted as six individual sentences, when both should be treated as single word tokens. How should hyphens be interpreted? What about speech marks and apostrophes?

A good introduction to segmentation but I would test the segmentation with a sample text before trusting it too far. Writing habits vary even within languages.

### Cross Validation vs. Inter-Annotator Agreement

Monday, March 12th, 2012

Cross Validation vs. Inter-Annotator Agreement by Bob Carpenter.

Time, Negation, and Clinical Events

Mitzi’s been annotating clinical notes for time expressions, negations, and a couple other classes of clinically relevant phrases like diagnoses and treatments (I just can’t remember exactly which!). This is part of the project she’s working on with Noemie Elhadad, a professor in the Department of Biomedical Informatics at Columbia.

LingPipe Chunk Annotation GUI

Mitzi’s doing the phrase annotation with a LingPipe tool which can be found in

She even brought it up to date with the current release of LingPipe and generalized the layout for documents with subsections.

Lessons in the use of LingPipe tools!

If you are annotating texts or anticipate annotating texts, read this post.

### Corpus-Wide Association Studies

Sunday, March 11th, 2012

Corpus-Wide Association Studies by Mark Liberman.

I’ve spent the past couple of days at GURT 2012, and one of the interesting talks that I’ve heard was Julian Brooke and Sali Tagliamonte, “Hunting the linguistic variable: using computational techniques for data exploration and analysis”. Their abstract (all that’s available of the work so far) explains that:

The selection of an appropriate linguistic variable is typically the first step of a variationist analysis whose ultimate goal is to identify and explain social patterns. In this work, we invert the usual approach, starting with the sociolinguistic metadata associated with a large scale socially stratified corpus, and then testing the utility of computational tools for finding good variables to study. In particular, we use the ‘information gain’ metric included in data mining software to automatically filter a huge set of potential variables, and then apply our own corpus reader software to facilitate further human inspection. Finally, we subject a small set of particularly interesting features to a more traditional variationist analysis.

This type of data-mining for interesting patterns is likely to become a trend in sociolinguistics, as it is in other areas of the social and behavioral sciences, and so it’s worth giving some thought to potential problems as well as opportunities.

If you think about it, the social/behavioral sciences are being applied to the results of data mining of user behavior now. Perhaps you can “catch the wave” early on this cycle of research.

### 50 years of linguistics at MIT

Monday, January 23rd, 2012

50 years of linguistics at MIT by Arnold Zwicky.

The videos are now out — from the 50th-anniversary celebrations (“a scientific reunion”) of the linguistics program at MIT, December 9-11, 2011. The schedule of the talks (with links to slides for them) is available here, with links to other material: a list of attendees, a list of the many poster presentations, videos of the main presentations, personal essays by MIT alumni, photographs from the event, a list of MIT dissertations from 1965 to the present, and a 1974 history of linguistics at MIT (particularly interesting for the years before the first officially registered graduate students entered the program, in 1961).

The eleven YouTube videos (of the introduction and the main presentations) can be accessed directly here.

See Arnold’s post for the links.

### The communicative function of ambiguity in language

Friday, January 20th, 2012

The communicative function of ambiguity in language by Steven T. Piantadosi, Harry Tily and Edward Gibson. (Cognition, 2011) (PDF file)

We present a general information-theoretic argument that all efficient communication systems will be ambiguous, assuming that context is informative about meaning. We also argue that ambiguity allows for greater ease of processing by permitting efficient linguistic units to be re-used. We test predictions of this theory in English, German, and Dutch. Our results and theoretical analysis suggest that ambiguity is a functional property of language that allows for greater communicative efficiency. This provides theoretical and empirical arguments against recent suggestions that core features of linguistic systems are not designed for communication.

This is a must read paper if you are interesting in ambiguity and similar issues.

At page 289, the authors report:

These findings suggest that ambiguity is not enough of a problem to real-world communication that speakers would make much effort to avoid it. This may well be because actual language in context provides other information that resolves the ambiguities most of the time.

I don’t know if our communication systems are efficient or not but I think the phrase “in context” is covering up a very important point.

Our communication systems came about in very high-bandwidth circumstances. We were in the immediate presence of a person speaking. With all the context that provides.

Even if we accept an origin of language of say 200,000 years ago, written language, which provides the basis for communication without the presence of another person, emerges only in the last five or six thousand years. Just to keep it simple, 5 thousand years would be 2.5% of the entire history of language.

So for 97.5% of the history of language, it has been used in a high bandwidth situation. No wonder it has yet to adapt to narrow bandwidth situations.

If writing puts us into a narrow bandwidth situation and ambiguity, where does that leave our computers?

### Algorithm estimates who’s in control

Wednesday, January 4th, 2012

Algorithm estimates who’s in control

John Kleinberg, whose work influenced Google’s PageRank, is working on ranking something else. Kelinberg et al. developed an algorithm that ranks people, based on how they speak to each other.

This on the heels of the Big Brother’s Name is… has to have you wondering if you even want Internet access at all.

Just imagine, power (who has, who doesn’t) analysis of email discussion lists, wiki edits, email archives, transcripts.

This has the potential (along with other clever analysis) to identify and populate topic maps with some very interesting subjects.

I first saw this at FlowingData

### News Cracking 1 : Meet the Editors

Saturday, December 31st, 2011

News Cracking 1 : Meet the Editors

I posted recently about visualizing the relationships between editors and the countries appearing in articles they edit for news articles published by Reuters. I’ve since updated my experimental news aggregation site (now it is intended to eventually be more of a meta-news analysis site) to display only Reuters articles and to extract the names of contributors, including the editors. The overall list of editors is maintained (in the right column) and each editor is displayed with the number of articles observed for which they have attribution. Currently Cynthia Johnston and Tim Dobbyn are at the top of the list.

What do you think about Matthew’s plans for future tracking? Thoughts on how subject identities might/might not be helpful? Comment at Matthew’s blog.

I don’t know if CNN is still this way because it has been a long time since I have seen it but it used to repeat the same news over and over again every 24 hour cycle. It might be amusing to see how short a summary could be created for some declared 24 hour news cycle. I suppose the only problem would be that if CNN “knew” it was being watched, they would introduce artificial diversity into the newscast.

Still, I suppose one could capture the audio track and using voice recognition software collapse all the repetitive statements, excluding the commercials (or including commercials as well). Maybe I do need a cable TV connection in my home office.

### BUCC 2012: The Fifth Workshop on Building and Using Comparable Corpora

Saturday, December 31st, 2011

BUCC 2012: The Fifth Workshop on Building and Using Comparable Corpora (Special topic: Language Resources for Machine Translation in Less-Resourced Languages and Domains

Dates:

DEADLINE FOR PAPERS: 15 February 2012
Workshop Saturday, 26 May 2012
Lütfi Kirdar Istanbul Exhibition and Congress Centre
Istanbul, Turkey

Some of the information is from: Call for papers. the main conference site does not (yet) have the call for papers posted. Suggest that you verify dates with conference organizers before making travel arrangements.

In the language engineering and the linguistics communities, research in comparable corpora has been motivated by two main reasons. In language engineering, it is chiefly motivated by the need to use comparable corpora as training data for statistical NLP applications such as statistical machine translation or cross-lingual retrieval. In linguistics, on the other hand, comparable corpora are of interest in themselves by making possible inter-linguistic discoveries and comparisons. It is generally accepted in both communities that comparable corpora are documents in one or several languages that are comparable in content and form in various degrees and dimensions. We believe that the linguistic definitions and observations related to comparable corpora can improve methods to mine such corpora for applications of statistical NLP. As such, it is of great interest to bring together builders and users of such corpora.

The scarcity of parallel corpora has motivated research concerning the use of comparable corpora: pairs of monolingual corpora selected according to the same set of criteria, but in different languages or language varieties. Non-parallel yet comparable corpora overcome the two limitations of parallel corpora, since sources for original, monolingual texts are much more abundant than translated texts. However, because of their nature, mining translations in comparable corpora is much more challenging than in parallel corpora. What constitutes a good comparable corpus, for a given task or per se, also requires specific attention: while the definition of a parallel corpus is fairly straightforward, building a non-parallel corpus requires control over the selection of source texts in both languages.

Parallel corpora are a key resource as training data for statistical machine translation, and for building or extending bilingual lexicons and terminologies. However, beyond a few language pairs such as English-French or English-Chinese and a few contexts such as parliamentary debates or legal texts, they remain a scarce resource, despite the creation of automated methods to collect parallel corpora from the Web. To exemplify such issues in a practical setting, this year’s special focus will be on

Language Resources for Machine Translation in Less-Resourced Languages and Domains

with the aim of overcoming the shortage of parallel resources when building MT systems for less-resourced languages and domains, particularly by usage of comparable corpora for finding parallel data within and by reaching out for “hidden” parallel data. Lack of sufficient language resources for many language pairs and domains is currently one of the major obstacles in further advancement of machine translation.

Curious about the use of topic maps in the creation of comparable corpora? Seems like the use of language/domain scopes on linguistic data could result in easier construction of comparable corpora.

### Sexual Accommodation

Friday, December 30th, 2011

Sexual Accommodation by Mark Liberman.

You’ve probably noticed that how people talk depends on who they’re talking with. And for 40 years or so, linguists and psychologists and sociologists have referred to this process as “speech accommodation” or “communication accommodation” — or, for short, just plain “accommodation”. This morning’s Breakfast Experiment™ explores a version of the speech accommodation effect as applied to groups rather than individuals — some ways that men and women talk differently in same-sex vs. mixed-sex conversations.

I got the idea of doing this a couple of days ago, as I was indexing some conversational transcripts in order to find material for an experiment on a completely different topic. The transcripts in question come from a large collection of telephone conversations known as the “Fisher English” corpus, collected at the LDC in 2003 and published in 2004 and 2005. These two publications together comprise 11,699 two-person conversations, involving a diverse collection of speakers. While the sample is not demographically balanced in a strict sense, there is a good representation of speakers from all over the United States, across a wide range of ages, educational levels, occupations, and so forth.

I mention this because if usage varies by gender, doesn’t it also stand to reason that usage (read identification of subjects) varies by position in an organization?

Anyone who has been in an IT position can attest that conversations inside the IT department use a completely different vocabulary than when addressing people outside the department. For one, the term “idiot” is probably not used with reference to the CEO outside of the IT department.

Capturing the differences in vocabularies could be as useful as any result for an actual topic map, in terms of communication across levels of an organization.

Suggestions for text archives where that sort of difference could be investigated?

### Journal of Computing Science and Engineering

Monday, December 19th, 2011

Journal of Computing Science and Engineering

Journal of Computing Science and Engineering (JCSE) is a peer-reviewed quarterly journal that publishes high-quality papers on all aspects of computing science and engineering. The primary objective of JCSE is to be an authoritative international forum for delivering both theoretical and innovative applied researches in the field. JCSE publishes original research contributions, surveys, and experimental studies with scientific advances.

The scope of JCSE covers all topics related to computing science and engineering, with a special emphasis on the following areas: embedded computing, ubiquitous computing, convergence computing, green computing, smart and intelligent computing, and human computing.

I got here from following a sponsor link at a bioinformatics conference.

Then just picking at random from the current issue I see:

A Fast Algorithm for Korean Text Extraction and Segmentation from Subway Signboard Images Utilizing Smartphone Sensors by Igor Milevskiy, Jin-Young Ha.

Abstract:

We present a fast algorithm for Korean text extraction and segmentation from subway signboards using smart phone sensors in order to minimize computational time and memory usage. The algorithm can be used as preprocessing steps for optical character recognition (OCR): binarization, text location, and segmentation. An image of a signboard captured by smart phone camera while holding smart phone by an arbitrary angle is rotated by the detected angle, as if the image was taken by holding a smart phone horizontally. Binarization is only performed once on the subset of connected components instead of the whole image area, resulting in a large reduction in computational time. Text location is guided by user’s marker-line placed over the region of interest in binarized image via smart phone touch screen. Then, text segmentation utilizes the data of connected components received in the binarization step, and cuts the string into individual images for designated characters. The resulting data could be used as OCR input, hence solving the most difficult part of OCR on text area included in natural scene images. The experimental results showed that the binarization algorithm of our method is 3.5 and 3.7 times faster than Niblack and Sauvola adaptive-thresholding algorithms, respectively. In addition, our method achieved better quality than other methods.

Secure Blocking + Secure Matching = Secure Record Linkage by Alexandros Karakasidis, Vassilios S. Verykios.

Abstract:

Performing approximate data matching has always been an intriguing problem for both industry and academia. This task becomes even more challenging when the requirement of data privacy rises. In this paper, we propose a novel technique to address the problem of efficient privacy-preserving approximate record linkage. The secure framework we propose consists of two basic components. First, we utilize a secure blocking component based on phonetic algorithms statistically enhanced to improve security. Second, we use a secure matching component where actual approximate matching is performed using a novel private approach of the Levenshtein Distance algorithm. Our goal is to combine the speed of private blocking with the increased accuracy of approximate secure matching.

A Survey of Transfer and Multitask Learning in Bioinformatics by Qian Xu, Qiang Yang.

Abstract:

Machine learning and data mining have found many applications in biological domains, where we look to build predictive models based on labeled training data. However, in practice, high quality labeled data is scarce, and to label new data incurs high costs. Transfer and multitask learning offer an attractive alternative, by allowing useful knowledge to be extracted and transferred from data in auxiliary domains helps counter the lack of data problem in the target domain. In this article, we survey recent advances in transfer and multitask learning for bioinformatics applications. In particular, we survey several key bioinformatics application areas, including sequence classification, gene expression data analysis, biological network reconstruction and biomedical applications.

And the ones I didn’t list from the current issue are just as interesting and relevant to identity/mapping issues.

This journal is a good example of people who have deliberately reached further across disciplinary boundaries than most.

About the only excuse for not doing so left is the discomfort of being the newbie in a field not your own.

Is that a good enough reason to miss possible opportunities to make critical advances in your home field? (Only you can answer that for yourself. No one can answer it for you.)

### Surrogate Learning

Monday, November 28th, 2011

Surrogate Learning – From Feature Independence to Semi-Supervised Classification by Sriharsha Veeramachaneni and Ravi Kumar Kondadadi.

We consider the task of learning a classifier from the feature space X to the set of classes $Y = {0, 1}$, when the features can be partitioned into class-conditionally independent feature sets $X1$ and $X2$. We show that the class-conditional independence can be used to represent the original learning task in terms of 1) learning a classifier from $X2$ to $X1$ (in the sense of estimating the probability $P(x1|x2))$ and 2) learning the class-conditional distribution of the feature set $X1$. This fact can be exploited for semi-supervised learning because the former task can be accomplished purely from unlabeled samples. We present experimental evaluation of the idea in two real world applications.

The two “real world” applications are ones you are likely to encounter:

First:

Our problem consisted of merging each of ≈ 20000 physician records, which we call the update database, to the record of the same physician in a master database of ≈ 106 records.

Our old friends record linkage and entity resolution. The solution depends upon a clever choice of features for application of the technique. (The thought occurs to me that a repository of data analysis snippets for particular techniques would be as valuable, if not more so, than the techniques themselves. Techniques come and go. Data analysis and the skills it requires goes on and on.)

Second:

Sentence classification is often a preprocessing step for event or relation extraction from text. One of the challenges posed by sentence classification is the diversity in the language for expressing the same event or relationship. We present a surrogate learning approach to generating paraphrases for expressing the merger-acquisition (MA) event between two organizations in financial news. Our goal is to find paraphrase sentences for the MA event from an unlabeled corpus of news articles, that might eventually be used to train a sentence classifier that discriminates between MA and non-MA sentences. (Emphasis added. This is one of the issues in the legal track at TREC.)

This test was against 700000 financial news records.

Both tests were quite successful.

Surrogate learning looks interesting for a range of NLP applications.

### Template-Based Information Extraction without the Templates

Monday, November 28th, 2011

Template-Based Information Extraction without the Templates by Nathanael Chambers and Dan Jurafsky.

Standard algorithms for template-based information extraction (IE) require predefined template schemas, and often labeled data, to learn to extract their slot fillers (e.g., an embassy is the Target of a Bombing template). This paper describes an approach to template-based IE that removes this requirement and performs extraction without knowing the template structure in advance. Our algorithm instead learns the template structure automatically from raw text, inducing template schemas as sets of linked events (e.g., bombings include detonate, set off, and destroy events) associated with semantic roles. We also solve the standard IE task, using the induced syntactic patterns to extract role fillers from specific documents. We evaluate on the MUC-4 terrorism dataset and show that we induce template structure very similar to hand-created gold structure, and we extract role fillers with an F1 score of .40, approaching the performance of algorithms that require full knowledge of the templates.

Can you say association?

Definitely points towards a pipeline approach to topic map authoring. To abuse the term, perhaps a “dashboard” that allows selection of data sources followed by the construction of workflows with preliminary analysis being displayed at “breakpoints” in the processing. No particular reason why stages have to be wired together other than tradition.

Just looking a little bit into the future, imagine that some entities weren’t being recognized at a high enough rate. So you shift that part of the data to several thousand human entity processors and take the average of their results, higher than what you were getting and feed that back into the system. Could have knowledge workers who work full time but shift from job to job performing tasks too difficult to program effectively.

### collocations in wikipedia – parts 2 and 3

Friday, November 18th, 2011

collocations in wikipedia, part 2

In part 2 Matt discusses alternatives to “magic” frequency cut-offs for collocation analysis.

I rather like the idea of looking for alternatives to “it’s just that way” methodologies. Accepting traditional cut-offs, etc., maybe the right thing to do in some cases, but only with experience and understanding the alternatives.

finding phrases with mutual information [collocations, part 3]

In part 3 Matt discusses taking collocations beyond just two terms that occur together and techniques for that analysis.

Matt is also posting todo thoughts for further investigation.

If you have the time and interest, drop by Matt’s blog to leave suggestions or comments.

(See collocations in wikipedia, part 1 for our coverage of the first post.)

### Twitter POS Tagging with LingPipe and ARK Tweet Data

Monday, November 14th, 2011

Twitter POS Tagging with LingPipe and ARK Tweet Data by Bob Carpenter.

From the post:

We will train and test on anything that’s easy to parse.

How’s that for a motto!

Social media may be more important than I thought it was several years ago. It may just be the serialization in digital form all the banter in bars, at blocks parties and around the water cooler. If that is true, then governments would be well advised to encourage and assist with access to social media. To give them an even chance of leaving ahead of the widow maker.

Think of mining Twitter data like the NSA and phone traffic, but you aren’t doing anything illegal.

### Active learning: far from solved

Wednesday, October 12th, 2011

Active learning: far from solved

As Daniel Hsu and John Langford pointed out recently, there has been a lot of recent progress in active learning. This is to the point where I might actually be tempted to suggest some of these algorithms to people to use in practice, for instance the one John has that learns faster than supervised learning because it’s very careful about what work it performs. That is, in particular, I might suggest that people try it out instead of the usual query-by-uncertainty (QBU) or query-by-committee (QBC). This post is a brief overview of what I understand of the state of the art in active learning (paragraphs 2 and 3) and then a discussion of why I think (a) researchers don’t tend to make much use of active learning and (b) why the problem is far from solved. (a will lead to b.)

This is a deeply interesting article that could give rise to mini and major projects. I particularly like his point about not throwing away training data. No, you have to read the post for yourself. It’s not that long.

### Scala Tutorial – Tuples, Lists, methods on Lists and Strings

Monday, October 3rd, 2011

Scala Tutorial – Tuples, Lists, methods on Lists and Strings

I mention this not only because it looks like a good Scala tutorial series but also because it is being developed in connection with a course on computational linguistics at UT Austin (sorry, University of Texas at Austin, USA).

The cross-over between computer programming and computational linguistics illustrates the artificial nature of the divisions we make between disciplines and professions.

### Modeling Item Difficulty for Annotations of Multinomial Classifications

Sunday, September 25th, 2011

We all know from annotating data that some items are harder to annotate than others. We know from the epidemiology literature that the same holds true for medical tests applied to subjects, e.g., some cancers are easier to find than others.

But how do we model item difficulty? I’ll review how I’ve done this before using an IRT-like regression, then move on to Paul Mineiro’s suggestion for flattening multinomials, then consider a generalization of both these approaches.

For your convenience, links for the “…tutorial for LREC with Massimo Poesio” can be found at: LREC 2010 Tutorial: Modeling Data Annotation.