Archive for the ‘Translation’ Category

When will my computer understand me?

Monday, June 10th, 2013

When will my computer understand me?

From the post:

It’s not hard to tell the difference between the “charge” of a battery and criminal “charges.” But for computers, distinguishing between the various meanings of a word is difficult.

For more than 50 years, linguists and computer scientists have tried to get computers to understand human language by programming semantics as software. Driven initially by efforts to translate Russian scientific texts during the Cold War (and more recently by the value of information retrieval and data analysis tools), these efforts have met with mixed success. IBM’s Jeopardy-winning Watson system and Google Translate are high profile, successful applications of language technologies, but the humorous answers and mistranslations they sometimes produce are evidence of the continuing difficulty of the problem.

Our ability to easily distinguish between multiple word meanings is rooted in a lifetime of experience. Using the context in which a word is used, an intrinsic understanding of syntax and logic, and a sense of the speaker’s intention, we intuit what another person is telling us.

“In the past, people have tried to hand-code all of this knowledge,” explained Katrin Erk, a professor of linguistics at The University of Texas at Austin focusing on lexical semantics. “I think it’s fair to say that this hasn’t been successful. There are just too many little things that humans know.”

Other efforts have tried to use dictionary meanings to train computers to better understand language, but these attempts have also faced obstacles. Dictionaries have their own sense distinctions, which are crystal clear to the dictionary-maker but murky to the dictionary reader. Moreover, no two dictionaries provide the same set of meanings — frustrating, right?

Watching annotators struggle to make sense of conflicting definitions led Erk to try a different tactic. Instead of hard-coding human logic or deciphering dictionaries, why not mine a vast body of texts (which are a reflection of human knowledge) and use the implicit connections between the words to create a weighted map of relationships — a dictionary without a dictionary?

“An intuition for me was that you could visualize the different meanings of a word as points in space,” she said. “You could think of them as sometimes far apart, like a battery charge and criminal charges, and sometimes close together, like criminal charges and accusations (“the newspaper published charges…”). The meaning of a word in a particular context is a point in this space. Then we don’t have to say how many senses a word has. Instead we say: ‘This use of the word is close to this usage in another sentence, but far away from the third use.’”

Before you jump to the post looking for the code, Erk is working with a 10,000 dimension space to analyze her data.

The most recent paper: Montague Meets Markov: Deep Semantics with Probabilistic Logical Form (2013)

Abstract:

We combine logical and distributional representations of natural language meaning by transforming distributional similarity judgments into weighted inference rules using Markov Logic Networks (MLNs). We show that this framework supports both judging sentence similarity and recognizing textual entailment by appropriately adapting the MLN implementation of logical connectives. We also show that distributional phrase similarity, used as textual inference rules created on the fly, improves its performance.

NLP Weather: High Pressure or Low?

Tuesday, June 4th, 2013

Machine Translation Without the Translation by Geoffrey Pullum.

From the post:

I have been ruminating this month on why natural language processing (NLP) still hasn’t arrived, and I have pointed to three developments elsewhere that seem to be discouraging its development. First, enhanced keyword search via Google’s influentiality-ranking of results. Second, the dramatic enhancement in applicability of speech recognition that dialog design facilitates. I now turn to a third, which has to do with the sheer power of number-crunching.

Machine translation is the unclimbed Everest of computational linguistics. It calls for syntactic and semantic analysis of the source language, mapping source-language meanings to target-language meanings, and generating acceptable output from the latter. If computational linguists could do all those things, they could hang up the “mission accomplished” banner.

What has emerged instead, courtesy of Google Translate, is something utterly different: pseudotranslation without analysis of grammar or meaning, developed by people who do not know (or need to know) either source or target language.

The trick: huge quantities of parallel texts combined with massive amounts of rapid statistical computation. The catch: low quality, and output inevitably peppered with howlers.

Of course, if I may purloin Dr Johnson’s remark about a dog walking on his hind legs, although it is not done well you are surprised to find it done at all. For Google Translate’s pseudotranslation is based on zero linguistic understanding. Not even word meanings are looked up: The program couldn’t care less about the meaning of anything. Here, roughly, is how it works.

(…)

My conjecture is that it is useful enough to constitute one more reason for not investing much in trying to get real NLP industrially developed and deployed.

NLP will come, I think; but when you take into account the ready availability of (1) Google search, and (2) speech-driven applications aided by dialog design, and (3) the statistical pseudotranslation briefly discussed above, the cumulative effect is enough to reduce the pressure to develop NLP, and will probably delay its arrival for another decade or so.

Surprised to find that Geoffrey thinks more pressure will result in “real NLP,” albeit delayed by a decade or so for the reasons outlined in his post.

If you recall, machine translation of texts was the hot topic at the end of the 1950′s and early 1960′s.

With an emphasis on automatic translation of Russian. Height of the cold war so there was lots of pressure for a solution.

Lots of pressure then did not result in a solution.

There’s a rather practical reason for not investing in “real NLP.”

There is no evidence that how humans “understand” language is known well enough to program a computer to mimic that “understanding.”

If Geoffrey has evidence to the contrary, I am sure everyone would be glad to hear about it.

Massively Parallel Suffix Array Queries…

Saturday, May 11th, 2013

Massively Parallel Suffix Array Queries and On-Demand Phrase Extraction for Statistical Machine Translation Using GPUs by Hua He, Jimmy Lin, Adam Lopez.

Abstract:

Translation models can be scaled to large corpora and arbitrarily-long phrases by looking up translations of source phrases on the fly in an indexed parallel text. However, this is impractical because on-demand extraction of phrase tables is a major computational bottleneck. We solve this problem by developing novel algorithms for general purpose graphics processing units (GPUs), which enable suffix array queries for phrase lookup and phrase extractions to be massively parallelized. Our open-source implementation improves the speed of a highly-optimized, state-of-the-art serial CPU-based implementation by at least an order of magnitude. In a Chinese-English translation task, our GPU implementation extracts translation tables from approximately 100 million words of parallel text in less than 30 milliseconds.

If you think about topic maps as mapping the identification of a subject in multiple languages to a single representative, then the value of translation software becomes obvious.

You may or may not, depending upon project requirements, want to rely solely on automated mappings of phrases.

Whether you use automated mapping of phrases as an “assist” to or as a sanity check on human curation, this work looks very interesting.

Translation Memory

Tuesday, December 6th, 2011

Translation Memory

As we mentioned in Teaching Etsy to Speak a Second Language, developers need to tag English content so it can be extracted and then translated. Since we are a company with a continuous deployment development process, we do this on a daily basis and as an result get a significant number of new messages to be translated along with changes or deletions of existing ones that have already been translated. Therefore we needed some kind of recollection system to easily reuse or follow the style of existing translations.

A translation memory is an organized collection of text extracted from a source language with one or more matching translations. A translation memory system stores this data and makes it easily accessible to human translators in order to assist with their tasks. There’s a variety of translation memory systems and related standards in the language industry. Yet, the nature of our extracted messages (containing relevant PHP, Smarty, and JavaScript placeholders) and our desire to maintain a translation style curated by a human language manager made us develop an in-house solution.

Go ahead, read the rest of the post, I’ll wait.

Interesting yes?

What if the title of my post were identification memory?

Not really that much difference between translation language to language and identification to identification, where we are talking about the same subject.

Hardly any difference at all when you think about it.

I am sure your current vendors will assure you their methods of identification are the best and they may be right. But on the other hand, they may also be wrong.

And there always is the issues of other data sources that have chosen to identify the same subjects differently. Like your company down the road, say five years from now. Preparing now for that “translation” project in the not too distant future, may save you from losing critical information down the road.

Preserving access to critical data is a form of translation memory. Yes?