Archive for the ‘Translation’ Category

Machine Translation and Automated Analysis of Cuneiform Languages

Monday, October 2nd, 2017

Machine Translation and Automated Analysis of Cuneiform Languages

From the webpage:

The MTAAC project develops and applies new computerized methods to translate and analyze the contents of some 67,000 highly standardized administrative documents from southern Mesopotamia (ancient Iraq) from the 21st century BC. Our methodology, which combines machine learning with statistical and neural machine translation technologies, can then be applied to other ancient languages. This methodology, the translations, and the historical, social and economic data extracted from them, will be offered to the public in open access.

A recently funded (March 2017) project that strikes a number of resonances with me!

“Open access” and cuneiform isn’t an unheard of combination but many remember when access to cuneiform primary materials was a matter of whim and caprice. There are dark pockets where such practices continue but projects like MTAAC are hard on their heels.

The use of machine learning and automated analysis have the potential, when all extant cuneiform texts (multiple projects such as this one) are available, to provide a firm basis for grammars, lexicons, translations.

Do read: Machine Translation and Automated Analysis of the Sumerian Language by Émilie Pagé-Perron, Maria Sukhareva, Ilya Khait, Christian Chiarcos, for more details about the project.

There’s more to data science than taking advantage of sex-starved neurotics with under five second attention spans and twitchy mouse fingers.

Math Translator Wanted/Topic Map Needed: Mochizuki and the ABC Conjecture

Monday, January 4th, 2016

From the post:

The conjecture is fairly easy to state. Suppose we have three positive integers a,b,c satisfying a+b=c and having no prime factors in common. Let d denote the product of the distinct prime factors of the product abc. Then the conjecture asserts roughly there are only finitely many such triples with c > d. Or, put another way, if a and b are built up from small prime factors then c is usually divisible only by large primes.

Here’s a simple example. Take a=16, b=21, and c=37. In this case, d = 2x3x7x37 = 1554, which is greater than c. The ABC conjecture says that this happens almost all the time. There is plenty of numerical evidence to support the conjecture, and most experts in the field believe it to be true. But it hasn’t been mathematically proven — yet.

Enter Mochizuki. His papers develop a subject he calls Inter-Universal Teichmüller Theory, and in this setting he proves a vast collection of results that culminate in a putative proof of the ABC conjecture. Full of definitions and new terminology invented by Mochizuki (there’s something called a Frobenioid, for example), almost everyone who has attempted to read and understand it has given up in despair. Add to that Mochizuki’s odd refusal to speak to the press or to travel to discuss his work and you would think the mathematical community would have given up on the papers by now, dismissing them as unlikely to be correct. And yet, his previous work is so careful and clever that the experts aren’t quite ready to give up.

It’s not clear what the future holds for Mochizuki’s proof. A small handful of mathematicians claim to have read, understood and verified the argument; a much larger group remains completely baffled. The December workshop reinforced the community’s desperate need for a translator, someone who can explain Mochizuki’s strange new universe of ideas and provide concrete examples to illustrate the concepts. Until that happens, the status of the ABC conjecture will remain unclear.

It’s hard to imagine a more classic topic map problem.

At some point, Shinichi Mochizuki shared a common vocabulary with his colleagues in number theory and arithmetic geometry but no longer.

As Kevin points out:

The December workshop reinforced the community’s desperate need for a translator, someone who can explain Mochizuki’s strange new universe of ideas and provide concrete examples to illustrate the concepts.

Taking Mochizuki’s present vocabulary and working backwards to where he shared a common vocabulary with colleagues is simple enough to say.

The crux of the problem being that discussions are going to be fragmented, distributed in a variety of formal and informal venues.

Combining those discussions to construct a path back to where most number theorists reside today would require something with as few starting assumptions as is possible.

Where you could describe as much or as little about new subjects and their relations to other subjects as is necessary for an expert audience to continue to fill in any gaps.

I’m not qualified to venture an opinion on the conjecture or Mochizuki’s proof but the problem of mapping from new terminology that has its own context back to “standard” terminology is a problem uniquely suited to topic maps.

Saturday, December 26th, 2015

From the post:

It was early 1954 when computer scientists, for the first time, publicly revealed a machine that could translate between human languages. It became known as the Georgetown-IBM experiment: an “electronic brain” that translated sentences from Russian into English.

The scientists believed a universal translator, once developed, would not only give Americans a security edge over the Soviets but also promote world peace by eliminating language barriers.

They also believed this kind of progress was just around the corner: Leon Dostert, the Georgetown language scholar who initiated the collaboration with IBM founder Thomas Watson, suggested that people might be able to use electronic translators to bridge several languages within five years, or even less.

The process proved far slower. (So slow, in fact, that about a decade later, funders of the research launched an investigation into its lack of progress.) And more than 60 years later, a true real-time universal translator — a la C-3PO from Star Wars or the Babel Fish from The Hitchhiker’s Guide to the Galaxy — is still the stuff of science fiction.

How far are we from one, really? Expert opinions vary. As with so many other areas of machine learning, it depends on how quickly computers can be trained to emulate human thinking.

The Star Trek Next Generation episode Darmok was set during a five-year mission that began in 2364, some 349 years in our future. Faster than light travel, teleportation, etc. are day to day realities. One expects machine translation to have improved at least as much.

As Li reports exciting progress is being made with neural networks for translation but transposing words from one language to another, as illustrated in Darmok, isn’t a guarantee of “universal understanding.”

In fact, the transposition may be as opaque as the statement in its original language, such as “Darmok and Jalad at Tanagra,” leaves the hearer to wonder what happened at Tanagra, what was the relationship between Darmok and Jalad, etc.

In the early lines of The Story of the Shipwrecked Sailor, a Middle Kingdom (Egypt, 2000 BCE – 1700 BCE) story, there is a line that describes the sailor returning home and words to the effect “…we struck….” Then the next sentence picks up.

The words necessary to complete that statement don’t occur in the text. You have to know that mooring boats on the Nile did not involve piers, etc. but simply banking your boat and then driving a post (the unstated subject of “we struck”) to secure the vessel.

Transposition from Middle Egyptian to English leaves you without a clue as to the meaning of that passage.

To be sure, neural networks may clear away some of the rote work of transposition between languages but that is a far cry from “universal understanding.”

Both now and likely to continue into the 24th century.

Praise For Conservative Bible Translations

Wednesday, September 9th, 2015

I don’t often read praise for conservative Bible translations but conservative Bible translations can have unexpected uses:

Anders Søgaard and his colleagues from the project LOWLANDS: Parsing Low-Resource Languages and Domains are utilising the texts which were annotated for big languages to develop language technology for smaller languages, the key to which is to find translated texts so that the researchers can transfer knowledge of one language’s grammar onto another language:

“The Bible has been translated into more than 1,500 languages, even the smallest and most ‘exotic’ ones, and the translations are extremely conservative; the verses have a completely uniform structure across the many different languages which means that we can make suitable computer models of even very small languages where we only have a couple of hundred pages of biblical text,” Anders Søgaard says and elaborates:

“We teach the machines to register what is translated with what in the different translations of biblical texts, which makes it possible to find so many similarities between the annotated and unannotated texts that we can produce exact computer models of 100 different languages — languages such as Swahili, Wolof and Xhosa that are spoken in Nigeria. And we have made these models available for other developers and researchers. This means that we will be able to develop language technology resources for these languages similar to those which speakers of languages such as English and French have.”

Anders Søgaard and his colleagues have recently presented their results in the article ‘”If you all you have is a bit of the Bible” at the conference Annual Meeting of the Association of Computational Linguistics.

The abstract for the paper: If all you have is a bit of the Bible: Learning POS taggers for truly low-resource languages reads:

We present a simple method for learning part-of-speech taggers for languages like Akawaio, Aukan, or Cakchiquel – languages for which nothing but a translation of parts of the Bible exists. By aggregating over the tags from a few annotated languages and spreading them via word-alignment on the verses, we learn POS taggers for 100 languages, using the languages to bootstrap each other. We evaluate our cross-lingual models on the 25 languages where test sets exist, as well as on another 10 for which we have tag dictionaries. Our approach performs much better (20-30%) than state-of-the-art unsupervised POS taggers induced from Bible translations, and is often competitive with weakly supervised approaches that assume high-quality parallel corpora, representative monolingual corpora with perfect tokenization, and/or tag dictionaries. We make models for all 100 languages available.

All of the resources used in this project, along with their models, can be found at: https://bitbucket.org/lowlands/

Don’t forget conservative Bible translations if you are doing linguistic models.

Increase Multi-Language Productivity with Document Translator

Wednesday, July 15th, 2015

Increase Multi-Language Productivity with Document Translator

From the post:

The Document Translator app and the associated source code demonstrate how Microsoft Translator can be integrated into enterprise and business workflows. The app allows you to rapidly translate documents, individually or in batches, with full fidelity—keeping formatting such as headers and fonts intact, and allowing you to continue editing if necessary. Using the Document Translator code and documentation, developers can learn how to incorporate the functionality of the Microsoft Translator cloud service into a custom workflow, or add extensions and modifications to the batch translation app experience. Document Translator is a showcase for use of the Microsoft Translator API to increase productivity in a multi-language environment, released as an open source project on GitHub.

Whether you are writing in Word, pulling together the latest numbers into Excel, or creating presentations in PowerPoint, documents are at the center of many of your everyday activities. When your team speaks multiple languages, quick and efficient translation is essential to your organization’s communication and productivity. Microsoft Translator already brings the speed and efficiency of automatic translation to Office, Yammer, as well as a number of other apps, websites and workflows. Document Translator uses the power of the Translator API to accelerate the translation of large numbers of Word, PDF*, PowerPoint, or Excel documents into all the languages supported by Microsoft Translator.

How many languages does your topic map offer?

That many?

The Translator FAQ lists these languages for the Document Translator:

Microsoft Translator supports languages that cover more than 95% of worldwide gross domestic product (GDP)…and one language that is truly out of this world: Klingon.

 Arabic English Hungarian Maltese Slovak Yucatec Maya Bosnian (Latin) Estonian Indonesian Norwegian Slovenian Bulgarian Finnish Italian Persian Spanish Catalan French Japanese Polish Swedish Chinese Simplified German Klingon Portuguese Thai Chinese Traditional Greek Klingon (plqaD) Queretaro Otomi Turkish Croatian Haitian Creole Korean Romanian Ukrainian Czech Hebrew Latvian Russian Urdu Danish Hindi Lithuanian Serbian (Cyrillic) Vietnamese Dutch Hmong Daw Malay Serbian (Latin) Welsh

I have never looked for a topic map in Klingon but a translation could be handy at DragonCon.

Fifty-one languages by my count. What did you say your count was? 😉

Gappy Pattern Matching on GPUs for On-Demand Extraction of Hierarchical Translation Grammars

Wednesday, February 18th, 2015

Gappy Pattern Matching on GPUs for On-Demand Extraction of Hierarchical Translation Grammars by Hua He, Jimmy Lin, Adam Lopez. (Transactions of the Association for Computational Linguistics, vol. 3, pp. 87–100, 2015.)

Abstract:

Grammars for machine translation can be materialized on demand by finding source phrases in an indexed parallel corpus and extracting their translations. This approach is limited in practical applications by the computational expense of online lookup and extraction. For phrase-based models, recent work has shown that on-demand grammar extraction can be greatly accelerated by parallelization on general purpose graphics processing units (GPUs), but these algorithms do not work for hierarchical models, which require matching patterns that contain gaps. We address this limitation by presenting a novel GPU algorithm for on-demand hierarchical grammar extraction that is at least an order of magnitude faster than a comparable CPU algorithm when processing large batches of sentences. In terms of end-to-end translation, with decoding on the CPU, we increase throughput by roughly two thirds on a standard MT evaluation dataset. The GPU necessary to achieve these improvements increases the cost of a server by about a third. We believe that GPU-based extraction of hierarchical grammars is an attractive proposition, particularly for MT applications that demand high throughput.

If you are interested in cross-language search, DNA sequence alignment or other pattern matching problems, you need to watch the progress of this work.

This article and other important research is freely accessible at: Transactions of the Association for Computational Linguistics

WorldWideScience.org (Update)

Wednesday, January 28th, 2015

I first wrote about WorldWideScience.org in a post dated October 17, 2011.

A customer story from Microsoft: WorldWide Science Alliance and Deep Web Technologies made me revisit the site.

My original test query was “partially observable Markov processes” which resulted in 453 “hits” from at least 3266 found (2011 results). Today, running the same query resulted in “…1,342 top results from at least 25,710 found.” The top ninety-seven (97) were displayed.

A current description of the system from the customer story:

In June 2010, Deep Web Technologies and the Alliance launched multilingual search and translation capabilities with WorldWideScience.org, which today searches across more than 100 databases in more than 70 countries. Users worldwide can search databases and translate results in 10 languages: Arabic, Chinese, English, French, German, Japanese, Korean, Portuguese, Russian, and Spanish. The solution also takes advantage of the Microsoft Audio Video Indexing Service (MAVIS). In 2011, multimedia search capabilities were added so that users could retrieve speech-indexed content as well as text.

The site handles approximately 70,000 queries and 1 million page views each month, and all traffic, including that from automated crawlers and search engines, amounts to approximately 70 million transactions per year. When a user enters a search term, WorldWideScience.org instantly provides results clustered by topic, country, author, date, and more. Results are ranked by relevance, and users can choose to look at papers, multimedia, or research data. Divided into tabs for easy usability, the interface also provides details about each result, including a summary, date, author, location, and whether the full text is available. Users can print the search results or attach them to an email. They can also set up an alert that notifies them when new material is available.

Automated searching and translation can’t give you the semantic nuances possible by human authoring but it certainly can provide you with the source materials to build a specialized information resource with such semantics.

Very much a site to bookmark and use on a regular basis.

Links for subjects without them otherwise:

Deep Web Technologies

Microsoft Translator

Building a language-independent keyword-based system with the Wikipedia Miner

Monday, October 27th, 2014

Building a language-independent keyword-based system with the Wikipedia Miner by Gauthier Lemoine.

From the post:

Extracting keywords from texts and HTML pages is a common subject that opens doors to a lot of potential applications. These include classification (what is this page topic?), recommendation systems (identifying user likes to recommend the more accurate content), search engines (what is this page about?), document clustering (how can I pack different texts into a common group) and much more.

Most applications of these are usually based on only one language, usually english. However, it would be better to be able to process document in any language. For example, a case in a recommender system would be a user that speaks French and English. In his history, he gave positive ratings to a few pages containing the keyword “Airplane”. So, for next recommendations, we would boost this keyword. With a language-independent approach, we would also be able to boost pages containing “Avion”, the french term for airplane. If the user gave positive ratings to pages in English containing “Airplane”, and in French containing “Avion”, we would also be able to merge easily into the same keyword to build a language-independent user profile that will be used for accurate French and English recommendations.

This articles shows one way to achieve good results using an easy strategy. It is obvious that we can achieve better results using more complex algorithms.

The NSA can hire translators so I would not bother sharing this technique for harnessing the thousands of expert hours in Wikipedia with them.

Bear in mind that Wikipedia does not reach a large number of minority languages, dialects, and certainly not deliberate obscurity in any language. Your mileage will vary depending upon your particular use case.

UniArab:…

Thursday, June 5th, 2014

UniArab: An RRG Arabic-to-English Machine Translation Software by Dr. Brian Nolan and Yasser Salem.

A slide deck introducing UniArab.

I first saw this mentioned in a tweet by Christopher Phipps.

Which was enough to make me curious about the software and perhaps the original paper.

UniArab: An RRG Arabic-to-English Machine Translation Software (paper) by Brian Nolan and Yasser Salem.

Abstract:

This paper presents a machine translation system (Hutchins 2003) called UniArab (Salem, Hensman and Nolan 2008). It is a proof-of-concept system supporting the fundamental aspects of Arabic, such as the parts of speech, agreement and tenses. UniArab is based on the linking algorithm of RRG (syntax to semantics and vice versa). UniArab takes MSA Arabic as input in the native orthography, parses the sentence(s) into a logical meta-representation based on the fully expanded RRG logical structures and, using this, generates perfectly grammatical English output with full agreement and morphological resolution. UniArab utilizes an XML-based implementation of elements of the Role and Reference Grammar theory in software. In order to analyse Arabic by computer we first extract the lexical properties of the Arabic words (Al-Sughaiyer and Al-Kharashi 2004). From the parse, it then creates a computer-based representation for the logical structure of the Arabic sentence(s). We use the RRG theory to motivate the computational implementation of the architecture of the lexicon in software. We also implement in software the RRG bidirectional linking system to build the parse and generate functions between the syntax-semantic interfaces. Through seven input phases, including the morphological and syntactic unpacking, UniArab extracts the logical structure of an Arabic sentence. Using the XML-based metadata representing the RRG logical structure, UniArab then accurately generates an equivalent grammatical sentence in the target language through four output phases. We discuss the technologies used to support its development and also the user interface that allows for the
addition of lexical items directly to the lexicon in real time. The UniArab system has been tested and evaluated generating equivalent grammatical sentences, in English, via the logical structure of Arabic sentences, based on MSA Arabic input with very significant and accurate results (Izwaini 2006). At present we are working to greatly extend the coverage by the addition of more verbs to the lexicon. We have demonstrated in this research that RRG is a viable linguistic model for building accurate rulebased semantically oriented machine translation software. Role and Reference Grammar (RRG) is a functional theory of grammar that posits a direct mapping between the semantic representation of a sentence and its syntactic representation. The theory allows a sentence in a specific language to be described in terms of its logical structure and grammatical procedures. RRG creates a linking relationship between syntax and semantics, and can account for how semantic representations are mapped into syntactic representations. We claim that RRG is very suitable for machine translation of Arabic, notwithstanding well-documented difficulties found within Arabic MT (Izwaini, S. 2006), and that RRG can be implemented in software as the rule-based kernel of an Interlingua bridge MT engine. The version of Arabic (Ryding 2005, Alosh 2005, Schulz 2005), we consider in this paper is Modern Standard Arabic (MSA), which is distinct from classical Arabic. In the Arabic linguistic tradition there is not a clear-cut, well defined analysis of the inventory of parts of speech in Arabic.

At least as of today, http://informatics.itbresearch.id/~ysalem/ times out. Other pointers?

Interesting work on Arabic translation. Makes me curious about adaptation of these techniques to map between semantic domains.

I first saw this in a tweet by Christopher Phipps.

To index is to translate

Tuesday, July 30th, 2013

To index is to translate by Fran Alexander.

From the post:

Living in Montreal means I am trying to improve my very limited French and in trying to communicate with my Francophone neighbours I have become aware of a process of attempting to simplify my thoughts and express them using the limited vocabulary and grammar that I have available. I only have a few nouns, fewer verbs, and a couple of conjunctions that I can use so far and so trying to talk to people is not so much a process of thinking in English and translating that into French, as considering the basic core concepts that I need to convey and finding the simplest ways of expressing relationships. So I will say something like “The sun shone. It was big. People were happy” because I can’t properly translate “We all loved the great weather today”.

This made me realise how similar this is to the process of breaking down content into key concepts for indexing. My limited vocabulary is much like the controlled vocabulary of an indexing system, forcing me to analyse and decompose my ideas into simple components and basic relationships. This means I am doing quite well at fact-based communication, but my storytelling has suffered as I have only one very simple emotional register to work with. The best I can offer is a rather laconic style with some simple metaphors: “It was like a horror movie.”

It is regularly noted that ontology work in the sciences has forged ahead of that in the humanities, and the parallel with my ability to express facts but not tell stories struck me. When I tell my simplified stories I rely on shared understanding of a broad cultural context that provides the emotional aspect – I can use the simple expression “horror movie” because the concept has rich emotional associations, connotations, and resonances for people. The concept itself is rather vague, broad, and open to interpretation, so the shared understanding is rather thin. The opposite is true of scientific concepts, which are honed into precision and a very constrained definitive shared understanding. So, I wonder how much of sense that I can express facts well is actually an illusion, and it is just that those factual concepts have few emotional resonances.

Is mapping a process of translation?

Are translations always less rich than the source?

Or are translations as rich but differently rich?

When will my computer understand me?

Monday, June 10th, 2013

When will my computer understand me?

From the post:

It’s not hard to tell the difference between the “charge” of a battery and criminal “charges.” But for computers, distinguishing between the various meanings of a word is difficult.

For more than 50 years, linguists and computer scientists have tried to get computers to understand human language by programming semantics as software. Driven initially by efforts to translate Russian scientific texts during the Cold War (and more recently by the value of information retrieval and data analysis tools), these efforts have met with mixed success. IBM’s Jeopardy-winning Watson system and Google Translate are high profile, successful applications of language technologies, but the humorous answers and mistranslations they sometimes produce are evidence of the continuing difficulty of the problem.

Our ability to easily distinguish between multiple word meanings is rooted in a lifetime of experience. Using the context in which a word is used, an intrinsic understanding of syntax and logic, and a sense of the speaker’s intention, we intuit what another person is telling us.

“In the past, people have tried to hand-code all of this knowledge,” explained Katrin Erk, a professor of linguistics at The University of Texas at Austin focusing on lexical semantics. “I think it’s fair to say that this hasn’t been successful. There are just too many little things that humans know.”

Other efforts have tried to use dictionary meanings to train computers to better understand language, but these attempts have also faced obstacles. Dictionaries have their own sense distinctions, which are crystal clear to the dictionary-maker but murky to the dictionary reader. Moreover, no two dictionaries provide the same set of meanings — frustrating, right?

Watching annotators struggle to make sense of conflicting definitions led Erk to try a different tactic. Instead of hard-coding human logic or deciphering dictionaries, why not mine a vast body of texts (which are a reflection of human knowledge) and use the implicit connections between the words to create a weighted map of relationships — a dictionary without a dictionary?

“An intuition for me was that you could visualize the different meanings of a word as points in space,” she said. “You could think of them as sometimes far apart, like a battery charge and criminal charges, and sometimes close together, like criminal charges and accusations (“the newspaper published charges…”). The meaning of a word in a particular context is a point in this space. Then we don’t have to say how many senses a word has. Instead we say: ‘This use of the word is close to this usage in another sentence, but far away from the third use.'”

Before you jump to the post looking for the code, Erk is working with a 10,000 dimension space to analyze her data.

The most recent paper: Montague Meets Markov: Deep Semantics with Probabilistic Logical Form (2013)

Abstract:

We combine logical and distributional representations of natural language meaning by transforming distributional similarity judgments into weighted inference rules using Markov Logic Networks (MLNs). We show that this framework supports both judging sentence similarity and recognizing textual entailment by appropriately adapting the MLN implementation of logical connectives. We also show that distributional phrase similarity, used as textual inference rules created on the fly, improves its performance.

NLP Weather: High Pressure or Low?

Tuesday, June 4th, 2013

Machine Translation Without the Translation by Geoffrey Pullum.

From the post:

I have been ruminating this month on why natural language processing (NLP) still hasn’t arrived, and I have pointed to three developments elsewhere that seem to be discouraging its development. First, enhanced keyword search via Google’s influentiality-ranking of results. Second, the dramatic enhancement in applicability of speech recognition that dialog design facilitates. I now turn to a third, which has to do with the sheer power of number-crunching.

Machine translation is the unclimbed Everest of computational linguistics. It calls for syntactic and semantic analysis of the source language, mapping source-language meanings to target-language meanings, and generating acceptable output from the latter. If computational linguists could do all those things, they could hang up the “mission accomplished” banner.

What has emerged instead, courtesy of Google Translate, is something utterly different: pseudotranslation without analysis of grammar or meaning, developed by people who do not know (or need to know) either source or target language.

The trick: huge quantities of parallel texts combined with massive amounts of rapid statistical computation. The catch: low quality, and output inevitably peppered with howlers.

Of course, if I may purloin Dr Johnson’s remark about a dog walking on his hind legs, although it is not done well you are surprised to find it done at all. For Google Translate’s pseudotranslation is based on zero linguistic understanding. Not even word meanings are looked up: The program couldn’t care less about the meaning of anything. Here, roughly, is how it works.

(…)

My conjecture is that it is useful enough to constitute one more reason for not investing much in trying to get real NLP industrially developed and deployed.

NLP will come, I think; but when you take into account the ready availability of (1) Google search, and (2) speech-driven applications aided by dialog design, and (3) the statistical pseudotranslation briefly discussed above, the cumulative effect is enough to reduce the pressure to develop NLP, and will probably delay its arrival for another decade or so.

Surprised to find that Geoffrey thinks more pressure will result in “real NLP,” albeit delayed by a decade or so for the reasons outlined in his post.

If you recall, machine translation of texts was the hot topic at the end of the 1950’s and early 1960’s.

With an emphasis on automatic translation of Russian. Height of the cold war so there was lots of pressure for a solution.

Lots of pressure then did not result in a solution.

There’s a rather practical reason for not investing in “real NLP.”

There is no evidence that how humans “understand” language is known well enough to program a computer to mimic that “understanding.”

If Geoffrey has evidence to the contrary, I am sure everyone would be glad to hear about it.

Massively Parallel Suffix Array Queries…

Saturday, May 11th, 2013

Massively Parallel Suffix Array Queries and On-Demand Phrase Extraction for Statistical Machine Translation Using GPUs by Hua He, Jimmy Lin, Adam Lopez.

Abstract:

Translation models can be scaled to large corpora and arbitrarily-long phrases by looking up translations of source phrases on the fly in an indexed parallel text. However, this is impractical because on-demand extraction of phrase tables is a major computational bottleneck. We solve this problem by developing novel algorithms for general purpose graphics processing units (GPUs), which enable suffix array queries for phrase lookup and phrase extractions to be massively parallelized. Our open-source implementation improves the speed of a highly-optimized, state-of-the-art serial CPU-based implementation by at least an order of magnitude. In a Chinese-English translation task, our GPU implementation extracts translation tables from approximately 100 million words of parallel text in less than 30 milliseconds.

If you think about topic maps as mapping the identification of a subject in multiple languages to a single representative, then the value of translation software becomes obvious.

You may or may not, depending upon project requirements, want to rely solely on automated mappings of phrases.

Whether you use automated mapping of phrases as an “assist” to or as a sanity check on human curation, this work looks very interesting.

Translation Memory

Tuesday, December 6th, 2011

Translation Memory

As we mentioned in Teaching Etsy to Speak a Second Language, developers need to tag English content so it can be extracted and then translated. Since we are a company with a continuous deployment development process, we do this on a daily basis and as an result get a significant number of new messages to be translated along with changes or deletions of existing ones that have already been translated. Therefore we needed some kind of recollection system to easily reuse or follow the style of existing translations.

A translation memory is an organized collection of text extracted from a source language with one or more matching translations. A translation memory system stores this data and makes it easily accessible to human translators in order to assist with their tasks. There’s a variety of translation memory systems and related standards in the language industry. Yet, the nature of our extracted messages (containing relevant PHP, Smarty, and JavaScript placeholders) and our desire to maintain a translation style curated by a human language manager made us develop an in-house solution.

Interesting yes?

What if the title of my post were identification memory?

Not really that much difference between translation language to language and identification to identification, where we are talking about the same subject.

Hardly any difference at all when you think about it.

I am sure your current vendors will assure you their methods of identification are the best and they may be right. But on the other hand, they may also be wrong.

And there always is the issues of other data sources that have chosen to identify the same subjects differently. Like your company down the road, say five years from now. Preparing now for that “translation” project in the not too distant future, may save you from losing critical information down the road.

Preserving access to critical data is a form of translation memory. Yes?