I don’t often read praise for conservative Bible translations but conservative Bible translations can have unexpected uses:
Linguists use the Bible to develop language technology for small languages reports:
…
Anders Søgaard and his colleagues from the project LOWLANDS: Parsing Low-Resource Languages and Domains are utilising the texts which were annotated for big languages to develop language technology for smaller languages, the key to which is to find translated texts so that the researchers can transfer knowledge of one language’s grammar onto another language:“The Bible has been translated into more than 1,500 languages, even the smallest and most ‘exotic’ ones, and the translations are extremely conservative; the verses have a completely uniform structure across the many different languages which means that we can make suitable computer models of even very small languages where we only have a couple of hundred pages of biblical text,” Anders Søgaard says and elaborates:
“We teach the machines to register what is translated with what in the different translations of biblical texts, which makes it possible to find so many similarities between the annotated and unannotated texts that we can produce exact computer models of 100 different languages — languages such as Swahili, Wolof and Xhosa that are spoken in Nigeria. And we have made these models available for other developers and researchers. This means that we will be able to develop language technology resources for these languages similar to those which speakers of languages such as English and French have.”
Anders Søgaard and his colleagues have recently presented their results in the article ‘”If you all you have is a bit of the Bible” at the conference Annual Meeting of the Association of Computational Linguistics.
…
The abstract for the paper: If all you have is a bit of the Bible: Learning POS taggers for truly low-resource languages reads:
We present a simple method for learning part-of-speech taggers for languages like Akawaio, Aukan, or Cakchiquel – languages for which nothing but a translation of parts of the Bible exists. By aggregating over the tags from a few annotated languages and spreading them via word-alignment on the verses, we learn POS taggers for 100 languages, using the languages to bootstrap each other. We evaluate our cross-lingual models on the 25 languages where test sets exist, as well as on another 10 for which we have tag dictionaries. Our approach performs much better (20-30%) than state-of-the-art unsupervised POS taggers induced from Bible translations, and is often competitive with weakly supervised approaches that assume high-quality parallel corpora, representative monolingual corpora with perfect tokenization, and/or tag dictionaries. We make models for all 100 languages available.
All of the resources used in this project, along with their models, can be found at: https://bitbucket.org/lowlands/
Don’t forget conservative Bible translations if you are doing linguistic models.