Archive for the ‘Greek’ Category

Tired of Chasing Ephemera? Open Greek and Latin Design Sprint (bids in August, 2017)

Thursday, July 27th, 2017

Tired of reading/chasing the ephemera explosion in American politics?

I’ve got an opportunity for you to contribute to a project with texts preserved by hand for thousands of years!

Design Sprint for Perseus 5.0/Open Greek and Latin

From the webpage:

We announced in June that Center for Hellenic Studies had signed a contract with Intrepid.io to conduct a design sprint that would support Perseus 5.0 and the Open Greek and Latin collection that it will include. Our goal was to provide a sample model for a new interface that would support searching and reading of Greek, Latin, and other historical languages. The report from that sprint was handed over to CHS yesterday and we, in turn, have made these materials available, including both the summary presentation and associated materials. The goal is to solicit comment and to provide potential applicants to the planned RFP with access to this work as soon as possible.

The sprint took just over two weeks and was an intensive effort. An evolving Google Doc with commentary on the Intrepid Wrap-up slides for the Center for Hellenic studies should now be visible. Readers of the report will see that questions remain to be answered. How will we represent Perseus, Open Greek and Latin, Open Philology, and other efforts? One thing that we have added and that will not change will be the name of the system that this planned implementation phase will begin: whether it is Perseus, Open Philology or some other name, it will be powered by the Scaife Digital Library Viewer, a name that commemorates Ross Scaife, pioneer of Digital Classics and a friend whom many of us will always miss.

The Intrepid report also includes elements that we will wish to develop further — students of Greco-Roman culture may not find “relevance” a helpful way to sort search reports. The Intrepid Sprint greatly advanced our own thinking and provided us with a new starting point. Anyone may build upon the work presented here — but they can also suggest alternate approaches.

The core deliverables form an impressive list:

At the moment we would summarize core deliverables as:

  1. A new reading environment that captures the basic functionality of the Perseus 4.0 reading environment but that is more customizable and that can be localized efficiently into multiple modern languages, with Arabic, Persian, German and English as the initial target languages. The overall Open Greek and Latin team is, of course, responsible for providing the non-English content. The Scaife DL Viewer should make it possible for us to localize into multiple languages as efficiently as possible.
  2. The reading environment should be designed to support any CTS-compliant collection and should be easily configured with a look and feel for different collections.
  3. The reading environment should contain a lightweight treebank viewer — we don’t need to support editing of treebanks in the reading environment. The functionality that the Alpheios Project provided for the first book of the Odyssey would be more than adequate. Treebanks are available under the label “diagram” when you double-click on a Greek word.
  4. The reading environment should support dynamic word/phrase level alignments between source text and translation(s). Here again, the The functionality that the Alpheios Project provided for the first book of the Odyssey would be adequate. More recent work implementing this functionality is visible at Tariq Yousef’s work at http://divan-hafez.com/ and http://ugarit.ialigner.com/.
  5. The system must be able to search for both specific inflected forms and for all forms of a particular word (as in Perseus 4.0) in CTS-compliant epiDoc TEI XML. The search will build upon the linguistically analyzed texts available in https://github.com/gcelano/CTSAncientGreekXML. This will enable searching by dictionary entry, by part of speech, and by inflected form. For Greek, the base collection is visible at the First Thousand Years of Greek website (which now has begun to accumulate a substantial amount of later Greek). CTS-compliant epiDoc Latin texts can be found at https://github.com/OpenGreekAndLatin/csel-dev/tree/master/data and https://github.com/PerseusDL/canonical-latinLit/tree/master/data.
  6. The system should ideally be able to search Greek and Latin that is available only as uncorrected OCR-generated text in hOCR format. Here the results may follow the image-front strategy familiar to academics from sources such as Jstor. If it is not feasible to integrate this search within the three months of core work, then we need a plan for subsequent integration that Leipzig and OGL members can implement later.
  7. The new system must be scalable and updating from Lucene to Elasticsearch is desirable. While these collections may not be large by modern standards, they are substantial. Open Greek and Latin currently has c. 67 million words of Greek and Latin at various stages of post-processing and c. 90 million words of addition translations from Greek and Latin into English,French, German and Italian, while the Lace Greek OCR Project has OCR-generated text for 1100 volumes.
  8. The system integrate translations and translation alignments into the searching system, so that users can search either in the original or in modern language translations where we provide this data. This goes back to work by David Bamman in the NEH-funded Dynamic Lexicon Project (when he was a researcher at Perseus at Tufts). For more recent examples of this, see http://divan-hafez.com/ and Ugarit. Note that one reason to adopt CTS URNs is to simplify the task of display translations of source texts — the system is only responsible for displaying translations insofar as they are available via the CTS API.
  9. The system must provide initial support for a user profile. One benefit of the profile is that users will be able to define their own reading lists — and the Scaife DL Viewer will then be able to provide personalized reading support, e.g., word X already showed up in your reading at places A, B, and C, while word Y, which is new to you, will appear 12 times in the rest of your planned readings (i.e., you should think about learning that word). By adopting the CTS data model, we can make very precise reading lists, defining precise selections from particular editions of particular works. We also want to be able to support an initial set of user contributions that are (1) easy to implement technically and (2) easy for users to understand and perform. Thus we would support fixing residual data entry errors, creating alignments between source texts and translations, improving automated part of speech tagging and lemmatization but users would go to external resources to perform more complex tasks such as syntactic markup (treebanking).
  10. We would welcome a bids that bring to bear expertise in the EPUB format and that could help develop a model for representing for representing CTS-compliant Greek and Latin sources in EPUB as a mechanism to make these materials available on smartphones. We can already convert our TEI XML into EPUB. The goal here is to exploit the easiest ways to optimize the experience. We can, for example, convert one or more of our Greek and Latin lexica into the EPUB Dictionary format and use our morphological analyses to generate links from particular forms in a text to the right dictionary entry or entries. Can we represent syntactically analyzed sentences with SVG? Can we include dynamic translation alignments?
  11. Bids should consider including a design component. We were very pleased with the Design Sprint that took place in July 2017 and would like to include a follow-up Design Sprint in early 2018 that will consider (1) next steps for Greek and Latin and (2) generalizing our work to other historical languages. This Design Sprint might well go to a separate contractor (thus providing us also with a separate point of view on the work done so far).
  12. Work must be build upon the Canonical Text Services Protocol. Bids should be prepared to build upon https://github.com/Capitains, but should also be able to build upon other CTS servers (e.g., https://github.com/ThomasK81/LightWeightCTSServer and cts.informatik.uni-leipzig.de).
  13. All source code must be available on Github under an appropriate open license so that third parties can freely reuse and build upon it.
  14. Source code must be designed and documented to facilitate actual (not just legally possible) reuse.
  15. The contractor will have the flexibility to get the job done but will be expected to work as closely as possible with, and to draw wherever possible upon the on-going work done by, the collaborators who are contributing to Open Greek and Latin. The contractor must have the right to decide how much collaboration makes sense.

You can use your data science skills to sell soap, cars, ED treatments, or even apocalyptically narcissistic politicians, or, you can advance Perseus 5.0.

Your call.

An Initial Reboot of Oxlos

Tuesday, April 18th, 2017

An Initial Reboot of Oxlos by James Tauber.

From the post:

In a recent post, Update on LXX Progress, I talked about the possibility of putting together a crowd-sourcing tool to help share the load of clarifying some parse code errors in the CATSS LXX morphological analysis. Last Friday, Patrick Altman and I spent an evening of hacking and built the tool.

Back at BibleTech 2010, I gave a talk about Django, Pinax, and some early ideas for a platform built on them to do collaborative corpus linguistics. Patrick Altman was my main co-developer on some early prototypes and I ended up hiring him to work with me at Eldarion.

The original project was called oxlos after the betacode transcription of the Greek word for “crowd”, a nod to “crowd-sourcing”. Work didn’t continue much past those original prototypes in 2010 and Pinax has come a long way since so, when we decided to work on oxlos again, it made sense to start from scratch. From the initial commit to launching the site took about six hours.

At the moment there is one collective task available—clarifying which of a set of parse codes is valid for a given verb form in the LXX—but as the need for others arises, it will be straightforward to add them (and please contact me if you have similar tasks you’d like added to the site).
… (emphasis in the original)

Crowd sourcing, parse code errors in the CATSS LXX morphological analysis, Patrick Altman and James Tauber! What’s more could you ask for!

Well, assuming you enjoy Django development, https://github.com/jtauber/oxlos2 or have Greek morphology, sign up at: http://oxlos.org/.

After mastering Greek, you don’t really want to lose it from lack of practice. Yes? Perfect opportunity for recent or even not so recent Classics and divinity majors.

I suppose that’s a nice way to say you won’t be encountering LXX Greek on ESPN or CNN. 😉

New MorphGNT Releases and Accentuation Analysis

Thursday, February 16th, 2017

New MorphGNT Releases and Accentuation Analysis by James Tauber.

From the post:

Back in 2015, I talked about Annotating the Normalization Column in MorphGNT. This post could almost be considered Part 2.

I recently went back to that work and made a fresh start on a new repo gnt-accentuation intended to explain the accentuation of each word in the GNT (and eventually other Greek texts). There’s two parts to that: explaining why the normalized form is accented the way it but then explaining why the word-in-context might be accented differently (clitics, etc). The repo is eventually going to do both but I started with the latter.

My goal with that repo is to be part of the larger vision of an “executable grammar” I’ve talked about for years where rules about, say, enclitics, are formally written up in a way that can be tested against the data. This means:

  • students reading a rule can immediately jump to real examples (or exceptions)
  • students confused by something in a text can immediately jump to rules explaining it
  • the correctness of the rules can be tested
  • errors in the text can be found

It is the fourth point that meant that my recent work uncovered some accentuation issues in the SBLGNT, normalization and lemmatization. Some of that has been corrected in a series of new releases of the MorphGNT: 6.08, 6.09, and 6.10. See https://github.com/morphgnt/sblgnt/releases for details of specifics. The reason for so many releases was I wanted to get corrections out as soon as I made them but then I found more issues!

There are some issues in the text itself which need to be resolved. See the Github issue https://github.com/morphgnt/sblgnt/issues/52 for details. I’d very much appreciate people’s input.

In the meantime, stay tuned for more progress on gnt-accentuation.

Was it random chance that I saw this announcement from James and Getting your hands dirty with the Digital Manuscripts Toolkit on the same day?

😉

I should mention that Codex Sinaiticus (second oldest witness to the Greek New Testament) and numerous other Greek NT manuscripts have been digitized by the British Library.

Paring these resources together offers a great opportunity to discover the Greek NT text as choices made by others. (Same holds true for the Hebrew Bible as well.)

greek-accentuation 1.0.0 Released

Thursday, July 28th, 2016

greek-accentuation 1.0.0 Released by James Tauber.

From the post:

greek-accentuation has finally hit 1.0.0 with a couple more functions and a module layout change.

The library (which I’ve previously written about here) has been sitting on 0.9.9 for a while and I’ve been using it sucessfully in my inflectional morphology work for 18 months. There were, however, a couple of functions that lived in the inflectional morphology repos that really belonged in greek-accentuation. They have now been moved there.

If that sounds a tad obscure, some additional explanation from an earlier post by James:

It [greek-accentuation] consists of three modules:

  • characters
  • syllabify
  • accentuation

The characters module provides basic analysis and manipulation of Greek characters in terms of their Unicode diacritics as if decomposed. So you can use it to add, remove or test for breathing, accents, iota subscript or length diacritics.

The syllabify module provides basic analysis and manipulation of Greek syllables. It can syllabify words, give you the onset, nucleus, code, rime or body of a syllable, judge syllable length or give you the accentuation class of word.

The accentuation module uses the other two modules to accentuate Ancient Greek words. As well as listing possible_accentuations for a given unaccented word, it can produce recessive and (given another form with an accent) persistent accentuations.

Another name from my past and a welcome reminder that not all of computer science is focused on recommending ephemera for our consumption.

Modelling Stems and Principal Part Lists (Attic Greek)

Friday, June 17th, 2016

Modelling Stems and Principal Part Lists by James Tauber.

From the post:

This is part 0 of a series of blog posts about modelling stems and principal part lists, particularly for Attic Greek but hopefully more generally applicable. This is largely writing up work already done but I’m doing cleanup as I go along as well.

A core part of the handling of verbs in the Morphological Lexicon is the set of terminations and sandhi rules that can generate paradigms attested in grammars like Louise Pratt’s The Essentials of Greek Grammar. Another core part is the stem information for a broader range of verbs usually conveyed in works like Pratt’s in the form of lists of principal parts.

A rough outline of future posts is:

  • the sources of principal part lists for this work
  • lemmas in the Pratt principal parts
  • lemma differences across lists
  • what information is captured in each of the lists individually
  • how to model a merge of the lists
  • inferring stems from principal parts
  • stems, terminations and sandhi
  • relationships between stems
  • ???

I’ll update this outline with links as posts are published.

(emphasis in original)

A welcome reminder of projects that transcend the ephemera that is social media.

Or should I say “modern” social media?

The texts we parse so carefully were originally spoken, recorded and copied, repeatedly, without the benefit of modern reference grammars and/or dictionaries.

Enjoy!