Archive for the ‘Language’ Category

Comparing Comprehensive English Grammars?

Thursday, February 22nd, 2018

Neal Goldfarb in SCOTUS cites CGEL (Props to Justice Gorsuch and the Supreme Court library) highlights two comprehensive grammars for English.

Both are known by the initials CGEL:

Being the more recent work, Cambridge Grammar of the English Language lists today for $279.30 (1860 pages), whereas Quirk’s 1985 Comprehensive Grammar of the English Language, can be had for $166.08 (1779 pages).

Interesting fact, the acronym CGEL was in use for 17 years by Comprehensive Grammar of the English Language before Cambridge Grammar of the English Language was published, using the same acronym.

Curious how much new information was added by the Cambridge grammar? If you had a machine readable text of both, excluded the examples and then calculated the semantic distance between sections on the same material, you could produce a measurement of the distance between the two texts.

Given the prices of academic texts, standardizing a method of comparison would be a boon to scholars and graduate students!

(No comment on the over-writing of the acronym for Quirk’s work by Cambridge.)

New Draft Morphological Tags for MorphGNT

Monday, February 5th, 2018

New Draft Morphological Tags for MorphGNT by James Tauber.

From the post:

At least going back to my initial collaboration with Ulrik Sandborg-Petersen in 2005, I’ve been thinking about how I would do morphological tags in MorphGNT if I were starting from scratch.

Much later, in 2014, I had some discussions with Mike Aubrey at my first SBL conference and put together a straw proposal. There was a rethinking of some parts-of-speech, handling of tense/aspect, handling of voice, handling of syncretism and underspecification.

Even though some of the ideas were more drastic than others, a few things have remained consistent in my thinking:

  • there is value in a purely morphological analysis that doesn’t disambiguate on syntactic or semantic grounds
  • this analysis does not need the notion of parts-of-speech beyond purely Morphological Parts of Speech
  • this analysis should not attempt to distinguish middles and passives in the present or perfect system

As part of the handling of syncretism and underspecification, I had originally suggested a need for a value for the case property that didn’t distinguish nominative and accusative and a need for a value for the gender property like “non-neuter”.

If you are interested in language encoding, Biblical Greek, or morphology, Tauber has a project for you!

Be forewarned that what you tag has a great deal to do with what you can and/or will see. You have been warned.


Getting Started with Python/CLTK for Historical Languages

Friday, January 12th, 2018

Getting Started with Python/CLTK for Historical Languages by Patrick J. Burns.

From the post:

This is a ongoing project to collect online resources for anybody looking to get started with working with Python for historical languages, esp. using the Classical Language Toolkit. If you have suggestions for this lists, email me at patrick[at]diyclassics[dot]org.

What classic or historical language resources would you recommend?

Unicode Egyptian Hieroglyphic Fonts

Monday, October 16th, 2017

Unicode Egyptian Hieroglyphic Fonts by Bob Richmond.

From the webpage:

These fonts all contain the Unicode 5.2 (2009) basic set of Egyptian Hieroglyphs.

Please contact me if you know of any others, or information to include.

Also of interest:

UMdC Coding Manual for Egyptian Hieroglyphic in Unicode

UMdC (Unicode MdC) aims to provides guidelines for encoding Egyptian Hieroglyphic and related scripts In Unicode using plain text with optional lightweight mark-up.

This GitHub project is the central point for development of UMdC and associated resources. Features of UMdC are still in a discussion phase so everything here should be regarded as preliminary and subject to change. As such the project is initially oriented towards expert Egyptologists and software developers who wish to help ensure ancient Egyptian writing system is well supported in modern digital media.

The Manuel de Codage (MdC) system for digital encoding of Ancient Egyptian textual data was adopted as an informal standard in the 1980s and has formed the basis for most subsequent digital encodings, sometimes using extensions or revisions to the original scheme. UMdC links to the traditional methodology in various ways to help with the transition to Unicode-based solutions.

As with the original MdC system, UMdC data files (.umdc) can be viewed and edited in standard text editors (such as Windows Notepad) and the HTML <textarea></textarea> control. Specialist software applications can be adapted or developed to provide a simpler workflow or enable additional techniques for working with the material.

Also see UMdC overview [pdf].

A UMdC-compatible hieroglyphic font Aaron UMdC Alpha (relative to the current draft) can be downloaded from the Hieroglyphs Everywhere Fonts project.

For news and information on Ancient Egyptian in Unicode see

I understand the need for “plain text” viewing of hieroglyphics, especially for primers and possibly for search engines, but Egyptian hieroglyphs can be written facing right or left, top to bottom and more rarely bottom to top. Moreover, artistic and other considerations can result in transposition of glyphs out of their “linear” order in a Western reading sense.

Unicode hieroglyphs are a major step forward for the interchange of hieroglyphic texts but we should remain mindful “linear” presentation of inscription texts is a far cry from their originals.

The greater our capacity for graphic representation, the more we simplify complex representations from the past. Are the needs of our computers really that important?

Machine Translation and Automated Analysis of Cuneiform Languages

Monday, October 2nd, 2017

Machine Translation and Automated Analysis of Cuneiform Languages

From the webpage:

The MTAAC project develops and applies new computerized methods to translate and analyze the contents of some 67,000 highly standardized administrative documents from southern Mesopotamia (ancient Iraq) from the 21st century BC. Our methodology, which combines machine learning with statistical and neural machine translation technologies, can then be applied to other ancient languages. This methodology, the translations, and the historical, social and economic data extracted from them, will be offered to the public in open access.

A recently funded (March 2017) project that strikes a number of resonances with me!

“Open access” and cuneiform isn’t an unheard of combination but many remember when access to cuneiform primary materials was a matter of whim and caprice. There are dark pockets where such practices continue but projects like MTAAC are hard on their heels.

The use of machine learning and automated analysis have the potential, when all extant cuneiform texts (multiple projects such as this one) are available, to provide a firm basis for grammars, lexicons, translations.

Do read: Machine Translation and Automated Analysis of the Sumerian Language by Émilie Pagé-Perron, Maria Sukhareva, Ilya Khait, Christian Chiarcos, for more details about the project.

There’s more to data science than taking advantage of sex-starved neurotics with under five second attention spans and twitchy mouse fingers.

NLP tools for East Asian languages

Thursday, September 28th, 2017

NLP tools for East Asian languages

CLARIN is building a list of NLP tools for East Asian languages.

Oh, sorry:

CLARIN – European Research Infrastructure for Language Resources and Technology

CLARIN makes digital language resources available to scholars, researchers, students and citizen-scientists from all disciplines, especially in the humanities and social sciences, through single sign-on access. CLARIN offers long-term solutions and technology services for deploying, connecting, analyzing and sustaining digital language data and tools. CLARIN supports scholars who want to engage in cutting edge data-driven research, contributing to a truly multilingual European Research Area.

CLARIN stands for “Common Language Resources and Technology Infrastructure”.

Contribute to the spreadsheet of NLP tools and enjoy the CLARIN website.

Syntacticus – Early Indo-European Languages

Saturday, September 23rd, 2017


From the about page:

Syntacticus provides easy access to around a million morphosyntactically annotated sentences from a range of early Indo-European languages.

Syntacticus is an umbrella project for the PROIEL Treebank, the TOROT Treebank and the ISWOC Treebank, which all use the same annotation system and share similar linguistic priorities. In total, Syntacticus contains 80,138 sentences or 936,874 tokens in 10 languages.

We are constantly adding new material to Syntacticus. The ultimate goal is to have a representative sample of different text types from each branch of early Indo-European. We maintain lists of texts we are working on at the moment, which you can find on the PROIEL Treebank and the TOROT Treebank pages, but this is extremely time-consuming work so please be patient!

The focus for Syntacticus at the moment is to consolidate and edit our documentation so that it is easier to approach. We are very aware that the current documentation is inadequate! But new features and better integration with our development toolchain are also on the horizon in the near future.

Language Size
Ancient Greek 250,449 tokens
Latin 202,140 tokens
Classical Armenian 23,513 tokens
Gothic 57,211 tokens
Portuguese 36,595 tokens
Spanish 54,661 tokens
Old English 29,406 tokens
Old French 2,340 tokens
Old Russian 209,334 tokens
Old Church Slavonic 71,225 tokens

The mention of Old Russian should attract attention, given the media frenzy over Russia these days. However, the data at Syntacticus is meaningful, unlike news reports that reflect Western ignorance more often than news.

You may have noticed US reports have moved from guilt by association to guilt by nationality (anyone who is Russian = Putin confidant) and are approaching guilt by proximity (citizen of any country near Russia = Putin puppet).

It’s hard to imagine a political campaign without crimes being committed by someone but traditionally, in law courts anyway, proof precedes a decision of guilt.

Looking forward to competent evidence (that’s legal terminology with a specific meaning), tested in an open proceeding against the elements of defined offenses. That’s a far cry from current discussions.

Tired of Chasing Ephemera? Open Greek and Latin Design Sprint (bids in August, 2017)

Thursday, July 27th, 2017

Tired of reading/chasing the ephemera explosion in American politics?

I’ve got an opportunity for you to contribute to a project with texts preserved by hand for thousands of years!

Design Sprint for Perseus 5.0/Open Greek and Latin

From the webpage:

We announced in June that Center for Hellenic Studies had signed a contract with to conduct a design sprint that would support Perseus 5.0 and the Open Greek and Latin collection that it will include. Our goal was to provide a sample model for a new interface that would support searching and reading of Greek, Latin, and other historical languages. The report from that sprint was handed over to CHS yesterday and we, in turn, have made these materials available, including both the summary presentation and associated materials. The goal is to solicit comment and to provide potential applicants to the planned RFP with access to this work as soon as possible.

The sprint took just over two weeks and was an intensive effort. An evolving Google Doc with commentary on the Intrepid Wrap-up slides for the Center for Hellenic studies should now be visible. Readers of the report will see that questions remain to be answered. How will we represent Perseus, Open Greek and Latin, Open Philology, and other efforts? One thing that we have added and that will not change will be the name of the system that this planned implementation phase will begin: whether it is Perseus, Open Philology or some other name, it will be powered by the Scaife Digital Library Viewer, a name that commemorates Ross Scaife, pioneer of Digital Classics and a friend whom many of us will always miss.

The Intrepid report also includes elements that we will wish to develop further — students of Greco-Roman culture may not find “relevance” a helpful way to sort search reports. The Intrepid Sprint greatly advanced our own thinking and provided us with a new starting point. Anyone may build upon the work presented here — but they can also suggest alternate approaches.

The core deliverables form an impressive list:

At the moment we would summarize core deliverables as:

  1. A new reading environment that captures the basic functionality of the Perseus 4.0 reading environment but that is more customizable and that can be localized efficiently into multiple modern languages, with Arabic, Persian, German and English as the initial target languages. The overall Open Greek and Latin team is, of course, responsible for providing the non-English content. The Scaife DL Viewer should make it possible for us to localize into multiple languages as efficiently as possible.
  2. The reading environment should be designed to support any CTS-compliant collection and should be easily configured with a look and feel for different collections.
  3. The reading environment should contain a lightweight treebank viewer — we don’t need to support editing of treebanks in the reading environment. The functionality that the Alpheios Project provided for the first book of the Odyssey would be more than adequate. Treebanks are available under the label “diagram” when you double-click on a Greek word.
  4. The reading environment should support dynamic word/phrase level alignments between source text and translation(s). Here again, the The functionality that the Alpheios Project provided for the first book of the Odyssey would be adequate. More recent work implementing this functionality is visible at Tariq Yousef’s work at and
  5. The system must be able to search for both specific inflected forms and for all forms of a particular word (as in Perseus 4.0) in CTS-compliant epiDoc TEI XML. The search will build upon the linguistically analyzed texts available in This will enable searching by dictionary entry, by part of speech, and by inflected form. For Greek, the base collection is visible at the First Thousand Years of Greek website (which now has begun to accumulate a substantial amount of later Greek). CTS-compliant epiDoc Latin texts can be found at and
  6. The system should ideally be able to search Greek and Latin that is available only as uncorrected OCR-generated text in hOCR format. Here the results may follow the image-front strategy familiar to academics from sources such as Jstor. If it is not feasible to integrate this search within the three months of core work, then we need a plan for subsequent integration that Leipzig and OGL members can implement later.
  7. The new system must be scalable and updating from Lucene to Elasticsearch is desirable. While these collections may not be large by modern standards, they are substantial. Open Greek and Latin currently has c. 67 million words of Greek and Latin at various stages of post-processing and c. 90 million words of addition translations from Greek and Latin into English,French, German and Italian, while the Lace Greek OCR Project has OCR-generated text for 1100 volumes.
  8. The system integrate translations and translation alignments into the searching system, so that users can search either in the original or in modern language translations where we provide this data. This goes back to work by David Bamman in the NEH-funded Dynamic Lexicon Project (when he was a researcher at Perseus at Tufts). For more recent examples of this, see and Ugarit. Note that one reason to adopt CTS URNs is to simplify the task of display translations of source texts — the system is only responsible for displaying translations insofar as they are available via the CTS API.
  9. The system must provide initial support for a user profile. One benefit of the profile is that users will be able to define their own reading lists — and the Scaife DL Viewer will then be able to provide personalized reading support, e.g., word X already showed up in your reading at places A, B, and C, while word Y, which is new to you, will appear 12 times in the rest of your planned readings (i.e., you should think about learning that word). By adopting the CTS data model, we can make very precise reading lists, defining precise selections from particular editions of particular works. We also want to be able to support an initial set of user contributions that are (1) easy to implement technically and (2) easy for users to understand and perform. Thus we would support fixing residual data entry errors, creating alignments between source texts and translations, improving automated part of speech tagging and lemmatization but users would go to external resources to perform more complex tasks such as syntactic markup (treebanking).
  10. We would welcome a bids that bring to bear expertise in the EPUB format and that could help develop a model for representing for representing CTS-compliant Greek and Latin sources in EPUB as a mechanism to make these materials available on smartphones. We can already convert our TEI XML into EPUB. The goal here is to exploit the easiest ways to optimize the experience. We can, for example, convert one or more of our Greek and Latin lexica into the EPUB Dictionary format and use our morphological analyses to generate links from particular forms in a text to the right dictionary entry or entries. Can we represent syntactically analyzed sentences with SVG? Can we include dynamic translation alignments?
  11. Bids should consider including a design component. We were very pleased with the Design Sprint that took place in July 2017 and would like to include a follow-up Design Sprint in early 2018 that will consider (1) next steps for Greek and Latin and (2) generalizing our work to other historical languages. This Design Sprint might well go to a separate contractor (thus providing us also with a separate point of view on the work done so far).
  12. Work must be build upon the Canonical Text Services Protocol. Bids should be prepared to build upon, but should also be able to build upon other CTS servers (e.g., and
  13. All source code must be available on Github under an appropriate open license so that third parties can freely reuse and build upon it.
  14. Source code must be designed and documented to facilitate actual (not just legally possible) reuse.
  15. The contractor will have the flexibility to get the job done but will be expected to work as closely as possible with, and to draw wherever possible upon the on-going work done by, the collaborators who are contributing to Open Greek and Latin. The contractor must have the right to decide how much collaboration makes sense.

You can use your data science skills to sell soap, cars, ED treatments, or even apocalyptically narcissistic politicians, or, you can advance Perseus 5.0.

Your call.

Graphing the distribution of English letters towards…

Tuesday, July 11th, 2017

Graphing the distribution of English letters towards the beginning, middle or end of words by David Taylor.

From the post:

(partial image)

Some data visualizations tell you something you never knew. Others tell you things you knew, but didn’t know you knew. This was the case for this visualization.

Many choices had to be made to visually present this essentially semi-quantitative data (how do you compare a 3- and a 13-letter word?). I semi-exhaustively explain everything at on my other, geekier blog, prooffreaderplus, and provide the code I used; I’ll just repeat the most crucial here:

The counts here were generated from Brown corpus, which is composed of texts printed in 1961.

Take Taylor’s post as an inducement to read both Prooffreader Plus and Prooffreader on a regular basis.

A Dictionary of Victorian Slang (1909)

Tuesday, June 20th, 2017

Passing English of the Victorian era, a dictionary of heterodox English, slang and phrase (1909) by J. Reeding Ware.

Quoted from the Preface:

HERE is a numerically weak collection of instances of ‘Passing English’. It may be hoped that there are errors on every page, and also that no entry is ‘quite too dull’. Thousands of words and phrases in existence in 1870 have drifted away, or changed their forms, or been absorbed, while as many have been added or are being added. ‘Passing English’ ripples from countless sources, forming a river of new language which has its tide and its ebb, while its current brings down new ideas and carries away those that have dribbled out of fashion. Not only is ‘Passing English’ general ; it is local ; often very seasonably local. Careless etymologists might hold that there are only four divisions of fugitive language in London west, east, north and south. But the variations are countless. Holborn knows little of Petty Italia behind Hatton Garden, and both these ignore Clerkenwell, which is equally foreign to Islington proper; in the South, Lambeth generally ignores the New Cut, and both look upon Southwark as linguistically out of bounds; while in Central London, Clare Market (disappearing with the nineteenth century) had, if it no longer has, a distinct fashion in words from its great and partially surviving rival through the centuries the world of Seven Dials, which is in St Giles’s St James’s being ractically in the next parish. In the East the confusion of languages is a world of ‘ variants ‘ there must be half-a-dozen of Anglo-Yiddish alone all, however, outgrown from the Hebrew stem. ‘Passing English’ belongs to all the classes, from the peerage class who have always adopted an imperfection in speech or frequency of phrase associated with the court, to the court of the lowest costermonger, who gives the fashion to his immediate entourage.

A healthy reminder that language is no more fixed and unchanging than the people who use it.


How to Help Trump

Wednesday, December 21st, 2016

How to Help Trump by George Lakoff.

From the post:

Without knowing it, many Democrats, progressives and members of the news media help Donald Trump every day. The way they help him is simple: they spread his message.

Think about it: every time Trump issues a mean tweet or utters a shocking statement, millions of people begin to obsess over his words. Reporters make it the top headline. Cable TV panels talk about it for hours. Horrified Democrats and progressives share the stories online, making sure to repeat the nastiest statements in order to refute them. While this response is understandable, it works in favor of Trump.

When you repeat Trump, you help Trump. You do this by spreading his message wide and far.

I know Lakoff from his Women, Fire, and Dangerous Things: What Categories Reveal about the Mind.

I haven’t read any of his “political” books but would buy them sight unseen on the strength of Women, Fire, and Dangerous Things.

Lakoff promises a series of posts using effective framing to “…expose and undermine Trump’s propaganda.”

Whether you want to help expose Trump or use framing to promote your own produce or agenda, start following Lakoff today!

Pattern Overloading

Tuesday, December 6th, 2016

Pattern Overloading by Ramsey Nasser.

From the post:

C-like languages have a problem of overloaded syntax that I noticed while teaching high school students. Consider the following snippets in such a language:


function foo(int x) {

for(int i=0;i < 10; i++) {

if(x > 10) {

case(x) {

A programmer experienced with this family would see

  1. Function invocation
  2. Function definition
  3. Control flow examples

In my experience, new programmers see these constructs as instances of the same idea: name(some-stuff) more-stuff. This is not an unreasonable conclusion to reach. The syntax for each construct is shockingly similar given that their semantics are wildly different.

You won’t be called upon to re-design C but Nasser’s advice:

Syntactic similarity should mirror semantic similarity

Or, to take a quote from the UX world

Similar things should look similar and dissimilar things should look dissimilar

is equally applicable to any syntax that you design.

BBC World Service – In 40 Languages [Non-U.S. Centric Topic Mappers Take Note]

Tuesday, November 15th, 2016

BBC World Service announces biggest expansion ‘since the 1940s’

From the post:

The BBC World Service will launch 11 new language services as part of its biggest expansion “since the 1940s”, the corporation has announced.

The expansion is a result of the funding boost announced by the UK government last year.

The new languages will be Afaan Oromo, Amharic, Gujarati, Igbo, Korean, Marathi, Pidgin, Punjabi, Telugu, Tigrinya, and Yoruba.

The first new services are expected to launch in 2017.

“This is a historic day for the BBC, as we announce the biggest expansion of the World Service since the 1940s,” said BBC director general Tony Hall.

“The BBC World Service is a jewel in the crown – for the BBC and for Britain.

“As we move towards our centenary, my vision is of a confident, outward-looking BBC which brings the best of our independent, impartial journalism and world-class entertainment to half a billion people around the world.


The BBC World Service is the starting place to broaden your horizons.

In English “all shows” lists 1831 shows.

I prefer reading over listening but have resolved to start exploring the world of the BBC.

Green’s Dictionary of Slang [New Commercializing Information Model?]

Friday, October 14th, 2016

Green’s Dictionary of Slang

From the about page:

Green’s Dictionary of Slang is the largest historical dictionary of English slang. Written by Jonathon Green over 17 years from 1993, it reached the printed page in 2010 in a three-volume set containing nearly 100,000 entries supported by over 400,000 citations from c. ad 1000 to the present day. The main focus of the dictionary is the coverage of over 500 years of slang from c. 1500 onwards.

The printed version of the dictionary received the Dartmouth Medal for outstanding works of reference from the American Library Association in 2012; fellow recipients include the Dictionary of American Regional English, the Oxford Dictionary of National Biography, and the New Grove Dictionary of Music and Musicians. It has been hailed by the American New York Times as ‘the pièce de résistance of English slang studies’ and by the British Sunday Times as ‘a stupendous achievement, in range, meticulous scholarship, and not least entertainment value’.

On this website the dictionary is now available in updated online form for the first time, complete with advanced search tools enabling search by definition and history, and an expanded bibliography of slang sources from the early modern period to the present day. Since the print edition, nearly 60,000 quotations have been added, supporting 5,000 new senses in 2,500 new entries and sub-entries, of which around half are new slang terms from the last five years.

Green’s Dictionary of Slang has an interesting commercial model.

You can search for any word, freely, but “more search features” requires a subscription:

By subscribing to Green’s Dictionary of Slang Online, you gain access to advanced search tools (including the ability to search for words by meaning, history, and usage), full historical citations in each entry, and a bibliography of over 9,000 slang sources.

Current rate for individuals is £ 49 (or about $59.96).

In addition to being a fascinating collection of information, is the free/commercial split here of interest?

An alternative to:

The Teaser Model

Contrast the Oxford Music Online:

Grove Music Online is the eighth edition of Grove’s Dictionary of Music and Musicians, and contains articles commissioned specifically for the site as well as articles from New Grove 2001, Grove Opera, and Grove Jazz. The recently published second editions of The Grove Dictionary of American Music and The Grove Dictionary of Musical Instruments are still being put online, and new articles are added to GMO with each site update.

Oh, Oxford Music Online isn’t all pay-per-view.

It offers the following thirteen (13) articles for free viewing:

Sotiria Bellou, Greek singer of rebetiko song, famous for the special quality and register of her voice

Cell [Mobile] Phone Orchestra, ensemble of performers using programmable mobile (cellular) phones

Crete, largest and most populous of the Greek islands

Lyuba Encheva, Bulgarian pianist and teacher

Gaaw, generic term for drums, and specifically the frame drum, of the Tlingit and Haida peoples of Alaska

Johanna Kinkel, German composer, writer, pianist, music teacher, and conductor

Lady’s Glove Controller, modified glove that can control sound, mechanical devices, and lights

Outsider music, a loosely related set of recordings that do not fit well within any pre-existing generic framework

Peter (Joshua) Sculthorpe, Australian composer, seen by the Australian musical public as the most nationally representative.

Slovenia, country in southern Central Europe

Sound art, a term ecompassing a variety of art forms that utlize sound, or comment on auditory cultures

Alice (Bigelow) Tully, American singer and music philanthropist

Wars in Iraq and Afghanistan, soliders’ relationship with music is largely shaped by contemporary audio technology

Hmmm, 160,000 slang terms for free from Green’s Dictionary of Slang versus 13 free articles from Oxford Music Online.

Show of hands for the teaser model of Oxford Music Online?

The Consumer As Product

You are aware that casual web browsing and alleged “free” sites are not just supported by ads, but by the information they collect on you?

Consider this rather boastful touting of information collection capabilities:

To collect online data, we use our native tracking tags as experience has shown that other methods require a great deal of time, effort and cost on both ends and almost never yield satisfactory coverage or results since they depend on data provided by third parties or compiled by humans (!!), without being able to verify the quality of the information. We have a simple universal server-side tag that works with most tag managers. Collecting offline marketing data is a bit trickier. For TV and radio, we will with your offline advertising agency to collect post-log reports on a weekly basis, transmitted to a secure FTP. Typical parameters include flight and cost, date/time stamp, network, program, creative length, time of spot, GRP, etc.

Convertro is also able to collect other type of offline data, such as in-store sales, phone orders or catalog feeds. Our most popular proprietary solution involves placing a view pixel within a confirmation email. This makes it possible for our customers to tie these users to prior online activity without sharing private user information with us. For some customers, we are able to match almost 100% of offline sales. Other customers that have different conversion data can feed them into our system and match it to online activity by partnering with LiveRamp. These matches usually have a success rate between 30%-50%. Phone orders are tracked by utilizing a smart combination of our in-house approach, the inputting of special codes, or by third party vendors such as Mongoose and ResponseTap.v

You don’t have to be on the web, you can be tracked “in-store,” on the phone, etc.

Converto doesn’t mention explicitly “supercookies,” for which Verizon just paid a $1.35 Million fine. From the post:

“Supercookies,” known officially as unique identifier headers [UIDH], are short-term serial numbers used by corporations to track customer data for advertising purposes. According to Jacob Hoffman-Andrews, a technologist with the Electronic Frontier Foundation, these cookies can be read by any web server one visits used to build individual profiles of internet habits. These cookies are hard to detect, and even harder to get rid of.

If any of that sounds objectionable to you, remember that to be valuable, user habits must be tracked.

That is if you find the idea of being a product acceptable.

The Green’s Dictionary of Slang offers an economic model that enables free access to casual users, kids writing book reports, journalists, etc., while at the same time creating a value-add that power users will pay for.

Other examples of value-add models with free access to the core information?

What would that look like for the Podesta emails?

Stochastic Terrorism – Usage Prior To January 10, 2011?

Wednesday, August 10th, 2016

With Donald Trump’s remarks today, you know that discussions of stochastic terrorism are about to engulf social media.

Anticipating that, I tried to run down some facts on the usage of “stochastic terrorism.”

As a starting point, Google NGrams comes up with zero (0) examples up to the year 2000.

One blog I found, named appropriately, Stochastic Terrorism, has only one post from January 26, 2011, may have the same author as: Stochastic Terrorism: Triggering the shooters (Daily Kos, January 10, 2011) with a closely similar post: Glenn Beck- Consider yourself on notice *Pictures fixed* (Daily Kos, July 26, 2011). The January 10, 2011 post may be the origin of this phrase.

The Corpus of Contemporary American English, which is complete up to 2015, reports zero (0) hits for “stochastic terrorism.”

NOW Corpus (News on the Web) reports three (3) hits for “stochastic terrorism.”

July 18, 2016 – Salon. All hate is not created equal: The folly of perceiving murderers like Dylann Roof, Micah Johnson and Gavin Long as one and the same by Chauncey DeVega.

Dylann Roof was delusional; his understanding of reality colored by white racial paranoiac fantasies. However, Roof was not born that way. He was socialized into hatred by a right-wing news media that encourages stochastic terrorism among its audience by the repeated use of eliminationist rhetoric, subtle and overt racism against non-whites, conspiracy theories, and reactionary language such as “real America” and “take our country back.”

In case you don’t have the context for Dylann Roof:

Roof is a white supremacist. Driven by that belief, he decided to kill 9 unarmed black people after a prayer meeting in Charleston, North Carolina’s Ebenezer Baptist Church. Roof’s manifesto explains that he wanted to kill black people because white people were “oppressed” in their “own country,” “illegal immigrants” and “Jews” were ruining the United States, and African-Americas are all criminals. Like other white supremacists and white nationalists (and yes, many “respectable” white conservatives as well) Roof’s political and intellectual cosmology is oriented around a belief that white Americans are somehow marginalized or treated badly in the United States. This is perverse and delusional: white people are the most economically and politically powerful racial group in the United States; American society is oriented around the protection of white privilege.

“Stochastic terrorism” occurs twice in:

December 7, 2015 The American Conservative. The Challenge of Lone Wolf Terrorism by Philip Jenkins.

Jenkins covers at length “leaderless resistance:”

Amazingly, the story goes back to the U.S. ultra-Right in the 1980s. Far Rightists and neo-Nazis tried to organize guerrilla campaigns against the U.S. government, which caused some damage but soon collapsed ignominiously. The problem was the federal agencies had these movements thoroughly penetrated, so that every time someone planned an attack, it was immediately discovered by means of either electronic or human intelligence. The groups were thoroughly penetrated by informers.

The collapse of that endeavor led to some serious rethinking by the movement’s intellectual leaders. Extremist theorists now evolved a shrewd if desperate strategy of “leaderless resistance,” based on what they called the “Phantom Cell or individual action.” If even the tightest of cell systems could be penetrated by federal agents, why have a hierarchical structure at all? Why have a chain of command? Why not simply move to a non-structure, in which individual groups circulate propaganda, manuals and broad suggestions for activities, which can be taken up or adapted according to need by particular groups or even individuals?

The phrase stochastic terrorism occurs twice, both in a comment:

Are they leaderless resistance tactics or is this stochastic terrorism? Stochastic terrorism is the use of mass communications/media to incite random actors to carry out violent or terrorist acts that are statistically predictable but individually unpredictable. That is, remote-control murder by lone wolf. This is by no means the sole province of one group.

The thread ends shortly thereafter with no one picking up on the distinction between “leaderless resistance,” and “stochastic terrorism,” if there is one.

I don’t have a publication date for Stochastic Terrorism? by Larry Wohlgemuth, the lack of dating on content a rant for another day, which says:

Everybody was certain it would happen, and in the wake of the shooting in Tucson last week only the most militant teabagger was able to deny that incendiary rhetoric played a role. We knew this talk of crosshairs, Second Amendment remedies and lock and load eventually would have repercussions, and it did.

Only the most obtuse can deny that, if you talk long enough about picking up a gun and shooting people, marginal personalities and the mentally ill will respond to that suggestion. Feebleminded and disturbed people DO exist, and to believe these words wouldn’t affect them seemed inauthentic at best and criminal at worst.

Now that the unthinkable has happened, people on the left want to shove it down the throats of wingers that are denying culpability. Suddenly, like Manna from heaven, a radical new “meme” was gifted to people intended to buttress their arguments that incendiary rhetoric does indeed result in violent actions.

It begs the question, what is stochastic terrorism, and how does it apply to the shooting in Tucson.

This diary on Daily Kos by a member who calls himself G2geek was posted Monday, January 10, two days after the tragedy in Tucson. It describes in detail the mechanisms whereby “stochastic terrorism” works, and who’s vulnerable to it. Here’s the diarist’s own words in explaining stochastic terrorism:

Which puts the original of “stochastic terrorism” back the Daily Kos, January 10, 2011 post, Stochastic Terrorism: Triggering the shooters, which appeared two days after U.S. Representative Gabrielle Giffords and eighteen others were shot in Tuscon, Arizona.

As of this morning, a popular search engine returns 536 “hits” for “stochastic terrorist,” and 12,300 “hits” for “stochastic terrorism.”

The term “stochastic terrorism” isn’t a popular one, perhaps it isn’t as easy to say as “…lone wolf.”

My concern is the potential use of “stochastic terrorism” to criminalize free speech and to intimidate speakers into self-censorship.

Not to mention that we should write Privilege with a capital P when you can order the deaths of foreign leaders and prosecute anyone who suggests that violence is possible against you. Now that’s Privilege.

Suggestions on further sources?

greek-accentuation 1.0.0 Released

Thursday, July 28th, 2016

greek-accentuation 1.0.0 Released by James Tauber.

From the post:

greek-accentuation has finally hit 1.0.0 with a couple more functions and a module layout change.

The library (which I’ve previously written about here) has been sitting on 0.9.9 for a while and I’ve been using it sucessfully in my inflectional morphology work for 18 months. There were, however, a couple of functions that lived in the inflectional morphology repos that really belonged in greek-accentuation. They have now been moved there.

If that sounds a tad obscure, some additional explanation from an earlier post by James:

It [greek-accentuation] consists of three modules:

  • characters
  • syllabify
  • accentuation

The characters module provides basic analysis and manipulation of Greek characters in terms of their Unicode diacritics as if decomposed. So you can use it to add, remove or test for breathing, accents, iota subscript or length diacritics.

The syllabify module provides basic analysis and manipulation of Greek syllables. It can syllabify words, give you the onset, nucleus, code, rime or body of a syllable, judge syllable length or give you the accentuation class of word.

The accentuation module uses the other two modules to accentuate Ancient Greek words. As well as listing possible_accentuations for a given unaccented word, it can produce recessive and (given another form with an accent) persistent accentuations.

Another name from my past and a welcome reminder that not all of computer science is focused on recommending ephemera for our consumption.

Sticks and Stones: How Names Work & Why They Hurt

Friday, February 26th, 2016

Sticks and Stones (1): How Names Work & Why They Hurt by Michael Ramscar.

Sticks and Stones (2): How Names Work & Why They Hurt

Sticks and Stones (3): How Names Work & Why They Hurt

From part 1:

In 1781, Christian Wilhelm von Dohm, a civil servant, political writer and historian in what was then Prussia published a two volume work entitled Über die Bürgerliche Verbesserung der Juden (“On the Civic Improvement of Jews”). In it, von Dohm laid out the case for emancipation for a people systematically denied the rights granted to most other European citizens. At the heart of his treatise lay a simple observation: The universal principles of humanity and justice that framed the constitutions of the nation-states then establishing themselves across the continent could hardly be taken seriously until those principles were, in fact, applied universally. To all.

Von Dohm was inspired to write his treatise by his friend, the Jewish philosopher Moses Mendelssohn, who wisely supposed that even though basic and universal principles were involved, there were advantages to be gained in this context by having their implications articulated by a Christian. Mendelssohn’s wisdom is reflected in history: von Dohm’s treatise was widely circulated and praised, and is thought to have influenced the French National Assembly’s decision to emancipate Jews in France in 1791 (Mendelssohn was particularly concerned at the poor treatment of Jews in Alsace), as well as laying the groundwork for the an edict that was issued on behalf of the Prussian Government on the 11th of March 1812:

“We, Frederick William, King of Prussia by the Grace of God, etc. etc., having decided to establish a new constitution conforming to the public good of Jewish believers living in our kingdom, proclaim all the former laws and prescriptions not confirmed in this present edict to be abrogated.”

To gain the full rights due to a Prussian citizen, Jews were required to declare themselves to the police within six months of the promulgation of the edict. And following a proposal put forward in von Dohm’s treatise (and later approved by David Friedländer, another member of Mendelssohn’s circle who acted as a consultant in the drawing up of the edict), any Jews who wanted to take up full Prussian citizenship were further required to adopt a Prussian Nachname.

What we call in English, a ‘surname.’

From the vantage afforded by the present day, it is easy to assume that names as we now know them are an immutable part of human history. Since one’s name is ever-present in one’s own life, it might seem that fixed names are ever-present and universal, like mountains, or the sunrise. Yet in the Western world, the idea that everyone should have an official, hereditary identifier is a very recent one, and on examination, it turns out that the naming practices we take for granted in modern Western states are far from ancient.

A very deep dives on person names across the centuries and the significance attached to them.

Not an easy read but definitely worth the time!

It may help you to understand why U.S.-centric name forms are so annoying to others.

Math whizzes of ancient Babylon figured out forerunner of calculus

Thursday, January 28th, 2016

The video is very cool and goes along with:

Math whizzes of ancient Babylon figured out forerunner of calculus by Ron Cowen.


What could have happened if a forerunner to calculus wasn’t forgotten for 1400 years?

A sharper question would be:

What if you didn’t lose corporate memory with every promotion, retirement or person leaving the company?

We have all seen it happen and all of us have suffered from it.

What if the investment in expertise and knowledge wasn’t flushed away with promotion, retirement, departure?

That would have to be one helluva ontology to capture everyone’s expertise and knowledge.

What if it wasn’t a single, unified or even “logical” ontology? What if it only represented the knowledge that was important to capture for you and yours? Not every potential user for all time.

Just as we don’t all wear the same uniforms to work everyday, we should not waste time looking for a universal business language for corporate memory.

Unless you are in the business of filling seats for such quixotic quests.

I prefer to deliver a measurable ROI if its all the same to you.

Are you ready to stop hemorrhaging corporate knowledge?

why I try to teach writing when I am supposed to be teaching art history

Wednesday, December 9th, 2015

why I try to teach writing when I am supposed to be teaching art history

From the post:

My daughter asked me yesterday what I had taught for so long the day before, and I told her, “the history of photography” and “writing.” She found this rather funny, since she, as a second-grader, has lately perfected the art of handwriting, so why would I be teaching it — still — to grown ups? I told her it wasn’t really how to write so much as how to put the ideas together — how to take a lot of information and say something with it to somebody else. How to express an idea in an organised way that lets somebody know what and why you think something. So, it turns out, what we call writing is never really just writing at all. It is expressing something in the hopes of becoming less alone. Of finding a voice, yes, but also in finding an ear to hear that voice, and an ear with a mouth that can speak back. It is about learning to enter into a conversation that becomes frozen in letters, yes, but also flexible in the form of its call and response: a magic trick that has the potential power of magnifying each voice, at times in conflict, but also in collusion, and of building those voices into the choir that can be called community. I realise that there was a time before I could write, and also a time when, like my daughter, writing consisted simply of the magic of transforming a line from my pen into words that could lift off the page no different than how I had set them down. But it feels like the me that is me has always been writing, as long as I can remember. It is this voice, however far it reaches or does not reach, that has been me and will continue to be me as long as I live and, in the strange way of words, enter into history. Someday, somebody will write historiographies in which they will talk about me, and I will consist not of this body that I inhabit, but the words that I string onto a page.

This is not to say that I write for the sake of immortality, so much as its opposite: the potential for a tiny bit of immortality is the by product of my attempt to be alive, in its fullest sense. To make a mark, to piss in the corners of life as it were, although hopefully in a slightly more sophisticated and usually less smelly way. Writing is, to me, the greatest output for the least investment: by writing, I gain a voice in the world which, like the flap of a butterfly’s wing, has the potential to grow on its own, outside of me, beyond me. My conviction that I should write is not so much because I think I’m better or have more of a right to speak than anybody else, but because I’m equally not convinced that anybody, no matter what their position of authority, is better or has more of an authorisation to write than me.

Writing is the greatest power that I can ever have. It is also an intimate passion, an orgy, between the many who write and the many who read, excitedly communicating with each other. For this reason it is not a power that I wish only for myself, for that would be no more interesting than the echo chamber of my own head. I love the power that is in others to write, the liberty they grant me to enter into their heads and hear their voices. I love our power to chime together, across time and space. I love the strange ability to enter into conversations with ghosts, as well as argue with, and just as often befriend, people I may never meet and people I hope to have a chance to meet. Even when we disagree, reading what people have written and taking it seriously feels like a deep form of respect to other human beings, to their right to think freely. It is this power of voices, of the many being able of their own accord to formulate a chorus, that appeals to the idealist deep within my superficially cynical self. To my mind, democracy can only emerge through this chorus: a cacophanous chorus that has the power to express as well as respect the diversity within itself.

A deep essay on writing that I recommend you read in full.

There is a line that hints at a reason for semantic diversity data science and the lack of code reuse in programming.

My conviction that I should write is not so much because I think I’m better or have more of a right to speak than anybody else, but because I’m equally not convinced that anybody, no matter what their position of authority, is better or has more of an authorisation to write than me.

Beyond the question of authority, whose writing do you understand better or more intuitively, yours or the writing or code of someone else? (At least assuming not too much time has gone by since you wrote it.)

The vast majority of use are more comfortable with our own prose or code, even though it required the effort to transpose prose or code written by others into our re-telling.

Being more aware of the nearly universal casting of prose/code to be our own, should help us acknowledge the moral debts to others and to point back to the sources of our prose/code.

I first saw this in a tweet by Atabey Kaygun.

Tomas Petricek on The Against Method

Tuesday, October 13th, 2015

Tomas Petricek on The Against Method by Tomas Petricek.

From the webpage:

How is computer science research done? What we take for granted and what we question? And how do theories in computer science tell us something about the real world? Those are some of the questions that may inspire computer scientist like me (and you!) to look into philosophy of science. I’ll present the work of one of the more extreme (and interesting!) philosophers of science, Paul Feyerabend. In “Against Method”, Feyerabend looks at the history of science and finds that there is no fixed scientific methodology and the only methodology that can encompass the rich history is ‘anything goes’. We see (not only computer) science as a perfect methodology for building correct knowledge, but is this really the case? To quote Feyerabend:

“Science is much more ‘sloppy’ and ‘irrational’ than its methodological image.”

I’ll be mostly talking about Paul Feyerabend’s “Against Method”, but as a computer scientist myself, I’ll insert a number of examples based on my experience with theoretical programming language research. I hope to convince you that looking at philosophy of science is very much worthwhile if we want to better understand what we do and how we do it as computer scientists!

The video runs an hour and about eighteen minutes but is worth every minute of it. As you can imagine, I was particularly taken with Tomas’ emphasis on the importance of language. Tomas goes so far as to suggest that disagreements about “type” in computer science stem from fundamentally different understandings of the word “type.”

I was reminded of Stanley Fish‘s “Doing What Comes Naturally (DWCN).

DWCN is a long and complex work but in brief Fish argues that we are all members of various “interpretive communities,” and that each of those communities influence how we understand language as readers. Which should come as assurance to those who fear intellectual anarchy and chaos because our interpretations are always within the context of an interpretative community.

Two caveats on Fish. As far as I know, Fish has never made the strong move and pointed out that his concept of “interpretative communities is just as applicable to natural sciences as it is to social sciences. What passes as “objective” today is part and parcel of an interpretative community that has declared it so. Other interpretative communities can and do reach other conclusions.

The second caveat is more sad than useful. Post-9/11, Fish and a number of other critics who were accused of teaching cultural relativity of values felt it necessary to distance themselves from that position. While they could not say that all cultures have the same values (factually false), they did say that Western values, as opposed to those of “cowardly, murdering,” etc. others, were superior.

If you think there is any credibility to that post-9/11 position, you haven’t read enough Chompsky. 9/11 wasn’t 1/100,0000 of the violence the United States has visited on civilians in other countries after the Korea War.

Today’s Special on Universal Languages

Tuesday, July 7th, 2015

I have often wondered about the fate of the Loglan project, but never seriously enough to track down any potential successor.

Today I encountered a link to Lojban, which is described by Wikipedia as follows:

Lojban (pronounced [ˈloʒban] is a constructed, syntactically unambiguous human language based on predicate logic, succeeding the Loglan project. The name “Lojban” is a compound formed from loj and ban, which are short forms of logji (logic) and bangu (language).

The Logical Language Group (LLG) began developing Lojban in 1987. The LLG sought to realize Loglan’s purposes, and further improve the language by making it more usable and freely available (as indicated by its official full English title, “Lojban: A Realization of Loglan”). After a long initial period of debating and testing, the baseline was completed in 1997, and published as The Complete Lojban Language. In an interview in 2010 with the New York Times, Arika Okrent, the author of In the Land of Invented Languages, stated: “The constructed language with the most complete grammar is probably Lojban—a language created to reflect the principles of logic.”

Lojban was developed to be a worldlang; to ensure that the gismu (root words) of the language sound familiar to people from diverse linguistic backgrounds, they were based on the six most widely spoken languages as of 1987—Mandarin, English, Hindi, Spanish, Russian, and Arabic. Lojban has also taken components from other constructed languages, notably the set of evidential indicators from Láadan.

I mention this just in case someone proposes to you than a universal language would increase communication and decrease ambiguity, resulting in better, more accurate communication in all fields.

Yes, yes it would. And several already exist. Including Lojban. Their language can take its place along side other universal languages, i.e., it can increase the number of languages that make up the present matrix of semantic confusion.

In case you know, what part of: New languages increase the potential for semantic confusion, seems unclear?

“Bake Cake” = “Build a Bomb”?

Friday, May 29th, 2015

The CNN never misses an opportunity to pollute the English language when it issues vague, wandering alerts social media and terrorists.

In its coverage of an FBI terror bulletin, FBI issues terror bulletin on ISIS social media reach (video), CNN displays a tweet allegedly using “bake cake” for “build a bomb” at time mark 1:42.

The link pointed to is obscured and due to censorship of my Twitter feed, I cannot confirm the authenticity of the tweet, nor to what location the link pointed.

The FBI bulletin was issued on May 21, 2015 and the tweet in question was dated May 27, 2015. Its relevance to the FBI bulletin is highly questionable.

The tweet in its entirety reads:

want to bake cake but dont know how?>

for free cake baking training>

Why is this part of the CNN story?

What better way to stoke fear than to make common phrases into fearful ones?

Hearing the phrase “bake a cake” isn’t going to send you diving under the couch but as CNN pushes this equivalence, you will become more and more aware of it.

Not unlike being in the Dallas/Ft. Worth airport for hours listening to: “Watch out for unattended packages!” Whether there is danger or not, it wears on your psyche.

Detecting Deception Strategies [Godsend for the 2016 Election Cycle]

Wednesday, May 20th, 2015

Discriminative Models for Predicting Deception Strategies by Scott Appling, Erica Briscoe, C.J. Hutto.


Although a large body of work has previously investigated various cues predicting deceptive communications, especially as demonstrated through written and spoken language (e.g., [30]), little has been done to explore predicting kinds of deception. We present novel work to evaluate the use of textual cues to discriminate between deception strategies (such as exaggeration or falsifi cation), concentrating on intentionally untruthful statements meant to persuade in a social media context. We conduct human subjects experimentation wherein subjects were engaged in a conversational task and then asked to label the kind(s) of deception they employed for each deceptive statement made. We then develop discriminative models to understand the difficulty between choosing between one and several strategies. We evaluate the models using precision and recall for strategy prediction among 4 deception strategies based on the most relevant psycholinguistic, structural, and data-driven cues. Our single strategy model results demonstrate as much as a 58% increase over baseline (random chance) accuracy and we also find that it is more difficult to predict certain kinds of deception than others.

The deception strategies studied in this paper:

  • Falsification
  • Exaggeration
  • Omission
  • Misleading

especially omission, will form the bulk of the content in the 2016 election cycle in the United States. Only deceptive statements were included in the test data, so the models were tested on correctly recognizing the deception strategy in a known deceptive statement.

The test data is remarkably similar to political content, which aside from their names and names of their opponents (mostly), is composed entirely of deceptive statements, albeit not marked for the strategy used in each one.

A web interface for loading pointers to video, audio or text with political content that emits tagged deception with pointers to additional information would be a real hit for the next U.S. election cycle. Monetize with ads, the sources of additional information, etc.

I first saw this in a tweet by Leon Derczynski.

Glossary of linguistic terms

Wednesday, May 6th, 2015

Glossary of linguistic terms by Eugene E. Loos (general editor), Susan Anderson (editor), Dwight H., Day, Jr. (editor), Paul C. Jordan (editor), J. Douglas Wingate (editor).

An excellent source for linguistic terminology.

If you have any interest in languages or linguistics you should give SIL International a visit.

BTW, the last update on the glossary page was in 2004 so if you can suggest some updates or additions, I am sure they would be appreciated.


Unker Non-Linear Writing System

Thursday, April 23rd, 2015

Unker Non-Linear Writing System by Alex Fink & Sai.

From the webpage:


“I understood from my parents, as they did from their parents, etc., that they became happier as they more fully grokked and were grokked by their cat.”[3]

Here is another snippet from the text:

Binding points, lines and relations

Every glyph includes a number of binding points, one for each of its arguments, the semantic roles involved in its meaning. For instance, the glyph glossed as eat has two binding points—one for the thing consumed and one for the consumer. The glyph glossed as (be) fish has only one, the fish. Often we give glosses more like “X eat Y”, so as to give names for the binding points (X is eater, Y is eaten).

A basic utterance in UNLWS is put together by writing out a number of glyphs (without overlaps) and joining up their binding points with lines. When two binding points are connected, this means the entities filling those semantic roles of the glyphs involved coincide. Thus when the ‘consumed’ binding point of eat is connected to the only binding point of fish, the connection refers to an eaten fish.

This is the main mechanism by which UNLWS clauses are assembled. To take a worked example, here are four glyphs:


If you are interested in graphical representations for design or presentation, this may be of interest.

Sam Hunting forwarded this while we were exploring TeX graphics.

PS: The “cat” people on Twitter may appreciate the first graphic. 😉

North Korea vs. TED Talk

Thursday, March 12th, 2015

Quiz: North Korean Slogan or TED Talk Sound Bite? by Dave Gilson.

From the post:

North Korea recently released a list of 310 slogans, trying to rouse patriotic fervor for everything from obeying bureaucracy (“Carry out the tasks given by the Party within the time it has set”) to mushroom cultivation (“Let us turn ours into a country of mushrooms”) and aggressive athleticism (“Play sports games in an offensive way, the way the anti-Japanese guerrillas did!”). The slogans also urge North Koreans to embrace science and technology and adopt a spirit of can-do optimism—messages that might not be too out of place in a TED talk.

Can you tell which of the following exhortations are propaganda from Pyongyang and which are sound bites from TED speakers? (Exclamation points have been added to all TED quotes to match North Korean house style.)

When you discover the source of the quote, do your change your interpretation of its reasonableness, etc.?

All I will say about my score is that either I need to watch far more TED talks and/or pay closer attention to North Korean Radio. 😉


PS: I think a weekly quiz with White House, “terrorist” and Congressional quotes would be more popular than the New York Times Crossword puzzle.

Category theory for beginners

Monday, February 23rd, 2015

Category theory for beginners by Ken Scrambler

From the post:

Explains the basic concepts of Category Theory, useful terminology to help understand the literature, and why it’s so relevant to software engineering.

Some two hundred and nine (209) slides, ending with pointers to other resources.

I would have dearly loved to see the presentation live!

This slide deck comes as close as any I have seen to teaching category theory as you would a natural language. Not too close but closer than others.

Think about it. When you entered school did the teacher begin with the terminology of grammar and how rules of grammar fit together?

Or, did the teacher start you off with “See Jack run.” or its equivalent in your language?

You were well on your way to being a competent language user before you were tasked with learning the rules for that language.

Interesting that the exact opposite approach is taken with category theory and so many topics related to computer science.

Pointers to anyone using a natural language teaching approach for category theory or CS material?

Gappy Pattern Matching on GPUs for On-Demand Extraction of Hierarchical Translation Grammars

Wednesday, February 18th, 2015

Gappy Pattern Matching on GPUs for On-Demand Extraction of Hierarchical Translation Grammars by Hua He, Jimmy Lin, Adam Lopez. (Transactions of the Association for Computational Linguistics, vol. 3, pp. 87–100, 2015.)


Grammars for machine translation can be materialized on demand by finding source phrases in an indexed parallel corpus and extracting their translations. This approach is limited in practical applications by the computational expense of online lookup and extraction. For phrase-based models, recent work has shown that on-demand grammar extraction can be greatly accelerated by parallelization on general purpose graphics processing units (GPUs), but these algorithms do not work for hierarchical models, which require matching patterns that contain gaps. We address this limitation by presenting a novel GPU algorithm for on-demand hierarchical grammar extraction that is at least an order of magnitude faster than a comparable CPU algorithm when processing large batches of sentences. In terms of end-to-end translation, with decoding on the CPU, we increase throughput by roughly two thirds on a standard MT evaluation dataset. The GPU necessary to achieve these improvements increases the cost of a server by about a third. We believe that GPU-based extraction of hierarchical grammars is an attractive proposition, particularly for MT applications that demand high throughput.

If you are interested in cross-language search, DNA sequence alignment or other pattern matching problems, you need to watch the progress of this work.

This article and other important research is freely accessible at: Transactions of the Association for Computational Linguistics (Update)

Wednesday, January 28th, 2015

I first wrote about in a post dated October 17, 2011.

A customer story from Microsoft: WorldWide Science Alliance and Deep Web Technologies made me revisit the site.

My original test query was “partially observable Markov processes” which resulted in 453 “hits” from at least 3266 found (2011 results). Today, running the same query resulted in “…1,342 top results from at least 25,710 found.” The top ninety-seven (97) were displayed.

A current description of the system from the customer story:

In June 2010, Deep Web Technologies and the Alliance launched multilingual search and translation capabilities with, which today searches across more than 100 databases in more than 70 countries. Users worldwide can search databases and translate results in 10 languages: Arabic, Chinese, English, French, German, Japanese, Korean, Portuguese, Russian, and Spanish. The solution also takes advantage of the Microsoft Audio Video Indexing Service (MAVIS). In 2011, multimedia search capabilities were added so that users could retrieve speech-indexed content as well as text.

The site handles approximately 70,000 queries and 1 million page views each month, and all traffic, including that from automated crawlers and search engines, amounts to approximately 70 million transactions per year. When a user enters a search term, instantly provides results clustered by topic, country, author, date, and more. Results are ranked by relevance, and users can choose to look at papers, multimedia, or research data. Divided into tabs for easy usability, the interface also provides details about each result, including a summary, date, author, location, and whether the full text is available. Users can print the search results or attach them to an email. They can also set up an alert that notifies them when new material is available.

Automated searching and translation can’t give you the semantic nuances possible by human authoring but it certainly can provide you with the source materials to build a specialized information resource with such semantics.

Very much a site to bookmark and use on a regular basis.

Links for subjects without them otherwise:

Deep Web Technologies

Microsoft Translator

Modelling Plot: On the “conversional novel”

Tuesday, January 20th, 2015

Modelling Plot: On the “conversional novel” by Andrew Piper.

From the post:

I am pleased to announce the acceptance of a new piece that will be appearing soon in New Literary History. In it, I explore techniques for identifying narratives of conversion in the modern novel in German, French and English. A great deal of new work has been circulating recently that addresses the question of plot structures within different genres and how we might or might not be able to model these computationally. My hope is that this piece offers a compelling new way of computationally studying different plot types and understanding their meaning within different genres.

Looking over recent work, in addition to Ben Schmidt’s original post examining plot “arcs” in TV shows using PCA, there have been posts by Ted Underwood and Matthew Jockers looking at novels, as well as a new piece in LLC that tries to identify plot units in fairy tales using the tools of natural language processing (frame nets and identity extraction). In this vein, my work offers an attempt to think about a single plot “type” (narrative conversion) and its role in the development of the novel over the long nineteenth century. How might we develop models that register the novel’s relationship to the narration of profound change, and how might such narratives be indicative of readerly investment? Is there something intrinsic, I have been asking myself, to the way novels ask us to commit to them? If so, does this have something to do with larger linguistic currents within them – not just a single line, passage, or character, or even something like “style” – but the way a greater shift of language over the course of the novel can be generative of affective states such as allegiance, belief or conviction? Can linguistic change, in other words, serve as an efficacious vehicle of readerly devotion?

While the full paper is available here, I wanted to post a distilled version of what I see as its primary findings. It’s a long essay that not only tries to experiment with the project of modelling plot, but also reflects on the process of model building itself and its place within critical reading practices. In many ways, its a polemic against the unfortunate binariness that surrounds debates in our field right now (distant/close, surface/depth etc.). Instead, I want us to see how computational modelling is in many ways conversional in nature, if by that we understand it as a circular process of gradually approaching some imaginary, yet never attainable centre, one that oscillates between both quantitative and qualitative stances (distant and close practices of reading).

Andrew writes of “…critical reading practices….” I’m not sure that technology will increase the use of “…critical reading practices…” but it certainly offers the opportunity to “read” texts in different ways.

I have done this with IT standards but never a novel, attempt reading it from the back forwards, a sentence at a time. At least with authoring you are proofing, it provides a radically different perspective than the more normal front to back. The first thing you notice is that it interrupts your reading/skimming speed so you will catch more errors as well as nuances in the text.

Before you think that literary analysis is a bit far afield from “practical” application, remember that narratives (think literature) are what drive social policy and decision making.

Take the current popular “war on terrorism” narrative that is so popular and unquestioned in the United States. Ask anyone inside the beltway in D.C. and they will blather on and on about the need to defend against terrorism. But there is an absolute paucity of terrorists, at least by deed, in the United States. Why does the narrative persist in the absence of any evidence to support it?

The various Red Scares in U.S. history were similar narratives that have never completely faded. They too had a radical disconnect between the narrative and the “facts on the ground.”

Piper doesn’t offer answers to those sort of questions but a deeper understanding of narrative, such as is found in novels, may lead to hints with profound policy implications.