Archive for the ‘Language’ Category

How to Help Trump

Wednesday, December 21st, 2016

How to Help Trump by George Lakoff.

From the post:

Without knowing it, many Democrats, progressives and members of the news media help Donald Trump every day. The way they help him is simple: they spread his message.

Think about it: every time Trump issues a mean tweet or utters a shocking statement, millions of people begin to obsess over his words. Reporters make it the top headline. Cable TV panels talk about it for hours. Horrified Democrats and progressives share the stories online, making sure to repeat the nastiest statements in order to refute them. While this response is understandable, it works in favor of Trump.

When you repeat Trump, you help Trump. You do this by spreading his message wide and far.

I know Lakoff from his Women, Fire, and Dangerous Things: What Categories Reveal about the Mind.

I haven’t read any of his “political” books but would buy them sight unseen on the strength of Women, Fire, and Dangerous Things.

Lakoff promises a series of posts using effective framing to “…expose and undermine Trump’s propaganda.”

Whether you want to help expose Trump or use framing to promote your own produce or agenda, start following Lakoff today!

Pattern Overloading

Tuesday, December 6th, 2016

Pattern Overloading by Ramsey Nasser.

From the post:

C-like languages have a problem of overloaded syntax that I noticed while teaching high school students. Consider the following snippets in such a language:


function foo(int x) {

for(int i=0;i < 10; i++) {

if(x > 10) {

case(x) {

A programmer experienced with this family would see

  1. Function invocation
  2. Function definition
  3. Control flow examples

In my experience, new programmers see these constructs as instances of the same idea: name(some-stuff) more-stuff. This is not an unreasonable conclusion to reach. The syntax for each construct is shockingly similar given that their semantics are wildly different.

You won’t be called upon to re-design C but Nasser’s advice:

Syntactic similarity should mirror semantic similarity

Or, to take a quote from the UX world

Similar things should look similar and dissimilar things should look dissimilar

is equally applicable to any syntax that you design.

BBC World Service – In 40 Languages [Non-U.S. Centric Topic Mappers Take Note]

Tuesday, November 15th, 2016

BBC World Service announces biggest expansion ‘since the 1940s’

From the post:

The BBC World Service will launch 11 new language services as part of its biggest expansion “since the 1940s”, the corporation has announced.

The expansion is a result of the funding boost announced by the UK government last year.

The new languages will be Afaan Oromo, Amharic, Gujarati, Igbo, Korean, Marathi, Pidgin, Punjabi, Telugu, Tigrinya, and Yoruba.

The first new services are expected to launch in 2017.

“This is a historic day for the BBC, as we announce the biggest expansion of the World Service since the 1940s,” said BBC director general Tony Hall.

“The BBC World Service is a jewel in the crown – for the BBC and for Britain.

“As we move towards our centenary, my vision is of a confident, outward-looking BBC which brings the best of our independent, impartial journalism and world-class entertainment to half a billion people around the world.


The BBC World Service is the starting place to broaden your horizons.

In English “all shows” lists 1831 shows.

I prefer reading over listening but have resolved to start exploring the world of the BBC.

Green’s Dictionary of Slang [New Commercializing Information Model?]

Friday, October 14th, 2016

Green’s Dictionary of Slang

From the about page:

Green’s Dictionary of Slang is the largest historical dictionary of English slang. Written by Jonathon Green over 17 years from 1993, it reached the printed page in 2010 in a three-volume set containing nearly 100,000 entries supported by over 400,000 citations from c. ad 1000 to the present day. The main focus of the dictionary is the coverage of over 500 years of slang from c. 1500 onwards.

The printed version of the dictionary received the Dartmouth Medal for outstanding works of reference from the American Library Association in 2012; fellow recipients include the Dictionary of American Regional English, the Oxford Dictionary of National Biography, and the New Grove Dictionary of Music and Musicians. It has been hailed by the American New York Times as ‘the pièce de résistance of English slang studies’ and by the British Sunday Times as ‘a stupendous achievement, in range, meticulous scholarship, and not least entertainment value’.

On this website the dictionary is now available in updated online form for the first time, complete with advanced search tools enabling search by definition and history, and an expanded bibliography of slang sources from the early modern period to the present day. Since the print edition, nearly 60,000 quotations have been added, supporting 5,000 new senses in 2,500 new entries and sub-entries, of which around half are new slang terms from the last five years.

Green’s Dictionary of Slang has an interesting commercial model.

You can search for any word, freely, but “more search features” requires a subscription:

By subscribing to Green’s Dictionary of Slang Online, you gain access to advanced search tools (including the ability to search for words by meaning, history, and usage), full historical citations in each entry, and a bibliography of over 9,000 slang sources.

Current rate for individuals is £ 49 (or about $59.96).

In addition to being a fascinating collection of information, is the free/commercial split here of interest?

An alternative to:

The Teaser Model

Contrast the Oxford Music Online:

Grove Music Online is the eighth edition of Grove’s Dictionary of Music and Musicians, and contains articles commissioned specifically for the site as well as articles from New Grove 2001, Grove Opera, and Grove Jazz. The recently published second editions of The Grove Dictionary of American Music and The Grove Dictionary of Musical Instruments are still being put online, and new articles are added to GMO with each site update.

Oh, Oxford Music Online isn’t all pay-per-view.

It offers the following thirteen (13) articles for free viewing:

Sotiria Bellou, Greek singer of rebetiko song, famous for the special quality and register of her voice

Cell [Mobile] Phone Orchestra, ensemble of performers using programmable mobile (cellular) phones

Crete, largest and most populous of the Greek islands

Lyuba Encheva, Bulgarian pianist and teacher

Gaaw, generic term for drums, and specifically the frame drum, of the Tlingit and Haida peoples of Alaska

Johanna Kinkel, German composer, writer, pianist, music teacher, and conductor

Lady’s Glove Controller, modified glove that can control sound, mechanical devices, and lights

Outsider music, a loosely related set of recordings that do not fit well within any pre-existing generic framework

Peter (Joshua) Sculthorpe, Australian composer, seen by the Australian musical public as the most nationally representative.

Slovenia, country in southern Central Europe

Sound art, a term ecompassing a variety of art forms that utlize sound, or comment on auditory cultures

Alice (Bigelow) Tully, American singer and music philanthropist

Wars in Iraq and Afghanistan, soliders’ relationship with music is largely shaped by contemporary audio technology

Hmmm, 160,000 slang terms for free from Green’s Dictionary of Slang versus 13 free articles from Oxford Music Online.

Show of hands for the teaser model of Oxford Music Online?

The Consumer As Product

You are aware that casual web browsing and alleged “free” sites are not just supported by ads, but by the information they collect on you?

Consider this rather boastful touting of information collection capabilities:

To collect online data, we use our native tracking tags as experience has shown that other methods require a great deal of time, effort and cost on both ends and almost never yield satisfactory coverage or results since they depend on data provided by third parties or compiled by humans (!!), without being able to verify the quality of the information. We have a simple universal server-side tag that works with most tag managers. Collecting offline marketing data is a bit trickier. For TV and radio, we will with your offline advertising agency to collect post-log reports on a weekly basis, transmitted to a secure FTP. Typical parameters include flight and cost, date/time stamp, network, program, creative length, time of spot, GRP, etc.

Convertro is also able to collect other type of offline data, such as in-store sales, phone orders or catalog feeds. Our most popular proprietary solution involves placing a view pixel within a confirmation email. This makes it possible for our customers to tie these users to prior online activity without sharing private user information with us. For some customers, we are able to match almost 100% of offline sales. Other customers that have different conversion data can feed them into our system and match it to online activity by partnering with LiveRamp. These matches usually have a success rate between 30%-50%. Phone orders are tracked by utilizing a smart combination of our in-house approach, the inputting of special codes, or by third party vendors such as Mongoose and ResponseTap.v

You don’t have to be on the web, you can be tracked “in-store,” on the phone, etc.

Converto doesn’t mention explicitly “supercookies,” for which Verizon just paid a $1.35 Million fine. From the post:

“Supercookies,” known officially as unique identifier headers [UIDH], are short-term serial numbers used by corporations to track customer data for advertising purposes. According to Jacob Hoffman-Andrews, a technologist with the Electronic Frontier Foundation, these cookies can be read by any web server one visits used to build individual profiles of internet habits. These cookies are hard to detect, and even harder to get rid of.

If any of that sounds objectionable to you, remember that to be valuable, user habits must be tracked.

That is if you find the idea of being a product acceptable.

The Green’s Dictionary of Slang offers an economic model that enables free access to casual users, kids writing book reports, journalists, etc., while at the same time creating a value-add that power users will pay for.

Other examples of value-add models with free access to the core information?

What would that look like for the Podesta emails?

Stochastic Terrorism – Usage Prior To January 10, 2011?

Wednesday, August 10th, 2016

With Donald Trump’s remarks today, you know that discussions of stochastic terrorism are about to engulf social media.

Anticipating that, I tried to run down some facts on the usage of “stochastic terrorism.”

As a starting point, Google NGrams comes up with zero (0) examples up to the year 2000.

One blog I found, named appropriately, Stochastic Terrorism, has only one post from January 26, 2011, may have the same author as: Stochastic Terrorism: Triggering the shooters (Daily Kos, January 10, 2011) with a closely similar post: Glenn Beck- Consider yourself on notice *Pictures fixed* (Daily Kos, July 26, 2011). The January 10, 2011 post may be the origin of this phrase.

The Corpus of Contemporary American English, which is complete up to 2015, reports zero (0) hits for “stochastic terrorism.”

NOW Corpus (News on the Web) reports three (3) hits for “stochastic terrorism.”

July 18, 2016 – Salon. All hate is not created equal: The folly of perceiving murderers like Dylann Roof, Micah Johnson and Gavin Long as one and the same by Chauncey DeVega.

Dylann Roof was delusional; his understanding of reality colored by white racial paranoiac fantasies. However, Roof was not born that way. He was socialized into hatred by a right-wing news media that encourages stochastic terrorism among its audience by the repeated use of eliminationist rhetoric, subtle and overt racism against non-whites, conspiracy theories, and reactionary language such as “real America” and “take our country back.”

In case you don’t have the context for Dylann Roof:

Roof is a white supremacist. Driven by that belief, he decided to kill 9 unarmed black people after a prayer meeting in Charleston, North Carolina’s Ebenezer Baptist Church. Roof’s manifesto explains that he wanted to kill black people because white people were “oppressed” in their “own country,” “illegal immigrants” and “Jews” were ruining the United States, and African-Americas are all criminals. Like other white supremacists and white nationalists (and yes, many “respectable” white conservatives as well) Roof’s political and intellectual cosmology is oriented around a belief that white Americans are somehow marginalized or treated badly in the United States. This is perverse and delusional: white people are the most economically and politically powerful racial group in the United States; American society is oriented around the protection of white privilege.

“Stochastic terrorism” occurs twice in:

December 7, 2015 The American Conservative. The Challenge of Lone Wolf Terrorism by Philip Jenkins.

Jenkins covers at length “leaderless resistance:”

Amazingly, the story goes back to the U.S. ultra-Right in the 1980s. Far Rightists and neo-Nazis tried to organize guerrilla campaigns against the U.S. government, which caused some damage but soon collapsed ignominiously. The problem was the federal agencies had these movements thoroughly penetrated, so that every time someone planned an attack, it was immediately discovered by means of either electronic or human intelligence. The groups were thoroughly penetrated by informers.

The collapse of that endeavor led to some serious rethinking by the movement’s intellectual leaders. Extremist theorists now evolved a shrewd if desperate strategy of “leaderless resistance,” based on what they called the “Phantom Cell or individual action.” If even the tightest of cell systems could be penetrated by federal agents, why have a hierarchical structure at all? Why have a chain of command? Why not simply move to a non-structure, in which individual groups circulate propaganda, manuals and broad suggestions for activities, which can be taken up or adapted according to need by particular groups or even individuals?

The phrase stochastic terrorism occurs twice, both in a comment:

Are they leaderless resistance tactics or is this stochastic terrorism? Stochastic terrorism is the use of mass communications/media to incite random actors to carry out violent or terrorist acts that are statistically predictable but individually unpredictable. That is, remote-control murder by lone wolf. This is by no means the sole province of one group.

The thread ends shortly thereafter with no one picking up on the distinction between “leaderless resistance,” and “stochastic terrorism,” if there is one.

I don’t have a publication date for Stochastic Terrorism? by Larry Wohlgemuth, the lack of dating on content a rant for another day, which says:

Everybody was certain it would happen, and in the wake of the shooting in Tucson last week only the most militant teabagger was able to deny that incendiary rhetoric played a role. We knew this talk of crosshairs, Second Amendment remedies and lock and load eventually would have repercussions, and it did.

Only the most obtuse can deny that, if you talk long enough about picking up a gun and shooting people, marginal personalities and the mentally ill will respond to that suggestion. Feebleminded and disturbed people DO exist, and to believe these words wouldn’t affect them seemed inauthentic at best and criminal at worst.

Now that the unthinkable has happened, people on the left want to shove it down the throats of wingers that are denying culpability. Suddenly, like Manna from heaven, a radical new “meme” was gifted to people intended to buttress their arguments that incendiary rhetoric does indeed result in violent actions.

It begs the question, what is stochastic terrorism, and how does it apply to the shooting in Tucson.

This diary on Daily Kos by a member who calls himself G2geek was posted Monday, January 10, two days after the tragedy in Tucson. It describes in detail the mechanisms whereby “stochastic terrorism” works, and who’s vulnerable to it. Here’s the diarist’s own words in explaining stochastic terrorism:

Which puts the original of “stochastic terrorism” back the Daily Kos, January 10, 2011 post, Stochastic Terrorism: Triggering the shooters, which appeared two days after U.S. Representative Gabrielle Giffords and eighteen others were shot in Tuscon, Arizona.

As of this morning, a popular search engine returns 536 “hits” for “stochastic terrorist,” and 12,300 “hits” for “stochastic terrorism.”

The term “stochastic terrorism” isn’t a popular one, perhaps it isn’t as easy to say as “…lone wolf.”

My concern is the potential use of “stochastic terrorism” to criminalize free speech and to intimidate speakers into self-censorship.

Not to mention that we should write Privilege with a capital P when you can order the deaths of foreign leaders and prosecute anyone who suggests that violence is possible against you. Now that’s Privilege.

Suggestions on further sources?

greek-accentuation 1.0.0 Released

Thursday, July 28th, 2016

greek-accentuation 1.0.0 Released by James Tauber.

From the post:

greek-accentuation has finally hit 1.0.0 with a couple more functions and a module layout change.

The library (which I’ve previously written about here) has been sitting on 0.9.9 for a while and I’ve been using it sucessfully in my inflectional morphology work for 18 months. There were, however, a couple of functions that lived in the inflectional morphology repos that really belonged in greek-accentuation. They have now been moved there.

If that sounds a tad obscure, some additional explanation from an earlier post by James:

It [greek-accentuation] consists of three modules:

  • characters
  • syllabify
  • accentuation

The characters module provides basic analysis and manipulation of Greek characters in terms of their Unicode diacritics as if decomposed. So you can use it to add, remove or test for breathing, accents, iota subscript or length diacritics.

The syllabify module provides basic analysis and manipulation of Greek syllables. It can syllabify words, give you the onset, nucleus, code, rime or body of a syllable, judge syllable length or give you the accentuation class of word.

The accentuation module uses the other two modules to accentuate Ancient Greek words. As well as listing possible_accentuations for a given unaccented word, it can produce recessive and (given another form with an accent) persistent accentuations.

Another name from my past and a welcome reminder that not all of computer science is focused on recommending ephemera for our consumption.

Sticks and Stones: How Names Work & Why They Hurt

Friday, February 26th, 2016

Sticks and Stones (1): How Names Work & Why They Hurt by Michael Ramscar.

Sticks and Stones (2): How Names Work & Why They Hurt

Sticks and Stones (3): How Names Work & Why They Hurt

From part 1:

In 1781, Christian Wilhelm von Dohm, a civil servant, political writer and historian in what was then Prussia published a two volume work entitled Über die Bürgerliche Verbesserung der Juden (“On the Civic Improvement of Jews”). In it, von Dohm laid out the case for emancipation for a people systematically denied the rights granted to most other European citizens. At the heart of his treatise lay a simple observation: The universal principles of humanity and justice that framed the constitutions of the nation-states then establishing themselves across the continent could hardly be taken seriously until those principles were, in fact, applied universally. To all.

Von Dohm was inspired to write his treatise by his friend, the Jewish philosopher Moses Mendelssohn, who wisely supposed that even though basic and universal principles were involved, there were advantages to be gained in this context by having their implications articulated by a Christian. Mendelssohn’s wisdom is reflected in history: von Dohm’s treatise was widely circulated and praised, and is thought to have influenced the French National Assembly’s decision to emancipate Jews in France in 1791 (Mendelssohn was particularly concerned at the poor treatment of Jews in Alsace), as well as laying the groundwork for the an edict that was issued on behalf of the Prussian Government on the 11th of March 1812:

“We, Frederick William, King of Prussia by the Grace of God, etc. etc., having decided to establish a new constitution conforming to the public good of Jewish believers living in our kingdom, proclaim all the former laws and prescriptions not confirmed in this present edict to be abrogated.”

To gain the full rights due to a Prussian citizen, Jews were required to declare themselves to the police within six months of the promulgation of the edict. And following a proposal put forward in von Dohm’s treatise (and later approved by David Friedländer, another member of Mendelssohn’s circle who acted as a consultant in the drawing up of the edict), any Jews who wanted to take up full Prussian citizenship were further required to adopt a Prussian Nachname.

What we call in English, a ‘surname.’

From the vantage afforded by the present day, it is easy to assume that names as we now know them are an immutable part of human history. Since one’s name is ever-present in one’s own life, it might seem that fixed names are ever-present and universal, like mountains, or the sunrise. Yet in the Western world, the idea that everyone should have an official, hereditary identifier is a very recent one, and on examination, it turns out that the naming practices we take for granted in modern Western states are far from ancient.

A very deep dives on person names across the centuries and the significance attached to them.

Not an easy read but definitely worth the time!

It may help you to understand why U.S.-centric name forms are so annoying to others.

Math whizzes of ancient Babylon figured out forerunner of calculus

Thursday, January 28th, 2016

The video is very cool and goes along with:

Math whizzes of ancient Babylon figured out forerunner of calculus by Ron Cowen.


What could have happened if a forerunner to calculus wasn’t forgotten for 1400 years?

A sharper question would be:

What if you didn’t lose corporate memory with every promotion, retirement or person leaving the company?

We have all seen it happen and all of us have suffered from it.

What if the investment in expertise and knowledge wasn’t flushed away with promotion, retirement, departure?

That would have to be one helluva ontology to capture everyone’s expertise and knowledge.

What if it wasn’t a single, unified or even “logical” ontology? What if it only represented the knowledge that was important to capture for you and yours? Not every potential user for all time.

Just as we don’t all wear the same uniforms to work everyday, we should not waste time looking for a universal business language for corporate memory.

Unless you are in the business of filling seats for such quixotic quests.

I prefer to deliver a measurable ROI if its all the same to you.

Are you ready to stop hemorrhaging corporate knowledge?

why I try to teach writing when I am supposed to be teaching art history

Wednesday, December 9th, 2015

why I try to teach writing when I am supposed to be teaching art history

From the post:

My daughter asked me yesterday what I had taught for so long the day before, and I told her, “the history of photography” and “writing.” She found this rather funny, since she, as a second-grader, has lately perfected the art of handwriting, so why would I be teaching it — still — to grown ups? I told her it wasn’t really how to write so much as how to put the ideas together — how to take a lot of information and say something with it to somebody else. How to express an idea in an organised way that lets somebody know what and why you think something. So, it turns out, what we call writing is never really just writing at all. It is expressing something in the hopes of becoming less alone. Of finding a voice, yes, but also in finding an ear to hear that voice, and an ear with a mouth that can speak back. It is about learning to enter into a conversation that becomes frozen in letters, yes, but also flexible in the form of its call and response: a magic trick that has the potential power of magnifying each voice, at times in conflict, but also in collusion, and of building those voices into the choir that can be called community. I realise that there was a time before I could write, and also a time when, like my daughter, writing consisted simply of the magic of transforming a line from my pen into words that could lift off the page no different than how I had set them down. But it feels like the me that is me has always been writing, as long as I can remember. It is this voice, however far it reaches or does not reach, that has been me and will continue to be me as long as I live and, in the strange way of words, enter into history. Someday, somebody will write historiographies in which they will talk about me, and I will consist not of this body that I inhabit, but the words that I string onto a page.

This is not to say that I write for the sake of immortality, so much as its opposite: the potential for a tiny bit of immortality is the by product of my attempt to be alive, in its fullest sense. To make a mark, to piss in the corners of life as it were, although hopefully in a slightly more sophisticated and usually less smelly way. Writing is, to me, the greatest output for the least investment: by writing, I gain a voice in the world which, like the flap of a butterfly’s wing, has the potential to grow on its own, outside of me, beyond me. My conviction that I should write is not so much because I think I’m better or have more of a right to speak than anybody else, but because I’m equally not convinced that anybody, no matter what their position of authority, is better or has more of an authorisation to write than me.

Writing is the greatest power that I can ever have. It is also an intimate passion, an orgy, between the many who write and the many who read, excitedly communicating with each other. For this reason it is not a power that I wish only for myself, for that would be no more interesting than the echo chamber of my own head. I love the power that is in others to write, the liberty they grant me to enter into their heads and hear their voices. I love our power to chime together, across time and space. I love the strange ability to enter into conversations with ghosts, as well as argue with, and just as often befriend, people I may never meet and people I hope to have a chance to meet. Even when we disagree, reading what people have written and taking it seriously feels like a deep form of respect to other human beings, to their right to think freely. It is this power of voices, of the many being able of their own accord to formulate a chorus, that appeals to the idealist deep within my superficially cynical self. To my mind, democracy can only emerge through this chorus: a cacophanous chorus that has the power to express as well as respect the diversity within itself.

A deep essay on writing that I recommend you read in full.

There is a line that hints at a reason for semantic diversity data science and the lack of code reuse in programming.

My conviction that I should write is not so much because I think I’m better or have more of a right to speak than anybody else, but because I’m equally not convinced that anybody, no matter what their position of authority, is better or has more of an authorisation to write than me.

Beyond the question of authority, whose writing do you understand better or more intuitively, yours or the writing or code of someone else? (At least assuming not too much time has gone by since you wrote it.)

The vast majority of use are more comfortable with our own prose or code, even though it required the effort to transpose prose or code written by others into our re-telling.

Being more aware of the nearly universal casting of prose/code to be our own, should help us acknowledge the moral debts to others and to point back to the sources of our prose/code.

I first saw this in a tweet by Atabey Kaygun.

Tomas Petricek on The Against Method

Tuesday, October 13th, 2015

Tomas Petricek on The Against Method by Tomas Petricek.

From the webpage:

How is computer science research done? What we take for granted and what we question? And how do theories in computer science tell us something about the real world? Those are some of the questions that may inspire computer scientist like me (and you!) to look into philosophy of science. I’ll present the work of one of the more extreme (and interesting!) philosophers of science, Paul Feyerabend. In “Against Method”, Feyerabend looks at the history of science and finds that there is no fixed scientific methodology and the only methodology that can encompass the rich history is ‘anything goes’. We see (not only computer) science as a perfect methodology for building correct knowledge, but is this really the case? To quote Feyerabend:

“Science is much more ‘sloppy’ and ‘irrational’ than its methodological image.”

I’ll be mostly talking about Paul Feyerabend’s “Against Method”, but as a computer scientist myself, I’ll insert a number of examples based on my experience with theoretical programming language research. I hope to convince you that looking at philosophy of science is very much worthwhile if we want to better understand what we do and how we do it as computer scientists!

The video runs an hour and about eighteen minutes but is worth every minute of it. As you can imagine, I was particularly taken with Tomas’ emphasis on the importance of language. Tomas goes so far as to suggest that disagreements about “type” in computer science stem from fundamentally different understandings of the word “type.”

I was reminded of Stanley Fish‘s “Doing What Comes Naturally (DWCN).

DWCN is a long and complex work but in brief Fish argues that we are all members of various “interpretive communities,” and that each of those communities influence how we understand language as readers. Which should come as assurance to those who fear intellectual anarchy and chaos because our interpretations are always within the context of an interpretative community.

Two caveats on Fish. As far as I know, Fish has never made the strong move and pointed out that his concept of “interpretative communities is just as applicable to natural sciences as it is to social sciences. What passes as “objective” today is part and parcel of an interpretative community that has declared it so. Other interpretative communities can and do reach other conclusions.

The second caveat is more sad than useful. Post-9/11, Fish and a number of other critics who were accused of teaching cultural relativity of values felt it necessary to distance themselves from that position. While they could not say that all cultures have the same values (factually false), they did say that Western values, as opposed to those of “cowardly, murdering,” etc. others, were superior.

If you think there is any credibility to that post-9/11 position, you haven’t read enough Chompsky. 9/11 wasn’t 1/100,0000 of the violence the United States has visited on civilians in other countries after the Korea War.

Today’s Special on Universal Languages

Tuesday, July 7th, 2015

I have often wondered about the fate of the Loglan project, but never seriously enough to track down any potential successor.

Today I encountered a link to Lojban, which is described by Wikipedia as follows:

Lojban (pronounced [ˈloʒban] is a constructed, syntactically unambiguous human language based on predicate logic, succeeding the Loglan project. The name “Lojban” is a compound formed from loj and ban, which are short forms of logji (logic) and bangu (language).

The Logical Language Group (LLG) began developing Lojban in 1987. The LLG sought to realize Loglan’s purposes, and further improve the language by making it more usable and freely available (as indicated by its official full English title, “Lojban: A Realization of Loglan”). After a long initial period of debating and testing, the baseline was completed in 1997, and published as The Complete Lojban Language. In an interview in 2010 with the New York Times, Arika Okrent, the author of In the Land of Invented Languages, stated: “The constructed language with the most complete grammar is probably Lojban—a language created to reflect the principles of logic.”

Lojban was developed to be a worldlang; to ensure that the gismu (root words) of the language sound familiar to people from diverse linguistic backgrounds, they were based on the six most widely spoken languages as of 1987—Mandarin, English, Hindi, Spanish, Russian, and Arabic. Lojban has also taken components from other constructed languages, notably the set of evidential indicators from Láadan.

I mention this just in case someone proposes to you than a universal language would increase communication and decrease ambiguity, resulting in better, more accurate communication in all fields.

Yes, yes it would. And several already exist. Including Lojban. Their language can take its place along side other universal languages, i.e., it can increase the number of languages that make up the present matrix of semantic confusion.

In case you know, what part of: New languages increase the potential for semantic confusion, seems unclear?

“Bake Cake” = “Build a Bomb”?

Friday, May 29th, 2015

The CNN never misses an opportunity to pollute the English language when it issues vague, wandering alerts social media and terrorists.

In its coverage of an FBI terror bulletin, FBI issues terror bulletin on ISIS social media reach (video), CNN displays a tweet allegedly using “bake cake” for “build a bomb” at time mark 1:42.

The link pointed to is obscured and due to censorship of my Twitter feed, I cannot confirm the authenticity of the tweet, nor to what location the link pointed.

The FBI bulletin was issued on May 21, 2015 and the tweet in question was dated May 27, 2015. Its relevance to the FBI bulletin is highly questionable.

The tweet in its entirety reads:

want to bake cake but dont know how?>

for free cake baking training>

Why is this part of the CNN story?

What better way to stoke fear than to make common phrases into fearful ones?

Hearing the phrase “bake a cake” isn’t going to send you diving under the couch but as CNN pushes this equivalence, you will become more and more aware of it.

Not unlike being in the Dallas/Ft. Worth airport for hours listening to: “Watch out for unattended packages!” Whether there is danger or not, it wears on your psyche.

Detecting Deception Strategies [Godsend for the 2016 Election Cycle]

Wednesday, May 20th, 2015

Discriminative Models for Predicting Deception Strategies by Scott Appling, Erica Briscoe, C.J. Hutto.


Although a large body of work has previously investigated various cues predicting deceptive communications, especially as demonstrated through written and spoken language (e.g., [30]), little has been done to explore predicting kinds of deception. We present novel work to evaluate the use of textual cues to discriminate between deception strategies (such as exaggeration or falsifi cation), concentrating on intentionally untruthful statements meant to persuade in a social media context. We conduct human subjects experimentation wherein subjects were engaged in a conversational task and then asked to label the kind(s) of deception they employed for each deceptive statement made. We then develop discriminative models to understand the difficulty between choosing between one and several strategies. We evaluate the models using precision and recall for strategy prediction among 4 deception strategies based on the most relevant psycholinguistic, structural, and data-driven cues. Our single strategy model results demonstrate as much as a 58% increase over baseline (random chance) accuracy and we also find that it is more difficult to predict certain kinds of deception than others.

The deception strategies studied in this paper:

  • Falsification
  • Exaggeration
  • Omission
  • Misleading

especially omission, will form the bulk of the content in the 2016 election cycle in the United States. Only deceptive statements were included in the test data, so the models were tested on correctly recognizing the deception strategy in a known deceptive statement.

The test data is remarkably similar to political content, which aside from their names and names of their opponents (mostly), is composed entirely of deceptive statements, albeit not marked for the strategy used in each one.

A web interface for loading pointers to video, audio or text with political content that emits tagged deception with pointers to additional information would be a real hit for the next U.S. election cycle. Monetize with ads, the sources of additional information, etc.

I first saw this in a tweet by Leon Derczynski.

Glossary of linguistic terms

Wednesday, May 6th, 2015

Glossary of linguistic terms by Eugene E. Loos (general editor), Susan Anderson (editor), Dwight H., Day, Jr. (editor), Paul C. Jordan (editor), J. Douglas Wingate (editor).

An excellent source for linguistic terminology.

If you have any interest in languages or linguistics you should give SIL International a visit.

BTW, the last update on the glossary page was in 2004 so if you can suggest some updates or additions, I am sure they would be appreciated.


Unker Non-Linear Writing System

Thursday, April 23rd, 2015

Unker Non-Linear Writing System by Alex Fink & Sai.

From the webpage:


“I understood from my parents, as they did from their parents, etc., that they became happier as they more fully grokked and were grokked by their cat.”[3]

Here is another snippet from the text:

Binding points, lines and relations

Every glyph includes a number of binding points, one for each of its arguments, the semantic roles involved in its meaning. For instance, the glyph glossed as eat has two binding points—one for the thing consumed and one for the consumer. The glyph glossed as (be) fish has only one, the fish. Often we give glosses more like “X eat Y”, so as to give names for the binding points (X is eater, Y is eaten).

A basic utterance in UNLWS is put together by writing out a number of glyphs (without overlaps) and joining up their binding points with lines. When two binding points are connected, this means the entities filling those semantic roles of the glyphs involved coincide. Thus when the ‘consumed’ binding point of eat is connected to the only binding point of fish, the connection refers to an eaten fish.

This is the main mechanism by which UNLWS clauses are assembled. To take a worked example, here are four glyphs:


If you are interested in graphical representations for design or presentation, this may be of interest.

Sam Hunting forwarded this while we were exploring TeX graphics.

PS: The “cat” people on Twitter may appreciate the first graphic. 😉

North Korea vs. TED Talk

Thursday, March 12th, 2015

Quiz: North Korean Slogan or TED Talk Sound Bite? by Dave Gilson.

From the post:

North Korea recently released a list of 310 slogans, trying to rouse patriotic fervor for everything from obeying bureaucracy (“Carry out the tasks given by the Party within the time it has set”) to mushroom cultivation (“Let us turn ours into a country of mushrooms”) and aggressive athleticism (“Play sports games in an offensive way, the way the anti-Japanese guerrillas did!”). The slogans also urge North Koreans to embrace science and technology and adopt a spirit of can-do optimism—messages that might not be too out of place in a TED talk.

Can you tell which of the following exhortations are propaganda from Pyongyang and which are sound bites from TED speakers? (Exclamation points have been added to all TED quotes to match North Korean house style.)

When you discover the source of the quote, do your change your interpretation of its reasonableness, etc.?

All I will say about my score is that either I need to watch far more TED talks and/or pay closer attention to North Korean Radio. 😉


PS: I think a weekly quiz with White House, “terrorist” and Congressional quotes would be more popular than the New York Times Crossword puzzle.

Category theory for beginners

Monday, February 23rd, 2015

Category theory for beginners by Ken Scrambler

From the post:

Explains the basic concepts of Category Theory, useful terminology to help understand the literature, and why it’s so relevant to software engineering.

Some two hundred and nine (209) slides, ending with pointers to other resources.

I would have dearly loved to see the presentation live!

This slide deck comes as close as any I have seen to teaching category theory as you would a natural language. Not too close but closer than others.

Think about it. When you entered school did the teacher begin with the terminology of grammar and how rules of grammar fit together?

Or, did the teacher start you off with “See Jack run.” or its equivalent in your language?

You were well on your way to being a competent language user before you were tasked with learning the rules for that language.

Interesting that the exact opposite approach is taken with category theory and so many topics related to computer science.

Pointers to anyone using a natural language teaching approach for category theory or CS material?

Gappy Pattern Matching on GPUs for On-Demand Extraction of Hierarchical Translation Grammars

Wednesday, February 18th, 2015

Gappy Pattern Matching on GPUs for On-Demand Extraction of Hierarchical Translation Grammars by Hua He, Jimmy Lin, Adam Lopez. (Transactions of the Association for Computational Linguistics, vol. 3, pp. 87–100, 2015.)


Grammars for machine translation can be materialized on demand by finding source phrases in an indexed parallel corpus and extracting their translations. This approach is limited in practical applications by the computational expense of online lookup and extraction. For phrase-based models, recent work has shown that on-demand grammar extraction can be greatly accelerated by parallelization on general purpose graphics processing units (GPUs), but these algorithms do not work for hierarchical models, which require matching patterns that contain gaps. We address this limitation by presenting a novel GPU algorithm for on-demand hierarchical grammar extraction that is at least an order of magnitude faster than a comparable CPU algorithm when processing large batches of sentences. In terms of end-to-end translation, with decoding on the CPU, we increase throughput by roughly two thirds on a standard MT evaluation dataset. The GPU necessary to achieve these improvements increases the cost of a server by about a third. We believe that GPU-based extraction of hierarchical grammars is an attractive proposition, particularly for MT applications that demand high throughput.

If you are interested in cross-language search, DNA sequence alignment or other pattern matching problems, you need to watch the progress of this work.

This article and other important research is freely accessible at: Transactions of the Association for Computational Linguistics (Update)

Wednesday, January 28th, 2015

I first wrote about in a post dated October 17, 2011.

A customer story from Microsoft: WorldWide Science Alliance and Deep Web Technologies made me revisit the site.

My original test query was “partially observable Markov processes” which resulted in 453 “hits” from at least 3266 found (2011 results). Today, running the same query resulted in “…1,342 top results from at least 25,710 found.” The top ninety-seven (97) were displayed.

A current description of the system from the customer story:

In June 2010, Deep Web Technologies and the Alliance launched multilingual search and translation capabilities with, which today searches across more than 100 databases in more than 70 countries. Users worldwide can search databases and translate results in 10 languages: Arabic, Chinese, English, French, German, Japanese, Korean, Portuguese, Russian, and Spanish. The solution also takes advantage of the Microsoft Audio Video Indexing Service (MAVIS). In 2011, multimedia search capabilities were added so that users could retrieve speech-indexed content as well as text.

The site handles approximately 70,000 queries and 1 million page views each month, and all traffic, including that from automated crawlers and search engines, amounts to approximately 70 million transactions per year. When a user enters a search term, instantly provides results clustered by topic, country, author, date, and more. Results are ranked by relevance, and users can choose to look at papers, multimedia, or research data. Divided into tabs for easy usability, the interface also provides details about each result, including a summary, date, author, location, and whether the full text is available. Users can print the search results or attach them to an email. They can also set up an alert that notifies them when new material is available.

Automated searching and translation can’t give you the semantic nuances possible by human authoring but it certainly can provide you with the source materials to build a specialized information resource with such semantics.

Very much a site to bookmark and use on a regular basis.

Links for subjects without them otherwise:

Deep Web Technologies

Microsoft Translator

Modelling Plot: On the “conversional novel”

Tuesday, January 20th, 2015

Modelling Plot: On the “conversional novel” by Andrew Piper.

From the post:

I am pleased to announce the acceptance of a new piece that will be appearing soon in New Literary History. In it, I explore techniques for identifying narratives of conversion in the modern novel in German, French and English. A great deal of new work has been circulating recently that addresses the question of plot structures within different genres and how we might or might not be able to model these computationally. My hope is that this piece offers a compelling new way of computationally studying different plot types and understanding their meaning within different genres.

Looking over recent work, in addition to Ben Schmidt’s original post examining plot “arcs” in TV shows using PCA, there have been posts by Ted Underwood and Matthew Jockers looking at novels, as well as a new piece in LLC that tries to identify plot units in fairy tales using the tools of natural language processing (frame nets and identity extraction). In this vein, my work offers an attempt to think about a single plot “type” (narrative conversion) and its role in the development of the novel over the long nineteenth century. How might we develop models that register the novel’s relationship to the narration of profound change, and how might such narratives be indicative of readerly investment? Is there something intrinsic, I have been asking myself, to the way novels ask us to commit to them? If so, does this have something to do with larger linguistic currents within them – not just a single line, passage, or character, or even something like “style” – but the way a greater shift of language over the course of the novel can be generative of affective states such as allegiance, belief or conviction? Can linguistic change, in other words, serve as an efficacious vehicle of readerly devotion?

While the full paper is available here, I wanted to post a distilled version of what I see as its primary findings. It’s a long essay that not only tries to experiment with the project of modelling plot, but also reflects on the process of model building itself and its place within critical reading practices. In many ways, its a polemic against the unfortunate binariness that surrounds debates in our field right now (distant/close, surface/depth etc.). Instead, I want us to see how computational modelling is in many ways conversional in nature, if by that we understand it as a circular process of gradually approaching some imaginary, yet never attainable centre, one that oscillates between both quantitative and qualitative stances (distant and close practices of reading).

Andrew writes of “…critical reading practices….” I’m not sure that technology will increase the use of “…critical reading practices…” but it certainly offers the opportunity to “read” texts in different ways.

I have done this with IT standards but never a novel, attempt reading it from the back forwards, a sentence at a time. At least with authoring you are proofing, it provides a radically different perspective than the more normal front to back. The first thing you notice is that it interrupts your reading/skimming speed so you will catch more errors as well as nuances in the text.

Before you think that literary analysis is a bit far afield from “practical” application, remember that narratives (think literature) are what drive social policy and decision making.

Take the current popular “war on terrorism” narrative that is so popular and unquestioned in the United States. Ask anyone inside the beltway in D.C. and they will blather on and on about the need to defend against terrorism. But there is an absolute paucity of terrorists, at least by deed, in the United States. Why does the narrative persist in the absence of any evidence to support it?

The various Red Scares in U.S. history were similar narratives that have never completely faded. They too had a radical disconnect between the narrative and the “facts on the ground.”

Piper doesn’t offer answers to those sort of questions but a deeper understanding of narrative, such as is found in novels, may lead to hints with profound policy implications.

How Language Shapes Thought:…

Wednesday, December 24th, 2014

How Language Shapes Thought: The languages we speak affect our perceptions of the world by Lera Boroditsky.

From the article:

I am standing next to a five-year old girl in pormpuraaw, a small Aboriginal community on the western edge of Cape York in northern Australia. When I ask her to point north, she points precisely and without hesitation. My compass says she is right. Later, back in a lecture hall at Stanford University, I make the same request of an audience of distinguished scholars—winners of science medals and genius prizes. Some of them have come to this very room to hear lectures for more than 40 years. I ask them to close their eyes (so they don’t cheat) and point north. Many refuse; they do not know the answer. Those who do point take a while to think about it and then aim in all possible directions. I have repeated this exercise at Harvard and Princeton and in Moscow, London and Beijing, always with the same results.

A five-year-old in one culture can do something with ease that eminent scientists in other cultures struggle with. This is a big difference in cognitive ability. What could explain it? The surprising answer, it turns out, may be language.

Michael Nielson mentioned this article in a tweet about a new book due out from Lera in the Fall of 2015.

Looking further I found: 7,000 Universes: How the Language We Speak Shapes the Way We Think [Kindle Edition] by Lera Boroditsky. (September, 2015, available for pre-order now)

As Michael says, looking forward to seeing this book! Sounds like a good title to forward to Steve Newcomb. Steve would argue, correctly I might add, any natural language may contain an infinite number of possible universes of discourse.

I assume some of this issue will be caught by your testing topic map UIs with actual users in whatever subject domain and language you are offering information. That is rather than consider the influence of language in the abstract, you will be silently taking it into account in user feedback. You are testing your topic map deliverables with live users before delivery. Yes?

There are other papers by Lera available for your leisure reading.

The Sense of Style [25 December 2014 – 10 AM – C-SPAN2]

Tuesday, December 23rd, 2014

Steve Pinker discussing his book The Sense of Style: The Thinking Person’s Guide to Writing in the 21st Century.

From the description:

Steven Pinker talked about his book, The Sense of Style: The Thinking Person’s Guide to Writing in the 21st Century, in which he questions why so much of our writing today is bad. Professor Pinker said that while texting and the internet are blamed for developing bad writing habits, especially among young people, good writing has always been a difficult task.

The transcript, made for closed captioning, will convince you of the power of paragraphing if you attempt to read it. I may copy it, watch the lecture Christmas morning, insert paragraphing and ask CSPAN if they would like a corrected copy. 😉

One suggestion for learning to write (like learning to program), that I have heard but never followed, is to type out text written by known good writers. As you probably suspect, my excuse is a lack of time. Perhaps that will be a New Year’s resolution for the coming year.

Becoming a better writer automatically means you will be communicating better with your audience. For some of us that may be a plus or a negative. You have been forewarned.


In case you miss the broadcast, I found the video archive of the presentation. Nothing that will startle you but Pinker is an entertaining speaker.

I am watching the video early and Pinker points out an “inherent problem in the design of language.” [paraphrasing] We hold knowledge in a semantic network in our brains but when we use language to communicate some piece of that knowledge, the order of words in a sentence has to do two things at once:

* Serve as a code for meaning (who did what to whom)

* Present some bits of information to the reader before others (affects how the information is absorbed)

Pinker points out that passive voice allows better prose. Focus remains on the subject. (Is prevalent in bad prose but Pinker argues that is due to the curse of knowledge.)

Question: Do we need a form of passive voice in computer languages? What would that look like?


Tuesday, December 16th, 2014

LT-Accelerate: LT-Accelerate is a conference designed to help businesses, researchers and public administrations discover business value via Language Technology.

From the about page:

LT-Accelerate is a joint production of LT-Innovate, the European Association of the Language Technology Industry, and Alta Plana Corporation, a Washington DC based strategy consultancy headed by analyst Seth Grimes.

Held December 4-5, 2014 in Brussels, the website reports seven (7) interviews with key speakers and slides from thirty-eight speakers.

Not as in depth as papers nor as useful as videos of the presentations but still capable of sparking new ideas as you review the slides.

For example, the slides from Multi-Dimensional Sentiment Analysis by Stephen Pulman made me wonder what sentiment detection design would be appropriate for the Michael Brown grand jury transcripts?

Sentiment detection has been successfully used with tweets (140 character limit) and I am reliably informed that most of the text strings in the Michael Brown grand jury transcript are far longer than one hundred and forty (140) characters. 😉

Any sentiment detectives in the audience?

Inheritance Patterns in Citation Networks Reveal Scientific Memes

Sunday, December 14th, 2014

Inheritance Patterns in Citation Networks Reveal Scientific Memes by Tobias Kuhn, Matjaž Perc, and Dirk Helbing. (Phys. Rev. X 4, 041036 – Published 21 November 2014.)


Memes are the cultural equivalent of genes that spread across human culture by means of imitation. What makes a meme and what distinguishes it from other forms of information, however, is still poorly understood. Our analysis of memes in the scientific literature reveals that they are governed by a surprisingly simple relationship between frequency of occurrence and the degree to which they propagate along the citation graph. We propose a simple formalization of this pattern and validate it with data from close to 50 million publication records from the Web of Science, PubMed Central, and the American Physical Society. Evaluations relying on human annotators, citation network randomizations, and comparisons with several alternative approaches confirm that our formula is accurate and effective, without a dependence on linguistic or ontological knowledge and without the application of arbitrary thresholds or filters.

Popular Summary:

It is widely known that certain cultural entities—known as “memes”—in a sense behave and evolve like genes, replicating by means of human imitation. A new scientific concept, for example, spreads and mutates when other scientists start using and refining the concept and cite it in their publications. Unlike genes, however, little is known about the characteristic properties of memes and their specific effects, despite their central importance in science and human culture in general. We show that memes in the form of words and phrases in scientific publications can be characterized and identified by a simple mathematical regularity.

We define a scientific meme as a short unit of text that is replicated in citing publications (“graphene” and “self-organized criticality” are two examples). We employ nearly 50 million digital publication records from the American Physical Society, PubMed Central, and the Web of Science in our analysis. To identify and characterize scientific memes, we define a meme score that consists of a propagation score—quantifying the degree to which a meme aligns with the citation graph—multiplied by the frequency of occurrence of the word or phrase. Our method does not require arbitrary thresholds or filters and does not depend on any linguistic or ontological knowledge. We show that the results of the meme score are consistent with expert opinion and align well with the scientific concepts described on Wikipedia. The top-ranking memes, furthermore, have interesting bursty time dynamics, illustrating that memes are continuously developing, propagating, and, in a sense, fighting for the attention of scientists.

Our results open up future research directions for studying memes in a comprehensive fashion, which could lead to new insights in fields as disparate as cultural evolution, innovation, information diffusion, and social media.

You definitely should grab the PDF version of this article for printing and a slow read.

From Section III Discussion:

We show that the meme score can be calculated exactly and exhaustively without the introduction of arbitrary thresholds or filters and without relying on any kind of linguistic or ontological knowledge. The method is fast and reliable, and it can be applied to massive databases.

Fair enough but “black,” “inflation,” and, “traffic flow,” all appear in the top fifty memes in physics. I don’t know that I would consider any of them to be “memes.”

There is much left to be discovered about memes. Such as who is good at propagating memes? Would not hurt if your research paper is the origin of a very popular meme.

I first saw this in a tweet by Max Fisher.

When Do Natural Language Metaphors Influence Reasoning?…

Thursday, December 11th, 2014

When Do Natural Language Metaphors Influence Reasoning? A Follow-Up Study to Thibodeau and Boroditsky (2013) by Gerard J. Steen, W. Gudrun Reijnierse, and Christian Burgers.


In this article, we offer a critical view of Thibodeau and Boroditsky who report an effect of metaphorical framing on readers’ preference for political measures after exposure to a short text on the increase of crime in a fictitious town: when crime was metaphorically presented as a beast, readers became more enforcement-oriented than when crime was metaphorically framed as a virus. We argue that the design of the study has left room for alternative explanations. We report four experiments comprising a follow-up study, remedying several shortcomings in the original design while collecting more encompassing sets of data. Our experiments include three additions to the original studies: (1) a non-metaphorical control condition, which is contrasted to the two metaphorical framing conditions used by Thibodeau and Boroditsky, (2) text versions that do not have the other, potentially supporting metaphors of the original stimulus texts, (3) a pre-exposure measure of political preference (Experiments 1–2). We do not find a metaphorical framing effect but instead show that there is another process at play across the board which presumably has to do with simple exposure to textual information. Reading about crime increases people’s preference for enforcement irrespective of metaphorical frame or metaphorical support of the frame. These findings suggest the existence of boundary conditions under which metaphors can have differential effects on reasoning. Thus, our four experiments provide converging evidence raising questions about when metaphors do and do not influence reasoning.

The influence of metaphors on reasoning raises an interesting question for those attempting to duplicate the human brain in silicon: Can a previously recorded metaphor influence the outcome of AI reasoning?

Or can hearing the same information multiple times from different sources influence an AI’s perception of the validity of that information? (In a non-AI context, a relevant question for the Michael Brown grand jury discussion.)

On it own merits, a very good read and recommended to anyone who enjoys language issues.

Cultural Fault Lines Determine How New Words Spread On Twitter, Say Computational Linguists

Saturday, December 6th, 2014

Cultural Fault Lines Determine How New Words Spread On Twitter, Say Computational Linguists

From the post:

A dialect is a particular form of language that is limited to a specific location or population group. Linguists are fascinated by these variations because they are determined both by geography and by demographics. So studying them can produce important insights into the nature of society and how different groups within it interact.

That’s why linguists are keen to understand how new words, abbreviations and usages spread on new forms of electronic communication, such as social media platforms. It is easy to imagine that the rapid spread of neologisms could one day lead to a single unified dialect of netspeak. An interesting question is whether there is any evidence that this is actually happening.

Today, we get a fascinating insight into this problem thanks to the work of Jacob Eisenstein at the Georgia Institute of Technology in Atlanta and a few pals. These guys have measured the spread of neologisms on Twitter and say they have clear evidence that online language is not converging at all. Indeed, they say that electronic dialects are just as common as ordinary ones and seem to reflect same fault lines in society.

Disappointment for those who thought the Net would help people overcome the curse of Babel.

When we move into new languages or means of communication, we simply take our linguistic diversity with us, like well traveled but familiar luggage.

If you think about it, the difficulties of multiple semantics for OWL same:As is another instance of the same phenomena. Semantically distinct groups assigned the same token, OWL same:As different semantics. That should not have been a surprise. But it was and it will be every time on community privileges itself to be the giver of meaning for any term.

If you want to see the background for the post in full:

Diffusion of Lexical Change in Social Media by Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, Eric P. Xing.


Computer-mediated communication is driving fundamental changes in the nature of written language. We investigate these changes by statistical analysis of a dataset comprising 107 million Twitter messages (authored by 2.7 million unique user accounts). Using a latent vector autoregressive model to aggregate across thousands of words, we identify high-level patterns in diffusion of linguistic change over the United States. Our model is robust to unpredictable changes in Twitter’s sampling rate, and provides a probabilistic characterization of the relationship of macro-scale linguistic influence to a set of demographic and geographic predictors. The results of this analysis offer support for prior arguments that focus on geographical proximity and population size. However, demographic similarity — especially with regard to race — plays an even more central role, as cities with similar racial demographics are far more likely to share linguistic influence. Rather than moving towards a single unified “netspeak” dialect, language evolution in computer-mediated communication reproduces existing fault lines in spoken American English.

Hebrew Astrolabe:…

Thursday, December 4th, 2014

Hebrew Astrolabe: A History of the World in 100 Objects, Status Symbols (1200 – 1400 AD) by Neil MacGregor.

From the webpage:

Neil MacGregor’s world history as told through objects at the British Museum. This week he is exploring high status objects from across the world around 700 years ago. Today he has chosen an astronomical instrument that could perform multiple tasks in the medieval age, from working out the time to preparing horoscopes. It is called an astrolabe and originates from Spain at a time when Christianity, Islam and Judaism coexisted and collaborated with relative ease – indeed this instrument carries symbols recognisable to all three religions. Neil considers who it was made for and how it was used. The astrolabe’s curator, Silke Ackermann, describes the device and its markings, while the historian Sir John Elliott discusses the political and religious climate of 14th century Spain. Was it as tolerant as it seems?

The astrolabe that is the focus of this podcast is quite remarkable. The Hebrew, Arabic and Spanish words on this astrolabe are all written in Hebrew characters.

Would you say that is multilingual?

BTW, this series from the British Museum will not be available indefinitely so start listening to these podcasts soon!

Cliques are nasty but Cliques are nastier

Tuesday, December 2nd, 2014

Cliques are nasty but Cliques are nastier by Lance Fortnow.

A heteronym that fails to make the listing at: The Heteronym Homepage.

From the Heteronym Homepage:

Heteronyms are words that are spelled identically but have different meanings when pronounced differently.

Before you jump to Lance’s post (see the comments as well), care to guess the pronunciations and meanings of “clique?”


Old World Language Families

Sunday, November 30th, 2014

language tree

Be design (limitation of space) not all languages were included.

Despite that, the original post has gotten seven hundred and twenty-two (722) comments as of today. A large number of which mention wanting a poster of this visualization.

I could assemble the same information, sans the interesting graphic and get no comments and no requests for a poster version.


What makes this presentation (map) compelling? Could you transfer it to another body of information with the same impact?

What do you make of: “The approximate sizes of our known living language populations, compared to year 0.”

Suggested reading on what makes some graphics compelling and others not?

Originally from: Stand Still Stay Silent Comic, although I first saw it at: Old World Language Families by Randy Krum.

PS: For extra credit, how many languages can you name that don’t appear on this map?

Building a language-independent keyword-based system with the Wikipedia Miner

Monday, October 27th, 2014

Building a language-independent keyword-based system with the Wikipedia Miner by Gauthier Lemoine.

From the post:

Extracting keywords from texts and HTML pages is a common subject that opens doors to a lot of potential applications. These include classification (what is this page topic?), recommendation systems (identifying user likes to recommend the more accurate content), search engines (what is this page about?), document clustering (how can I pack different texts into a common group) and much more.

Most applications of these are usually based on only one language, usually english. However, it would be better to be able to process document in any language. For example, a case in a recommender system would be a user that speaks French and English. In his history, he gave positive ratings to a few pages containing the keyword “Airplane”. So, for next recommendations, we would boost this keyword. With a language-independent approach, we would also be able to boost pages containing “Avion”, the french term for airplane. If the user gave positive ratings to pages in English containing “Airplane”, and in French containing “Avion”, we would also be able to merge easily into the same keyword to build a language-independent user profile that will be used for accurate French and English recommendations.

This articles shows one way to achieve good results using an easy strategy. It is obvious that we can achieve better results using more complex algorithms.

The NSA can hire translators so I would not bother sharing this technique for harnessing the thousands of expert hours in Wikipedia with them.

Bear in mind that Wikipedia does not reach a large number of minority languages, dialects, and certainly not deliberate obscurity in any language. Your mileage will vary depending upon your particular use case.