Archive for the ‘Language’ Category

Inheritance Patterns in Citation Networks Reveal Scientific Memes

Sunday, December 14th, 2014

Inheritance Patterns in Citation Networks Reveal Scientific Memes by Tobias Kuhn, Matjaž Perc, and Dirk Helbing. (Phys. Rev. X 4, 041036 – Published 21 November 2014.)


Memes are the cultural equivalent of genes that spread across human culture by means of imitation. What makes a meme and what distinguishes it from other forms of information, however, is still poorly understood. Our analysis of memes in the scientific literature reveals that they are governed by a surprisingly simple relationship between frequency of occurrence and the degree to which they propagate along the citation graph. We propose a simple formalization of this pattern and validate it with data from close to 50 million publication records from the Web of Science, PubMed Central, and the American Physical Society. Evaluations relying on human annotators, citation network randomizations, and comparisons with several alternative approaches confirm that our formula is accurate and effective, without a dependence on linguistic or ontological knowledge and without the application of arbitrary thresholds or filters.

Popular Summary:

It is widely known that certain cultural entities—known as “memes”—in a sense behave and evolve like genes, replicating by means of human imitation. A new scientific concept, for example, spreads and mutates when other scientists start using and refining the concept and cite it in their publications. Unlike genes, however, little is known about the characteristic properties of memes and their specific effects, despite their central importance in science and human culture in general. We show that memes in the form of words and phrases in scientific publications can be characterized and identified by a simple mathematical regularity.

We define a scientific meme as a short unit of text that is replicated in citing publications (“graphene” and “self-organized criticality” are two examples). We employ nearly 50 million digital publication records from the American Physical Society, PubMed Central, and the Web of Science in our analysis. To identify and characterize scientific memes, we define a meme score that consists of a propagation score—quantifying the degree to which a meme aligns with the citation graph—multiplied by the frequency of occurrence of the word or phrase. Our method does not require arbitrary thresholds or filters and does not depend on any linguistic or ontological knowledge. We show that the results of the meme score are consistent with expert opinion and align well with the scientific concepts described on Wikipedia. The top-ranking memes, furthermore, have interesting bursty time dynamics, illustrating that memes are continuously developing, propagating, and, in a sense, fighting for the attention of scientists.

Our results open up future research directions for studying memes in a comprehensive fashion, which could lead to new insights in fields as disparate as cultural evolution, innovation, information diffusion, and social media.

You definitely should grab the PDF version of this article for printing and a slow read.

From Section III Discussion:

We show that the meme score can be calculated exactly and exhaustively without the introduction of arbitrary thresholds or filters and without relying on any kind of linguistic or ontological knowledge. The method is fast and reliable, and it can be applied to massive databases.

Fair enough but “black,” “inflation,” and, “traffic flow,” all appear in the top fifty memes in physics. I don’t know that I would consider any of them to be “memes.”

There is much left to be discovered about memes. Such as who is good at propagating memes? Would not hurt if your research paper is the origin of a very popular meme.

I first saw this in a tweet by Max Fisher.

When Do Natural Language Metaphors Influence Reasoning?…

Thursday, December 11th, 2014

When Do Natural Language Metaphors Influence Reasoning? A Follow-Up Study to Thibodeau and Boroditsky (2013) by Gerard J. Steen, W. Gudrun Reijnierse, and Christian Burgers.


In this article, we offer a critical view of Thibodeau and Boroditsky who report an effect of metaphorical framing on readers’ preference for political measures after exposure to a short text on the increase of crime in a fictitious town: when crime was metaphorically presented as a beast, readers became more enforcement-oriented than when crime was metaphorically framed as a virus. We argue that the design of the study has left room for alternative explanations. We report four experiments comprising a follow-up study, remedying several shortcomings in the original design while collecting more encompassing sets of data. Our experiments include three additions to the original studies: (1) a non-metaphorical control condition, which is contrasted to the two metaphorical framing conditions used by Thibodeau and Boroditsky, (2) text versions that do not have the other, potentially supporting metaphors of the original stimulus texts, (3) a pre-exposure measure of political preference (Experiments 1–2). We do not find a metaphorical framing effect but instead show that there is another process at play across the board which presumably has to do with simple exposure to textual information. Reading about crime increases people’s preference for enforcement irrespective of metaphorical frame or metaphorical support of the frame. These findings suggest the existence of boundary conditions under which metaphors can have differential effects on reasoning. Thus, our four experiments provide converging evidence raising questions about when metaphors do and do not influence reasoning.

The influence of metaphors on reasoning raises an interesting question for those attempting to duplicate the human brain in silicon: Can a previously recorded metaphor influence the outcome of AI reasoning?

Or can hearing the same information multiple times from different sources influence an AI’s perception of the validity of that information? (In a non-AI context, a relevant question for the Michael Brown grand jury discussion.)

On it own merits, a very good read and recommended to anyone who enjoys language issues.

Cultural Fault Lines Determine How New Words Spread On Twitter, Say Computational Linguists

Saturday, December 6th, 2014

Cultural Fault Lines Determine How New Words Spread On Twitter, Say Computational Linguists

From the post:

A dialect is a particular form of language that is limited to a specific location or population group. Linguists are fascinated by these variations because they are determined both by geography and by demographics. So studying them can produce important insights into the nature of society and how different groups within it interact.

That’s why linguists are keen to understand how new words, abbreviations and usages spread on new forms of electronic communication, such as social media platforms. It is easy to imagine that the rapid spread of neologisms could one day lead to a single unified dialect of netspeak. An interesting question is whether there is any evidence that this is actually happening.

Today, we get a fascinating insight into this problem thanks to the work of Jacob Eisenstein at the Georgia Institute of Technology in Atlanta and a few pals. These guys have measured the spread of neologisms on Twitter and say they have clear evidence that online language is not converging at all. Indeed, they say that electronic dialects are just as common as ordinary ones and seem to reflect same fault lines in society.

Disappointment for those who thought the Net would help people overcome the curse of Babel.

When we move into new languages or means of communication, we simply take our linguistic diversity with us, like well traveled but familiar luggage.

If you think about it, the difficulties of multiple semantics for OWL same:As is another instance of the same phenomena. Semantically distinct groups assigned the same token, OWL same:As different semantics. That should not have been a surprise. But it was and it will be every time on community privileges itself to be the giver of meaning for any term.

If you want to see the background for the post in full:

Diffusion of Lexical Change in Social Media by Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, Eric P. Xing.


Computer-mediated communication is driving fundamental changes in the nature of written language. We investigate these changes by statistical analysis of a dataset comprising 107 million Twitter messages (authored by 2.7 million unique user accounts). Using a latent vector autoregressive model to aggregate across thousands of words, we identify high-level patterns in diffusion of linguistic change over the United States. Our model is robust to unpredictable changes in Twitter’s sampling rate, and provides a probabilistic characterization of the relationship of macro-scale linguistic influence to a set of demographic and geographic predictors. The results of this analysis offer support for prior arguments that focus on geographical proximity and population size. However, demographic similarity — especially with regard to race — plays an even more central role, as cities with similar racial demographics are far more likely to share linguistic influence. Rather than moving towards a single unified “netspeak” dialect, language evolution in computer-mediated communication reproduces existing fault lines in spoken American English.

Hebrew Astrolabe:…

Thursday, December 4th, 2014

Hebrew Astrolabe: A History of the World in 100 Objects, Status Symbols (1200 – 1400 AD) by Neil MacGregor.

From the webpage:

Neil MacGregor’s world history as told through objects at the British Museum. This week he is exploring high status objects from across the world around 700 years ago. Today he has chosen an astronomical instrument that could perform multiple tasks in the medieval age, from working out the time to preparing horoscopes. It is called an astrolabe and originates from Spain at a time when Christianity, Islam and Judaism coexisted and collaborated with relative ease – indeed this instrument carries symbols recognisable to all three religions. Neil considers who it was made for and how it was used. The astrolabe’s curator, Silke Ackermann, describes the device and its markings, while the historian Sir John Elliott discusses the political and religious climate of 14th century Spain. Was it as tolerant as it seems?

The astrolabe that is the focus of this podcast is quite remarkable. The Hebrew, Arabic and Spanish words on this astrolabe are all written in Hebrew characters.

Would you say that is multilingual?

BTW, this series from the British Museum will not be available indefinitely so start listening to these podcasts soon!

Cliques are nasty but Cliques are nastier

Tuesday, December 2nd, 2014

Cliques are nasty but Cliques are nastier by Lance Fortnow.

A heteronym that fails to make the listing at: The Heteronym Homepage.

From the Heteronym Homepage:

Heteronyms are words that are spelled identically but have different meanings when pronounced differently.

Before you jump to Lance’s post (see the comments as well), care to guess the pronunciations and meanings of “clique?”


Old World Language Families

Sunday, November 30th, 2014

language tree

Be design (limitation of space) not all languages were included.

Despite that, the original post has gotten seven hundred and twenty-two (722) comments as of today. A large number of which mention wanting a poster of this visualization.

I could assemble the same information, sans the interesting graphic and get no comments and no requests for a poster version.


What makes this presentation (map) compelling? Could you transfer it to another body of information with the same impact?

What do you make of: “The approximate sizes of our known living language populations, compared to year 0.”

Suggested reading on what makes some graphics compelling and others not?

Originally from: Stand Still Stay Silent Comic, although I first saw it at: Old World Language Families by Randy Krum.

PS: For extra credit, how many languages can you name that don’t appear on this map?

Building a language-independent keyword-based system with the Wikipedia Miner

Monday, October 27th, 2014

Building a language-independent keyword-based system with the Wikipedia Miner by Gauthier Lemoine.

From the post:

Extracting keywords from texts and HTML pages is a common subject that opens doors to a lot of potential applications. These include classification (what is this page topic?), recommendation systems (identifying user likes to recommend the more accurate content), search engines (what is this page about?), document clustering (how can I pack different texts into a common group) and much more.

Most applications of these are usually based on only one language, usually english. However, it would be better to be able to process document in any language. For example, a case in a recommender system would be a user that speaks French and English. In his history, he gave positive ratings to a few pages containing the keyword “Airplane”. So, for next recommendations, we would boost this keyword. With a language-independent approach, we would also be able to boost pages containing “Avion”, the french term for airplane. If the user gave positive ratings to pages in English containing “Airplane”, and in French containing “Avion”, we would also be able to merge easily into the same keyword to build a language-independent user profile that will be used for accurate French and English recommendations.

This articles shows one way to achieve good results using an easy strategy. It is obvious that we can achieve better results using more complex algorithms.

The NSA can hire translators so I would not bother sharing this technique for harnessing the thousands of expert hours in Wikipedia with them.

Bear in mind that Wikipedia does not reach a large number of minority languages, dialects, and certainly not deliberate obscurity in any language. Your mileage will vary depending upon your particular use case.

Gram­mat­i­cal the­o­ry: From trans­for­ma­tion­al gram­mar to con­straint-​based ap­proach­es

Wednesday, October 22nd, 2014

Gram­mat­i­cal the­o­ry: From trans­for­ma­tion­al gram­mar to con­straint-​based ap­proach­es by Ste­fan Müller.

From the webpage:

To ap­pear 2015 in Lec­ture Notes in Lan­guage Scineces, No 1, Berlin: Lan­guage Sci­ence Press. The book is a trans­la­tion and ex­ten­sion of the sec­ond edi­tion of my gram­mar the­o­ry book that ap­peared 2010 in the Stauf­fen­burg Ver­lag.

This book in­tro­duces for­mal gram­mar the­o­ries that play a role in cur­rent lin­guis­tics or con­tribut­ed tools that are rel­e­vant for cur­rent lin­guis­tic the­o­riz­ing (Phrase Struc­ture Gram­mar, Trans­for­ma­tion­al Gram­mar/Gov­ern­ment & Bind­ing, Gen­er­al­ized Phrase Struc­ture Gram­mar, Lex­i­cal Func­tion­al Gram­mar, Cat­e­go­ri­al Gram­mar, Head-​Driv­en Phrase Struc­ture Gram­mar, Con­struc­tion Gram­mar, Tree Ad­join­ing Gram­mar). The key as­sump­tions are ex­plained and it is shown how the re­spec­tive the­o­ry treats ar­gu­ments and ad­juncts, the ac­tive/pas­sive al­ter­na­tion, local re­order­ings, verb place­ment, and fronting of con­stituents over long dis­tances. The anal­y­ses are ex­plained with Ger­man as the ob­ject lan­guage.

In a final chap­ter the ap­proach­es are com­pared with re­spect to their pre­dic­tions re­gard­ing lan­guage ac­qui­si­tion and psy­cholin­guis­tic plau­si­bil­i­ty. The Na­tivism hy­poth­e­sis that as­sumes that hu­mans poss­es ge­net­i­cal­ly de­ter­mined in­nate lan­guage-​spe­cif­ic knowl­edge is ex­am­ined crit­i­cal­ly and al­ter­na­tive mod­els of lan­guage ac­qui­si­tion are dis­cussed. In ad­di­tion this chap­ter ad­dress­es is­sues that are dis­cussed con­tro­ver­sial­ly in cur­rent the­o­ry build­ing as for in­stance the ques­tion whether flat or bi­na­ry branch­ing struc­tures are more ap­pro­pri­ate, the ques­tion whether con­struc­tions should be treat­ed on the phrasal or the lex­i­cal level, and the ques­tion whether ab­stract, non-​vis­i­ble en­ti­ties should play a role in syn­tac­tic anal­y­ses. It is shown that the anal­y­ses that are sug­gest­ed in the re­spec­tive frame­works are often trans­lat­able into each other. The book clos­es with a sec­tion that shows how prop­er­ties that are com­mon to all lan­guages or to cer­tain lan­guage class­es can be cap­tured.

The webpage offers a download link for the current draft, teaching materials and a BibTeX file of all publications that the author cites in his works.

Interesting because of the application of these models to a language other than English and the author’s attempt to help readers avoid semantic confusion:

Unfortunately, linguistics is a scientific field which is afflicted by an unbelievable degree of terminological chaos. This is partly due to the fact that terminology originally defined for certain languages (e. g. Latin, English) was later simply adopted for the description of other languages as well. However, this is not always appropriate since languages differ from one another greatly and are constantly changing. Due to the problems this caused, the terminology started to be used differently or new terms were invented. when new terms are introduced in this book, I will always mention related terminology or differing uses of each term so that readers can relate this to other literature.

Unfortunately, it does not appear like the author gathered the new terms up into a table or list. Creating such a list from the book would be a very useful project.

Growing a Language

Saturday, September 20th, 2014

Growing a Language by Guy L. Steele, Jr.

The first paper in a new series of posts from the Hacker School blog, “Paper of the Week.”

I haven’t found a good way to summarize Steele’s paper but can observe that a central theme is the growth of programming languages.

While enjoying the Steele paper, ask yourself how would you capture the changing nuances of a language, natural or artificial?


Python-ZPar – Python Wrapper for ZPAR

Monday, September 8th, 2014

Python-ZPar – Python Wrapper for ZPAR by Nitin Madnani.

From the webpage:

python-zpar is a python wrapper around the ZPar parser. ZPar was written by Yue Zhang while he was at Oxford University. According to its home page: ZPar is a statistical natural language parser, which performs syntactic analysis tasks including word segmentation, part-of-speech tagging and parsing. ZPar supports multiple languages and multiple grammar formalisms. ZPar has been most heavily developed for Chinese and English, while it provides generic support for other languages. ZPar is fast, processing above 50 sentences per second using the standard Penn Teebank (Wall Street Journal) data.

I wrote python-zpar since I needed a fast and efficient parser for my NLP work which is primarily done in Python and not C++. I wanted to be able to use this parser directly from Python without having to create a bunch of files and running them through subprocesses. python-zpar not only provides a simply python wrapper but also provides an XML-RPC ZPar server to make batch-processing of large files easier.

python-zpar uses ctypes, a very cool foreign function library bundled with Python that allows calling functions in C DLLs or shared libraries directly.

Just in case you are looking for a language parser for Chinese or English.

It is only a matter of time before commercial opportunities are going to force greater attention on non-English languages. Forewarned is forearmed.

How Could Language Have Evolved?

Monday, September 1st, 2014

How Could Language Have Evolved? by Johan J. Bolhuis, Ian Tattersall, Noam Chomsky, Robert C. Berwick.


The evolution of the faculty of language largely remains an enigma. In this essay, we ask why. Language’s evolutionary analysis is complicated because it has no equivalent in any nonhuman species. There is also no consensus regarding the essential nature of the language “phenotype.” According to the “Strong Minimalist Thesis,” the key distinguishing feature of language (and what evolutionary theory must explain) is hierarchical syntactic structure. The faculty of language is likely to have emerged quite recently in evolutionary terms, some 70,000–100,000 years ago, and does not seem to have undergone modification since then, though individual languages do of course change over time, operating within this basic framework. The recent emergence of language and its stability are both consistent with the Strong Minimalist Thesis, which has at its core a single repeatable operation that takes exactly two syntactic elements a and b and assembles them to form the set {a, b}.

Interesting that Chomsky and his co-authors have seized upon “hierarchical syntactic structure” as “the key distinguishing feature of language.”

Remember text as an Ordered Hierarchy of Content Objects (OHCO), which has made the rounds in markup circles since 1993. It’s staying power was quite surprising since examples are hard to find outside of markup text encodings. Your average text prior to markup can be mapped to OHCO only with difficulty in most cases.

Syntactic structures are attributed to languages so be mindful that any “hierarchical syntactic structure” is entirely of human origin separate and apart from language.

New York Times Annotated Corpus Add-On

Wednesday, August 27th, 2014

New York Times corpus add-on annotations: MIDs and Entity Salience. (GitHub – Data)

From the webpage:

The data included in this release accompanies the paper, entitled “A New Entity Salience Task with Millions of Training Examples” by Jesse Dunietz and Dan Gillick (EACL 2014).

The training data includes 100,834 documents from 2003-2006, with 19,261,118 annotated entities. The evaluation data includes 9,706 documents from 2007, with 187,080 annotated entities.

An empty line separates each document annotation. The first line of a document’s annotation contains the NYT document id followed by the title. Each subsequent line refers to an entity, with the following tab-separated fields:

entity index automatically inferred salience {0,1} mention count (from our coreference system) first mention’s text byte offset start position for the first mention byte offset end position for the first mention MID (from our entity resolution system)

The background in Teaching machines to read between the lines (and a new corpus with entity salience annotations) by Dan Gillick and Dave Orr, will be useful.

From the post:

Language understanding systems are largely trained on freely available data, such as the Penn Treebank, perhaps the most widely used linguistic resource ever created. We have previously released lots of linguistic data ourselves, to contribute to the language understanding community as well as encourage further research into these areas.

Now, we’re releasing a new dataset, based on another great resource: the New York Times Annotated Corpus, a set of 1.8 million articles spanning 20 years. 600,000 articles in the NYTimes Corpus have hand-written summaries, and more than 1.5 million of them are tagged with people, places, and organizations mentioned in the article. The Times encourages use of the metadata for all kinds of things, and has set up a forum to discuss related research.

We recently used this corpus to study a topic called “entity salience”. To understand salience, consider: how do you know what a news article or a web page is about? Reading comes pretty easily to people — we can quickly identify the places or things or people most central to a piece of text. But how might we teach a machine to perform this same task? This problem is a key step towards being able to read and understand an article.

Term ratios are a start, but we can do better. Search indexing these days is much more involved, using for example the distances between pairs of words on a page to capture their relatedness. Now, with the Knowledge Graph, we are beginning to think in terms of entities and relations rather than keywords. “Basketball” is more than a string of characters; it is a reference to something in the real word which we already already know quite a bit about. (emphasis added)

Truly an important data set but I’m rather partial to that last line. 😉

So the question is if we “recognize” a entity as salient, do we annotate the entity and:

  • Present the reader with a list of links, each to a separate mention with or without ads?
  • Present the reader with what is known about the entity, with or without ads?

I see enough divided posts and other information that forces readers to endure more ads that I consciously avoid buying anything for which I see a web ad. Suggest you do the same. (If possible.) I buy books, for example, because someone known to me recommends it, not because some marketeer pushes it at me across many domains.

Biscriptal juxtaposition in Chinese

Tuesday, August 26th, 2014

Biscriptal juxtaposition in Chinese by Victor Mair.

From the post:

We have often seen how the Roman alphabet is creeping into Chinese writing, both for expressing English words and morphemes that have been borrowed into Chinese, but also increasingly for writing Mandarin and other varieties of Chinese in Pinyin (spelling). Here are just a few earlier Language Log posts dealing with this phenomenon:

“A New Morpheme in Mandarin” (4/26/11)

“Zhao C: a Man Who Lost His Name” (2/27/09)

“Creeping Romanization in Chinese” (8/30/12)

Now an even more intricate application of alphabetic usage is developing in internet writing, namely, the juxtaposition and intertwining of simultaneous phrases with contrasting meaning.

Highly entertaining post on the complexities of evolving language usage.

The sort of usage that hasn’t made it into a dictionary, yet, but still needs to be captured and shared.

Sam Hunting brought this to my attention.

Using Category Theory to design…

Tuesday, July 29th, 2014

Using Category Theory to design implicit conversions and generic operators by John C. Reynolds.


A generalization of many-sorted algebras, called category-sorted algebras, is defined and applied to the language-design problem of avoiding anomalies in the interaction of implicit conversions and generic operators. The definition of a simple imperative language (without any binding mechanisms) is used as an example.

The greatest exposure most people have to implicit conversions is that they are handled properly.

This paper dates from 1980 so some of the category theory jargon will seem odd but consider it a “practical” application of category theory.

That should hold your interest. 😉

I first saw this in a tweet by scottfleischman.

Introducing Source Han Sans:…

Wednesday, July 16th, 2014

Introducing Source Han Sans: An open source Pan-CJK typeface by Caleb Belohlavek.

From the post:

Adobe, in partnership with Google, is pleased to announce the release of Source Han Sans, a new open source Pan-CJK typeface family that is now available on Typekit for desktop use. If you don’t have a Typekit account, it’s easy to set one up and start using the font immediately with our free subscription. And for those who want to play with the original source files, you can get those from our download page on SourceForge.

It’s rather difficult to describe your semantics when you can’t write in your own language.

Kudos to Adobe and Google for sponsoring this project!

I first saw this in a tweet by James Clark.

An Empirical Investigation into Programming Language Syntax

Monday, July 14th, 2014

An Empirical Investigation into Programming Language Syntax by Greg Wilson.

A great synopsis of Andreas Stefik and Susanna Siebert’s “An Empirical Investigation into Programming Language Syntax.” ACM Transactions on Computing Education, 13(4), Nov. 2013.

A sample to interest you in the post:

  1. Programming language designers needlessly make programming languages harder to learn by not doing basic usability testing. For example, “…the three most common words for looping in computer science, for, while, and foreach, were rated as the three most unintuitive choices by non-programmers.”
  2. C-style syntax, as used in Java and Perl, is just as hard for novices to learn as a randomly-designed syntax. Again, this pain is needless, because the syntax of other languages (such as Python and Ruby) is significantly easier.

Let me repeat part of that:

C-style syntax, as used in Java and Perl, is just as hard for novices to learn as a randomly-designed syntax.

Randomly-designed syntax?

Now, think about the latest semantic syntax or semantic query syntax you have read about.

Was it designed for users? Was there any user testing at all?

Is there a lesson here for designers of semantic syntaxes and query languages?


I first saw this in Greg Wilson’s Software Carpentry: Lessons Learned video.

Ontology-Based Interpretation of Natural Language

Thursday, July 10th, 2014

Ontology-Based Interpretation of Natural Language by Philipp Cimiano, Christina Unger, John McCrae.

Authors’ description:

For humans, understanding a natural language sentence or discourse is so effortless that we hardly ever think about it. For machines, however, the task of interpreting natural language, especially grasping meaning beyond the literal content, has proven extremely difficult and requires a large amount of background knowledge.

The book Ontology-based interpretation of natural language presents an approach to the interpretation of natural language with respect to specific domain knowledge captured in ontologies. It puts ontologies at the center of the interpretation process, meaning that ontologies not only provide a formalization of domain knowlegde necessary for interpretation but also support and guide the construction of meaning representations.

The links under Resources for Ontologies, Lexica and Grammars, as of today return “coming soon.”

Implementations fares a bit better, returning information on various aspects of lemon.

lemon is a proposed meta-model for describing ontology lexica with RDF. It is declarative, thus abstracts from specific syntactic and semantic theories, and clearly separates lexicon and ontology. It follows the principle of semantics by reference, which means that the meaning of lexical entries is specified by pointing to elements in the ontology.


It may just be me but the Lemon model seems more complicated than asking users what identifies their subjects and distinguishes them from other subjects.

Lemon is said to be compatible with RDF, OWL, SPARQL, etc.

But, accurate (to a user) identification of subjects and their relationships to other subjects is more important to me than compatibility with RDF, SPARQL, etc.


I first saw this in a tweet by Stefano Bertolo.

The Proceedings of the Old Bailey, 1674-1913

Tuesday, July 1st, 2014

The Proceedings of the Old Bailey, 1674-1913

From the webpage:

A fully searchable edition of the largest body of texts detailing the lives of non-elite people ever published, containing 197,745 criminal trials held at London’s central criminal court. If you are new to this site, you may find the Getting Started and Guide to Searching videos and tutorials helpful.

While writing about using The WORD on the STREET for examples of language change, I remember the proceedings from Old Bailey being online.

An extremely rich site with lots of help for the average reader but there was one section in particular I wanted to point out:

Gender in the Proceedings

Men’s and women’s experiences of crime, justice and punishment

Virtually every aspect of English life between 1674 and 1913 was influenced by gender, and this includes behaviour documented in the Old Bailey Proceedings. Long-held views about the particular strengths, weaknesses, and appropriate responsibilities of each sex shaped everyday lives, patterns of crime, and responses to crime. This page provides an introduction to gender roles in this period; a discussion of how they affected crime, justice, and punishment; and advice on how to analyse the Proceedings for information about gender.

Gender relations are but one example of the semantic distance that exists between us and our ancestors. We cannot ever eliminate that distance, any more than we can talk about the moon without remembering we have walked upon it.

But, we can do our best to honor that semantic distance by being aware that their world is not ours. Closely attending to language is a first step in that direction.


Early Canadiana Online

Friday, May 23rd, 2014

Early Canadiana Online

From the webpage:

These collections contain over 80,000 rare books, magazines and government publications from the 1600s to the 1940s.

This rare collection of documentary heritage will be of interest to scholars, genealogists, history buffs and anyone who enjoys reading about Canada’s early days.

The Early Canadiana Online collection of rare books, magazines and government publications has over 80,000 titles (3,500,000 pages) and is growing. The collection includes material published from the time of the first European settlers to the first four decades of the 20th Century.

You will find books written in 21 languages including French, English, 10 First Nations languages and several European languages, Latin and Greek.

Every online collection such as this one, increases the volume of information that is accessible and also increases the difficulty of finding related information for any given subject. But the latter is such a nice problem to have!

I first saw this in a tweet from Lincoln Mullen.

Speak and learn with Spell Up, our latest Chrome Experiment

Thursday, May 15th, 2014

Speak and learn with Spell Up, our latest Chrome Experiment by Xavier Barrade.

From the post:

As a student growing up in France, I was always looking for ways to improve my English, often with a heavy French-to-English dictionary in tow. Since then, technology has opened up a world of new educational opportunities, from simple searches to Google Translate (and our backpacks have gotten a lot lighter). But it can be hard to find time and the means to practice a new language. So when the Web Speech API made it possible to speak to our phones, tablets and computers, I got curious about whether this technology could help people learn a language more easily.

That’s the idea behind Spell Up, a new word game and Chrome Experiment that helps you improve your English using your voice—and a modern browser, of course. It’s like a virtual spelling bee, with a twist.

This rocks!

If Google is going to open source another project and support it, Spell Up should be it.

The machine pronunciation could use some work, or at least it seems that way to me. (My hearing may be a factor there.)

Thinking of the impact of Spell Up for lesser often taught languages.

Online Language Taggers

Tuesday, May 13th, 2014

UCREL Semantic Analysis System (USAS)

From the homepage:

The UCREL semantic analysis system is a framework for undertaking the automatic semantic analysis of text. The framework has been designed and used across a number of research projects and this page collects together various pointers to those projects and publications produced since 1990.

The semantic tagset used by USAS was originally loosely based on Tom McArthur’s Longman Lexicon of Contemporary English (McArthur, 1981). It has a multi-tier structure with 21 major discourse fields (shown here on the right), subdivided, and with the possibility of further fine-grained subdivision in certain cases. We have written an introduction to the USAS category system (PDF file) with examples of prototypical words and multi-word units in each semantic field.

There are four online taggers available:

English: 100,000 word limit

Italian: 2,000 word limit

Dutch: 2,000 word limit

Chinese: 3,000 character limit


I first saw this in a tweet by Paul Rayson.

Non-English/Spanish Language by State

Tuesday, May 13th, 2014

I need your help. I saw this on a twitter feed from Slate.


I don’t have confirmation that any member of Georgia (United States) government reads Slate, but putting this type of information where it might be seen by Georgia government staffers strikes me as irresponsible news reporting.

Publishing all of Snowden’s documents as an unedited dump would have less of an impact than members of the Georgia legislature finding out there is yet another race to worry about in Georgia.

The legislature hardly knows which way to turn now, knowing about African-Americans and Latinos. Adding another group to that list will only make matters worse.

Question: How to suppress information about the increasing diversity of the population of Georgia?

Not for long, just until it becomes diverse enough to replace all the sitting members of the Georgia legislature in one fell swoop. 😉

The more diverse Georgia becomes, the more vibrant its rebirth will be following its current period of stagnation trying to hold onto the “good old days.”

What I Said Is Not What You Heard

Sunday, May 4th, 2014

Another example of where semantic impedance can impair communication, not to mention public policy decisions:


Those are just a few terms that are used in public statements from scientists.

I can hardly imagine the disconnect between lawyers and the public. Or economists and the public.

To say nothing of computer science in general and the public.

I’m not sold on the solution to bias being heard as distortion, political motive, is: offset from an observation.

Literally true but I’m not sure omitting the reason for the offset is all that helpful.

Something more along the lines of: “test A misses the true value B by C, so we (subtract/add) C to A to get a more correct value.”

A lot more words but clearer.

The image is from: Communicating the science of climate change by Richard C. J. Somerville and Susan Joy Hassol. A very good article on the perils of trying to communicate with the general public about climate change.

But it isn’t just the general public that has difficulty understanding scientists. Scientists have difficulty understanding other scientists, particularly if the scientists in question are from different domains or even different fields within a domain.

All of which has to make you wonder: If human beings, including scientists fail to understand each other on a regular basis, who is watching for misunderstandings between computers?

I first saw this in a tweet by Austin Frakt.

PS: Pointers to research on words that fail to communicate greatly appreciated.

Language is a Map

Wednesday, April 30th, 2014

Language is a Map by Tim O’Reilly.

From the post:

I’ve twice given an Ignite talk entitled Language is a Map, but I’ve never written up the fundamental concepts underlying that talk. Here I do that.

When I first moved to Sebastopol, before I raised horses, I’d look out at a meadow and I’d see grass. But over time, I learned to distinguish between oats, rye, orchard grass, and alfalfa. Having a language to make distinctions between different types of grass helped me to see what I was looking at.

I first learned this notion, that language is a map that reflects reality, and helps us to see it more deeply – or if wrong, blinds us to it – from George Simon, whom I first met in 1969. Later, George went on to teach workshops at the Esalen Institute, which was to the human potential movement of the 1970s as the Googleplex or Apple’s Infinite Loop is to the Silicon Valley of today. I taught at Esalen with George when I was barely out of high school, and his ideas have deeply influenced my thinking ever since.

If you accept Tim’s premise that “language is a map,” the next question that comes to mind is how faithfully can an information system represent your map?

Your map, not the map of an IT developer or a software vendor but your map?

Does your information system capture the shades and nuances of your map?


…Generalized Language Models…

Wednesday, April 16th, 2014

How Generalized Language Models outperform Modified Kneser Ney Smoothing by a Perplexity drop of up to 25% by René Pickhardt.

René reports on the core of his dissertation work.

From the post:

When you want to assign a probability to a sequence of words you will run into the Problem that longer sequences are very rare. People fight this problem by using smoothing techniques and interpolating longer order models (models with longer word sequences) with lower order language models. While this idea is strong and helpful it is usually applied in the same way. In order to use a shorter model the first word of the sequence is omitted. This will be iterated. The Problem occurs if one of the last words of the sequence is the really rare word. In this way omiting words in the front will not help.

So the simple trick of Generalized Language models is to smooth a sequence of n words with n-1 shorter models which skip a word at position 1 to n-1 respectively.

Then we combine everything with Modified Kneser Ney Smoothing just like it was done with the previous smoothing methods.

Unlike some white papers, webinars and demos, you don’t have to register, list your email and phone number, etc. to see both the test data and code that implements René’s ideas.

Data, Source.

Please send René useful feedback as a way to say thank you for sharing both data and code.


Wednesday, March 5th, 2014

Q by Bernard Lambeau.

From the webpage:

Q is a data language. For now, it is limited to a data definition language (DDL). Think “JSON/XML schema”, but the correct way. Q comes with a dedicated type system for defining data and a theory, called information contracts, for interoperability with programming and data exchange languages.

I am sure this will be useful but limited since it doesn’t extend to disclosing the semantics of data or the structures that contain data.

Unfortunate but it seems like the semantics of data are treated as: “…you know what the data means…,” which is rather far from the truth.

Sometimes some people may know what the data “means,” but that is hardly a sure thing.

My favorite example being the pyramids being build in front of hundreds of thousands of people over decades and because everyone “…knew how it was done…,” no one bothered to write it down.

Now H2 can consult with “ancient astronaut theorists” (I’m not lying, that is what they called their experts) about the building of the pyramids.

Do you want your data to be interpreted by the data equivalent of an “ancient astronaut theorist?” If not, you had better give some consideration to documenting the semantics of your data.

I first saw this in a tweet by Carl Anderson.

How to learn Chinese and Japanese [and computing?]

Saturday, March 1st, 2014

How to learn Chinese and Japanese by Victor Mair.

From the post:

Victor concludes after a discussion of various authorities and sources:

If you delay introducing the characters, students’ mastery of pronunciation, grammar, vocabulary, syntax, and so forth, are all faster and more secure. Surprisingly, when later on they do start to study the characters (ideally in combination with large amounts of reading interesting texts with phonetic annotation), students acquire mastery of written Chinese much more quickly and painlessly than if writing is introduced at the same time as the spoken language.

An interesting debate follows in the comments.

I am wondering if the current emphasis on “coding” would be better shift to an emphasis on computing?

That is teaching the fundamental concepts of computing, separate and apart from any particular coding language or practice.

Much as I have taught the principles of subject identification separate and apart from a particular model or syntax.

The nooks and crannies of particular models or syntaxes can weight until later.

Arabic Natural Language Processing

Saturday, February 8th, 2014

Arabic Natural Language Processing

From the webpage:

Arabic is the largest member of the Semitic language family and is spoken by nearly 500 million people worldwide. It is one of the six official UN languages. Despite its cultural, religious, and political significance, Arabic has received comparatively little attention by modern computational linguistics. We are remedying this oversight by developing tools and techniques that deliver state-of-the-art performance in a variety of language processing tasks. Machine translation is our most active area of research, but we have also worked on statistical parsing and part-of-speech tagging. This page provides links to our freely available software along with a list of relevant publications.

Software and papers from the Stanford NLP group.

An important capability to add to your toolkit, especially if you are dealing with the U.S. security complex.

I first saw this at: Stanford NLP Group Tackles Arabic Machine Translation.

So You Want To Write Your Own Language?

Tuesday, February 4th, 2014

So You Want To Write Your Own Language? by Walter Bright.

From the post:

The naked truth about the joys, frustrations, and hard work of writing your own programming language

My career has been all about designing programming languages and writing compilers for them. This has been a great joy and source of satisfaction to me, and perhaps I can offer some observations about what you’re in for if you decide to design and implement a professional programming language. This is actually a book-length topic, so I’ll just hit on a few highlights here and avoid topics well covered elsewhere.

In case you are wondering if Walter is a good source for language writing advice, I pulled this bio from the Dr. Dobb’s site:

Walter Bright is a computer programmer known for being the designer of the D programming language. He was also the main developer of the first native C++ compiler, Zortech C++ (later to become Symantec C++, now Digital Mars C++). Before the C++ compiler he developed the Datalight C compiler, also sold as Zorland C and later Zortech C.

I am sure writing a language is an enormous amount of work but Water makes it sound quite attractive.