Archive for August, 2014

Web Data Commons Extraction Framework …

Sunday, August 31st, 2014

Web Data Commons Extraction Framework for the Distributed Processing of CC Data by Robert Meusel.

Interested in a framework to process all the Common Crawl data?

From the post:

We used the extraction tool for example to extract a hyperlink graph covering over 3.5 billion pages and 126 billion hyperlinks from the 2012 CC corpus (over 100TB when uncompressed). Using our framework and 100 EC2 instances, the extraction took less than 12 hours and did costs less than US$ 500. The extracted graph had a size of less than 100GB zipped.

NSA level processing it’s not but then you are most likely looking for useful results, not data for the sake of filling up drives.

An Introduction to Graphical Models

Sunday, August 31st, 2014

An Introduction to Graphical Models by Michael I. Jordan.

A bit dated (1997), slides, although “wordy” ones, that introduce you to graphical models.

Makes a nice outline to check your knowledge of graphical models.

I first saw this in a tweet by Data Tau.

Resisting Tyranny – Customer-Centric-Cloud (CCCl)

Sunday, August 31st, 2014

Microsoft Defies Court Order, Will Not Give Emails to US Government by Paul Thurrott.

From the post:

Despite a federal court order directing Microsoft to turn overseas-held email data to federal authorities, the software giant said Friday it will continue to withhold that information as it waits for the case to wind through the appeals process. The judge has now ordered both Microsoft and federal prosecutors to advise her how to proceed by next Friday, September 5.

Let there be no doubt that Microsoft’s actions in this controversial case are customer-centric. The firm isn’t just standing up to the US government on moral principles. It’s now defying a federal court order.

“Microsoft will not be turning over the email and plans to appeal,” a Microsoft statement notes. “Everyone agrees this case can and will proceed to the appeals court. This is simply about finding the appropriate procedure for that to happen.”

Hooray for Microsoft! So say we all!

Unlike victimizing an individual like Aaron Swartz, the United States government now faces an opponent with no fear of personal injury and phalanxes of lawyers to defend itself. An opponent that is heavily wired into the halls of government itself.

Why should other corporations join Microsoft in this fighting government overreaching?

Thurrott’s post captures that is a single phrase, “customer-centric.”

If you want to protect your part of a projected market of $235.1 billion in 2017, being customer-centric is a necessity.

That’s it. Bottom line it is a question of cold, hard economics and profit. Lose potential customers because of government overreaching and your bottom line shrinks.

Unless a shrinking bottom line is your corporate strategy for the future (you need to tell your shareholders about that) the time to start fighting government overreaching is now.

For example, all orders for customer data or data about customers should be subject to appellate review up to and including the United States Supreme Court.

Congress has established lower courts and even secret courts, so it can damned well make their orders subject to appeal up to the Supreme Court.

If IT businesses and their customers want a “customer-centric” Cloud, the time has some for calling on their elected representatives to make it so.

Possible slogan? Customer-Centric-Cloud! CCCl

Scoring tennis using finite-state automata

Saturday, August 30th, 2014

Scoring tennis using finite-state automata by Michael McCandless.

From the post:

For some reason having to do with the medieval French, the scoring system for tennis is very strange.

In actuality, the game is easy to explain: to win, you must score at least 4 points and win by at least 2. Yet in practice, you are supposed to use strange labels like “love” (0 points), “15” (1 point), “30” (2 points), “40” (3 points), “deuce” (3 or more points each, and the players are tied), “all” (players are tied) instead of simply tracking points as numbers, as other sports do.

This is of course wildly confusing to newcomers. Fortunately, the convoluted logic is easy to express as a finite-state automaton (FSA):

And you thought that CS course in automata wasn’t going to be useful. 😉

Michael goes on to say:

FSA minimization saved only 3 states for the game of tennis, resulting in a 10% smaller automaton, and maybe this simplifies keeping track of scores in your games by a bit, but in other FSA applications in Lucene, such as the analyzing suggester, MemoryPostingsFormat and the terms index, minimization is vital since saves substantial disk and RAM for Lucene applications!

A funny introduction with a serious purpose!

National Museum of Denmark – Images

Friday, August 29th, 2014

Nationalmuseet frigiver tusindvis af historiske fotos

The National Museum of Denmark has released nearly 50,000 images with a long term goal of 750,000 images, licensed under Creative Commons license BY-SA to the photos where the museum owns the copyright.

Should have an interesting impact on object recognition in images. What objects are “common” in a particular period? What objects are associated with particular artists or themes?


I first saw this in a tweet by Michael Peter Edson.

Onyx: Distributed Workflows….

Thursday, August 28th, 2014

Onyx: Distributed Workflows for Dynamic Systems by Michael Drogalis.

From the post:

If you’ve ever jumped heads down into a codebase maintaining complex distributed activity and tried to simplify or change the processing workflow, not only will you scratch your head for 7 sleepless nights before you can get anywhere, but you’ll come to realize that workflows are often deeply complected with their mechanism of execution.

In this talk, we’ll survey contemporary frameworks such as Storm and Cascading. We’ll identify the pain points that seem to crop up time and time again: workflow specification, stateful lifecycle management, and developer testing – to name a few.

Onyx is a new distributed computation system written in Clojure that addresses these problems head-on. Hardware advancements in the last 10 years have enabled new designs that leverage fast networks and SSDs. Onyx takes advantage and adapts to this new environment. The concepts and tools discussed remove the incidental complexity that plagues modern frameworks.

Attendees will come away with new perspective on leveraging immutability, persistent data structures, queues, and transactions to tackle increasingly complex problem spaces.

This and much more at Strangeloop, St. Louis, Sept. 17-19th, 2014.

…Deep Learning Text Classification

Thursday, August 28th, 2014

Using a Graph Database for Deep Learning Text Classification by Kenny Bastani.

From the post:

Graphify is a Neo4j unmanaged extension that provides plug and play natural language text classification.

Graphify gives you a mechanism to train natural language parsing models that extract features of a text using deep learning. When training a model to recognize the meaning of a text, you can send an article of text with a provided set of labels that describe the nature of the text. Over time the natural language parsing model in Neo4j will grow to identify those features that optimally disambiguate a text to a set of classes.

Similarity and graphs. What’s there to not like?

Outside the Closed World…

Thursday, August 28th, 2014

Outside the Closed World: On Using Machine Learning For Network Intrusion Detection by Robin Sommer and Vern Paxson.


In network intrusion detection research, one popular strategy for finding attacks is monitoring a network’s activity for anomalies: deviations from profiles of normality previously learned from benign traffic, typically identified using tools borrowed from the machine learning community. However, despite extensive academic research one finds a striking gap in terms of actual deployments of such systems: compared with other intrusion detection approaches, machine learning is rarely employed in operational “real world ” settings. We examine the differences between the network intrusion detection problem and other areas where machine learning regularly finds much more success. Our main claim is that the task of finding attacks is fundamentally different from these other applications, making it significantly harder for the intrusion detection community to employ machine learning effectively. We support this claim by identifying challenges particular to network intrusion detection, and provide a set of guidelines meant to strengthen future research on anomaly detection. Keywords-anomaly detection; machine learning; intrusion detection; network security.

From the introduction:

In this paper we set out to examine the differences between the intrusion detection domain and other areas where machine learning is used with more success. Our main claim is that the task of finding attacks is fundamentally different from other applications, making it significantly harder for the intrusion detection community to employ machine learning effectively. We believe that a significant part of the problem already originates in the premise, found in virtually any relevant textbook, that anomaly detection is suitable for finding novel attacks; we argue that this premise does not hold with the generality commonly implied. Rather, the strength of machine-learning tools is finding activity that is similar to something previously seen, without the need however to precisely describe that activity up front (as misuse detection must).

Between data breaches at firms that should know better and the ongoing dribble of Snowden revelations, cybersecurity will be a hot topic for years.

Beyond security concerns, the author’s isolation of machine learning as detecting something “similar to something previously seen” sets a limit on the usefulness of machine learning for detecting “new” subjects/concepts in a data stream.

A news account I saw earlier this week described machine processing of text as “objective.” A better term would have been “unimaginative.” Machine learning can return the meanings it has been taught but it will not offer a new meaning. Something to bear in mind when mining large bodies of texts.

50 UNIX Commands

Thursday, August 28th, 2014

50 Most Frequently Used UNIX / Linux Commands (With Examples) by Ramesh Natarajan.

From the post:

This article provides practical examples for 50 most frequently used commands in Linux / UNIX.

This is not a comprehensive list by any means, but this should give you a jumpstart on some of the common Linux commands. Bookmark this article for your future reference.

Nothing new but handy if someone asks for guidance on basic Unix commands. Sending this list might save you some time.

Or, if you are a recruiter, edit out the examples and ask for an example of using each command. 😉

I first saw this in a tweet by Lincoln Mullen.

Game Dialogue + FOL + Clojure

Thursday, August 28th, 2014

Representing Game Dialogue as Expressions in First Order Logic by Kaylen Wheeler.


Despite advancements in graphics, physics, and artificial intelligence, modern video games are still lacking in believable dialogue generation. The more complex and interactive stories in modern games may allow the player to experience diffierent paths in dialogue trees, but such trees are still required to be manually created by authors. Recently, there has been research on methods of creating emergent believable behaviour, but these are lacking true dialogue construction capabilities. Because the mapping of natural language to meaningful computational representations (logical forms) is a difficult problem, an important first step may be to develop a means of representing in-game dialogue as logical expressions. This thesis introduces and describes a system for representing dialogue as first-order logic predicates, demonstrates its equivalence with current dialogue authoring techniques, and shows how this representation is more dynamic and flexible.

If you remember the Knights and Knaves from Labyrinth or other sources, you will find this an enjoyable read. After solving the puzzle, Kaylen’s discussion shows that a robust solution requires information hiding and the capacity for higher-order questioning.

Clojure fans will appreciate the use of clojure.core.logic.


I first saw this in a tweet by David Nolen.

Thou Shalt Share!

Thursday, August 28th, 2014

NIH Tells Genomic Researchers: ‘You Must Share Data’ by Paul Basken.

From the post:

Scientists who use government money to conduct genomic research will now be required to quickly share the data they gather under a policy announced on Wednesday by the National Institutes of Health.

The data-sharing policy, which will take effect with grants awarded in January, will give agency-financed researchers six months to load any genomic data they collect—from human or nonhuman subjects—into a government-established database or a recognized alternative.

NIH officials described the move as the latest in a series of efforts by the federal government to improve the efficiency of taxpayer-financed research by ensuring that scientific findings are shared as widely as possible.

“We’ve gone from a circumstance of saying, ‘Everybody should share data,’ to now saying, in the case of genomic data, ‘You must share data,’” said Eric D. Green, director of the National Human Genome Research Institute at the NIH.

A step in the right direction!

Waiting for other government funding sources and private funders (including in the humanities) to take the same step.

I first saw this in a tweet by Kevin Davies.

Linked Data Platform Best Practices…

Thursday, August 28th, 2014

Linked Data Platform Best Practices and Guidelines Note Published

From the post:

The Linked Data Platform (LDP) Working Group has published a Group Note of Linked Data Platform Best Practices and Guidelines. This document provides best practices and guidelines for implementing Linked Data Platform servers and clients. Learn more about the Data Activity.

The document takes pains to distinguish “best practice” from “guidance:”

For the purposes of this document, it is useful to make a minor, yet important distinction between the term ‘best practice’ and the term ‘guideline’. We define and differentiate the terms as follows:

best practice
An implementation practice (method or technique) that has consistently shown results superior to those achieved with other means and that is used as a benchmark. Best practices within this document apply specifically to the ways that LDP servers and clients are implemented as well as how certain resources are prepared and used with them. In this document, the best practices might be used as a kind of check-list against which an implementer can directly evaluate a system’s design and code. Lack of adherence to any given best practice, however, does not necessarily imply a lack of quality; they are recommendations that are said to be ‘best’ in most cases and in most contexts, but not all. A best practice is always subject to improvement as we learn and evolve the Web together.
A tip, a trick, a note, a suggestion, or answer to a frequently asked question. Guidelines within this document provide useful information that can advance an implementer’s knowledge and understanding, but that may not be directly applicable to an implementation or recognized by consensus as a ‘best practice’.

Personally I don’t see the distinction as useful but I bring it to your attention in case you are reading or authoring in this space.

US State Dept. “The Cloud is Falling!”

Thursday, August 28th, 2014

US: NSA leaks should be no excuse for local storage mandates, which harm “organic” internet by David Meyer:

From the post:

The U.S. State Department has warned against countries such as Russia forcing web service providers to store citizens’ data locally, even though such moves are at least in part inspired by Edward Snowden’s revelations of the NSA spying on foreigners’ personal data.

“[People should not] use the Snowden revelations as an excuse for taking what are essentially protectionist measures that will harm the ability of the internet to work in an organic way,” a State Department official said Thursday, ahead of the annual Internet Governance Forum meeting in Istanbul next week.

With truthful translation inserted:

“[People should not] use the Snowden revelations as an excuse for taking what are essentially protectionist measures that will prevent unlimited data mining by the NSA,” a State Department official said Thursday, ahead of the annual Internet Governance Forum meeting in Istanbul next week.

Will local storage hurt cloud computing?


You bet! It will hurt cloud providers, service providers, the whole range of activities associated with the cloud and the people who perform those activities.

The more critical point is the NSA or more properly, the United States government, should have considered the risk to the developing Cloud infrastructure before so thoroughly abusing it.

How much danger is posed by damaging the Cloud? quotes IHS Technology as saying:

Cloud-related tech spending by businesses is forecast to triple from 2011 to 2017, the research firm said. By 2017, enterprise spending on cloud computing will amount to a projected $235.1 billion, triple the $78.2 billion spent in 2011, IHS said.

As Gross Domestic Product (GDP), cloud spending of $235.1 billion puts the cloud in the top fifty countries by GDP. Or between Finland and Israel according to United Nations estimates.

How long it may take the cloud to break into the top 25 countries by GDP isn’t clear.

What is clear is that so long as the Cloud is subject to the whim and caprice of the NSA and similar organizations, it will never reach its full economic potential. If you care about the cloud and its economic future, demand the NSA be hobbled before it does any more damage.

PS: Demands for action that are accompanied by donations get the most attention. The equivalent of a top 50 (or better) GDP economic opportunity is the tech community’s to lose.

SICP Distilled (Clojure)

Thursday, August 28th, 2014

SICP Distilled (Clojure) by Tom Hall.

Tom has started a Kickstarter campaign to finance the authoring of a Cloure-based companion to SICP.

A companion which Tom describes as:

As the book [SICP] itself is available online for free I want to make the perfect accompaniment to it – an ebook summarising the key ideas, short videos describing them, screencasts of solving some of the exercises, translation of the examples into Clojure, example projects, partial solutions for you to complete (similar to 4clojure and Clojure koans) and a place to discuss solving them with people.

Project will be funded on Tue, Sept. 16, 2014 4:12 EDT.

This should be enough notice for even the most disorganized of us to pledge in support of SICP Distilled (Clojure).

Please pass this along, tweet, re-post, etc.

Thesaurus Joke

Thursday, August 28th, 2014

Lu Hersey tweets: “Best thesaurus joke I’ve seen all day :)”


Exploring Calculus with Julia

Wednesday, August 27th, 2014

Exploring Calculus with Julia

From the post:

This is a collection of notes for exploring calculus concepts with the Julia programming language. Such an approach is used in MTH 229 at the College of Staten Island.

These notes are broken into different sections, where most all sections have some self-grading questions at the end that allow you to test your knowledge of that material. The code should be copy-and-pasteable into a julia session. The code output is similar to what would be shown if evaluated in an IJulia cell, our recommended interface while learning julia.

The notes mostly follow topics of a standard first-semester calculus course after some background material is presented for learning julia within a mathematical framework.

Another example of pedagogical technique.

Semantic disconnects are legion and not hard to find. However, what criteria would you use to select a set to be solved using topic maps?

Or perhaps better, before mentioning topic maps, how would you solve them so that the solution works up to being a topic map?

Either digitally or even with pencil and paper?

Thinking that getting people to internalize the value-add of topic maps before investing effort into syntax, etc. could be a successful way to promote them.

Unofficial Bash Strict Mode

Wednesday, August 27th, 2014

Use the Unofficial Bash Strict Mode (Unless You Looove Debugging) by Aaron Maxwell.

Let’s start with the punchline. Your bash scripts will be more robust, reliable and maintainable if you start them like this:

set -euo pipefail

I call this the unofficial bash strict mode. This causes bash to behave in a way that makes many classes of subtle bugs impossible. You’ll spend much less time debugging, and also avoid having unexpected complications in production.

There is a short-term downside: these settings make certain common bash idioms harder to work with. They all have simple workarounds, detailed below: jump to Issues & Solutions. But first, let’s look at what these obscure lines actually do.

The sort of thing you only hear about in rumors or stumble across in a Twitter feed.

Bookmark and share!

I first saw this in a tweet by Neil Saunders.

New York Times Annotated Corpus Add-On

Wednesday, August 27th, 2014

New York Times corpus add-on annotations: MIDs and Entity Salience. (GitHub – Data)

From the webpage:

The data included in this release accompanies the paper, entitled “A New Entity Salience Task with Millions of Training Examples” by Jesse Dunietz and Dan Gillick (EACL 2014).

The training data includes 100,834 documents from 2003-2006, with 19,261,118 annotated entities. The evaluation data includes 9,706 documents from 2007, with 187,080 annotated entities.

An empty line separates each document annotation. The first line of a document’s annotation contains the NYT document id followed by the title. Each subsequent line refers to an entity, with the following tab-separated fields:

entity index automatically inferred salience {0,1} mention count (from our coreference system) first mention’s text byte offset start position for the first mention byte offset end position for the first mention MID (from our entity resolution system)

The background in Teaching machines to read between the lines (and a new corpus with entity salience annotations) by Dan Gillick and Dave Orr, will be useful.

From the post:

Language understanding systems are largely trained on freely available data, such as the Penn Treebank, perhaps the most widely used linguistic resource ever created. We have previously released lots of linguistic data ourselves, to contribute to the language understanding community as well as encourage further research into these areas.

Now, we’re releasing a new dataset, based on another great resource: the New York Times Annotated Corpus, a set of 1.8 million articles spanning 20 years. 600,000 articles in the NYTimes Corpus have hand-written summaries, and more than 1.5 million of them are tagged with people, places, and organizations mentioned in the article. The Times encourages use of the metadata for all kinds of things, and has set up a forum to discuss related research.

We recently used this corpus to study a topic called “entity salience”. To understand salience, consider: how do you know what a news article or a web page is about? Reading comes pretty easily to people — we can quickly identify the places or things or people most central to a piece of text. But how might we teach a machine to perform this same task? This problem is a key step towards being able to read and understand an article.

Term ratios are a start, but we can do better. Search indexing these days is much more involved, using for example the distances between pairs of words on a page to capture their relatedness. Now, with the Knowledge Graph, we are beginning to think in terms of entities and relations rather than keywords. “Basketball” is more than a string of characters; it is a reference to something in the real word which we already already know quite a bit about. (emphasis added)

Truly an important data set but I’m rather partial to that last line. 😉

So the question is if we “recognize” a entity as salient, do we annotate the entity and:

  • Present the reader with a list of links, each to a separate mention with or without ads?
  • Present the reader with what is known about the entity, with or without ads?

I see enough divided posts and other information that forces readers to endure more ads that I consciously avoid buying anything for which I see a web ad. Suggest you do the same. (If possible.) I buy books, for example, because someone known to me recommends it, not because some marketeer pushes it at me across many domains.

You Say “Concepts” I Say “Subjects”

Wednesday, August 27th, 2014

Researchers are cracking text analysis one dataset at a time by Derrick Harris.

From the post:

Google on Monday released the latest in a string of text datasets designed to make it easier for people outside its hallowed walls to build applications that can make sense of all the words surrounding them.

As explained in a blog post, the company analyzed the New York Times Annotated Corpus — a collection of millions of articles spanning 20 years, tagged for properties such as people, places and things mentioned — and created a dataset that ranks the salience (or relative importance) of every name mentioned in each one of those articles.

Essentially, the goal with the dataset is to give researchers a base understanding of which entities are important within particular pieces of content, an understanding that should then be complemented with background data sources that will provide even more information. So while the number of times a person or company is mentioned in an article can be a very strong sign of which words are important — especially when compared to the usual mention count for that word, one of the early methods for ranking search results — a more telling method of ranking importance would also leverage existing knowledge of broader concepts to capture important words that don’t stand out from a volume perspective.

A summary of some of the recent work on recognizing concepts in text and not just key words.

As topic mappers know, there is no universal one to one correspondence between words and subjects (“concepts” in this article). Finding “concepts” means that whatever words triggered that recognition, we can supply other information that is known about the same concept.

Certainly will make topic map authoring easier when text analytics can generate occurrence data and decorate existing topic maps with their findings.

Digital or Paper?

Wednesday, August 27th, 2014

Readers absorb less on Kindles than on paper, study finds by Alison Flood.

From the post:

A new study which found that readers using a Kindle were “significantly” worse than paperback readers at recalling when events occurred in a mystery story is part of major new Europe-wide research looking at the impact of digitisation on the reading experience.

The study, presented in Italy at a conference last month and set to be published as a paper, gave 50 readers the same short story by Elizabeth George to read. Half read the 28-page story on a Kindle, and half in a paperback, with readers then tested on aspects of the story including objects, characters and settings.

Anne Mangen of Norway’s Stavanger University, a lead researcher on the study, thought academics might “find differences in the immersion facilitated by the device, in emotional responses” to the story. Her predictions were based on an earlier study comparing reading an upsetting short story on paper and on iPad. “In this study, we found that paper readers did report higher on measures having to do with empathy and transportation and immersion, and narrative coherence, than iPad readers,” said Mangen.

But instead, the performance was largely similar, except when it came to the timing of events in the story. “The Kindle readers performed significantly worse on the plot reconstruction measure, ie, when they were asked to place 14 events in the correct order.”

Don’t panic! The choices aren’t digital vs. paper. I have any number of titles in both forms. One for searching and the other for reading.

This report is about one data point in a largely unexplored area. Not to say there haven’t been other studies, papers, etc., but on very small groups over short periods of time.

Think of it as family snapshots of a few families versus the bulk of humanity. Useful, but not the full story.

We need to keep taking these family snapshots in hopes of building up a more comprehensive picture of our interaction with interfaces.

A Public Official’s Guide to Financial Literacy

Tuesday, August 26th, 2014

A Public Official’s Guide to Financial Literacy

From the webpage:

Many individuals enter government service to effect change, but you can’t accomplish much if you don’t speak the language of public finance. The goal of this guide is to help state and local leaders become financially literate and answer questions like: “How does my jurisdiction get and spend its money?” and “Are we in sound financial shape?” The guide provides technical knowledge, essential questions and examples of what not to do to give leaders a core understanding of key public finance concepts.

I was fairly annoyed by the data collecting process to obtain this “free” report.

It is far from everything you need to know, but it will reduce the ability of public officials to hide behind finance language.

That is the first step towards asking effective questions and tracking public financing, with topic maps perhaps.

Despite the annoying data collecting, I recommend you get someone to register and get a copy. Then widely distribute the copy! 😉

Biscriptal juxtaposition in Chinese

Tuesday, August 26th, 2014

Biscriptal juxtaposition in Chinese by Victor Mair.

From the post:

We have often seen how the Roman alphabet is creeping into Chinese writing, both for expressing English words and morphemes that have been borrowed into Chinese, but also increasingly for writing Mandarin and other varieties of Chinese in Pinyin (spelling). Here are just a few earlier Language Log posts dealing with this phenomenon:

“A New Morpheme in Mandarin” (4/26/11)

“Zhao C: a Man Who Lost His Name” (2/27/09)

“Creeping Romanization in Chinese” (8/30/12)

Now an even more intricate application of alphabetic usage is developing in internet writing, namely, the juxtaposition and intertwining of simultaneous phrases with contrasting meaning.

Highly entertaining post on the complexities of evolving language usage.

The sort of usage that hasn’t made it into a dictionary, yet, but still needs to be captured and shared.

Sam Hunting brought this to my attention.

Probabilistic Topic Maps?

Tuesday, August 26th, 2014

Probabilistic Soft Logic

From the webpage:

Probabilistic soft logic (PSL) is a modeling language (with accompanying implementation) for learning and predicting in relational domains. Such tasks occur in many areas such as natural language processing, social-network analysis, computer vision, and machine learning in general.

PSL allows users to describe their problems in an intuitive, logic-like language and then apply their models to data.


  • PSL models are templates for hinge-loss Markov random fields (HL-MRFs), a powerful class of probabilistic graphical models.
  • HL-MRFs are extremely scalable models because they are log-concave densities over continuous variables that can be optimized using the alternating direction method of multipliers.
  • See the publications page for more technical information and applications.

This homepage lists three introductory videos and has a set of slides on PSL.

Under entity resolution, the slides illustrate rules that govern the “evidence” that two entities represent the same person. You will also find link prediction, mapping of different ontologies, discussion of mapreduce implementations and other materials in the slides.

Probabilistic rules could be included in a TMDM instance but I don’t know of any topic map software that supports probabilistic merging. Would be a nice authoring feature to have.

The source code is on GitHub if you want to take a closer look.

Improving sparse word similarity models…

Tuesday, August 26th, 2014

Improving sparse word similarity models with asymmetric measures by Jean Mark Gawron.


We show that asymmetric models based on Tversky (1977) improve correlations with human similarity judgments and nearest neighbor discovery for both frequent and middle-rank words. In accord with Tversky’s discovery that asymmetric similarity judgments arise when comparing sparse and rich representations, improvement on our two tasks can be traced to heavily weighting the feature bias toward the rarer word when comparing high- and mid- frequency words.

From the introduction:

A key assumption of most models of similarity is that a similarity relation is symmetric. This assumption is foundational for some conceptions, such as the idea of a similarity space, in which similarity is the inverse of distance; and it is deeply embedded into many of the algorithms that build on a similarity relation among objects, such as clustering algorithms. The symmetry assumption is not, however, universal, and it is not essential to all applications of similarity, especially when it comes to modeling human similarity judgments.

What assumptions underlie your “similarity” measures?

Not that we can get away from “assumptions” but are your assumptions based on evidence or are they unexamined assumptions?

Do you know of any general techniques for discovering assumptions in algorithms?

OrientDB Manual – version 1.7.8

Tuesday, August 26th, 2014

OrientDB Manual – version 1.7.8

From the post:

Welcome to OrientDB – the first Multi-Model Open Source NoSQL DBMS that brings together the power of graphs and the flexibility of documents into one scalable, high-performance operational database. OrientDB is sponsored by Orient Technologies, LTD.

OrientDB has features of both Document and Graph DBMSs. Written in Java and designed to be exceptionally fast: it can store up to 150,000 records per second on common hardware. Not only can it embed documents like any other Document database, but it manages relationships like Graph Databases with direct connections among records. You can traverse parts of or entire trees and graphs of records in a few milliseconds.

OrientDB supports schema-less, schema-full and schema-mixed modes and it has a strong security profiling system based on users and roles. Thanks to the SQL layer, OrientDB query language is straightforward and easy to use, especially for those skilled in the relational DBMS world.

Take a look at some OrientDB Presentations.

A new version of the documentation for OrientDB. I saw this last week but forgot to mention it.


Cloudera Navigator Demo

Tuesday, August 26th, 2014

Cloudera Navigator Demo

Not long (9:50) but useful demo of Cloudera Navigator.

There was a surprise or two.

The first one was the suggestion that if there are multiple columns with different names for zip code (the equivalent of postal codes), that you should normalize all the columns to one name.

Understandable but what if the column has a non-intuitive (to the user) name for the column? Such as CEP?

It appears that “searching” is on surface tokens and we all know the perils of that type of searching. More robust searching would allow for searching for any variant name of postal code, for example, and return the columns that shared the property of being a postal code, without regard to the column name.

The second surprise was that “normalization” as described sets the stage for repeating normatization with each data import. That sounds subject to human error as more and more data sets are imported.

The interface itself appears easy to use, assuming you are satisfied with opaque tokens for which you have to guess the semantics. You could be right but then on the other hand, you could be wrong.

Abstraction, intuition,…

Tuesday, August 26th, 2014

Abstraction, intuition, and the “monad tutorial fallacy” by Brent Yorgey.

From the post:

While working on an article for the Monad.Reader, I’ve had the opportunity to think about how people learn and gain intuition for abstraction, and the implications for pedagogy. The heart of the matter is that people begin with the concrete, and move to the abstract. Humans are very good at pattern recognition, so this is a natural progression. By examining concrete objects in detail, one begins to notice similarities and patterns, until one comes to understand on a more abstract, intuitive level. This is why it’s such good pedagogical practice to demonstrate examples of concepts you are trying to teach. It’s particularly important to note that this process doesn’t change even when one is presented with the abstraction up front! For example, when presented with a mathematical definition for the first time, most people (me included) don’t “get it” immediately: it is only after examining some specific instances of the definition, and working through the implications of the definition in detail, that one begins to appreciate the definition and gain an understanding of what it “really says.”

Unfortunately, there is a whole cottage industry of monad tutorials that get this wrong….

It isn’t often that you see a blog post from 2009 that is getting comments in 2014!

I take the post to be more about pedagogy than monads but there are plenty of pointers to monad tutorials in the comments.

Another post mentioned in the comments that you may find useful: Developing Your Intuition For Math by Kalid Azad.

What if you ran a presentation from back to front? Started with concrete examples of your solution in action in multiple cases. Explain the cases. Extract the common paths or patterns. Then run out of time before you can repeat what everyone already knows about the area? Would that work?

The Hitchhiker’s Guide to…

Tuesday, August 26th, 2014

The Hitchhiker’s Guide to the Curry-Howard Correspondence by Chris Ford.

From the description:

Slides can be found here:

Functions are proofs. Um, I can see how that might work. Go on.

Types are propositions. Really? In what sense?

In fact, a function is the proof of the proposition its type represents. Woah, you’ve lose me now.

Don’t Panic!

The Curry-Howard Correspondence is an elegant bridge between the planet of logic and the planet of programming, and it’s not actually that hard to understand.

In this talk I’ll use the Idris dependently-typed functional programming language for examples, as its type system is sophisticated enough to construct interesting automated proofs simply by writing functions. This talk is not designed to convert you into a theoretical computer scientist, but to share with you a wonderful sight in your journey through the vast and peculiar universe of programming.

A familiarity with functional programming would be useful for appreciating this talk, but it will not require any significant prior study of theoretical computer science.

Great presentation by Chris Ford at EuroClojure!

The only problem I had was coordinating the slides, which aren’t very visible in the presentation, with the slide deck you can download.

Definitely a watch more than once video.


Useful links (references in the slides):

Edwin Brady, Indris

Edwin Brady, Programming in Indris: A Tutorial

Brian McKenna, EvenOdd in Agda, Idris, Haskell, Scala

Philip Wadler, Propositions as Types (Updated, June 2014)

Test Your Analysis With Random Numbers

Tuesday, August 26th, 2014

A critical reanalysis of the relationship between genomics and well-being by Nicholas J. L. Brown, et al. (Nicholas J. L. Brown, doi: 10.1073/pnas.1407057111)


Fredrickson et al. [Fredrickson BL, et al. (2013) Proc Natl Acad Sci USA 110(33):13684–13689] claimed to have observed significant differences in gene expression related to hedonic and eudaimonic dimensions of well-being. Having closely examined both their claims and their data, we draw substantially different conclusions. After identifying some important conceptual and methodological flaws in their argument, we report the results of a series of reanalyses of their dataset. We first applied a variety of exploratory and confirmatory factor analysis techniques to their self-reported well-being data. A number of plausible factor solutions emerged, but none of these corresponded to Fredrickson et al.’s claimed hedonic and eudaimonic dimensions. We next examined the regression analyses that purportedly yielded distinct differential profiles of gene expression associated with the two well-being dimensions. Using the best-fitting two-factor solution that we identified, we obtained effects almost twice as large as those found by Fredrickson et al. using their questionable hedonic and eudaimonic factors. Next, we conducted regression analyses for all possible two-factor solutions of the psychometric data; we found that 69.2% of these gave statistically significant results for both factors, whereas only 0.25% would be expected to do so if the regression process was really able to identify independent differential gene expression effects. Finally, we replaced Fredrickson et al.’s psychometric data with random numbers and continued to find very large numbers of apparently statistically significant effects. We conclude that Fredrickson et al.’s widely publicized claims about the effects of different dimensions of well-being on health-related gene expression are merely artifacts of dubious analyses and erroneous methodology. (emphasis added)

To see the details you will need a subscription the the Proceedings of the National Academy of Sciences.

However, you can take this data analysis lesson from the abstract:

If your data can be replaced with random numbers and still yield statistically significant results, stop the publication process. Something is seriously wrong with your methodology.

I first saw this in a tweet by WvSchaik.

6,482 Datasets Available

Tuesday, August 26th, 2014

6,482 Datasets Available Across 22 Federal Agencies In Data.json Files by Kin Lane.

From the post:

It has been a few months since I ran any of my federal government data.json harvesting, so I picked back up my work, and will be doing more work around datasets that federal agnecies have been making available, and telling the stories across my network.

I’m still surprised at how many people are unaware that 22 of the top federal agencies have data inventories of their public data assets, available in the root of their domain as a data.json file. This means you can go to many and there is a machine readable list of that agencies current inventory of public datasets.

See Kin’s post for links to the agency data.json files.

You may also want to read: What Happened With Federal Agencies And Their Data.json Files, which details Kin’s earlier efforts with tracking agency data.json files.

Kin points out that these data.json files are governed by: OMB M-13-13 Open Data Policy—Managing Information as an Asset. It’s pretty joyless reading but if you are interested in the the policy details or the requirements agencies must meet, it’s required reading.

If you are looking for datasets to clean up or combine together, it would be hard to imagine a more diverse set to choose from.