Update with 162 new papers to Deeplearning.University Bibliography

October 17th, 2014

Update with 162 new papers to Deeplearning.University Bibliography by Amund Tveit.

From the post:

Added 162 new Deep Learning papers to the Deeplearning.University Bibliography, if you want to see them separate from the previous papers in the bibliography the new ones are listed below. There are many highly interesting papers, a few examples are:

  1. Deep neural network based load forecast – forecasts of electricity prediction
  2. The relation of eye gaze and face pose: Potential impact on speech recognition – combining speech recognition with facial expression
  3. Feature Learning from Incomplete EEG with Denoising Autoencoder – Deep Learning for Brain Computer Interfaces

Underneath are the 162 new papers, enjoy!

(Complete Bibliography – at Deeplearning.University Bibliography)

Disclaimer: we’re so far only covering (subset of) 2014 deep learning papers, so still far from a complete bibliography, but our goal is to come close eventuallly

Best regards,

Amund Tveit (Memkite Team)

You could find all these papers by search, if you knew what search terms to use.

This bibliography is a reminder of the power of curated data. The categories and grouping the papers into categories are definitely a value-add. Search doesn’t have those, in case you haven’t noticed. ;-)

DevCenter 1.2 delivers support for Cassandra 2.1 and query tracing

October 17th, 2014

DevCenter 1.2 delivers support for Cassandra 2.1 and query tracing by Alex Popescu.

From the post:

We’re very pleased to announce the availability of DataStax DevCenter 1.2, which you can download now. We’re excited to see how DevCenter has already become the defacto query and development tool for those of you working with Cassandra and DataStax Enterprise, and now with version 1.2, we’ve added additional support and options to make your development work even easier.

Version 1.2 of DevCenter delivers full support for the many new features in Apache Cassandra 2.1, including user defined types and tuples. DevCenter’s built-in validations, quick fix suggestions, the updated code assistance engine and the new snippets can greatly simplify your work with all the new features of Cassandra 2.1.

The download page offers the DataStax Sandbox if you are interested in a VM version.


BBC Genome Project

October 17th, 2014

BBC Genome Project

From the post:

This site contains the BBC listings information which the BBC printed in Radio Times between 1923 and 2009. You can search the site for BBC programmes, people, dates and Radio Times editions.

We hope it helps you find that long forgotten BBC programme, research a particular person or browse your own involvement with the BBC.

This is a historical record of both the planned output and the BBC services of any given time. It should be viewed in this context and with the understanding that it reflects the attitudes and standards of its time – not those of today.

Join in

You can join in and become part of the community that is improving this resource. As a result of the scanning process there are lots of spelling mistakes and punctuation errors and you can edit the entries to accurately reflect the magazine entry. You can also tell us when the schedule changed and we will hold on to that information for the next stage of this project.

What a delightful resource to find on a Friday!

True, no links to the original programs but perhaps someday?


I first saw this in a tweet by Tom Loosemore.

Update: Genome: behind the scenes by Andy Armstrong.

From the post:

In October 2011 Helen Papadopoulos wrote about the Genome project – a mammoth effort to digitise an issue of the Radio Times from every week between 1923 and 2009 and make searchable programme listings available online.

Helen expected there to be between 3 and 3.5 million programme entries. Since then the number has grown to 4,423,653 programmes from 4,469 issues. You can now browse and search all of them at http://genome.ch.bbc.co.uk/

Back in 2011 the process of digitising the scanned magazines was well advanced and our thoughts were turning to how to present the archive online. It’s taken three years and a few prototypes to get us to our first public release.

Andy gives you the backend view of the BBC Genome Project.

I first saw this in a tweet by Jem Stone.

Mobile encryption could lead to FREEDOM

October 17th, 2014

FBI Director: Mobile encryption could lead us to ‘very dark place’ by Charlie Osborne.

Opps! Looks like I mis-quoted the headline!

Charlie got the FBI Director’s phrase right but I wanted to emphasize the cost of the FBI’s position.

The choices really are that stark: You can have encryption + freedom or back doors + government surveillance.

Director Comey argues that mechanisms are in place to make sure the government obeys the law. I concede there are mechanisms with that purpose, but the reason we are having this national debate is that the government chose to not use those mechanisms.

Having not followed its own rules for years, why should we accept the government’s word that it won’t do so again?

The time has come to “go dark,” not just on mobile devices but all digital communications. It won’t be easy at first but products will be created to satisfy the demand to “go dark.”

Any artists in the crowd? Will need buttons for “Going Dark,” “Go Dark,” and “Gone Dark.”

BTW, read Charlie’s post in full to get a sense of the arguments the FBI will be making against encryption.

PS: Charlie mentions that Google and Apple will be handing encryption keys over to customers. That means that the 5th Amendment protections about self-incrimination come into play. You can refuse to hand over the keys!

There is an essay on the 5th Amendment and encryption at: The Fifth Amendment, Encryption, and the Forgotten State Interest by Dan Terzian. 61 UCLA L. Rev. Disc. 298 (2014).


This Essay considers how the Fifth Amendment’s Self Incrimination Clause applies to encrypted data and computer passwords. In particular, it focuses on one aspect of the Fifth Amendment that has been largely ignored: its aim to achieve a fair balance between the state’s interest and the individual’s. This aim has often guided courts in defining the Self Incrimination Clause’s scope, and it should continue to do so here. With encryption, a fair balance requires permitting the compelled production of passwords or decrypted data in order to give state interests, like prosecution, an even chance. Courts should therefore interpret Fifth Amendment doctrine in a manner permitting this compulsion.

Hoping that Terzian’s position never prevails but you do need to know the arguments that will be made in support of his position.

COLD 2014 Consuming Linked Data

October 16th, 2014

COLD 2014 Consuming Linked Data

Table of Contents

You can get an early start on your weekend reading now! ;-)

Free Public Access to Federal Materials on Guide to Law Online [Browsing, No Search]

October 16th, 2014

Free Public Access to Federal Materials on Guide to Law Online by Donna Sokol.

From the post:

Through an agreement with the Library of Congress, the publisher William S. Hein & Co., Inc. has generously allowed the Law Library of Congress to offer free online access to historical U.S. legal materials from HeinOnline. These titles are available through the Library’s web portal, Guide to Law Online: U.S. Federal, and include:

I should be happy but then I read:

These collections are browseable. For example, to locate the 1982 version of the Bankruptcy code in Title 11 of the U.S. Code you could select the year (1982) and then Title number (11) to retrieve the material. (emphasis added)

Err, actually it should say: These collections are browseable only. No search within or across the collections.

Here is an example:

sumpreme court default listing

If you expand volume 542 you will see:

supreme court volume 542

Look! There is Intell vs. ADM, let’s look at that one!

Intel vs. ADM download page

Did I just overlook a search box?

I checked the others and you can to.

I did find one that was small enough (less than 20 pages I suppose) to have a search function:

CFR General Provisions image

So, let’s search for something that ought to be in the CFR general provisions, like “department:”

Department in search box

The result?

search error

Actually that is an abbreviation of the error message. Waste of space to show more.

To summarize, the Library of Congress has arranged for all of us to have browseable access but no search to:

  • United States Code 1925-1988 (includes content up to 1993)
    • From Guide to Law Online: United States Law
  • United States Reports v. 1-542 (1754-2004)
    • From Guide to Law Online: United States Judiciary
  • Code of Federal Regulations (1938-1995)
    • From Guide to Law Online: Executive
  • Federal Register v. 1-58 (1936-1993)
    • From Guide to Law Online: Executive

Hundreds of thousands of pages of some of the most complex documents in history and no searching.

If that’s helping us, I don’t think we can afford much more help from the Library of Congress. That’s a hard thing for me to say because in the vast number of cases I really like and support the Library of Congress (aside from the robber baron refugees holed up on the Copyright Office).

Just so I don’t end on a negative note, I have a suggestion to correct this situation:

Give Thompson-Reuters (I knew them as West Publishing Company) or LexisNexis a call. Either one is capable of a better solution than you have with William S. Hein & Co., Inc. Either one has “related” products it could tastefully suggest along with search results.

Storyline Ontology

October 16th, 2014

Storyline Ontology

From the post:

The News Storyline Ontology is a generic model for describing and organising the stories news organisations tell. The ontology is intended to be flexible to support any given news or media publisher’s approach to handling news stories. At the heart of the ontology, is the concept of Storyline. As a nuance of the English language the word ‘story’ has multiple meanings. In news organisations, a story can be an individual piece of content, such as an article or news report. It can also be the editorial view on events occurring in the world.

The journalist pulls together information, facts, opinion, quotes, and data to explain the significance of world events and their context to create a narrative. The event is an award being received; the story is the triumph over adversity and personal tragedy of the victor leading up to receiving the reward (and the inevitable fall from grace due to drugs and sexual peccadillos). Or, the event is a bombing outside a building; the story is an escalating civil war or a gas mains fault due to cost cutting. To avoid this confusion, the term Storyline has been used to remove the ambiguity between the piece of creative work (the written article) and the editorial perspective on events.

Storyline ontology

I know, it’s RDF. Well, but the ontology itself, aside from the RDF cruft, represents a thought out and shared view of story development by major news producers. It is important for that reason if no other.

And you can use it as the basis for developing or integrating other story development ontologies.

Just as the post acknowledges:

As news stories are typically of a subjective nature (one news publisher’s interpretation of any given news story may be different from another’s), Storylines can be attributed to some agent to provide this provenance.

the same is true for ontologies. Ready to claim credit/blame for yours?

IBM Watson: How it Works [This is a real hoot!]

October 16th, 2014

Dibs on why “artificial intelligence” has, is and will fail! (At least if you think “artificial intelligence” means reason like a human being.)

IBM describes the decision making process in humans as four steps:

  1. Observe
  2. Interpret and draw hypotheses
  3. Evaluate which hypotheses is right or wrong
  4. Decide based on the evaluation

Most of us learned those four steps or variations on them as part of research paper writing or introductions to science. And we have heard them repeated in a variety of contexts.

However, we also know that model of human “reasoning” is a fantasy. Most if not all of us claim to follow it but the truth about the vast majority of decision making has little to do with those four steps.

That’s not just a “blog opinion” but one that has been substantiated by years of research. Look at any chapter in Thinking, Fast and Slow by Daniel Kahneman and tell me how Watson’s four step process is a better explanation than the one you will find there.

One of my favorite examples was the impact of meal times on parole decisions in Israel. Shai Danzinger, Jonathan Levav, and Liora Avnaim-Pesso, “Extraneous Factors in Judicial Decisions,” PNAS 108 (2011): 6889-92.

Abstract from Danzinger:

Are judicial rulings based solely on laws and facts? Legal formalism holds that judges apply legal reasons to the facts of a case in a rational, mechanical, and deliberative manner. In contrast, legal realists argue that the rational application of legal reasons does not sufficiently explain the decisions of judges and that psychological, political, and social factors influence judicial rulings. We test the common caricature of realism that justice is “what the judge ate for breakfast” in sequential parole decisions made by experienced judges. We record the judges’ two daily food breaks, which result in segmenting the deliberations of the day into three distinct “decision sessions.” We find that the percentage of favorable rulings drops gradually from ≈65% to nearly zero within each decision session and returns abruptly to ≈65% after a break. Our findings suggest that judicial rulings can be swayed by extraneous variables that should have no bearing on legal decisions.

If yes on parole applications starts at 65% right after breakfast or lunch and dwindles to zero, I know when I want my case heard.

That is just one example from hundreds in Kahneman.

Watson lacks the irrationality necessary to “reason like a human being.”

(Note that Watson is only given simple questions. No questions about policy choices in long simmering conflicts. We save those for human beings.)

GraphLab Create™ v1.0 Now Generally Available

October 16th, 2014

GraphLab Create™ v1.0 Now Generally Available by Johnnie Konstantas.

From the post:

It is with tremendous pride in this amazing team that I am posting on the general availability of version 1.0, our flagship product. This work represents a bar being set on usability, breadth of features and productivity possible with a machine learning platform.

What’s next you ask? It’s easy to talk about all of our great plans for scale and administration but I want to give this watershed moment it’s due. Have a look at what’s new.

graphlab demo

New features available in the GraphLab Create platform include:

  • Predictive Services – Companies can build predictive applications quickly, easily, and at scale.  Predictive service deployments are scalable, fault-tolerant, and high performing, enabling easy integration with front-end applications. Trained models can be deployed on Amazon Elastic Compute Cloud (EC2) and monitored through Amazon CloudWatch. They can be queried in real-time via a RESTful API and the entire deployment pipeline is seen through a visual dashboard. The time from prototyping to production is dramatically reduced for GraphLab Create users.
  • Deep Learning – These models are ideal for automatic learning of salient features, without human supervision, from data such as images. Combined with GraphLab Create image analysis tools, the Deep Learning package enables accurate and in-depth understanding of images and videos. The GraphLab Create image analysis package makes quick work of importing and preprocessing millions of images as well as numeric data. It is built on the latest architectures including Convolution Layer, Max, Sum, Average Pooling and Dropout. The available API allows for extensibility in building user custom neural networks. Applications include image classification, object detection and image similarity.
  • Boosted Trees – With this feature, GraphLab adds support for this popular class of algorithms for robust and accurate regression and classification tasks.  With an out-of-core implementation, Boosted Trees in GraphLab Create can easily scale up to large datasets that do not fit into memory.

  • Visualization – New dashboards allow users to visualize the status and health of offline jobs deployed in various environments including local, Hadoop Clusters and EC2.  Also part of GraphLab Canvas is the visualization of GraphLab SFrames and SGraphs, enabling users to explore tables, graphs, text and images, in a single interactive environment making feature engineering more efficient.

…(and more)

Rather than downloading the software, go to GraphLab Create™ Quick Start to generate a product key. After you generate a product key (displayed on webpage), GraphLab offers command line code to set you up for installing GraphLab via pip. Quick and easy on Ubuntu 12.04.

Next stop: The Five-Line Recommender, Explained by Alice Zheng. ;-)


Bloom Filters

October 15th, 2014

Bloom Filters by Jason Davies.

From the post:

Everyone is always raving about bloom filters. But what exactly are they, and what are they useful for?

Very straightforward explanation along with interactive demo. The applications section will immediately suggest how Bloom filters could be used when querying.

There are other complexities, see the Bloom Filter entry at Wikipedia. But as a first blush explanation, you will be hard pressed to find one as good as Jason’s.

I first saw this in a tweet by Allen Day.

How To Build Linked Data APIs…

October 15th, 2014

This is the second high signal-to-noise presentation I have seen this week! I am sure that streak won’t last but I will enjoy it as long as it does.

Resources for after you see the presentation: Hydra: Hypermedia-Driven Web APIs, JSON for Linking Data, and, JSON-LD 1.0.

Near the end of the presentation, Marcus quotes Phil Archer, W3C Data Activity Lead:

Archer on Semantic Web

Which is an odd statement considering that JSON-LD 1.0 Section 7 Data Model, reads in part:

JSON-LD is a serialization format for Linked Data based on JSON. It is therefore important to distinguish between the syntax, which is defined by JSON in [RFC4627], and the data model which is an extension of the RDF data model [RDF11-CONCEPTS]. The precise details of how JSON-LD relates to the RDF data model are given in section 9. Relationship to RDF.

And section 9. Relationship to RDF reads in part:

JSON-LD is a concrete RDF syntax as described in [RDF11-CONCEPTS]. Hence, a JSON-LD document is both an RDF document and a JSON document and correspondingly represents an instance of an RDF data model. However, JSON-LD also extends the RDF data model to optionally allow JSON-LD to serialize Generalized RDF Datasets. The JSON-LD extensions to the RDF data model are:…

Is JSON-LD “…a concrete RDF syntax…” where you can ignore RDF?

Not that I was ever a fan of RDF but standards should be fish or fowl and not attempt to be something in between.

5 Machine Learning Areas You Should Be Cultivating

October 15th, 2014

5 Machine Learning Areas You Should Be Cultivating by Jason Brownlee.

From the post:

You want to learn machine learning to have more opportunities at work or to get a job. You may already be working as a data scientist or machine learning engineer and looking to improve your skills.

It is about as easy to pigeonhole machine learning skills as it is programming skills (you can’t).

There is a wide array of tasks that require some skill in data mining and machine learning in business from data analysis type work to full systems architecture and integration.

Nevertheless there are common tasks and common skills that you will want to develop, just like you could suggest for an aspiring software developer.

In this post we will look at 5 key areas were you might want to develop skills and the types of activities that you could take on to practice in those areas.

Jason has a number of useful suggestions for the five areas and you will profit from taking his advice.

At the same time, I would be keeping a notebooks of assumptions or exploits that are possible with every technique or process that you learn. Results and data will be presented to you as though the results and data are both clean. It is your responsibility to test that presentation.

Concatenative Clojure

October 15th, 2014

Concatenative Clojure by Brandon Bloom.


Brandon Bloom introduces Factor and demonstrates Factjor –concatenative DSL – and DomScript –DOM library written in ClojureScript – in the context of concatenative programming.

Brandon compares and contrasts applicative and concatenative programming languages, concluding with this table:

presentation slide

Urges viewers to explore Factjor and to understand the differences between between applicative and contatenative programming languages. It is a fast moving presentation that will require viewing more than once!

Watch for new developments at: https://github.com/brandonbloom

I first saw this in a tweet by Wiliam Byrd.

Google details new “Poodle” bug…

October 15th, 2014

Google details new “Poodle” bug, making browsers susceptible to hacking by Jonathan Vanian.

From the post:

Google’s security team detailed today a new bug that takes advantage of a design flaw in SSL version 3.0, a security protocol created by Netscape in the mid 1990s. The researchers called it a Padding Oracle on Downgraded Legacy Encryption bug, or POODLE.

Although the protocol is old, Google said that “nearly all browsers support it” and its available for hackers to exploit. Even though many modern-day websites use the TLS security protocol (essentially, the next-generation SSL) as their means of encrypting data for a secure network connection between a browser and a website, things can run amok if the connection goes down for some reason.

See Jonathan’s post for more “Poodle” details.

Suggestions for a curated and relatively comprehensive collection in security bugs as they are discovered? I ask because I follow a couple of fairly active streams but I haven’t found one that I would call “curated,” in the sense that each bug is reported once and only once, with related material linked to it.

Is it just me or would others find that to be a useful resource?

Inductive Graph Representations in Idris

October 15th, 2014

Inductive Graph Representations in Idris by Michael R. Bernstein.

An early exploration on Inductive Graphs and Functional Graph Algorithms by Martin Erwig.

Abstract (of Erwig’s paper):

We propose a new style of writing graph algorithms in functional languages which is based on an alternative view of graphs as inductively defined data types. We show how this graph model can be implemented efficiently and then we demonstrate how graph algorithms can be succinctly given by recursive function definitions based on the inductive graph view. We also regard this as a contribution to the teaching of algorithms and data structures in functional languages since we can use the functional-graph algorithms instead of the imperative algorithms that are dominant today.

You can follow Michael at: @mrb_bk or https://github.com/mrb or his blog: http://michaelrbernste.in/.

More details on Idris: A Language With Dependent Types.

Cryptic genetic variation in software:…

October 14th, 2014

Cryptic genetic variation in software: hunting a buffered 41 year old bug by Sean Eddy.

From the post:

In genetics, cryptic genetic variation means that a genome can contain mutations whose phenotypic effects are invisible because they are suppressed or buffered, but under rare conditions they become visible and subject to selection pressure.

In software code, engineers sometimes also face the nightmare of a bug in one routine that has no visible effect because of a compensatory bug elsewhere. You fix the other routine, and suddenly the first routine starts failing for an apparently unrelated reason. Epistasis sucks.

I’ve just found an example in our code, and traced the origin of the problem back 41 years to the algorithm’s description in a 1973 applied mathematics paper. The algorithm — for sampling from a Gaussian distribution — is used worldwide, because it’s implemented in the venerable RANLIB software library still used in lots of numerical codebases, including GNU Octave. It looks to me that the only reason code has been working is that a compensatory “mutation” has been selected for in everyone else’s code except mine.


A bug hunting story to read and forward! Sean just bagged a forty-one (41) year old bug. What’s the oldest bug you have ever found?

When you reach the crux of the problem, you will understand why ambiguous, vague, incomplete and poorly organized standards annoy me to no end.

No guarantees of unambiguous results but if you need extra eyes on IT standards you know where to find me.

I first saw this in a tweet by Neil Saunders.

Classifying Shakespearean Drama with Sparse Feature Sets

October 14th, 2014

Classifying Shakespearean Drama with Sparse Feature Sets by Douglas Duhaime.

From the post:

In her fantastic series of lectures on early modern England, Emma Smith identifies an interesting feature that differentiates the tragedies and comedies of Elizabethan drama: “Tragedies tend to have more streamlined plots, or less plot—you know, fewer things happening. Comedies tend to enjoy a multiplication of characters, disguises, and trickeries. I mean, you could partly think about the way [tragedies tend to move] towards the isolation of a single figure on the stage, getting rid of other people, moving towards a kind of solitude, whereas comedies tend to end with a big scene at the end where everybody’s on stage” (6:02-6:37). 

The distinction Smith draws between tragedies and comedies is fairly intuitive: tragedies isolate the poor player that struts and frets his hour upon the stage and then is heard no more. Comedies, on the other hand, aggregate characters in order to facilitate comedic trickery and tidy marriage plots. While this discrepancy seemed promising, I couldn’t help but wonder whether computational analysis would bear out the hypothesis. Inspired by the recent proliferation of computer-assisted genre classifications of Shakespeare’s plays—many of which are founded upon high dimensional data sets like those generated by DocuScope—I was curious to know if paying attention to the number of characters on stage in Shakespearean drama could help provide additional feature sets with which to carry out this task.

A quick reminder that not all text analysis is concerned with 140 character strings. ;-)

Do you prefer:

high dimensional

where every letter in “high dimensional” is a hyperlink with an unknown target or a fuller listing:

Allison, Sarah, and Ryan Heuser, Matthew Jockers, Franco Moretti, Michael Witmore. Quantitative Formalism: An Experiment

Jockers, Matthew. Machine-Classifying Novels and Plays by Genre

Hope, Jonathan and Michael Witmore. “The Hundredth Psalm to the Tune of ‘Green Sleeves’”: Digital Approaches Shakespeare’s Language of Genre

Hope, Jonathan. Shakespeare by the numbers: on the linguistic texture of the Late Plays

Hope, Jonathan and Michael Witmore. The Very Large Textual Object: A Prosthetic Reading of Shakespeare

Lenthe, Victor. Finding the Sherlock in Shakespeare: some ideas about prose genre and linguistic uniqueness

Stumpf, Mike. How Quickly Nature Falls Into Revolt: On Revisiting Shakespeare’s Genres

Stumpf, Mike. This Thing of Darkness (Part III)

Tootalian, Jacob A. Shakespeare, Without Measure: The Rhetorical Tendencies of Renaissance Dramatic Prose

Ullyot, Michael. Encoding Shakespeare

Witmore, Michael. A Genre Map of Shakespeare’s Plays from the First Folio (1623)

Witmore, Michael. Shakespeare Out of Place?

Witmore, Michael. Shakespeare Quarterly 61.3 Figures

Witmore, Michael. Visualizing English Print, 1530-1800, Genre Contents of the Corpus

Decompiling Shakespeare (Site is down. Was also down when the WayBack machine tried to archive the site in July of 2014)

I prefer the longer listing.

If you are interested in Shakespeare, Folger Digital Texts has free XML and PDF versions of his work.

I first saw this in a tweet by Gregory Piatetsky

RNeo4j: Neo4j graph database combined with R statistical programming language

October 14th, 2014

From the description:

RNeo4j combines the power of a Neo4j graph database with the R statistical programming language to easily build predictive models based on connected data. From calculating the probability of friends of friends connections to plotting an adjacency heat map based on graph analytics, the RNeo4j package allows for easy interaction with a Neo4j graph database.

Nicole is the author of the RNeo4j R package. Don’t be dismayed by the “What is a Graph” and “What is R” in the presentation outline. Mercifully only three minutes followed by a rocking live coding demonstration of the package!

Beyond Neo4j and R, use this webinar as a standard for the useful content that should appear in a webinar!

RNeo4j at Github.

How designers prototype at GDS

October 14th, 2014

How designers prototype at GDS by Rebecca Cottrell.

From the post:

All of the designers at GDS can code or are learning to code. If you’re a designer who has used prototyping tools like Axure for a large part of your professional career, the idea of prototyping in code might be intimidating. Terrifying, even.

I’m a good example of that. When I joined GDS I felt intimidated by the idea of using Terminal and things like Git and GitHub, and just the perceived slowness of coding in HTML.

At first I felt my workflow had slowed down significantly, but the reason for that was the learning curve involved – I soon adapted and got much faster.

GDS has lots of tools (design patterns, code snippets, front-end toolkit) to speed things up. Sharing what I learned in the process felt like a good idea to help new designers get to grips with how we work.

Not a rigid set of prescriptions but experience at prototyping and pointers to other resources. Whether you have a current system of prototyping or not, you are very likely to gain a tip or two from this post.

I first saw this in a tweet by Ben Terrett.

ADW (Align, Disambiguate and Walk) [Semantic Similarity]

October 14th, 2014

ADW (Align, Disambiguate and Walk) version 1.0 by Mohammad Taher Pilehvar.

From the webpage:

This package provides a Java implementation of ADW, a state-of-the-art semantic similarity approach that enables the comparison of lexical items at different lexical levels: from senses to texts. For more details about the approach please refer to: http://wwwusers.di.uniroma1.it/~navigli/pubs/ACL_2013_Pilehvar_Jurgens_Navigli.pdf

The abstract for the paper reads:

Semantic similarity is an essential component of many Natural Language Processing applications. However, prior methods for computing semantic similarity often operate at different levels, e.g., single words or entire documents, which requires adapting the method for each data type. We present a unified approach to semantic similarity that operates at multiple levels, all the way from comparing word senses to comparing text documents. Our method leverages a common probabilistic representation over word senses in order to compare different types of linguistic data. This unified representation shows state-of-the-art performance on three tasks: semantic textual similarity, word similarity, and word sense coarsening.

Online Demo.

The strength of this approach is the use of multiple levels of semantic similarity. It relies on WordNet but the authors promise to extend their approach to named entities and other tokens not appearing in WordNet (like your company or industry’s internal vocabulary).

The bibliography of the paper cites much of the recent work in this area so that will be an added bonus for perusing the paper.

I first saw this in a tweet by Gregory Piatetsky.

The Dirty Little Secret of Cancer Research

October 13th, 2014

The Dirty Little Secret of Cancer Research by Jill Neimark.

From the post:

Across different fields of cancer research, up to a third of all cell lines have been identified as imposters. Yet this fact is widely ignored, and the lines continue to be used under their false identities. As recently as 2013, one of Ain’s contaminated lines was used in a paper on thyroid cancer published in the journal Oncogene.

“There are about 10,000 citations every year on false lines—new publications that refer to or rely on papers based on imposter (human cancer) celllines,” says geneticist Christopher Korch, former director of the University of Colorado’s DNA Sequencing Analysis & Core Facility. “It’s like a huge pyramid of toothpicks precariously and deceptively held together.”

For all the worry about “big data,” where is the concern over “big bad data?”

Or is “big data” too big for correctness of the data to matter?

Once you discover that a paper is based on “imposter (human cancer) celllines,” how do you pass that information along to anyone who attempts to cite the article?

In other words, where do you write down that data about the paper, where the paper is the subject in question?

And how do you propagate that data across a universe of citations?

The post ends on a high note of current improvements but it is far from settled how to prevent reliance on compromised research.

I first saw this in a tweet by Dan Graur

Scrape the Gibson: Python skills for data scrapers

October 13th, 2014

Scrape the Gibson: Python skills for data scrapers by Brian Abelson.

From the post:

Two years ago, I learned I had superpowers. Steve Romalewski was working on some fascinating analyses of CitiBike locations and needed some help scraping information from the city’s data portal. Cobbling together the little I knew about R, I wrote a simple scraper to fetch the json files for each bike share location and output it as a csv. When I opened the clean data in Excel, the feeling was tantamount to this scene from Hackers:

Ever since then I’ve spent a good portion of my life scraping data from websites. From movies, to bird sounds, to missed connections, and john boards (don’t ask, I promise it’s for good!), there’s not much I haven’t tried to scrape. In many cases, I dont’t even analyze the data I’ve obtained, and the whole process amounts to a nerdy version of sport hunting, with my comma-delimited trophies mounted proudly on Amazon S3.

Important post for two reasons:

  • Good introduction to the art of scraping data
  • Set the norm for sharing scraped data
    • The people who force scraping of data don’t want it shared, combined, merged or analyzed.

      You can help in disappointing them! ;-)

Making of: Introduction to A*

October 13th, 2014

Making of: Introduction to A* by Amit Patel.

From the post:

(Warning: these notes are rough – the main page is here and these are some notes I wrote for a few colleagues and then I kept adding to it until it became a longer page)

Several people have asked me how I make the diagrams on my tutorials.

I need to learn the algorithm and data structures I want to demonstrate. Sometimes I already know them. Sometimes I know nothing about them. It varies a lot. It can take 1 to 5 months to make a tutorial. It’s slow, but the more I make, the faster I am getting.

I need to figure out what I want to show. I start with what’s in the algorithm itself: inputs, outputs, internal variables. With A*, the input is (start, goal, graph), the output is (parent pointers, distances), and the internal variables are (open set, closed set, parent pointers, distances, current node, neighbors, child node). I’m looking for the main idea to visualize. With A* it’s the frontier, which is the open set. Sometimes the thing I want to visualize is one of the algorithm’s internal variables, but not always.

Pure gold on making diagrams for tutorials here. You may make different choices but it isn’t often that the process of making a choice is exposed.

Pass this along. We all benefit from better illustrations in tutorials!

The Big List of D3.js Examples (Approx. 2500 Examples)

October 13th, 2014

The Big List of D3.js Examples by Christophe Viau.

The interactive version has 2523 examples, whereas the numbered list has 1897 examples, as of 13 October 2014.

There is a rudimentary index of the examples. That’s an observation, not a compliant. Effective indexing of the examples would be a real challenge to the art of indexing.

The current index uses chart type, a rather open ended category. The subject matter of the chart would be another way to index. Indexing by the D3 techniques used would be useful. Data that is being combined with other data?

Effective access to the techniques and data represented by this collection would be awesome!

Give it some thought.

I first saw this in a tweet by Michael McGuffin.

Introduction to Graphing with D3.js

October 13th, 2014

Introduction to Graphing with D3.js by Jan Milosh.

From the post:

D3.js (d3js.org) stands for Data-Driven Documents, a JavaScript library for data visualization. It was created by Mike Bostock, based on his PhD studies in the Stanford University data visualization program. Mike now works at the New York Times who sponsors his open source work.

D3 was designed for more than just graphs and charts. It’s also capable of presenting maps, networks, and ordered lists. It was created for the efficient manipulation of documents based on data.

This demonstration will focus on creating a simple scatter plot.

If you are not already using D3 for graphics, Jan’s post is an easy introduction with additional references to take you further.


I first saw this in a tweet by Christophe Viau.

Mirrors for Princes and Sultans:…

October 13th, 2014

Mirrors for Princes and Sultans: Advice on the Art of Governance in the Medieval Christian and Islamic Worlds by Lisa Blaydes, Justin Grimmery, and Alison McQueen.


Among the most signi cant forms of political writing to emerge from the medieval period are texts off ering advice to kings and other high-ranking ocials. Books of counsel varied considerably in their content and form; scholars agree, however, that such texts reflected the political exigencies of their day. As a result, writings in the “mirrors for princes” tradition o er valuable insights into the evolution of medieval modes of governance. While European mirrors (and Machiavelli’s Prince in particular) have been extensively studied, there has been less scholarly examination of a parallel political advice literature emanating from the Islamic world. We compare Muslim and Christian advisory writings from the medieval period using automated text analysis, identify sixty conceptually distinct topics that our method automatically categorizes into three areas of concern common to both Muslim and Christian polities, and examine how they evolve over time. We o er some tentative explanations for these trends.

If you don’t know the phrase, “mirrors for princes,”:

texts that seek to off er wisdom or guidance to monarchs and other high-ranking advisors.

Since nearly all bloggers and everyone with a byline in traditional media considers themselves qualified to offer advice to “…monarchs and other high-ranking advisors,” one wonders how the techniques presented would fare with modern texts?

Certainly a different style of textual analysis than is seen outside the humanities and so instructive for that purpose.

I do wonder about the comparison of texts in translation into English. Obviously easier but runs the risk of comparing translators to translators and not so much the thoughts of the original authors.

I first saw this in a tweet by Christopher Phipps.

Measuring Search Relevance

October 13th, 2014

Measuring Search Relevance by Hugh E. Williams.

From the post:

The process of asking many judges to assess search performance is known as relevance judgment: collecting human judgments on the relevance of search results. The basic task goes like this: you present a judge with a search result, and a search engine query, and you ask the judge to assess how relevant the item is to the query on (say) a four-point scale.

Suppose the query you want to assess is ipod nano 16Gb. Imagine that one of the results is a link to Apple’s page that describes the latest Apple iPod nano 16Gb. A judge might decide that this is a “great result” (which might be, say, our top rating on the four-point scale). They’d then click on a radio button to record their vote and move on to the next task. If the result we showed them was a story about a giraffe, the judge might decide this result is “irrelevant” (say the lowest rating on the four point scale). If it were information about an iPhone, it might be “partially relevant” (say the second-to-lowest), and if it were a review of the latest iPod nano, the judge might say “relevant” (it’s not perfect, but it sure is useful information about an Apple iPod).

The human judgment process itself is subjective, and different people will make different choices. You could argue that a review of the latest iPod nano is a “great result” — maybe you think it’s even better than Apple’s page on the topic. You could also argue that the definitive Apple page isn’t terribly useful in making a buying decision, and you might only rate it as relevant. A judge who knows everything about Apple’s products might make a different decision to someone who’s never owned an digital music player. You get the idea. In practice, judging decisions depend on training, experience, context, knowledge, and quality — it’s an art at best.

There are a few different ways to address subjectivity and get meaningful results. First, you can ask multiple judges to assess the same results to get an average score. Second, you can judge thousands of queries, so that you can compute metrics and be confident statistically that the numbers you see represent true differences in performance between algorithms. Last, you can train your judges carefully, and give them information about what you think relevance means.

An illustrated walk through measuring search relevance. Useful for a basic understanding of the measurement process and its parameters.

Bookmark this post so When you tell your judges what “…relevance means”, you can return here and post what you told your judges.

I ask because I deeply suspect that our ideas of “relevance” vary widely from subject to subject.


Twitter Mapping: Foundations

October 12th, 2014

Twitter Mapping: Foundations by Simon Rogers.

From the post:

With more than 500 million tweets sent every day, Twitter data as a whole can seem huge and unimaginable, like cramming the contents of the Library of Congress into your living room.

One way of trying to make that big data understandable is by making it smaller and easier to handle by giving it context; by putting it on a map.

It’s something I do a lot—I’ve published over 1,000 maps in the past five years, mostly at Guardian Data. At Twitter, with 77% of users outside the US, it’s often aimed at seeing if regional variations can give us a global picture, an insight into the way a story spreads around the globe. Here’s what I’ve learned about using Twitter data on maps.

… (lots of really cool maps and links omitted)

Creating data visualizations is simpler now than it’s ever been, with a plethora of tools (free and paid) meaning that any journalist working in any newsroom can make a chart or a map in a matter of minutes. Because of time constraints, we often use CartoDB to animate maps of tweets over time. The process is straightforward—I’ve written a how-to guide on my blog that shows how to create an animated map of dots using the basic interface, and if the data is not too big it won’t cost you anything. CartoDB is also handy for other reasons: as it has access to Twitter data, you can use it to get the geotagged tweets too. And it’s not the only one: Trendsmap is a great way to see location of conversations over time.

Have you made a map with Twitter Data that tells a compelling story? Share it with us via @TwitterData.

While composing this post I looked at CartoDB solution for geotagged tweets and while impressive, it is currently in beta with a starting price of $300/month. Works if you get your expenses paid but a bit pricey for occasional use.

There is a free option for CartoDB (up to 50 MB of data) but I don’t think it includes the twitter capabilities.

Sample mapping tweets on your favorite issues. Maps are persuasive in ways that are not completely understood.

Making Your First Map

October 11th, 2014

Making Your First Map from Mapbox.

From the webpage:

Regardless of your skill level, we have the tools that allow you to quickly build maps and share them online in minutes.

In this guide, we’ll cover the basics of our online tool, the Mapbox Editor, by creating a store location map for a bike shop.

A great example of the sort of authoring interface that is needed by topic maps.

Hmmm, by the way, did you notice that “…creating a store location map for a bike shop” is creating an association between the “bike shop” and a “street location?” True, Mapbox doesn’t include roles or the association type but the role players are present.

For a topic map authoring interface, you could default the role of location for any geographic point on the map and the association type to be “street-location.”

The user would only have to pick, possibly from a pick list, the role of the role player, bike shop, bar, etc.

Mapbox could have started their guide with a review of map projections, used and theoretical.

Or covered the basics of surveying and a brief overview of surveying instruments. They didn’t.

I think there is a lesson there.

Microsoft’s Quantum Mechanics

October 11th, 2014

Microsoft’s Quantum Mechanics by Tom Simonite.

From the post:

In 2012, physicists in the Netherlands announced a discovery in particle physics that started chatter about a Nobel Prize. Inside a tiny rod of semiconductor crystal chilled cooler than outer space, they had caught the first glimpse of a strange particle called the Majorana fermion, finally confirming a prediction made in 1937. It was an advance seemingly unrelated to the challenges of selling office productivity software or competing with Amazon in cloud computing, but Craig Mundie, then heading Microsoft’s technology and research strategy, was delighted. The abstruse discovery—partly underwritten by Microsoft—was crucial to a project at the company aimed at making it possible to build immensely powerful computers that crunch data using quantum physics. “It was a pivotal moment,” says Mundie. “This research was guiding us toward a way of realizing one of these systems.”

Microsoft is now almost a decade into that project and has just begun to talk publicly about it. If it succeeds, the world could change dramatically. Since the physicist Richard Feynman first suggested the idea of a quantum computer in 1982, theorists have proved that such a machine could solve problems that would take the fastest conventional computers hundreds of millions of years or longer. Quantum computers might, for example, give researchers better tools to design novel medicines or super-efficient solar cells. They could revolutionize artificial intelligence.

Fairly upbeat review of current efforts to build a quantum computer.

You may want to off-set it by reading Scott Aaronson’s blog, Shtetl-Optimized, which has the following header note:

If you take just one piece of information from this blog:
Quantum computers would not solve hard search problems
instantaneously by simply trying all the possible solutions at once. (emphasis added)

See in particular: Speaking Truth to Parallelism at Cornell

Whatever speedups are possible with quantum computers, getting a semantically incorrect answer faster isn’t an advantage.

Assumptions about faster computing platforms include an assumption of correct semantics. There have been no proofs of default correct handling of semantics by present day or future computing platforms.

I first saw this in a tweet by Peter Lee.

PS: I saw the reference to Scott Aaronson’s blog in a comment to Tom’s post.