Linguistic Mapping Reveals How Word Meanings Sometimes Change Overnight

November 23rd, 2014

Linguistic Mapping Reveals How Word Meanings Sometimes Change Overnight Data mining the way we use words is revealing the linguistic earthquakes that constantly change our language.

From the post:

language change

In October 2012, Hurricane Sandy approached the eastern coast of the United States. At the same time, the English language was undergoing a small earthquake of its own. Just months before, the word “sandy” was an adjective meaning “covered in or consisting mostly of sand” or “having light yellowish brown colour”. Almost overnight, this word gained an additional meaning as a proper noun for one of the costliest storms in US history.

A similar change occurred to the word “mouse” in the early 1970s when it gained the new meaning of “computer input device”. In the 1980s, the word “apple” became a proper noun synonymous with the computer company. And later, the word “windows” followed a similar course after the release of the Microsoft operating system.

All this serves to show how language constantly evolves, often slowly but at other times almost overnight. Keeping track of these new senses and meanings has always been hard. But not anymore.

Today, Vivek Kulkarni at Stony Brook University in New York and a few pals show how they have tracked these linguistic changes by mining the corpus of words stored in databases such as Google Books, movie reviews from Amazon and of course the microblogging site Twitter.

These guys have developed three ways to spot changes in the language. The first is a simple count of how often words are used, using tools such as Google Trends. For example, in October 2012, the frequency of the words “Sandy” and “hurricane” both spiked in the run-up to the storm. However, only one of these words changed its meaning, something that a frequency count cannot spot.

A very good overview of:

Statistically Significant Detection of Linguistic Change by Vivek Kulkarni, Rami Al-Rfou, Bryan Perozzi, and Steven Skiena.


We propose a new computational approach for tracking and detecting statistically significant linguistic shifts in the meaning and usage of words. Such linguistic shifts are especially prevalent on the Internet, where the rapid exchange of ideas can quickly change a word’s meaning. Our meta-analysis approach constructs property time series of word usage, and then uses statistically sound change point detection algorithms to identify significant linguistic shifts.

We consider and analyze three approaches of increasing complexity to generate such linguistic property time series, the culmination of which uses distributional characteristics inferred from word co-occurrences. Using recently proposed deep neural language models, we first train vector representations of words for each time period. Second, we warp the vector spaces into one unified coordinate system. Finally, we construct a distance-based distributional time series for each word to track it’s linguistic displacement over time.

We demonstrate that our approach is scalable by tracking linguistic change across years of micro-blogging using Twitter, a decade of product reviews using a corpus of movie reviews from Amazon, and a century of written books using the Google Book-ngrams. Our analysis reveals interesting patterns of language usage change commensurate with each medium.

While the authors are concerned with scaling, I would think detecting cracks, crevasses, and minor tremors in the meaning and usage of words, say between a bank and its regulators, or stock traders and the SEC, would be equally important.

Even if auto-detection of the “new” or “changed” meaning is too much to expect, simply detecting dissonance in the usage of terms would be a step in the right direction.

Detecting earthquakes in meaning is a worthy endeavor but there is more tripping on cracks than falling from earthquakes, linguistically speaking.

Show and Tell: A Neural Image Caption Generator

November 23rd, 2014

Show and Tell: A Neural Image Caption Generator by Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan.


Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. For instance, while the current state-of-the-art BLEU score (the higher the better) on the Pascal dataset is 25, our approach yields 59, to be compared to human performance around 69. We also show BLEU score improvements on Flickr30k, from 55 to 66, and on SBU, from 19 to 27.

Another caption generating program for images. (see also, Deep Visual-Semantic Alignments for Generating Image Descriptions) Not quite to the performance of a human observer but quite respectable. The near misses are amusing enough for crowd correction to be an element in a full blown system.

Perhaps “rough recognition” is close enough for some purposes. Searching images for people who match a partial description and producing a much smaller set for additional processing.

I first saw this in Nat Torkington’s Four short links: 18 November 2014.

Compojure Address Book

November 22nd, 2014

Jarrod C. Taylor writes in part 1:


Clojure is a great language that is continuing to improve itself and expand its user base year over year. The Clojure ecosystem has many great libraries focused on being highly composable. This composability allows developers to easily build impressive applications from seemingly simple parts. Once you have a solid understanding of how Clojure libraries fit together, integration between them can become very intuitive. However, if you have not reached this level of understanding, knowing how all of the parts fit together can be daunting. Fear not, this series will walk you through start to finish, building a tested compojure web app backed by a Postgres Database.

Where We Are Going

The project we will build and test over the course of this blog series is an address book application. We will build the app using ring and Compojure and persist the data in a Postgres Database. The app will be a traditional client server app with no JavaScript. Here is a teaser of the final product.

Not that I need another address book but as an exercise in onboarding, this rocks!

Compojure Address Book Part 1 by

(see above)

Compojure Address Book Part 2

Recap and Restructure

So far we have modified the default Compojure template to include a basic POST route and used Midje and Ring-Mock to write a test to confirm that it works. Before we get started with templates and creating our address book we should provide some additional structure to our application in an effort to keep things organized as the project grows.

Compojure Address Book Part 3


In this installment of the address book series we are finally ready to start building the actual application. We have laid all of the ground work required to finally get to work.

Compojure Address Book Part 4

Persisting Data in Postgres

At this point we have an address book that will allow us to add new contacts. However, we are not persisting our new additions. It’s time to change that. You will need to have Postgres installed. If you are using a Mac, postgresapp is a very simple way of installing. If you are on another OS you will need to follow the install instructions from the Postgres website.

Once you have Postgres installed and running we are going to create a test user and two databases.

Compojure Address Book Part 5

The Finish Line

Our address book application has finally taken shape and we are in a position to put the finishing touches on it. All that remains is to allow the user the ability to edit and delete existing contacts.

One clever thing Jarrod has done is post all five (5) parts to this series on one day. You can go as fast or as slow as you choose to go.

Another clever thing is that testing is part of the development process.

How many programmers actually incorporate testing day to day? Given the prevalence of security bugs (to say nothing at all of other bugs), I would say less than one hundred percent (100%).


How much less than 100% I won’t hazard a guess.

Solr vs. Elasticsearch – Case by Case

November 22nd, 2014

Solr vs. Elasticsearch – Case by Case by Alexandre Rafalovitch.

From the description:

A presentation given at the Lucene/Solr Revolution 2014 conference to show Solr and Elasticsearch features side by side. The presentation time was only 30 minutes, so only the core usability features were compared. The full video is coming later.

Just the highlights and those from an admitted ElasticSearch user.

One very telling piece of advice for Solr:

Solr – needs to buckle down and focus on the onboarding experience

Solr is getting better (e.g. listen to SolrCluster podcast of October 24, 2014)

Just in case you don’t know the term: onboarding.

And SolrCluster podcast of October 24, 2014: Solr Usability with Steve Rowe & Tim Potter

From the description:

In this episode, Lucene/Solr Committers Steve Rowe and Tim Potter join the SolrCluster team to discuss how Lucidworks and the community are making changes and improvements to Solr to increase usability and add ease to the getting started experience. Steve and Tim discuss new features such as data-driven schema, start-up scripts, launching SolrCloud, and more. (length 33:29)


…focusing on the first five minutes of the Solr experience…hard to explore if you can’t get it started…can be a little bit scary at first…has lacked a focus on accessibility by ordinary users…need usability addressed throughout the lifecycle of the product…want to improve kicking the tires on Solr…lowering mental barriers for new users…do now have start scripts…bakes in a lot of best practices…scripts for SolrCloud…hide all the weird stuff…data driven schemas…throw data at Solr and it creates an index without creating a schema…working on improving tutorials and documentation…moving towards consolidating information…will include use cases…walk throughs…will point to different data sets…making it easier to query Solr and understand the query URLs…bringing full collections API support to the admin UI…Rest interface…components report possible configuration…plus a form to interact with it directly…forms that render in the browser…will have a continued focus on usability…not a one time push…new users need to submit any problems they encounter….

Great podcast!

Very encouraging on issues of documentation and accessibility in Solr.

Open-sourcing tools for Hadoop

November 22nd, 2014

Open-sourcing tools for Hadoop by Colin Marc.

From the post:

Stripe’s batch data infrastructure is built largely on top of Apache Hadoop. We use these systems for everything from fraud modeling to business analytics, and we’re open-sourcing a few pieces today:


Timberlake is a dashboard that gives you insight into the Hadoop jobs running on your cluster. Jeff built it as a replacement for YARN’s ResourceManager and MRv2’s JobHistory server, and it has some features we’ve found useful:

  • Map and reduce task waterfalls and timing plots
  • Scalding and Cascading awareness
  • Error tracebacks for failed jobs


Avi wrote a Scala framework for distributed learning of ensemble decision tree models called Brushfire. It’s inspired by Google’s PLANET, but built on Hadoop and Scalding. Designed to be highly generic, Brushfire can build and validate random forests and similar models from very large amounts of training data.


Sequins is a static database for serving data in Hadoop’s SequenceFile format. I wrote it to provide low-latency access to key/value aggregates generated by Hadoop. For example, we use it to give our API access to historical fraud modeling features, without adding an online dependency on HDFS.


At Stripe, we use Parquet extensively, especially in tandem with Cloudera Impala. Danielle, Jeff, and Avi wrote Herringbone (a collection of small command-line utilities) to make working with Parquet and Impala easier.

More open source tools for your Hadoop installation!

I am considering creating a list of closed source tools for Hadoop. It would be shorter and easier to maintain than a list of open source tools for Hadoop. ;-)

The structural virality of online diffusion

November 22nd, 2014

The structural virality of online di ffusion by Sharad Goel, Ashton Anderson, Jake Hofman, and Duncan J. Watts.

Viral products and ideas are intuitively understood to grow through a person-to-person di ffusion process analogous to the spread of an infectious disease; however, until recently it has been prohibitively difficult to directly observe purportedly viral events, and thus to rigorously quantify or characterize their structural properties. Here we propose a formal measure of what we label “structural virality” that interpolates between two conceptual extremes: content that gains its popularity through a single, large broadcast, and that which grows through multiple generations with any one individual directly responsible for only a fraction of the total adoption. We use this notion of structural virality to analyze a unique dataset of a billion di ffusion events on Twitter, including the propagation of news stories, videos, images, and petitions. We find that across all domains and all sizes of events, online di ffusion is characterized by surprising structural diversity. Popular events, that is, regularly grow via both broadcast and viral mechanisms, as well as essentially all conceivable combinations of the two. Correspondingly, we find that the correlation between the size of an event and its structural virality is surprisingly low, meaning that knowing how popular a piece of content is tells one little about how it spread. Finally, we attempt to replicate these fi ndings with a model of contagion characterized by a low infection rate spreading on a scale-free network. We fi nd that while several of our empirical fi ndings are consistent with such a model, it does not replicate the observed diversity of structural virality.

Before you get too excited, the authors do not provide a how-to-go-viral manual.

In part because:

Large and potentially viral cascades are therefore necessarily very rare events; hence one must observe a correspondingly large number of events in order to fi nd just one popular example, and many times that number to observe many such events. As we will describe later, in fact, even moderately popular events occur in our data at a rate of only about one in a thousand, while “viral hits” appear at a rate closer to one in a million. Consequently, in order to obtain a representative sample of a few hundred viral hits arguably just large enough to estimate statistical patterns reliably one requires an initial sample on the order of a billion events, an extraordinary data requirement that is difficult to satisfy even with contemporary data sources.

The authors clearly advance the state of research on “viral hits” and conclude with suggestions for future modeling work.

You can imagine the reaction of marketing departments should anyone get closer to designing successful viral advertising.

A good illustration that something we can observe, “viral hits,” in an environment where the spread can be tracked (Twitter), can still resist our best efforts to model and/or explain how to repeat the “viral hit” on command.

A good story to remember when a client claims that some action is transparent. It may well be, but that doesn’t mean there are enough instances to draw any useful conclusions.

I first saw this in a tweet by Steven Strogatz.

Clojure 1.3-1.6 Cheat Sheet (v13)

November 22nd, 2014

Clojure 1.3-1.6 Cheat Sheet (v13)

The Clojure CheatSheet has been updated to Clojure 1.6.

Available in PDF, the cheatsheet for an entire language only runs two (2) pages. Some sources would stretch that to at least twenty (20) or more and make it far less useful.

Do you know of a legend to the colors used in the PDF? I can see that Documentation, Zippers, Macros, Loading and Other are all some shade of green but I’m not sure what binds them together? Pointers anyone?

Thanks to the Clojure Community!

November 22nd, 2014

Thanks to the Clojure Community! by Alex Miller.

Today at the Clojure/conj, I gave thanks to many community members for their contributions. Any such list is inherently incomplete – I simply can’t capture everyone doing great work. If I missed someone important, please drop a comment and accept my apologies.

Alex has a list of people with GitHub, website and Twitter URLs.

I have extracted the Twitter URLs and created a Twitter handle followed by a Python comment marker and the users name for your convenience with Twitter feed scripts:

timbaldridge # Tim Baldridge

bbatsov # Bozhidar Batsov

fbellomi # Francesco Bellomi

ambrosebs # Ambrose Bonnaire-Sergeant

reborg # Renzo Borgatti

reiddraper # Reid Draper

colinfleming # Colin Fleming

deepbluelambda # Daniel Solano Gomez

nonrecursive # Daniel Higginbotham

bridgethillyer # Bridget Hillyer

heyzk # Zachary Kim

aphyr # Kyle Kingsbury

alanmalloy # Alan Malloy

gigasquid # Carin Meier

dmiller2718 # David Miller

bronsa_ # Nicola Mometto

ra # Ramsey Nasser

swannodette # David Nolen

ericnormand # Eric Normand

petitlaurent # Laurent Petit

tpope # Tim Pope

smashthepast # Ghadi Shayban

stuartsierra # Stuart Sierra

You will, of course, have to delete the blank lines with I retained for ease of human reading. Any mistakes or errors in this listing are solely my responsibility.


WebCorp Linguist’s Search Engine

November 22nd, 2014

WebCorp Linguist’s Search Engine

From the homepage:

The WebCorp Linguist’s Search Engine is a tool for the study of language on the web. The corpora below were built by crawling the web and extracting textual content from web pages. Searches can be performed to find words or phrases, including pattern matching, wildcards and part-of-speech. Results are given as concordance lines in KWIC format. Post-search analyses are possible including time series, collocation tables, sorting and summaries of meta-data from the matched web pages.

Synchronic English Web Corpus 470 million word corpus built from web-extracted texts. Including a randomly selected ‘mini-web’ and high-level subject classification. About

Diachronic English Web Corpus 130 million word corpus randomly selected from a larger collection and balanced to contain the same number of words per month. About

Birmingham Blog Corpus 630 million word corpus built from blogging websites. Including a 180 million word sub-section separated into posts and comments. About

Anglo-Norman Correspondence Corpus A corpus of approximately 150 personal letters written by users of Anglo-Norman. Including bespoke part-of-speech annotation. About

Novels of Charles Dickens A searchable collection of the novels of Charles Dickens. Results can be visualised across chapters and novels. About

You have to register to use the service but registration is free.

The way I toss subject around on this blog you would think it has only one meaning. Not so as shown by the first twenty “hits” on subject in the Synchronic English Web Corpus:

1    Service agencies.  'Merit' is subject to various interpretations depending 
2		amount of oxygen a subject breathes in," he says, "
3		    to work on the subject again next month "to 
4	    of Durham degrees were subject to a religion test 
5	    London, which were not subject to any religious test, 
6	cited researchers in broad subject categories in life sciences, 
7    Losing Weight.  Broaching the subject of weight can be 
8    by survey respondents include subject and curriculum, assessment, pastoral, 
9       knowledge in teachers' own subject area, the use of 
10     each addressing a different subject and how citizenship and 
11	     and school staff, but subject to that they dismissed 
12	       expressed but it is subject to the qualifications set 
13	        last piece on this subject was widely criticised and 
14    saw themselves as foreigners subject to oppression by the 
15	 to suggest that, although subject to similar experiences, other 
16	       since you raise the subject, it's notable that very 
17	position of the privileged subject with their disorderly emotions 
18	 Jimmy may include radical subject matter in his scripts, 
19	   more than sufficient as subject matter and as an 
20	      the NATO script were subject to personal attacks from 

There are a host of options for using the corpus and exporting the results. See the Users Guide for full details.

A great tool not only for linguists but anyone who wants to explore English as a language with professional grade tools.

If you re-read Dickens with concordance in hand, please let me know how it goes. That has the potential to be a very interesting experience.

Free for personal/academic work, commercial use requires a license.

I first saw this in a tweet by Claire Hardaker

A modern guide to getting started with Data Science and Python

November 22nd, 2014

A modern guide to getting started with Data Science and Python by Thomas Wiecki.

From the post:

Python has an extremely rich and healthy ecosystem of data science tools. Unfortunately, to outsiders this ecosystem can look like a jungle (cue snake joke). In this blog post I will provide a step-by-step guide to venturing into this PyData jungle.

What’s wrong with the many lists of PyData packages out there already you might ask? I think that providing too many options can easily overwhelm someone who is just getting started. So instead, I will keep a very narrow scope and focus on the 10% of tools that allow you to do 90% of the work. After you mastered these essentials you can browse the long lists of PyData packages to decide which to try next.

The upside is that the few tools I will introduce already allow you to do most things a data scientist does in his day-to-day (i.e. data i/o, data munging, and data analysis).

A great “start small” post on Python.

Very appropriate considering that over sixty percent (60%) of software skill job postings mention Python. Popular Software Skills in Data Science Job postings. If you have a good set of basic tools, you can add specialized ones later.

Using Load CSV in the Real World

November 22nd, 2014

Using Load CSV in the Real World by Nicole White.

From the description:

In this live-coding session, Nicole will demonstrate the process of downloading a raw .csv file from the Internet and importing it into Neo4j. This will include cleaning the .csv file, visualizing a data model, and writing the Cypher query that will import the data. This presentation is meant to make Neo4j users aware of common obstacles when dealing with real-world data in .csv format, along with best practices when using LOAD CSV.

A webinar with substantive content and not marketing pitches! Unusual but it does happen.

A very good walk through importing a CSV file into Neo4j, with some modeling comments along the way and hints of best practices.

The “next” thing for users after a brief introduction to graphs and Neo4j.

The experience will build their confidence and they will learn from experience what works best for modeling their data sets.

Big data in minutes with the ELK Stack

November 21st, 2014

Big data in minutes with the ELK Stack by Philippe Creux.

From the post:

We’ve built a data analysis and dashboarding infrastructure for one of our clients over the past few weeks. They collect about 10 million data points a day. Yes, that’s big data.

My highest priority was to allow them to browse the data they collect so that they can ensure that the data points are consistent and contain all the attributes required to generate the reports and dashboards they need.

I chose to give the ELK stack a try: ElasticSearch, logstash and Kibana.

Is it just me or does processing “big data” seem to have gotten easier over the past several years?

But however easy or hard the processing, the value-add question is what do we know post data processing that we didn’t know before?

Building an AirPair Dashboard Using Apache Spark and Clojure

November 21st, 2014

Building an AirPair Dashboard Using Apache Spark and Clojure by Sébastien Arnaud.

From the post:

Have you ever wondered how much you should charge per hour for your skills as an expert on How many opportunities are available on this platform? Which skills are in high demand? My name is Sébastien Arnaud and I am going to introduce you to the basics of how to gather, refine and analyze publicly available data using Apache Spark with Clojure, while attempting to generate basic insights about the AirPair platform.

1.1 Objectives

Here are some of the basic questions we will be trying to answer through this tutorial:

  • What is the lowest and the highest hourly rate?
  • What is the average hourly rate?
  • What is the most sought out skill?
  • What is the skill that pays the most on average?

1.2 Chosen tools

In order to make this journey a lot more interesting, we are going to step out of the usual comfort zone of using a standard iPython notebook using pandas with matplotlib. Instead we are going to explore one of the hottest data technologies of the year: Apache Spark, along with one of the up and coming functional languages of the moment: Clojure.

As it turns out, these two technologies can work relatively well together thanks to Flambo, a library developed and maintained by YieldBot's engineers who have improved upon clj-spark (an earlier attempt from the Climate Corporation). Finally, in order to share the results, we will attempt to build a small dashboard using DucksBoard, which makes pushing data to a public dashboard easy.

In addition to illustrating the use of Apache Spark and Clojure, Sébastien also covers harvesting data from Twitter and processing it into a useful format.

Definitely worth some time over a weekend!

Deep Visual-Semantic Alignments for Generating Image Descriptions

November 21st, 2014

Deep Visual-Semantic Alignments for Generating Image Descriptions by Andrej Karpathy and Li Fei-Fei.

From the webpage:

We present a model that generates free-form natural language descriptions of image regions. Our model leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between text and visual data. Our approach is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate the effectiveness of our alignment model with ranking experiments on Flickr8K, Flickr30K and COCO datasets, where we substantially improve on the state of the art. We then show that the sentences created by our generative model outperform retrieval baselines on the three aforementioned datasets and a new dataset of region-level annotations.

Excellent examples with generated text. Code and other predictions “coming soon.”

For the moment you can also read the research paper: Deep Visual-Semantic Alignments for Generating Image Descriptions

Serious potential in any event but even more so if the semantics of the descriptions could be captured and mapped across natural languages.

Land Matrix

November 21st, 2014

Land Matrix: The Online Public Database on Land Deals

From the webpage:

The Land Matrix is a global and independent land monitoring initiative that promotes transparency and accountability in decisions over land and investment.

This website is our Global Observatory – an open tool for collecting and visualising information about large-scale land acquisitions.

The data represented here is constantly evolving; to make this resource more accurate and comprehensive, we encourage your participation.

The deals collected as data must meet the following criteria:

  • Entail a transfer of rights to use, control or ownership of land through sale, lease or concession;
  • Have been initiated since the year 2000;
  • Cover an area of 200 hectares or more;
  • Imply the potential conversion of land from smallholder production, local community use or important ecosystem service provision to commercial use.

FYI, 200 hectares = 2 square kilometers.

Land ownership and its transfer are matters of law and law means government.

The project describes its data this way:

The dataset is inherently unreliable, but over time it is expected to become more accurate. Land deals are notoriously un-transparent. In many countries, established procedures for decision-making on land deals do not exist, and negotiations and decisions do not take place in the public realm. Furthermore, a range of government agencies and levels of government are usually responsible for approving different kinds of land deals. Even official data sources in the same country can therefore vary, and none may actually reflect reality on the ground. Decisions are often changed, and this may or may not be communicated publically.

I would start earlier than the year 2000 but the same techniques could be applied along the route of the Keystone XL pipeline. I am assuming that you are aware that pipelines, roads and other public works are not located purely for physical or aesthetic reasons. Yes?

Please take the time to view and support the Land Matrix project and consider similar projects in your community.

If the owners can be run to ground, you may find the parties to the transactions are linked by other “associations.”

October 2014 Crawl Archive Available

November 21st, 2014

October 2014 Crawl Archive Available by Stephen Merity.

From the post:

The crawl archive for October 2014 is now available! This crawl archive is over 254TB in size and contains 3.72 billion webpages. The files are located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2014-42/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://aws-publicdatasets/ or to each line, you end up with the S3 and HTTP paths respectively.

Thanks again to blekko for their ongoing donation of URLs for our crawl!

Just in time for weekend exploration! ;-)


CERN frees LHC data

November 21st, 2014

CERN frees LHC data

From the post:

Today CERN launched its Open Data Portal, which makes data from real collision events produced by LHC experiments available to the public for the first time.

“Data from the LHC program are among the most precious assets of the LHC experiments, that today we start sharing openly with the world,” says CERN Director General Rolf Heuer. “We hope these open data will support and inspire the global research community, including students and citizen scientists.”

The LHC collaborations will continue to release collision data over the coming years.

The first high-level and analyzable collision data openly released come from the CMS experiment and were originally collected in 2010 during the first LHC run. Open source software to read and analyze the data is also available, together with the corresponding documentation. The CMS collaboration is committed to releasing its data three years after collection, after they have been thoroughly studied by the collaboration.

“This is all new and we are curious to see how the data will be re-used,” says CMS data preservation coordinator Kati Lassila-Perini. “We’ve prepared tools and examples of different levels of complexity from simplified analysis to ready-to-use online applications. We hope these examples will stimulate the creativity of external users.”

In parallel, the CERN Open Data Portal gives access to additional event data sets from the ALICE, ATLAS, CMS and LHCb collaborations that have been prepared for educational purposes. These resources are accompanied by visualization tools.

All data on are shared under a Creative Commons CC0 public domain dedication. Data and software are assigned unique DOI identifiers to make them citable in scientific articles. And software is released under open source licenses. The CERN Open Data Portal is built on the open-source Invenio Digital Library software, which powers other CERN Open Science tools and initiatives.

Awesome is the only term for this data release!

But, when you dig just a little bit further, you discover that embargoes still exist on three (3) out of (4) experiments. Both on data and software.

Disappointing but hopefully a dying practice when it comes to publicly funded data.

I first saw this in a tweet by Ben Evans.

Apache Hive on Apache Spark: The First Demo

November 21st, 2014

Apache Hive on Apache Spark: The First Demo by Brock Noland.

From the post:

Apache Spark is quickly becoming the programmatic successor to MapReduce for data processing on Apache Hadoop. Over the course of its short history, it has become one of the most popular projects in the Hadoop ecosystem, and is now supported by multiple industry vendors—ensuring its status as an emerging standard.

Two months ago Cloudera, Databricks, IBM, Intel, MapR, and others came together to port Apache Hive and the other batch processing engines to Spark. In October at Strata + Hadoop World New York, the Hive on Spark project lead Xuefu Zhang shared the project status and a provided a demo of our work. The same week at the Bay Area Hadoop User Group, Szehon Ho discussed the project and demo’ed the work completed. Additionally, Xuefu and Suhas Satish will be speaking about Hive on Spark at the Bay Area Hive User Group on Dec. 3.

The community has committed more than 140 changes to the Spark branch as part of HIVE-7292 – Hive on Spark. We are proud to say that queries are now functionally able to run, as you can see in the demo below of a multi-node Hive-on-Spark query (query 28 from TPC-DS with a scale factor of 20 on a TPC-DS derived dataset).

After seeing the demo, you will want to move Spark up on your technology to master list!

Beyond You’re vs. Your: A Grammar Cheat Sheet Even The Pros Can Use

November 21st, 2014

Beyond You’re vs. Your: A Grammar Cheat Sheet Even The Pros Can Use by Hayley Mullen.

From the post:

Grammar is one of those funny things that sparks a wide range of reactions from different people. While one person couldn’t care less about colons vs. semicolons, another person will have a visceral reaction to a misplaced apostrophe or a “there” where a “their” is needed (if you fall into the latter category, hello and welcome).

I think we can still all agree on one thing: poor grammar and spelling takes away from your message and credibility. In the worst case, a blog post rife with errors will cause you to think twice about how knowledgeable the person who wrote it really is. In lesser cases, a “then” where a “than” should be is just distracting and reflects poorly on your editing skills. Which is a bummer.

More than the ills mentioned by Hayley, poor writing is hard to understand. Using standards or creating topic maps is hard enough without having to decipher poor writing.

If you already write well, a refresher never hurts. If you don’t write so well, take Hayley’s post to heart and learn from it.

There are errors in standards that tend to occur over and over again. Perhaps I should write a cheat sheet for common standard writing errors. Possible entries: Avoiding Definite Article Abuse, Saying It Once Is Enough, etc.

Big bang of npm

November 21st, 2014

Big bang of npm

From the webpage:

npm is the largest package manager for javascript. This visualization gives you a small spaceship to explore the universe from inside. 106,317 stars (packages), 235,887 connections (dependencies).

Use WASD keys to move around. If you are browsing this with a modern smartphone – rotate your device around to control the camera (WebGL is required).

Navigation and other functions weren’t intuitive, at least not to me:

W – zooms in.

A – pans left.

S – zooms out.

D – pans right.

L – toggles links.

Choosing dependencies or dependents (lower left) filters the current view to show only dependencies or dependents of the chosen package.

Choosing a package name on lower left take you to the page for that package.

Search box at the top has a nice drop down of possible matches and displays dependencies or dependents by name, when selected below.

I would prefer more clues on the main display but given the density of the graph, that would quickly render it unusable.

Perhaps a way to toggle package names when displaying only a portion of the graph?

Users would have to practice with it but this technique could be very useful for displaying dense graphs. Say a map of the known contributions by lobbyists to members of Congress for example. ;-)

I first saw this in a tweet by Lincoln Mullen.

Clojure/conf 2014 Videos! (Complete Set)

November 21st, 2014

Updated: As of November 23, 2014, 09:35 AM EST, the complete set of Clojure/conf 2014 videos have been posted to ClojureTV:

Presentations listing more than one author appear twice, once under each author’s last name.

Jeanine Adkisson – Variants are Not Unions

Bozhidar Batsov – The evolution of the Emacs tooling for Clojure

Lucas Cavalcanti & Edward Wible – Exploring four hidden superpowers of Datomic

Clojure/conj Washington, D.C. – Lightning talks

Paul deGrandis – Unlocking data-driven systems

Colin Fleming – Cursive: A different type of IDE

Julian Gamble – Applying the paradigms of core.async in ClojureScript

Brian Goetz – Stewardship: the Sobering Parts

Nathan Herzing & Chris Shea – Helping voters with Pedestal, Datomic, Om and core.async

Rich Hickey – Inside Transducers

Ashton Kemerling – Generative Integration Tests

Michał Marczyk – Persistent Data Structures for Special Occasions

Steve Miner – Generating Generators

Zach Oakes – Making Games at Runtime with Clojure

Anna Pawlicka – Om nom nom nom

David Pick – Building a Data Pipeline with Clojure and Kafka

Chris Shea & Nathan Herzing – Helping voters with Pedestal, Datomic, Om and core.async

Ghadi Shayban – JVM Creature Comforts

Zach Tellman – Always Be Composing

Glenn Vanderburg – Cló: The Algorithms of TeX in Clojure

Edward Wible & Lucas Cavalcanti – Exploring four hidden superpowers of Datomic

Steven Yi – Developing Music Systems on the JVM with Pink and Score

The set is now complete!


FISA Judge To Yahoo: If US Citizens Don’t Know They’re Being Surveilled, There’s No Harm

November 20th, 2014

FISA Judge To Yahoo: If US Citizens Don’t Know They’re Being Surveilled, There’s No Harm

From the post:

A legal battle between Yahoo and the government over the Protect America Act took place in 2008, but details (forced from the government’s Top Secret file folders by FISA Judge Reggie Walton) are only emerging now. A total of 1,500 pages will eventually make their way into the public domain once redactions have been applied. The most recent release is a transcript [pdf link] of oral arguments presented by Yahoo’s counsel (Mark Zwillinger) and the US Solicitor General (Gregory Garre).

Cutting to the chase:

But the most surprising assertions made in these oral arguments don’t come from the Solicitor General. They come from Judge Morris S. Arnold, who shows something nearing disdain for the privacy of the American public and their Fourth Amendment rights.

In the first few pages of the oral arguments, while discussing whether or not secret surveillance actually harms US citizens (or the companies forced to comply with government orders), Arnold pulls a complete Mike Rogers:

If this order is enforced and it’s secret, how can you be hurt? The people don’t know that — that they’re being monitored in some way. How can you be harmed by it? I mean, what’s –what’s the — what’s your — what’s the damage to your consumer?

By the same logic, all sorts of secret surveillance would be OK — like watching your neighbor’s wife undress through the window, or placing a hidden camera in the restroom — as long as the surveilled party is never made aware of it. If you don’t know it’s happening, then there’s nothing wrong with it. Right? [h/t to Alex Stamos]

In the next astounding quote, Arnold makes the case that the Fourth Amendment doesn’t stipulate the use of warrants for searches because it’s not written right up on top in bold caps… or something.

The whole thrust of the development of Fourth Amendment law has sort of emphasized the watchdog function of the judiciary. If you just look at the Fourth Amendment, there’s nothing in there that really says that a warrant is usually required. It doesn’t say that at all, and the warrant clause is at the bottom end of the Fourth Amendment, and — but that’s the way — that’s the way it has been interpreted.

What’s standing between US citizens and unconstitutional acts by their government is a very thin wall indeed.

Bear in mind that you are not harmed if you don’t know you are being spied upon.

I guess the new slogan is: Don’t Ask, Don’t Look, Don’t Worry.


UK seeks to shutter Russian site streaming video from webcams

November 20th, 2014

UK seeks to shutter Russian site streaming video from webcams by Barb Darrow.

From the post:

If you feel like someone’s watching you, you might be right.

A mega peeping Tom site out of Russia is collecting video and images from poorly secured webcams, closed-circuit TV cameras and even baby monitors around the world and is streaming the results. And now Christopher Graham, the U.K.’s information commissioner, wants to shut it down, according to this Guardian report.

According to the Guardian, Graham wants the Russian government to put the kibosh on the site and if that doesn’t happen will work with other regulators, including the U.S. Federal Trade Commission, to step in.

Earlier this month a NetworkWorld blogger wrote about a site, presumably the same one mentioned by Graham, with a Russian IP address that accesses some 73,000 unsecured security cameras.

The site has a pretty impressive inventory of images it said were gleaned from Foscam, Linksys, Panasonic security cameras, other unnamed “IP cameras” and AvTech and Hikvision DVRs, according to that post. The site was purportedly set up to illustrate the importance of updating default security passwords.

Apologies but it looks like the site is offline at the moment. Perhaps overload from visitors given the publicity.

An important reminder that security begins at home and with the most basic steps, such as changing default passwords.

Only if you access the site and find out that you have been spied upon will you suffer any harm.

I am completely serious, only if you discover you have been spied upon can you suffer any harm.

Authority for that statement? FISA Judge To Yahoo: If US Citizens Don’t Know They’re Being Surveilled, There’s No Harm.

Over 1,000 research data repositories indexed in

November 20th, 2014

Over 1,000 research data repositories indexed in

From the post:

In August 2012 – the Registry of Research Data Repositories went online with 23 entries. Two years later the registry provides researchers, funding organisations, libraries and publishers with over 1,000 listed research data repositories from all over the world making it the largest and most comprehensive online catalog of research data repositories on the web. provides detailed information about the research data repositories, and its distinctive icons help researchers easily identify relevant repositories for accessing and depositing data sets.

To more than 5,000 unique visitors per month offers reliable orientation in the heterogeneous landscape of research data repositories. An average of 10 repositories are added to the registry every week. The latest indexed data infrastructure is the new CERN Open Data Portal.

Add to your short list of major data repositories!

Senate Republicans are getting ready to declare war on patent trolls

November 20th, 2014

Senate Republicans are getting ready to declare war on patent trolls by Timothy B. Lee

From the post:

Republicans are about to take control of the US Senate. And when they do, one of the big items on their agenda will be the fight against patent trolls.

In a Wednesday speech on the Senate floor, Sen. Orrin Hatch (R-UT) outlined a proposal to stop abusive patent lawsuits. “Patent trolls – which are often shell companies that do not make or sell anything – are crippling innovation and growth across all sectors of our economy,” Hatch said.

Hatch, the longest-serving Republican in the US Senate , is far from the only Republican in Congress who is enthusiastic about patent reform. The incoming Republican chairmen of both the House and Senate Judiciary committees have signaled their support for patent legislation. And they largely see eye to eye with President Obama, who has also called for reform.

“We must improve the quality of patents issued by the U.S. Patent and Trademark Office,” Hatch said. “Low-quality patents are essential to a patent troll’s business model.” His speech was short on specifics here, but one approach he endorsed was better funding for the patent office. That, he argued, would allow “more and better-trained patent examiners, more complete libraries of prior art, and greater access to modern information technologies to address the agency’s growing needs.”

I would hate to agree with Senator Hatch on anything but there is no doubt that low-quality patents are rife at the U.S. Patent and Trademark Office. Whether patent trolls simply took advantage of the quality of patents or are responsible for low quality patents it’s hard to say.

In any event, the call for “…more complete libraries of prior art, and greater access to modern information technologies…” sounds like a business opportunity for topic maps.

After all, we all know that faster, more comprehensive search engines of the patent literature only gives you more material to review. It doesn’t give you more relevant material to review. Or give you material you did not know to look for. Only additional semantics has the power to accomplish either of those tasks.

There are those who will keep beating bags of words in hopes that semantics will appear.

Don’t be one of those. Choose an area of patents of interest and use interactive text mining to annotate existing terms with semantics (subject identity) which will reduce misses and increase the usefulness of “hits.”

That isn’t a recipe for mining all existing patents but who wants to do that? If you gain a large enough semantic advantage in genomics, semiconductors, etc., the start-up cost to catch up will be a tough nut to crack. Particularly since you are already selling a better product for a lower price than a start-up can match.

I first saw this in a tweet by Tim O’Reilly.

PS: A better solution for software patent trolls would be a Supreme Court ruling that eliminates all software patents. Then Congress could pass a software copyright bill that grants copyright status on published code for three (3) years, non-renewable. If that sounds harsh, consider the credibility impact of nineteen year old bugs.

If code had to be recast every three years and all vendors were on the same footing, there would be a commercial incentive for better software. Yes? If I had the coding advantages of a major vendor, I would start lobbying for three (3) year software copyrights tomorrow. Besides, it would make software piracy a lot easier to track.

Conflict of Interest – Reversing the Definition

November 20th, 2014

Just a quick heads up that the semantics of “conflict of interest” has changed, at least in the context of the US House of Representatives.

Traditionally, the meaning of “conflict of interest” is captured by Wikipedia’s one-liner:

A conflict of interest (COI) is a situation occurring when an individual or organization is involved in multiple interests, one of which could possibly corrupt the motivation.

That seems fairly straight forward.

However, in H.R.1422 — 113th Congress (2013-2014), passed on 11/18/2014, the House authorized paid representatives of industry interest to be appointed to the EPA Advisory Board, saying:

SEC. 2. SCIENCE ADVISORY BOARD.(b)(2)(C) – persons with substantial and relevant expertise are not excluded from the Board due to affiliation with or representation of entities that may have a potential interest in the Board’s advisory activities, so long as that interest is fully disclosed to the Administrator and the public and appointment to the Board complies with section 208 of title 18, United States Code;

So, the House of Representatives has just reversed the standard definition of “conflict of interest” to say that hired guns of industry players have no “conflict of interest” sitting on the EPA Science Board, so long as they say they are hired guns.

I thought I was fairly hardened to hearing bizarre things out of government but reversing the definition of “conflict of interest” is a new one on me.

The science board is supposed to be composed of scientists, unsurprisingly. Scientists, by the very nature of their profession, do science. Experiments, reports, projects, etc. And no surprise the scientists on the EPA science panel work on … that’s right, environment science.

Care to guess who H.R.1422 prohibits from certain advisory activities?


Board members may not participate in advisory activities that directly or indirectly involve review or evaluation of their own work;

Scientists are excluded from advisory activities where they have expertise.

Being an expert in a field is a “conflict of interest” and being a hired gun is not (so long as being a hired gun is disclosed).

So the revision of “conflict of interest” is even worse than I thought.

I don’t have the heart to amend the Wikipedia article on conflict of interest. Would someone do that for me?

PS: I first saw this at House Republicans just passed a bill forbidding scientists from advising the EPA on their own research by Lindsay Abrams. Lindsay does a great job summarizing the issues and links to the legislation. I followed her link to the bill and reported just the legislative language. I think that is chilling enough.

PPS: Did your member of the House of Representative vote for this bill?

I first saw this in a tweet by Wilson da Silva.

Video: Experts Share Perspectives on Web Annotation

November 20th, 2014

Video: Experts Share Perspectives on Web Annotation by Gary Price.

From the post:

The topic of web annotation continues to grow in interest and importance.

Here’s how the World Wide Web Consortium (W3C) describes the topic:

Web annotations are an attempt to recreate and extend that functionality as a new layer of interactivity and linking on top of the Web. It will allow anyone to annotate anything anywhere, be it a web page, an ebook, a video, an image, an audio stream, or data in raw or visualized form. Web annotations can be linked, shared between services, tracked back to their origins, searched and discovered, and stored wherever the author wishes; the vision is for a decentralized and open annotation infrastructure.

A Few Examples

In recent weeks and months a WC3 Web Annotation working group got underway,, a company that has been working in this area for several years (and one we’ve mentioned several times on infoDOCKET) formally launched a web annotation extension for Chrome, the Mellon Foundation awarded $750,000 in research funding, and The Journal of Electronic Publishing began offering annotation for each article in the publication.

New Video

Today, posted a 15 minute video (embedded below) where several experts share some of their perspectives (Why the interest in the topic? Biggest Challenges, Future Plans, etc.) on the topic of web annotation.

The video was recorded at the recent W3C TPAC 2014 Conference in Santa Clara, CA.

I am puzzled by more than one speaker on the video referring to the lack of robust addressing as a reason why annotation has not succeeded in the past. Perhaps they are unaware of the XLink and XPointer work at the W3C? Or HyTime for that matter?

True, none of those efforts were widely supported but that doesn’t mean that robust addressing wasn’t available.

I for one will be interested in comparing the capabilities of prior efforts against what is marketed as “new web annotation” capabilities.

Annotation, particularly what was known as “extended XLinks” is very important for the annotation of materials to which you don’t have read/write access. Think about annotating texts distributed by a vendor on DVD. Or annotating text that are delivered over a web stream. A separate third-party value-add product. Like a topic map for instance.

See videos from I Annotate 2014

Python Multi-armed Bandits (and Beer!)

November 20th, 2014

Python Multi-armed Bandits (and Beer!) by Eric Chiang.

From the post:

There are many ways to evaluate different strategies for solving different prediction tasks. In our last post, for example, we discussed calibration and descrimination, two measurements which assess the strength of a probabilistic prediciton. Measurements like accuracy, error, and recall among others are useful when considering whether random forest “works better” than support vector machines on a problem set. Common sense tells us that knowing which analytical strategy “does the best” is important, as it will impact the quality of our decisions downstream. The trick, therefore, is in selecting the right measurement, a task which isn’t always obvious.

There are many prediction problems where choosing the right accuracy measurement is particularly difficult. For example, what’s the best way to know whether this version of your recommendation system is better than the prior version? You could – as was the case with the Netflix Prize – try to predict the number of stars a user gave to a movie and measure your accuracy. Another (simpler) way to vet your recommender strategy would be to roll I out to users and watch before and after behaviors.

So by the end of this blog post, you (the reader) will hopefully be helping me improve our beer recommender through your clicks and interactions.

The final application which this blog will explain can be found at The original post explaining beer recommenders can be found here.

I have friend who programs in Python (as well as other languages) and they are or are on their way to becoming a professional beer taster.

Given a choice, I think I would prefer to become a professional beer drinker but each to their own. ;-)

The discussion of measures of distances between beers in this post is quite good. When reading it, think about beers (or other beverages) you have had and try to pick between Euclidean distance, distance correlation, and cosine similarity in discussing how you evaluate those beverages to each other.

What? That isn’t how you evaluate your choices between beverages?

Yet, those “measures” have proven to be effective (effective != 100%) at providing distances between individual choices.

The “mapping” between the unknown internal scale of users and the metric measures used in recommendation systems is derived from a population of users. The resulting scale may or may not be an exact fit for any user in the tested group.

The usefulness of any such scale depends on the similarity of the population over which it was derived and the population where you want to use it. Not to mention how you validated the answers. (Users are reported to give the “expected” response as opposed to their actual choices in some scenarios.)

Geospatial Data in Python

November 20th, 2014

Geospatial Data in Python by Carson Farmer.

Materials for the tutorial: Geospatial Data in Python: Database, Desktop, and the Web by Carson Farmer (Associate Director of CARSI lab).

Important skills if you are concerned about projects such as the Keystone XL Pipeline:

keystone pipeline route

This is an instance where having the skills to combine geospatial, archaeological, and other data together will empower local communities to minimize the damage they will suffer from this project.

Having a background in the processing geophysical data is the first step in that process.

Indexed Database API Proposed Recommendation Published (Review Opportunity)

November 20th, 2014

Indexed Database API Proposed Recommendation Published

From the post:

The Web Applications Working Group has published a Proposed Recommendation of Indexed Database API. This document defines APIs for a database of records holding simple values and hierarchical objects. Each record consists of a key and some value. Moreover, the database maintains indexes over records it stores. An application developer directly uses an API to locate records either by their key or by using an index. A query language can be layered on this API. An indexed database can be implemented using a persistent B-tree data structure. Comments are welcome through 18 December. Learn more about the Rich Web Client Activity.

If you have the time between now and 18 December, this is a great opportunity to “get your feet wet” reviewing W3C recommendations.

The Indexed Database API document isn’t long (43 pages approximately) and you are no doubt already familiar with databases in general.

An example to get you started:

3.2 APIs

The API methods return without blocking the calling thread. All asynchronous operations immediately return an IDBRequest instance. This object does not initially contain any information about the result of the operation. Once information becomes available, an event is fired on the request and the information becomes available through the properties of the IDBRequest instance.

  1. When you read:

    The API methods return without blocking the calling thread.

    How do you answer the question: What is being returned by an API method?

  2. What is your answer after reading the next sentence?

    All asynchronous operations immediately return an IDBRequest instance.

  3. Does this work?

    API methods return an IDBRequest instance without blocking the calling thread.

    (Also correcting the unnecessary definite article “The.”)

  4. One more issue with the first sentence is:

    …without blocking the calling thread.

    If you search the document, there is no other mention of calling threads.

    I suspect this is unnecessary and would ask for its removal.

    So, revised the first sentence would read:

    API methods return an IDBRequest instance.

  5. Maybe, except that the second sentence says “All asynchronous operations….”

    When you see a statement of “All …. operations…,” you should look for a definition of those operations.

    I have looked and while “asynchronous” is used thirty-four (34) times, “asynchronous operations” is used only once.

    (more comments on “asynchronous” below)

  6. I am guessing you caught the “immediately” problem on your own. Undefined and what other response would there be?

    If we don’t need “asynchronous” and the first sentence is incomplete, is this a good suggestion for 3.2 APIs?

    (Proposed Edit)

    3.2 APIs

    API methods return an IDBRequest instance. This object does not initially contain any information about the result of the operation. Once information becomes available, an event is fired on the request and the information becomes available through the properties of the IDBRequest instance.

    There, your first edit to a W3C Recommendation removed twelve (12) words and made the text clearer.

    Plus there are the other thirty-three (33) instances of “asynchronous” to investigate.

    Not bad for your first effort!

    After looking around the rest of the proposed recommendation, I suspect that “asynchronous” is used to mean that results and requests can arrive and be processed in any order (except for some cases of overlapping scope of operations). It’s simpler just to say that once and not go about talking about “asynchronous requests,” “opening databases asynchronously,” etc.