## The Debunking Handbook

November 23rd, 2014

The Debunking Handbook by John Cook, Stephan Lewandowsky.

From the post:

The Debunking Handbook, a guide to debunking misinformation, is now freely available to download. Although there is a great deal of psychological research on misinformation, there’s no summary of the literature that offers practical guidelines on the most effective ways of reducing the influence of myths. The Debunking Handbook boils the research down into a short, simple summary, intended as a guide for communicators in all areas (not just climate) who encounter misinformation.

The Handbook explores the surprising fact that debunking myths can sometimes reinforce the myth in peoples’ minds. Communicators need to be aware of the various backfire effects and how to avoid them, such as:

It also looks at a key element to successful debunking: providing an alternative explanation. The Handbook is designed to be useful to all communicators who have to deal with misinformation (eg – not just climate myths).

I think you will find this a delightful read! From the first section, titled: Debunking the first myth about debunking,

It’s self-evident that democratic societies should base their decisions on accurate information. On many issues, however, misinformation can become entrenched in parts of the community, particularly when vested interests are involved.1,2 Reducing the influence of misinformation is a difficult and complex challenge.

A common misconception about myths is the notion that removing its influence is as simple as packing more information into people’s heads. This approach assumes that public misperceptions are due to a lack of knowledge and that the solution is more information – in science communication, it’s known as the “information deficit model”. But that model is wrong: people don’t process information as simply as a hard drive downloading data.

Refuting misinformation involves dealing with complex cognitive processes. To successfully impart knowledge, communicators need to understand how people process information, how they modify
their existing knowledge and how worldviews affect their ability to think rationally. It’s not just what people think that matters, but how they think.

I would have accepted the first sentence had it read: It’s self-evident that democratic societies don’t base their decisions on accurate information.

I don’t know of any historical examples of democracies making decisions on accurate information.

For example, there are any number of “rational” and well-meaning people who have signed off on the “war on terrorism” as though the United States is in any danger.

Deaths from terrorism in the United States since 2001 – fourteen (14).

Deaths by entanglement in bed sheets between 2001-2009 – five thousand five hundred and sixty-one (5561).

Despite being a great read, Debunking has a problem, it presumes you are dealing with a “rational” person. Rational as defined by…, as defined by what? Hard to say. It is only mentioned once and I suspect “rational” means that you agree with debunking the climate “myth.” I do as well but that’s happenstance and not because I am “rational” in some undefined way.

Realize that “rational” is a favorable label people apply to themselves and little more than that. It rather conveniently makes anyone who disagrees with you “irrational.”

I prefer to use “persuasion” on topics like global warming. You can use “facts” for people who are amenable to that approach but also religion (stewarts of the environment), greed (exploitation of the Third World for carbon credits), financial interest in government funded programs, or whatever works to persuade enough people to support your climate change program. Being aware that other people with other agendas are going to be playing the same game. The question is whether you want to be “rational” or do you want to win?

Personally I am convinced of climate change and our role in causing it. I am also aware of the difficulty of sustaining action by people with an average attention span of fifteen (15) seconds over the period of the fifty (50) years it will take for the environment to stabilize if all human inputs stopped tomorrow. It’s going to take far more than “facts” to obtain a better result.

## Pride & Prejudice & Word Embedding Distance

November 23rd, 2014

Pride & Prejudice & Word Embedding Distance by Lynn Cherny.

From the webpage:

An experiment: Train a word2vec model on Jane Austen’s books, then replace the nouns in P&P with the nearest word in that model. The graph shows a 2D t-SNE distance plot of the nouns in this book, original and replacement. Mouse over the blue words!

In her blog post, Visualizing Word Embeddings in Pride and Prejudice, Lynn explain more about the project and the process she followed.

From that post:

Overall, the project as launched consists of the text of Pride and Prejudice, with the nouns replaced by the most similar word in a model trained on all of Jane Austen’s books’ text. The resulting text is pretty nonsensical. The blue words are the replaced words, shaded by how close a “match” they are to the original word; if you mouse over them, you see a little tooltip telling you the original word and the score.

I don’t agree that: “The resulting test is pretty nonsensical.”

True, it’s not Jane Austin’s original text and it is challenging to read, but that may be because our assumptions about Pride and Prejudice and literature in general are being defeated by the similar word replacements.

The lack of familiarity and smoothness of a received text may (no guarantees) enable us to see the text differently than we would on a casual re-reading.

What novel corpus would you use for such an experiment?

## …ambiguous phrases in research papers…

November 23rd, 2014

When scientists use ambiguous phrases in research papers… And what they might actually mean

This graphic was posted to Twitter by Jan Lentzos.

This sort of thing makes the rounds every now and again. From the number of retweets of Jan’s post, it never fails to amuse.

Enjoy!

## Visual Classification Simplified

November 23rd, 2014

Visual Classification Simplified

From the post:

Virtually all information governance initiatives depend on being able to accurately and consistently classify the electronic files and scanned documents being managed. Visual classification is the only technology that classifies both types of documents regardless of the amount or quality of text associated with them.

From the user perspective, visual classification is extremely easy to understand and work with. Once documents are collected, visual classification clusters or groups documents based on their appearance. This normalizes documents regardless of the types of files holding the content. The Word document that was saved to PDF will be grouped with that PDF and with the TIF that was made from scanning a paper copy of either document.

The clustering is automatic, there are no rules to write up front, no exemplars to select, no seed sets to try to tune. This is what a collection of documents might look like before visual classification is applied – no order and no way to classify the documents:

When the initial results of visual classification are presented to the client, the clusters are arranged according to the number of documents in each cluster. Reviewing the first clusters impacts the most documents. Based on reviewing one or two documents per cluster, the reviewer is able to determine (a) should the documents in the cluster be retained, and (b) if they should be retained, what document-type label to associate with the cluster.

By easily eliminating clusters that have no business or regulatory value, content collections can be dramatically reduced. Clusters that remain can have granular retention policies applied, be kept under appropriate access restrictions, and can be assigned business unit owners. Plus of course, the document-type labels can greatly assist users trying to find specific documents. (emphasis in original)

I suspect that BeyondRecognition, the host of this post, really means classification at the document level. A granularity that has plagued information retrieval for decades. Better than no retrieval at all but only just.

However, the graphics of visualization were just too good to pass up! Imagine that you are selecting merging criteria for a set of topics that represent subjects at a far lower granularity than document level.

With the results of those selections being returned to you as part of an interactive process.

If most topic map authoring is for aggregation, that is you author so that topics will merge, this would be aggregation by selection.

Hard to say for sure but I suspect that aggregation (merging) by selection would be far easier than authoring for aggregation.

Suggestions on how to test that premise?

## Linguistic Mapping Reveals How Word Meanings Sometimes Change Overnight

November 23rd, 2014

Linguistic Mapping Reveals How Word Meanings Sometimes Change Overnight Data mining the way we use words is revealing the linguistic earthquakes that constantly change our language.

From the post:

In October 2012, Hurricane Sandy approached the eastern coast of the United States. At the same time, the English language was undergoing a small earthquake of its own. Just months before, the word “sandy” was an adjective meaning “covered in or consisting mostly of sand” or “having light yellowish brown colour”. Almost overnight, this word gained an additional meaning as a proper noun for one of the costliest storms in US history.

A similar change occurred to the word “mouse” in the early 1970s when it gained the new meaning of “computer input device”. In the 1980s, the word “apple” became a proper noun synonymous with the computer company. And later, the word “windows” followed a similar course after the release of the Microsoft operating system.

All this serves to show how language constantly evolves, often slowly but at other times almost overnight. Keeping track of these new senses and meanings has always been hard. But not anymore.

Today, Vivek Kulkarni at Stony Brook University in New York and a few pals show how they have tracked these linguistic changes by mining the corpus of words stored in databases such as Google Books, movie reviews from Amazon and of course the microblogging site Twitter.

These guys have developed three ways to spot changes in the language. The first is a simple count of how often words are used, using tools such as Google Trends. For example, in October 2012, the frequency of the words “Sandy” and “hurricane” both spiked in the run-up to the storm. However, only one of these words changed its meaning, something that a frequency count cannot spot.

A very good overview of:

Statistically Significant Detection of Linguistic Change by Vivek Kulkarni, Rami Al-Rfou, Bryan Perozzi, and Steven Skiena.

Abstract:

We propose a new computational approach for tracking and detecting statistically significant linguistic shifts in the meaning and usage of words. Such linguistic shifts are especially prevalent on the Internet, where the rapid exchange of ideas can quickly change a word’s meaning. Our meta-analysis approach constructs property time series of word usage, and then uses statistically sound change point detection algorithms to identify significant linguistic shifts.

We consider and analyze three approaches of increasing complexity to generate such linguistic property time series, the culmination of which uses distributional characteristics inferred from word co-occurrences. Using recently proposed deep neural language models, we first train vector representations of words for each time period. Second, we warp the vector spaces into one unified coordinate system. Finally, we construct a distance-based distributional time series for each word to track it’s linguistic displacement over time.

We demonstrate that our approach is scalable by tracking linguistic change across years of micro-blogging using Twitter, a decade of product reviews using a corpus of movie reviews from Amazon, and a century of written books using the Google Book-ngrams. Our analysis reveals interesting patterns of language usage change commensurate with each medium.

While the authors are concerned with scaling, I would think detecting cracks, crevasses, and minor tremors in the meaning and usage of words, say between a bank and its regulators, or stock traders and the SEC, would be equally important.

Even if auto-detection of the “new” or “changed” meaning is too much to expect, simply detecting dissonance in the usage of terms would be a step in the right direction.

Detecting earthquakes in meaning is a worthy endeavor but there is more tripping on cracks than falling from earthquakes, linguistically speaking.

## Show and Tell: A Neural Image Caption Generator

November 23rd, 2014

Show and Tell: A Neural Image Caption Generator by Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan.

Abstract:

Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. For instance, while the current state-of-the-art BLEU score (the higher the better) on the Pascal dataset is 25, our approach yields 59, to be compared to human performance around 69. We also show BLEU score improvements on Flickr30k, from 55 to 66, and on SBU, from 19 to 27.

Another caption generating program for images. (see also, Deep Visual-Semantic Alignments for Generating Image Descriptions) Not quite to the performance of a human observer but quite respectable. The near misses are amusing enough for crowd correction to be an element in a full blown system.

Perhaps “rough recognition” is close enough for some purposes. Searching images for people who match a partial description and producing a much smaller set for additional processing.

I first saw this in Nat Torkington’s Four short links: 18 November 2014.

November 22nd, 2014

Jarrod C. Taylor writes in part 1:

Introduction

Clojure is a great language that is continuing to improve itself and expand its user base year over year. The Clojure ecosystem has many great libraries focused on being highly composable. This composability allows developers to easily build impressive applications from seemingly simple parts. Once you have a solid understanding of how Clojure libraries fit together, integration between them can become very intuitive. However, if you have not reached this level of understanding, knowing how all of the parts fit together can be daunting. Fear not, this series will walk you through start to finish, building a tested compojure web app backed by a Postgres Database.

Where We Are Going

The project we will build and test over the course of this blog series is an address book application. We will build the app using ring and Compojure and persist the data in a Postgres Database. The app will be a traditional client server app with no JavaScript. Here is a teaser of the final product.

Not that I need another address book but as an exercise in onboarding, this rocks!

(see above)

Recap and Restructure

So far we have modified the default Compojure template to include a basic POST route and used Midje and Ring-Mock to write a test to confirm that it works. Before we get started with templates and creating our address book we should provide some additional structure to our application in an effort to keep things organized as the project grows.

Introduction

In this installment of the address book series we are finally ready to start building the actual application. We have laid all of the ground work required to finally get to work.

Persisting Data in Postgres

At this point we have an address book that will allow us to add new contacts. However, we are not persisting our new additions. It’s time to change that. You will need to have Postgres installed. If you are using a Mac, postgresapp is a very simple way of installing. If you are on another OS you will need to follow the install instructions from the Postgres website.

Once you have Postgres installed and running we are going to create a test user and two databases.

The Finish Line

Our address book application has finally taken shape and we are in a position to put the finishing touches on it. All that remains is to allow the user the ability to edit and delete existing contacts.

One clever thing Jarrod has done is post all five (5) parts to this series on one day. You can go as fast or as slow as you choose to go.

Another clever thing is that testing is part of the development process.

How many programmers actually incorporate testing day to day? Given the prevalence of security bugs (to say nothing at all of other bugs), I would say less than one hundred percent (100%).

You?

How much less than 100% I won’t hazard a guess.

## Solr vs. Elasticsearch – Case by Case

November 22nd, 2014

Solr vs. Elasticsearch – Case by Case by Alexandre Rafalovitch.

From the description:

A presentation given at the Lucene/Solr Revolution 2014 conference to show Solr and Elasticsearch features side by side. The presentation time was only 30 minutes, so only the core usability features were compared. The full video is coming later.

Just the highlights and those from an admitted ElasticSearch user.

One very telling piece of advice for Solr:

Solr – needs to buckle down and focus on the onboarding experience

Solr is getting better (e.g. listen to SolrCluster podcast of October 24, 2014)

Just in case you don’t know the term: onboarding.

And SolrCluster podcast of October 24, 2014: Solr Usability with Steve Rowe & Tim Potter

From the description:

In this episode, Lucene/Solr Committers Steve Rowe and Tim Potter join the SolrCluster team to discuss how Lucidworks and the community are making changes and improvements to Solr to increase usability and add ease to the getting started experience. Steve and Tim discuss new features such as data-driven schema, start-up scripts, launching SolrCloud, and more. (length 33:29)

Paraphrasing:

…focusing on the first five minutes of the Solr experience…hard to explore if you can’t get it started…can be a little bit scary at first…has lacked a focus on accessibility by ordinary users…need usability addressed throughout the lifecycle of the product…want to improve kicking the tires on Solr…lowering mental barriers for new users…do now have start scripts…bakes in a lot of best practices…scripts for SolrCloud…hide all the weird stuff…data driven schemas…throw data at Solr and it creates an index without creating a schema…working on improving tutorials and documentation…moving towards consolidating information…will include use cases…walk throughs…will point to different data sets…making it easier to query Solr and understand the query URLs…bringing full collections API support to the admin UI…Rest interface…components report possible configuration…plus a form to interact with it directly…forms that render in the browser…will have a continued focus on usability…not a one time push…new users need to submit any problems they encounter….

Great podcast!

Very encouraging on issues of documentation and accessibility in Solr.

November 22nd, 2014

Open-sourcing tools for Hadoop by Colin Marc.

From the post:

Stripe’s batch data infrastructure is built largely on top of Apache Hadoop. We use these systems for everything from fraud modeling to business analytics, and we’re open-sourcing a few pieces today:

Timberlake

Timberlake is a dashboard that gives you insight into the Hadoop jobs running on your cluster. Jeff built it as a replacement for YARN’s ResourceManager and MRv2’s JobHistory server, and it has some features we’ve found useful:

• Map and reduce task waterfalls and timing plots
• Error tracebacks for failed jobs

Brushfire

Avi wrote a Scala framework for distributed learning of ensemble decision tree models called Brushfire. It’s inspired by Google’s PLANET, but built on Hadoop and Scalding. Designed to be highly generic, Brushfire can build and validate random forests and similar models from very large amounts of training data.

Sequins

Sequins is a static database for serving data in Hadoop’s SequenceFile format. I wrote it to provide low-latency access to key/value aggregates generated by Hadoop. For example, we use it to give our API access to historical fraud modeling features, without adding an online dependency on HDFS.

Herringbone

At Stripe, we use Parquet extensively, especially in tandem with Cloudera Impala. Danielle, Jeff, and Avi wrote Herringbone (a collection of small command-line utilities) to make working with Parquet and Impala easier.

I am considering creating a list of closed source tools for Hadoop. It would be shorter and easier to maintain than a list of open source tools for Hadoop.

## The structural virality of online diffusion

November 22nd, 2014

The structural virality of online di ffusion by Sharad Goel, Ashton Anderson, Jake Hofman, and Duncan J. Watts.

Viral products and ideas are intuitively understood to grow through a person-to-person di ffusion process analogous to the spread of an infectious disease; however, until recently it has been prohibitively difficult to directly observe purportedly viral events, and thus to rigorously quantify or characterize their structural properties. Here we propose a formal measure of what we label “structural virality” that interpolates between two conceptual extremes: content that gains its popularity through a single, large broadcast, and that which grows through multiple generations with any one individual directly responsible for only a fraction of the total adoption. We use this notion of structural virality to analyze a unique dataset of a billion di ffusion events on Twitter, including the propagation of news stories, videos, images, and petitions. We find that across all domains and all sizes of events, online di ffusion is characterized by surprising structural diversity. Popular events, that is, regularly grow via both broadcast and viral mechanisms, as well as essentially all conceivable combinations of the two. Correspondingly, we find that the correlation between the size of an event and its structural virality is surprisingly low, meaning that knowing how popular a piece of content is tells one little about how it spread. Finally, we attempt to replicate these fi ndings with a model of contagion characterized by a low infection rate spreading on a scale-free network. We fi nd that while several of our empirical fi ndings are consistent with such a model, it does not replicate the observed diversity of structural virality.

Before you get too excited, the authors do not provide a how-to-go-viral manual.

In part because:

Large and potentially viral cascades are therefore necessarily very rare events; hence one must observe a correspondingly large number of events in order to fi nd just one popular example, and many times that number to observe many such events. As we will describe later, in fact, even moderately popular events occur in our data at a rate of only about one in a thousand, while “viral hits” appear at a rate closer to one in a million. Consequently, in order to obtain a representative sample of a few hundred viral hits arguably just large enough to estimate statistical patterns reliably one requires an initial sample on the order of a billion events, an extraordinary data requirement that is difficult to satisfy even with contemporary data sources.

The authors clearly advance the state of research on “viral hits” and conclude with suggestions for future modeling work.

You can imagine the reaction of marketing departments should anyone get closer to designing successful viral advertising.

A good illustration that something we can observe, “viral hits,” in an environment where the spread can be tracked (Twitter), can still resist our best efforts to model and/or explain how to repeat the “viral hit” on command.

A good story to remember when a client claims that some action is transparent. It may well be, but that doesn’t mean there are enough instances to draw any useful conclusions.

I first saw this in a tweet by Steven Strogatz.

## Clojure 1.3-1.6 Cheat Sheet (v13)

November 22nd, 2014

Clojure 1.3-1.6 Cheat Sheet (v13)

The Clojure CheatSheet has been updated to Clojure 1.6.

Available in PDF, the cheatsheet for an entire language only runs two (2) pages. Some sources would stretch that to at least twenty (20) or more and make it far less useful.

Do you know of a legend to the colors used in the PDF? I can see that Documentation, Zippers, Macros, Loading and Other are all some shade of green but I’m not sure what binds them together? Pointers anyone?

## Thanks to the Clojure Community!

November 22nd, 2014

Thanks to the Clojure Community! by Alex Miller.

Today at the Clojure/conj, I gave thanks to many community members for their contributions. Any such list is inherently incomplete – I simply can’t capture everyone doing great work. If I missed someone important, please drop a comment and accept my apologies.

Alex has a list of people with GitHub, website and Twitter URLs.

I have extracted the Twitter URLs and created a Twitter handle followed by a Python comment marker and the users name for your convenience with Twitter feed scripts:

timbaldridge # Tim Baldridge

bbatsov # Bozhidar Batsov

fbellomi # Francesco Bellomi

ambrosebs # Ambrose Bonnaire-Sergeant

reborg # Renzo Borgatti

reiddraper # Reid Draper

colinfleming # Colin Fleming

deepbluelambda # Daniel Solano Gomez

nonrecursive # Daniel Higginbotham

bridgethillyer # Bridget Hillyer

heyzk # Zachary Kim

aphyr # Kyle Kingsbury

alanmalloy # Alan Malloy

gigasquid # Carin Meier

dmiller2718 # David Miller

bronsa_ # Nicola Mometto

ra # Ramsey Nasser

swannodette # David Nolen

ericnormand # Eric Normand

petitlaurent # Laurent Petit

tpope # Tim Pope

stuartsierra # Stuart Sierra


You will, of course, have to delete the blank lines with I retained for ease of human reading. Any mistakes or errors in this listing are solely my responsibility.

Enjoy!

## WebCorp Linguist’s Search Engine

November 22nd, 2014

WebCorp Linguist’s Search Engine

From the homepage:

The WebCorp Linguist’s Search Engine is a tool for the study of language on the web. The corpora below were built by crawling the web and extracting textual content from web pages. Searches can be performed to find words or phrases, including pattern matching, wildcards and part-of-speech. Results are given as concordance lines in KWIC format. Post-search analyses are possible including time series, collocation tables, sorting and summaries of meta-data from the matched web pages.

Synchronic English Web Corpus 470 million word corpus built from web-extracted texts. Including a randomly selected ‘mini-web’ and high-level subject classification. About

Diachronic English Web Corpus 130 million word corpus randomly selected from a larger collection and balanced to contain the same number of words per month. About

Birmingham Blog Corpus 630 million word corpus built from blogging websites. Including a 180 million word sub-section separated into posts and comments. About

Anglo-Norman Correspondence Corpus A corpus of approximately 150 personal letters written by users of Anglo-Norman. Including bespoke part-of-speech annotation. About

Novels of Charles Dickens A searchable collection of the novels of Charles Dickens. Results can be visualised across chapters and novels. About

You have to register to use the service but registration is free.

The way I toss subject around on this blog you would think it has only one meaning. Not so as shown by the first twenty “hits” on subject in the Synchronic English Web Corpus:

1    Service agencies.  'Merit' is subject to various interpretations depending
2		amount of oxygen a subject breathes in," he says, "
3		    to work on the subject again next month "to
4	    of Durham degrees were subject to a religion test
5	    London, which were not subject to any religious test,
6	cited researchers in broad subject categories in life sciences,
7    Losing Weight.  Broaching the subject of weight can be
8    by survey respondents include subject and curriculum, assessment, pastoral,
9       knowledge in teachers' own subject area, the use of
10     each addressing a different subject and how citizenship and
11	     and school staff, but subject to that they dismissed
12	       expressed but it is subject to the qualifications set
13	        last piece on this subject was widely criticised and
14    saw themselves as foreigners subject to oppression by the
15	 to suggest that, although subject to similar experiences, other
16	       since you raise the subject, it's notable that very
17	position of the privileged subject with their disorderly emotions
18	 Jimmy may include radical subject matter in his scripts,
19	   more than sufficient as subject matter and as an
20	      the NATO script were subject to personal attacks from


There are a host of options for using the corpus and exporting the results. See the Users Guide for full details.

A great tool not only for linguists but anyone who wants to explore English as a language with professional grade tools.

If you re-read Dickens with concordance in hand, please let me know how it goes. That has the potential to be a very interesting experience.

I first saw this in a tweet by Claire Hardaker

## A modern guide to getting started with Data Science and Python

November 22nd, 2014

A modern guide to getting started with Data Science and Python by Thomas Wiecki.

From the post:

Python has an extremely rich and healthy ecosystem of data science tools. Unfortunately, to outsiders this ecosystem can look like a jungle (cue snake joke). In this blog post I will provide a step-by-step guide to venturing into this PyData jungle.

What’s wrong with the many lists of PyData packages out there already you might ask? I think that providing too many options can easily overwhelm someone who is just getting started. So instead, I will keep a very narrow scope and focus on the 10% of tools that allow you to do 90% of the work. After you mastered these essentials you can browse the long lists of PyData packages to decide which to try next.

The upside is that the few tools I will introduce already allow you to do most things a data scientist does in his day-to-day (i.e. data i/o, data munging, and data analysis).

A great “start small” post on Python.

Very appropriate considering that over sixty percent (60%) of software skill job postings mention Python. Popular Software Skills in Data Science Job postings. If you have a good set of basic tools, you can add specialized ones later.

## Using Load CSV in the Real World

November 22nd, 2014

Using Load CSV in the Real World by Nicole White.

From the description:

In this live-coding session, Nicole will demonstrate the process of downloading a raw .csv file from the Internet and importing it into Neo4j. This will include cleaning the .csv file, visualizing a data model, and writing the Cypher query that will import the data. This presentation is meant to make Neo4j users aware of common obstacles when dealing with real-world data in .csv format, along with best practices when using LOAD CSV.

A webinar with substantive content and not marketing pitches! Unusual but it does happen.

A very good walk through importing a CSV file into Neo4j, with some modeling comments along the way and hints of best practices.

The “next” thing for users after a brief introduction to graphs and Neo4j.

The experience will build their confidence and they will learn from experience what works best for modeling their data sets.

## Big data in minutes with the ELK Stack

November 21st, 2014

Big data in minutes with the ELK Stack by Philippe Creux.

From the post:

We’ve built a data analysis and dashboarding infrastructure for one of our clients over the past few weeks. They collect about 10 million data points a day. Yes, that’s big data.

My highest priority was to allow them to browse the data they collect so that they can ensure that the data points are consistent and contain all the attributes required to generate the reports and dashboards they need.

I chose to give the ELK stack a try: ElasticSearch, logstash and Kibana.

Is it just me or does processing “big data” seem to have gotten easier over the past several years?

But however easy or hard the processing, the value-add question is what do we know post data processing that we didn’t know before?

## Building an AirPair Dashboard Using Apache Spark and Clojure

November 21st, 2014

Building an AirPair Dashboard Using Apache Spark and Clojure by Sébastien Arnaud.

From the post:

Have you ever wondered how much you should charge per hour for your skills as an expert on AirPair.com? How many opportunities are available on this platform? Which skills are in high demand? My name is Sébastien Arnaud and I am going to introduce you to the basics of how to gather, refine and analyze publicly available data using Apache Spark with Clojure, while attempting to generate basic insights about the AirPair platform.

1.1 Objectives

Here are some of the basic questions we will be trying to answer through this tutorial:

• What is the lowest and the highest hourly rate?
• What is the average hourly rate?
• What is the most sought out skill?
• What is the skill that pays the most on average?

1.2 Chosen tools

In order to make this journey a lot more interesting, we are going to step out of the usual comfort zone of using a standard iPython notebook using pandas with matplotlib. Instead we are going to explore one of the hottest data technologies of the year: Apache Spark, along with one of the up and coming functional languages of the moment: Clojure.

As it turns out, these two technologies can work relatively well together thanks to Flambo, a library developed and maintained by YieldBot's engineers who have improved upon clj-spark (an earlier attempt from the Climate Corporation). Finally, in order to share the results, we will attempt to build a small dashboard using DucksBoard, which makes pushing data to a public dashboard easy.

In addition to illustrating the use of Apache Spark and Clojure, Sébastien also covers harvesting data from Twitter and processing it into a useful format.

Definitely worth some time over a weekend!

## Deep Visual-Semantic Alignments for Generating Image Descriptions

November 21st, 2014

Deep Visual-Semantic Alignments for Generating Image Descriptions by Andrej Karpathy and Li Fei-Fei.

From the webpage:

We present a model that generates free-form natural language descriptions of image regions. Our model leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between text and visual data. Our approach is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate the effectiveness of our alignment model with ranking experiments on Flickr8K, Flickr30K and COCO datasets, where we substantially improve on the state of the art. We then show that the sentences created by our generative model outperform retrieval baselines on the three aforementioned datasets and a new dataset of region-level annotations.

Excellent examples with generated text. Code and other predictions “coming soon.”

For the moment you can also read the research paper: Deep Visual-Semantic Alignments for Generating Image Descriptions

Serious potential in any event but even more so if the semantics of the descriptions could be captured and mapped across natural languages.

## Land Matrix

November 21st, 2014

Land Matrix: The Online Public Database on Land Deals

From the webpage:

The Land Matrix is a global and independent land monitoring initiative that promotes transparency and accountability in decisions over land and investment.

This website is our Global Observatory – an open tool for collecting and visualising information about large-scale land acquisitions.

The data represented here is constantly evolving; to make this resource more accurate and comprehensive, we encourage your participation.

The deals collected as data must meet the following criteria:

• Entail a transfer of rights to use, control or ownership of land through sale, lease or concession;
• Have been initiated since the year 2000;
• Cover an area of 200 hectares or more;
• Imply the potential conversion of land from smallholder production, local community use or important ecosystem service provision to commercial use.

FYI, 200 hectares = 2 square kilometers.

Land ownership and its transfer are matters of law and law means government.

The project describes its data this way:

The dataset is inherently unreliable, but over time it is expected to become more accurate. Land deals are notoriously un-transparent. In many countries, established procedures for decision-making on land deals do not exist, and negotiations and decisions do not take place in the public realm. Furthermore, a range of government agencies and levels of government are usually responsible for approving different kinds of land deals. Even official data sources in the same country can therefore vary, and none may actually reflect reality on the ground. Decisions are often changed, and this may or may not be communicated publically.

I would start earlier than the year 2000 but the same techniques could be applied along the route of the Keystone XL pipeline. I am assuming that you are aware that pipelines, roads and other public works are not located purely for physical or aesthetic reasons. Yes?

Please take the time to view and support the Land Matrix project and consider similar projects in your community.

If the owners can be run to ground, you may find the parties to the transactions are linked by other “associations.”

## October 2014 Crawl Archive Available

November 21st, 2014

October 2014 Crawl Archive Available by Stephen Merity.

From the post:

The crawl archive for October 2014 is now available! This crawl archive is over 254TB in size and contains 3.72 billion webpages. The files are located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2014-42/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://aws-publicdatasets/ or https://aws-publicdatasets.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

Thanks again to blekko for their ongoing donation of URLs for our crawl!

Just in time for weekend exploration!

Enjoy!

## CERN frees LHC data

November 21st, 2014

CERN frees LHC data

From the post:

Today CERN launched its Open Data Portal, which makes data from real collision events produced by LHC experiments available to the public for the first time.

“Data from the LHC program are among the most precious assets of the LHC experiments, that today we start sharing openly with the world,” says CERN Director General Rolf Heuer. “We hope these open data will support and inspire the global research community, including students and citizen scientists.”

The LHC collaborations will continue to release collision data over the coming years.

The first high-level and analyzable collision data openly released come from the CMS experiment and were originally collected in 2010 during the first LHC run. Open source software to read and analyze the data is also available, together with the corresponding documentation. The CMS collaboration is committed to releasing its data three years after collection, after they have been thoroughly studied by the collaboration.

“This is all new and we are curious to see how the data will be re-used,” says CMS data preservation coordinator Kati Lassila-Perini. “We’ve prepared tools and examples of different levels of complexity from simplified analysis to ready-to-use online applications. We hope these examples will stimulate the creativity of external users.”

In parallel, the CERN Open Data Portal gives access to additional event data sets from the ALICE, ATLAS, CMS and LHCb collaborations that have been prepared for educational purposes. These resources are accompanied by visualization tools.

All data on OpenData.cern.ch are shared under a Creative Commons CC0 public domain dedication. Data and software are assigned unique DOI identifiers to make them citable in scientific articles. And software is released under open source licenses. The CERN Open Data Portal is built on the open-source Invenio Digital Library software, which powers other CERN Open Science tools and initiatives.

Awesome is the only term for this data release!

But, when you dig just a little bit further, you discover that embargoes still exist on three (3) out of (4) experiments. Both on data and software.

Disappointing but hopefully a dying practice when it comes to publicly funded data.

I first saw this in a tweet by Ben Evans.

## Apache Hive on Apache Spark: The First Demo

November 21st, 2014

Apache Hive on Apache Spark: The First Demo by Brock Noland.

From the post:

Apache Spark is quickly becoming the programmatic successor to MapReduce for data processing on Apache Hadoop. Over the course of its short history, it has become one of the most popular projects in the Hadoop ecosystem, and is now supported by multiple industry vendors—ensuring its status as an emerging standard.

Two months ago Cloudera, Databricks, IBM, Intel, MapR, and others came together to port Apache Hive and the other batch processing engines to Spark. In October at Strata + Hadoop World New York, the Hive on Spark project lead Xuefu Zhang shared the project status and a provided a demo of our work. The same week at the Bay Area Hadoop User Group, Szehon Ho discussed the project and demo’ed the work completed. Additionally, Xuefu and Suhas Satish will be speaking about Hive on Spark at the Bay Area Hive User Group on Dec. 3.

The community has committed more than 140 changes to the Spark branch as part of HIVE-7292 – Hive on Spark. We are proud to say that queries are now functionally able to run, as you can see in the demo below of a multi-node Hive-on-Spark query (query 28 from TPC-DS with a scale factor of 20 on a TPC-DS derived dataset).

After seeing the demo, you will want to move Spark up on your technology to master list!

## Beyond You’re vs. Your: A Grammar Cheat Sheet Even The Pros Can Use

November 21st, 2014

Beyond You’re vs. Your: A Grammar Cheat Sheet Even The Pros Can Use by Hayley Mullen.

From the post:

Grammar is one of those funny things that sparks a wide range of reactions from different people. While one person couldn’t care less about colons vs. semicolons, another person will have a visceral reaction to a misplaced apostrophe or a “there” where a “their” is needed (if you fall into the latter category, hello and welcome).

I think we can still all agree on one thing: poor grammar and spelling takes away from your message and credibility. In the worst case, a blog post rife with errors will cause you to think twice about how knowledgeable the person who wrote it really is. In lesser cases, a “then” where a “than” should be is just distracting and reflects poorly on your editing skills. Which is a bummer.

More than the ills mentioned by Hayley, poor writing is hard to understand. Using standards or creating topic maps is hard enough without having to decipher poor writing.

If you already write well, a refresher never hurts. If you don’t write so well, take Hayley’s post to heart and learn from it.

There are errors in standards that tend to occur over and over again. Perhaps I should write a cheat sheet for common standard writing errors. Possible entries: Avoiding Definite Article Abuse, Saying It Once Is Enough, etc.

## Big bang of npm

November 21st, 2014

Big bang of npm

From the webpage:

npm is the largest package manager for javascript. This visualization gives you a small spaceship to explore the universe from inside. 106,317 stars (packages), 235,887 connections (dependencies).

Use WASD keys to move around. If you are browsing this with a modern smartphone – rotate your device around to control the camera (WebGL is required).

Navigation and other functions weren’t intuitive, at least not to me:

W – zooms in.

A – pans left.

S – zooms out.

D – pans right.

Choosing dependencies or dependents (lower left) filters the current view to show only dependencies or dependents of the chosen package.

Choosing a package name on lower left take you to the page for that package.

Search box at the top has a nice drop down of possible matches and displays dependencies or dependents by name, when selected below.

I would prefer more clues on the main display but given the density of the graph, that would quickly render it unusable.

Perhaps a way to toggle package names when displaying only a portion of the graph?

Users would have to practice with it but this technique could be very useful for displaying dense graphs. Say a map of the known contributions by lobbyists to members of Congress for example.

I first saw this in a tweet by Lincoln Mullen.

## Clojure/conf 2014 Videos! (Complete Set)

November 21st, 2014

## FISA Judge To Yahoo: If US Citizens Don’t Know They’re Being Surveilled, There’s No Harm

November 20th, 2014

FISA Judge To Yahoo: If US Citizens Don’t Know They’re Being Surveilled, There’s No Harm

From the post:

A legal battle between Yahoo and the government over the Protect America Act took place in 2008, but details (forced from the government’s Top Secret file folders by FISA Judge Reggie Walton) are only emerging now. A total of 1,500 pages will eventually make their way into the public domain once redactions have been applied. The most recent release is a transcript [pdf link] of oral arguments presented by Yahoo’s counsel (Mark Zwillinger) and the US Solicitor General (Gregory Garre).

Cutting to the chase:

But the most surprising assertions made in these oral arguments don’t come from the Solicitor General. They come from Judge Morris S. Arnold, who shows something nearing disdain for the privacy of the American public and their Fourth Amendment rights.

In the first few pages of the oral arguments, while discussing whether or not secret surveillance actually harms US citizens (or the companies forced to comply with government orders), Arnold pulls a complete Mike Rogers:

If this order is enforced and it’s secret, how can you be hurt? The people don’t know that — that they’re being monitored in some way. How can you be harmed by it? I mean, what’s –what’s the — what’s your — what’s the damage to your consumer?

By the same logic, all sorts of secret surveillance would be OK — like watching your neighbor’s wife undress through the window, or placing a hidden camera in the restroom — as long as the surveilled party is never made aware of it. If you don’t know it’s happening, then there’s nothing wrong with it. Right? [h/t to Alex Stamos]

In the next astounding quote, Arnold makes the case that the Fourth Amendment doesn’t stipulate the use of warrants for searches because it’s not written right up on top in bold caps… or something.

The whole thrust of the development of Fourth Amendment law has sort of emphasized the watchdog function of the judiciary. If you just look at the Fourth Amendment, there’s nothing in there that really says that a warrant is usually required. It doesn’t say that at all, and the warrant clause is at the bottom end of the Fourth Amendment, and — but that’s the way — that’s the way it has been interpreted.

What’s standing between US citizens and unconstitutional acts by their government is a very thin wall indeed.

Bear in mind that you are not harmed if you don’t know you are being spied upon.

I guess the new slogan is: Don’t Ask, Don’t Look, Don’t Worry.

Suggestions?

## UK seeks to shutter Russian site streaming video from webcams

November 20th, 2014

UK seeks to shutter Russian site streaming video from webcams by Barb Darrow.

From the post:

If you feel like someone’s watching you, you might be right.

A mega peeping Tom site out of Russia is collecting video and images from poorly secured webcams, closed-circuit TV cameras and even baby monitors around the world and is streaming the results. And now Christopher Graham, the U.K.’s information commissioner, wants to shut it down, according to this Guardian report.

According to the Guardian, Graham wants the Russian government to put the kibosh on the site and if that doesn’t happen will work with other regulators, including the U.S. Federal Trade Commission, to step in.

Earlier this month a NetworkWorld blogger wrote about a site, presumably the same one mentioned by Graham, with a Russian IP address that accesses some 73,000 unsecured security cameras.

The site has a pretty impressive inventory of images it said were gleaned from Foscam, Linksys, Panasonic security cameras, other unnamed “IP cameras” and AvTech and Hikvision DVRs, according to that post. The site was purportedly set up to illustrate the importance of updating default security passwords.

Apologies but it looks like the site is offline at the moment. Perhaps overload from visitors given the publicity.

An important reminder that security begins at home and with the most basic steps, such as changing default passwords.

Only if you access the site and find out that you have been spied upon will you suffer any harm.

I am completely serious, only if you discover you have been spied upon can you suffer any harm.

Authority for that statement? FISA Judge To Yahoo: If US Citizens Don’t Know They’re Being Surveilled, There’s No Harm.

## Over 1,000 research data repositories indexed in re3data.org

November 20th, 2014

Over 1,000 research data repositories indexed in re3data.org

From the post:

In August 2012 re3data.org – the Registry of Research Data Repositories went online with 23 entries. Two years later the registry provides researchers, funding organisations, libraries and publishers with over 1,000 listed research data repositories from all over the world making it the largest and most comprehensive online catalog of research data repositories on the web. re3data.org provides detailed information about the research data repositories, and its distinctive icons help researchers easily identify relevant repositories for accessing and depositing data sets.

To more than 5,000 unique visitors per month re3data.org offers reliable orientation in the heterogeneous landscape of research data repositories. An average of 10 repositories are added to the registry every week. The latest indexed data infrastructure is the new CERN Open Data Portal.

## Senate Republicans are getting ready to declare war on patent trolls

November 20th, 2014

Senate Republicans are getting ready to declare war on patent trolls by Timothy B. Lee

From the post:

Republicans are about to take control of the US Senate. And when they do, one of the big items on their agenda will be the fight against patent trolls.

In a Wednesday speech on the Senate floor, Sen. Orrin Hatch (R-UT) outlined a proposal to stop abusive patent lawsuits. “Patent trolls – which are often shell companies that do not make or sell anything – are crippling innovation and growth across all sectors of our economy,” Hatch said.

Hatch, the longest-serving Republican in the US Senate , is far from the only Republican in Congress who is enthusiastic about patent reform. The incoming Republican chairmen of both the House and Senate Judiciary committees have signaled their support for patent legislation. And they largely see eye to eye with President Obama, who has also called for reform.

“We must improve the quality of patents issued by the U.S. Patent and Trademark Office,” Hatch said. “Low-quality patents are essential to a patent troll’s business model.” His speech was short on specifics here, but one approach he endorsed was better funding for the patent office. That, he argued, would allow “more and better-trained patent examiners, more complete libraries of prior art, and greater access to modern information technologies to address the agency’s growing needs.”

I would hate to agree with Senator Hatch on anything but there is no doubt that low-quality patents are rife at the U.S. Patent and Trademark Office. Whether patent trolls simply took advantage of the quality of patents or are responsible for low quality patents it’s hard to say.

In any event, the call for “…more complete libraries of prior art, and greater access to modern information technologies…” sounds like a business opportunity for topic maps.

After all, we all know that faster, more comprehensive search engines of the patent literature only gives you more material to review. It doesn’t give you more relevant material to review. Or give you material you did not know to look for. Only additional semantics has the power to accomplish either of those tasks.

There are those who will keep beating bags of words in hopes that semantics will appear.

Don’t be one of those. Choose an area of patents of interest and use interactive text mining to annotate existing terms with semantics (subject identity) which will reduce misses and increase the usefulness of “hits.”

That isn’t a recipe for mining all existing patents but who wants to do that? If you gain a large enough semantic advantage in genomics, semiconductors, etc., the start-up cost to catch up will be a tough nut to crack. Particularly since you are already selling a better product for a lower price than a start-up can match.

I first saw this in a tweet by Tim O’Reilly.

PS: A better solution for software patent trolls would be a Supreme Court ruling that eliminates all software patents. Then Congress could pass a software copyright bill that grants copyright status on published code for three (3) years, non-renewable. If that sounds harsh, consider the credibility impact of nineteen year old bugs.

If code had to be recast every three years and all vendors were on the same footing, there would be a commercial incentive for better software. Yes? If I had the coding advantages of a major vendor, I would start lobbying for three (3) year software copyrights tomorrow. Besides, it would make software piracy a lot easier to track.