How To Publish Open Data (in the UK)

March 2nd, 2015

No way this will display properly so I just linked to it.

I don’t know about the UK but a very similar discussion takes place in academic circles before releasing data that less than a dozen people have asked to see, ever.


I first saw this in a tweet by Irina Bolychevsky.

RAD – Outlier Detection on Big Data

March 2nd, 2015

RAD – Outlier Detection on Big Data by Jeffrey Wong, Chris Colburn, Elijah Meeks, and Shankar Vedaraman.

From the post:

Outlier detection can be a pain point for all data driven companies, especially as data volumes grow. At Netflix we have multiple datasets growing by 10B+ record/day and so there’s a need for automated anomaly detection tools ensuring data quality and identifying suspicious anomalies. Today we are open-sourcing our outlier detection function, called Robust Anomaly Detection (RAD), as part of our Surus project.

As we built RAD we identified four generic challenges that are ubiquitous in outlier detection on “big data.”

  • High cardinality dimensions: High cardinality data sets – especially those with large combinatorial permutations of column groupings – makes human inspection impractical.
  • Minimizing False Positives: A successful anomaly detection tool must minimize false positives. In our experience there are many alerting platforms that “sound an alarm” that goes ultimately unresolved. The goal is to create alerting mechanisms that can be tuned to appropriately balance noise and information.
  • Seasonality: Hourly/Weekly/Bi-weekly/Monthly seasonal effects are common and can be mis-identified as outliers deserving attention if not handled properly. Seasonal variability needs to be ignored.
  • Data is not always normally distributed: This has been a particular challenge since Netflix has been growing over the last 24 months. Generally though, an outlier tool must be robust so that it works on data that is not normally distributed.

In addition to addressing the challenges above, we wanted a solution with a generic interface (supporting application development). We met these objectives with a novel algorithm encased in a wrapper for easy deployment in our ETL environment.

Looking for “suspicious anomalies” is always popular, in part because it implies someone has deliberately departed from “normal” behavior.

Certainly important but as the FBI staging terror plots we discussed earlier today, show that the normal FBI “mo” is to stage terror plots and an anomaly would be a real terror plot, one not staged by the FBI.

The lesson being don’t assume outliers are departures from a desired norm. Can be, but not always are.

CartoDB and Plotly Analyze Earthquakes

March 2nd, 2015

CartoDB and Plotly Analyze Earthquakes

From the post:

CartoDB lets you easily make web-based maps driven by a PostgreSQL/PostGIS backend, so data management is easy. Plotly is a cloud-based graphing and analytics platform with Python, R, & MATLAB APIs where collaboration is easy. This IPython Notebook shows how to use them together to analyze earthquake data.

Assuming your data/events have geographic coordinates, this post should enable you to plot that information as easy as earthquakes.

For example, if you had traffic accident locations, delays caused by those accidents and weather conditions, you could plot where the most disruptive accidents happen and the weather conditions in which they occur.

Drilling Down: A Quick Guide to Free and Inexpensive Data Tools

March 2nd, 2015

Drilling Down: A Quick Guide to Free and Inexpensive Data Tools by Nils Mulvad.

From the post:

Newsrooms don’t need large budgets for analyzing data–they can easily access basic data tools that are free or inexpensive. The summary below is based on a five-day training session at Delo, the leading daily newspaper in Slovenia. Anuška Delić, journalist and project leader of DeloData at the paper, initiated the training with the aim of getting her team to work on data stories with easily available tools and a lot of new data.

“At first it seemed that not all of the 11 participants, who had no or almost no prior knowledge of this exciting field of journalism, would ‘catch the bug’ of data-driven thinking about stories, but soon it became obvious” once the training commenced, said Delić.

Encouraging story about data journalism as well as a source for inexpensive tools.

Even knowing the most basic tools will make you standout from people that repeat the government or party line (depending on where you are located).

Code for DeepMind & Commentary

March 2nd, 2015

If you are following the news of Google’s Atari buster, ;-), the following items will be of interest:

Code for Human-Level Control through Deep Reinforcement Learning, which offers the source code to accompany the Nature article.

DeepMind’s Nature Paper and Earlier Related Work by Jürgen Schmidhuber. Jürgen takes issue with some of the claims made in the abstract of the Nature paper. Quite usefully he cites references and provides links to numerous other materials on deep learning.

How soon before this comes true?

In an online multiplayer game, no one knows you are an AI.

Azure Machine Learning Videos: February 2015

March 2nd, 2015

Azure Machine Learning Videos: February 2015 by Mark Tabladillo.

From the post:

With the general availability of Azure Machine Learning, Microsoft released a collection of eighteen new videos which accurately summarize what the product does and how to use it. Most of the videos are short, and some of the material overlaps: I don’t have a recommended order, but you could play the shorter ones first. In all cases, you can download a copy of each video for your own library or offline use.

Eighteen new videos of varying lengths, the shortest and longest are:

Getting Started with Azure Machine Learning – Step3 35 seconds.

Preprocessing Data in Azure Machine Learning Studio 10 minutes 52 seconds.

Believe it or not, it is possible to say something meaningful in 35 seconds. Not a lot but enough to suggest an experiment based on information from a previous module.

For those of you on the MS side of the house or anyone who likes a range of employment options.


Operationalizing a Hadoop Eco-System

March 2nd, 2015

(Part 1: Installing & Configuring a 3-node Cluster) by Louis Frolio.

From the post:

The objective of DataTechBlog is to bring the many facets of data, data tools, and the theory of data to those curious about data science and big data. The relationship between these disciplines and data can be complex. However, if careful consideration is given to a tutorial, it is a practical expectation that the layman can be brought online quickly. With that said, I am extremely excited to bring this tutorial on the Hadoop Eco-system. Hadoop & MapReduce (at a high level) are not complicated ideas. Basically, you take a large volume of data and spread it across many servers (HDFS). Once at rest, the data can be acted upon by the many CPU’s in the cluster (MapReduce). What makes this so cool is that the traditional approach to processing data (bring data to cpu) is flipped. With MapReduce, CPU is brought to the data. This “divide-and-conquer” approach makes Hadoop and MapReduce indispensable when processing massive volumes of data. In part 1 of this multi-part series, I am going to demonstrate how to install, configure and run a 3-node Hadoop cluster. Finally, at the end I will run a simple MapReduce job to perform a unique word count of Shakespeare’s Hamlet. Future installments of this series will include topics such as: 1. Creating an advanced word count with MapReduce, 2. Installing and running Hive, 3. Installing and running Pig, 4. Using Sqoop to extract and import structured data into HDFS. The goal is to illuminate all the popular and useful tools that support Hadoop.

Operationalizing a Hadoop Eco-System (Part 2: Customizing Map Reduce)

Operationalizing a Hadoop Eco-System (Part 3: Installing and using Hive)

Be forewarned that Louis suggests hosting three Linux VMs on a fairly robust machine. He worked on a Windows 7 x64 machine with 1 TB of storage and 24G of RAM. (How much of that was used by Windows and Office he doesn’t say. ;-) )

The last post in this series was in April 2014 so you may have to look elsewhere for tutorials on Pig and Sqoop.


‘Keep Fear Alive.’ Keep it alive.

March 2nd, 2015

Why Does the FBI Have To Manufacture Its Own Plots If Terrorism And ISIS Are Such Grave Threats? by Glenn Greenwald.

From the post:

The FBI and major media outlets yesterday trumpeted the agency’s latest counterterrorism triumph: the arrest of three Brooklyn men, ages 19 to 30, on charges of conspiring to travel to Syria to fight for ISIS (photo of joint FBI/NYPD press conference, above). As my colleague Murtaza Hussain ably documents, “it appears that none of the three men was in any condition to travel or support the Islamic State, without help from the FBI informant.” One of the frightening terrorist villains told the FBI informant that, beyond having no money, he had encountered a significant problem in following through on the FBI’s plot: his mom had taken away his passport. Noting the bizarre and unhinged ranting of one of the suspects, Hussain noted on Twitter that this case “sounds like another victory for the FBI over the mentally ill.”

In this regard, this latest arrest appears to be quite similar to the overwhelming majority of terrorism arrests the FBI has proudly touted over the last decade. As my colleague Andrew Fishman and I wrote last month — after the FBI manipulated a 20-year-old loner who lived with his parents into allegedly agreei target=”_blank”ng to join an FBI-created plot to attack the Capitol — these cases follow a very clear pattern:

The known facts from this latest case seem to fit well within a now-familiar FBI pattern whereby the agency does not disrupt planned domestic terror attacks but rather creates them, then publicly praises itself for stopping its own plots.


In an update to the post, Greenwald quotes former FBI assistant director Thomas Fuentes as saying:

If you’re submitting budget proposals for a law enforcement agency, for an intelligence agency, you’re not going to submit the proposal that “We won the war on terror and everything’s great,” cuz the first thing that’s gonna happen is your budget’s gonna be cut in half. You know, it’s my opposite of Jesse Jackson’s ‘Keep Hope Alive’—it’s ‘Keep Fear Alive.’ Keep it alive. (emphasis in the original)

The FBI run terror operations give a ring of validity to the imagined plots that the rest of the intelligence and law enforcement community is alleged to be fighting.

It’s unfortunate that the mainstream media can’t divorce itself from the government long enough to notice the shortage of terrorists in the United States. As in zero judging from terrorist attacks on government and many other institutions.

For example, the federal, state and local governments employ 21,831,255 people. Let’s see, how many died last year in terrorist attacks against any level of government? Err, that would the the 0, empty set, nil.

What about all the local, state, federal elected officials? Certainly federal officials would be targets for terrorists. How many died last year in terrorist attacks? Again, 0, empty set, nil.

Or the 900,000 police officers? Again, 0, empty set, nil. (About 150 police officers die every year in the line of duty. Auto accidents, violent encounters with criminals, etc. but no terrorists.)

That covers some of the likely targets for any terrorist and we came up with zero deaths. Either terrorists aren’t in the United States or their mother won’t let them buy a gun.

Either way, you can see why everyone should be rejecting the fear narrative.

PS: Suggestion: Let’s cut all the terrorist related budgets in half and if there are no terrorist attacks within a year, half them again. Then there would be no budget crisis, we could pay down the national debt, save Social Security and not live in fear.

Beginning deep learning with 500 lines of Julia

March 2nd, 2015

Beginning deep learning with 500 lines of Julia by Deniz Yuret.

From the post:

There are a number of deep learning packages out there. However most sacrifice readability for efficiency. This has two disadvantages: (1) It is difficult for a beginner student to understand what the code is doing, which is a shame because sometimes the code can be a lot simpler than the underlying math. (2) Every other day new ideas come out for optimization, regularization, etc. If the package used already has the trick implemented, great. But if not, it is difficult for a researcher to test the new idea using impenetrable code with a steep learning curve. So I started writing KUnet.jl which currently implements backprop with basic units like relu, standard loss functions like softmax, dropout for generalization, L1-L2 regularization, and optimization using SGD, momentum, ADAGRAD, Nesterov’s accelerated gradient etc. in less than 500 lines of Julia code. Its speed is competitive with the fastest GPU packages (here is a benchmark). For installation and usage information, please refer to the GitHub repo. The remainder of this post will present (a slightly cleaned up version of) the code as a beginner’s neural network tutorial (modeled after Honnibal’s excellent parsing example).

This tutorial “begins” with you coding deep learning. If you need a bit more explanation on deep learning, you could do far worse than consulting Deep Learning: Methods and Applications or Deep Learning in Neural Networks: An Overview.

If you are already at the programming stage of deep learning, enjoy!

For Julia, Julia (homepage), Julia (online manual), (Julia blog aggregator), should be enough to get you started.

I first saw this in a tweet by Andre Pemmelaar.

Let Me Get That Data For You (LMGTDFY)

March 1st, 2015

Let Me Get That Data For You (LMGTDFY) by U.S. Open Data.

From the post:

LMGTDFY is a web-based utility to catalog all open data file formats found on a given domain name. It finds CSV, XML, JSON, XLS, XLSX, XML, and Shapefiles, and makes the resulting inventory available for download as a CSV file. It does this using Bing’s API.

This is intended for people who need to inventory all data files on a given domain name—these are generally employees of state and municipal government, who are creating an open data repository, and performing the initial step of figuring out what data is already being emitted by their government.

LMGTDFY powers U.S. Open Data’s LMGTDFY site, but anybody can install the software and use it to create their own inventory. You might want to do this if you have more than 300 data files on your site. U.S. Open Data’s LMGTDFY site caps the number of results at 300, in order to avoid winding up with an untenably large invoice for using Bing’s API. (Microsoft allows 5,000 searches/month for free.)

Now there’s a useful utility!


I first saw this in a tweet by Pycoders Weekly.

John Snow, and OpenStreetMap

March 1st, 2015

John Snow, and OpenStreetMap by Arthur Charpentier.

From the post:


While I was working for a training on data visualization, I wanted to get a nice visual for John Snow’s cholera dataset. This dataset can actually be found in a great package of famous historical datasets.

You know the story, right? Cholera epidemic in Soho, London, 1854. After Snow established that the Broad Street water pump was at the center of the outbreak, the Broad Street pump handle was removed.

But the story doesn’t end there, Wikipedia notes:

After the cholera epidemic had subsided, government officials replaced the Broad Street pump handle. They had responded only to the urgent threat posed to the population, and afterward they rejected Snow’s theory. To accept his proposal would have meant indirectly accepting the oral-fecal method transmission of disease, which was too unpleasant for most of the public to contemplate.

Government has been looking out for public opinion, not to say public health and well-being for quite some time.

Replicating the Snow analysis is important but it is even more important to realize that the equivalents of cholera are present in modern urban environments. Not cholera so often but street violence, bad drugs, high interest rate loans, food deserts, lack of child care, etc. are the modern equivalents of cholera.

What if a John Snow like mapping demonstrated that living in particular areas made you some N% more likely to spent X number of years in a state prison? Do you think that would affect the property values of housing owned by slum lords? Or impact the allocation for funds for schools and libraries?


Big Data Never Sleeps

March 1st, 2015


Suggestion: Enlarge and print out this graphic on 8 1/2 x 11 (or A4 outside of the US) paper. When “big data” sales people come calling, hand them a copy of it and ask them to outline the relevancy of any of the shown data to your products and business model.

Don’t get me wrong, there are areas seen and unforeseen, where big data is going to have unimaginable impacts.

However, big data solutions will be sold where appropriate and where not. The only way to protect yourself is to ask the same questions of big data sales people as you would of any vendor selling you more conventional equipment for your business. What is the cost? What benefits do you gain? How does it impact your profit margins? Will it result in new revenue streams and what has been the experience of others with those streams?

Or do you want to be YouTube and still not making a profit? If you like churn perhaps so but churn is a business model for hedge fund managers for the most part.

I first saw this in a tweet by Veronique Milsant.

Clojure and Overtone Driving Minecraft

March 1st, 2015

Clojure and Overtone Driving Minecraft by Joseph Wilk.

From the post:

Using Clojure we create interesting 3D shapes in Minecraft to the beat of music generated from Overtone.

We achieve this by embedding a Clojure REPL inside a Java Minecraft server which loads Overtone and connects to an external Supercollider instance (What Overtone uses for sound).

Speaking of functional programming, you may find this useful.

The graphics and music are impressive!

Help! Lost Source! (for story)

March 1st, 2015

I read a delightful account of functional versus imperative programming yesterday while in the middle of a major system upgrade. Yes, I can tell by your laughter that you realize I either failed to bookmark the site and/or lost it somewhere along the way. Yes, I have tried searching for it but with all the interest in functional programming, I was about as successful as the NSA in predicting the next terror attack.

Let me relate to you as much of it as I remember, in no particular order, and perhaps you will recognize the story. It was quite clever and I want to cite it properly as well as excerpt parts of it for this blog.

The story starts off talking about functional programming and how this is the author’s take on that subject. They start with Turing and the Turing machine and observes the Turning machine writes down results on a tape. Results that are consulted by later operations.

After a paragraph or two, they move onto Church and lamda calculus. Rather than writing down results, with functional programming, the results are passed from function to function.

I thought it was an excellent illustration of why Turing machines have “state” (marks on the tape) whereas functional languages (in theory at any rate) have none.

Other writers have made the same distinction but I found the author’s focus on whether results are captured or not being the clearest I have seen.

My impression is that the piece was fairly recent, in the last month or two but I could be mistaken in that regard. It was a blog post and not terribly long. (Exclude published articles and the like.)

Are you the author? Know of the author?

Pointers are most welcome!


February 28th, 2015


Algorithmia was born in 2013 with the goal of advancing the art of algorithm development, discovery and use. As developers ourselves we believe that given the right tools the possibilities for innovation and discovery are limitless.

Today we build what we believe to be the next era of programming: a collaborative, always live and community driven approach to making the machines that we interact with better.

The community drives the Algorithmia API. One API that exposes the collective knowledge of algorithm developers across the globe.

Currently in private beta but sounds very promising!

I first saw Algorithmia mentioned in Algorithmia API Exposes Collective Knowledge of Developers by Martin W. Brennan.

MI5 accused of covering up sexual abuse at boys’ home

February 28th, 2015

MI5 accused of covering up sexual abuse at boys’ home by Vikram Dodd and Richard Norton-Taylor.

From the post:

MI5 is facing allegations it was complicit in the sexual abuse of children, the high court in Northern Ireland will hear on Tuesday.

Victims of the abuse are taking legal action to force a full independent inquiry with the power to compel witnesses to testify and the security service to hand over documents.

The case, in Belfast, is the first in court over the alleged cover-up of British state involvement at the Kincora children’s home in Northern Ireland in the 1970s. It is also the first of the recent sex abuse cases allegedly tying in the British state directly. Victims allege that the cover-up over Kincora has lasted decades.

The victims want the claims of state collusion investigated by an inquiry with full powers, such as the one set up into other sex abuse scandals chaired by the New Zealand judge Lowell Goddard.

Amnesty International branded Kincora “one of the biggest scandals of our age” and backed the victims’ calls for an inquiry with full powers: “There are longstanding claims that MI5 blocked one or more police investigations into Kincora in the 1970s in order to protect its own intelligence-gathering operation, a terrible indictment which raises the spectre of countless vulnerable boys having faced further years of brutal abuse.

It’s too early to claim victory but, Belfast boys’ home abuse victims win legal bid by Henry McDonald:

Residents of a notorious Northern Ireland boys’ home are to be allowed to challenge a decision to exclude it from the UK-wide inquiry into establishment paedophile rings.

A high court judge in Belfast on Tuesday granted a number of former inmates from the Kincora home a judicial review into the decision to keep this scandal out of the investigation, headed by judge Lowell Goddard from New Zealand.

The Kincora boys’ home has been linked to a paedophile ring, some of whose members were allegedly being blackmailed by MI5 and other branches of the security forces during the Troubles.

Until now, the home secretary, Theresa May, has resisted demands from men who were abused at the home – and Amnesty International – that the inquiry be widened to include Kincora.

The campaigners want to establish whether the security services turned a blind eye to the abuse and instead used it to compromise a number of extreme Ulster loyalists guilty of abusing boys at the home.

If you read carefully you will see the abuse victims have won the right to challenge the exclusion of the boys home from a UK wide investigation. A long way from forcing MI5 and other collaborators in sexual abuse of children to provide an accounting in the clear light of day.

Leaked documents, caches of spy cables, spy documents, always show agents of the government protecting war criminals, paedophiles, engaging in torture, including rape and other dishonorable conduct.

My question is why does the mainstream media honors the fiction that government secrets are meant to protect the public? Government secrets are used to protect guilty, the dishonorable and the despicable. What’s unclear about that?

ClojureScript Tutorial

February 27th, 2015

ClojureScript Tutorial by Andrey Antukh.

From the webpage:

This tutorial consists on to provide an introduction to clojurescript from very basic setup, to more complex application in little incremental steps.

It includes:

  • Setup initial clojure app layout.
  • Setup initial clojurescript app layout.
  • First contact to clojurescript language.
  • Working with dom events.
  • Working with routing in the browser.
  • Working with ajax requests.
  • First contact with core.async.
  • Working with events and ajax using core.async.
  • First contact with om/reactjs.
  • Working with om and timetraveling.
  • Working with om and peristent state.
  • Little bonus: browser enabled repl.

I mention this because it will be helpful background for:

From the description:

Facebook’s React uses a virtual DOM diff implementation for high performance. It updates the view only when it’s needed. But David Nolen’s Om library (ClojureScript wrapper over React) goes even further. It stores application state in one place and passes “branches” of that state to a number of components. Data is immutable, and components are reusable. No more juggling around with JavaScript object literals. If anyone likes data as much as I do they will enjoy working with Om. It’s a great tool for building user interfaces around your data.

This talk will show how to combine core.async, liberator and om with JavaScript visualisation libraries to create interactive charts and maps. I will walk everyone through how to:

  • Generate a resource that can be used in a route and use it to pull data for visualisations
  • Use om to create reusable components out of JavaScript libraries: dimple.js and leaflet.js
  • Create core.async channels and use them communicate user clicks and selections between those components, e.g. selection on a checkbox component triggers reloading of data on a chart component.

The talk will be a practical guide to building small web applications and will be accessible to Clojurians with a basic knowledge of Clojure and HTML.

Enjoy! – a search engine bringing the Dark Web into the light

February 27th, 2015 – a search engine bringing the Dark Web into the light by Mark Stockley.

From the post:

The Dark Web is reflecting a little more light these days.

On Monday I wrote about Memex, DARPA’s Deep Web search engine. Memex is a sophisticated tool set that has been in the hands of a few select law enforcement agencies for a year now, but it isn’t available to regular users like you and me.

There is another search engine that is though.

Just a few days before I wrote that article, on 11 February, user Virgil Griffith went onto the Tor-talk mailing list and announced Onion City, a Dark Web search engine for the rest of us.

The search engine delves into the anonymous Tor network, finds .onion sites and makes them available to regular users on the ordinary World Wide Web.


Search and Access to Onion sites for Amusement ONLY! All of your activities are transparent to anyone capturing your web traffic.

If you need security and privacy, use a Tor client.

With that understanding: Onion City awaits your requests.

Is there a demand for an internal to Tor network search engine? Supported by internal to Tor advertising? Or is most Tor “marketing” by referral?

300 Data journalism blogs [1 Feedly OPML File]

February 27th, 2015

Data journalism blogs by Winny De Jong.

From the post:

At the News Impact Summit in Brussels I presented my workflow for getting ideas. Elsewhere on the blog a recap including interesting links. The RSS reader Feedly is a big part of my setup: together with Pocket its my most used app. Both are true lifesavers when reading is your default.

Since a lot op people of the News Summit audience use Feedly as well, I made this page to share my Feedly OPML file. If you’re not sure what an OPML file is read this page at

Download my Feedly OPML export containing 300+ data journalism related sites here

Now that is a great way to start the weekend!

With a file of three hundred (300) data blogs!


Comparing supervised learning algorithms

February 27th, 2015

Comparing supervised learning algorithms by Kevin Markham.

From the post:

In the data science course that I instruct, we cover most of the data science pipeline but focus especially on machine learning. Besides teaching model evaluation procedures and metrics, we obviously teach the algorithms themselves, primarily for supervised learning.

Near the end of this 11-week course, we spend a few hours reviewing the material that has been covered throughout the course, with the hope that students will start to construct mental connections between all of the different things they have learned. One of the skills that I want students to be able to take away from this course is the ability to intelligently choose between supervised learning algorithms when working a machine learning problem. Although there is some value in the “brute force” approach (try everything and see what works best), there is a lot more value in being able to understand the trade-offs you’re making when choosing one algorithm over another.

I decided to create a game for the students, in which I gave them a blank table listing the supervised learning algorithms we covered and asked them to compare the algorithms across a dozen different dimensions. I couldn’t find a table like this on the Internet, so I decided to construct one myself! Here’s what I came up with:

Eight (8) algorithms compared across a dozen (12) dimensions.

What algorithms would you add? Comments to add or take away?

Looks like the start of a very useful community resource.

Po’ Boy MapReduce

February 27th, 2015


Posted by Mirko Krivanek as What Is MapReduce?, credit @Tgrall

Have You Tried DRAKON Comrade? (Russian Space Program Specification Language)

February 27th, 2015


From the webpage:

DRAKON is a visual language for specifications from the Russian space program. DRAKON is used for capturing requirements and building software that controls spacecraft.

The rules of DRAKON are optimized to ensure easy understanding by human beings.

DRAKON is gaining popularity in other areas beyond software, such as medical textbooks. The purpose of DRAKON is to represent any knowledge that explains how to accomplish a goal.

DRAKON Editor is a free tool for authoring DRAKON flowcharts. It also supports sequence diagrams, entity-relationship and class diagrams.

With DRAKON Editor, you can quickly draw diagrams for:

  • software requirements and specifications;
  • documenting existing software systems;
  • business processes;
  • procedures and rules;
  • any other information that tells “how to do something”.

DRAKON Editor runs on Windows, Mac and Linux.

The user interface of DRAKON Editor is extremely simple and straightforward.

Software developers can build real programs with DRAKON Editor. Source code can be generated in several programming languages, including Java,, D, C#, C/C++ (with Qt support), Python, Tcl, Javascript, Lua, Erlang, AutoHotkey and Verilog

I note with amusement that the DRAKON editor has no “save” button. Rest easy! DRAKON saves all input automatically, removing the need for a “save” button. About time!

Download DRAKON editor.

I am in the middle of an upgrade so look for sample images next week.

Banning p < .05 In Psychology [Null Hypothesis Significance Testing Procedure (NHSTP)]

February 27th, 2015

The recent banning of the Null Hypothesis Significance Testing Procedure (NHSTP) in psychology should be a warning to would be data scientists that even “well established” statistical procedures may be deeply flawed.

Sorry, you may not have seen the news. In Basic and Applied Social Psychology (BASP), Banning Null Hypothesis Significance Testing Procedure (NHSTP) (2015) David Trafimow and Michael Marks write

The Basic and Applied Social Psychology (BASP) 2014 Editorial emphasized that the null hypothesis significance testing procedure (NHSTP) is invalid, and thus authors would be not required to perform it (Trafimow, 2014). However, to allow authors a grace period, the Editorial stopped short of actually banning the NHSTP. The purpose of the present Editorial is to announce that the grace period is over. From now on, BASP is banning the NHSTP.

You may be more familiar with seeing p < .05 rather than Null Hypothesis Significance Testing Procedure (NHSTP).

David Trafimow cites in the 2014 editorial warning about NHSTP his earlier work, Hypothesis Testing and Theory Evaluation at the Boundaries: Surprising Insights From Bayes’s Theorem (2003) as justifying non-use and the later ban of NHSTP.

His argument is summarized in the introduction:

Despite a variety of different criticisms, the standard nullhypothesis significance-testing procedure (NHSTP) has dominated psychology over the latter half of the past century. Although NHSTP has its defenders when used “properly” (e.g., Abelson, 1997; Chow, 1998; Hagen, 1997; Mulaik, Raju, & Harshman, 1997), it has also been subjected to virulent attacks (Bakan, 1966; Cohen, 1994; Rozeboom, 1960; Schmidt, 1996). For example, Schmidt and Hunter (1997) argue that NHSTP is “logically indefensible and retards the research enterprise by making it difficult to develop cumulative knowledge” (p. 38). According to Rozeboom (1997), “Null-hypothesis significance testing is surely the most bone-headedly misguided procedure ever institutionalized in the rote training of science students” (p. 336). The most important reason for these criticisms is that although one can calculate the probability of obtaining a finding given that the null hypothesis is true, this is not equivalent to calculating the probability that the null hypothesis is true given that one has obtained a finding. Thus, researchers are in the position of rejecting the null hypothesis even though they do not know its probability of being true (Cohen, 1994). One way around this problem is to use Bayes’s theorem to calculate the probability of the null hypothesis given that one has obtained a finding, but using Bayes’s theorem carries with it some problems of its own, including a lack of information necessary to make full use of the theorem. Nevertheless, by treating the unknown values as variables, it is possible to conduct some analyses that produce some interesting conclusions regarding NHSTP. These analyses clarify the relations between NHSTP and Bayesian theory and quantify exactly why the standard practice of rejecting the null hypothesis is, at times, a highly questionable procedure. In addition, some surprising findings come out of the analyses that bear on issues pertaining not only to hypothesis testing but also to the amount of information gained from findings and theory evaluation. It is possible that the implications of the following analyses for information gain and theory evaluation are as important as the NHSTP debate.

The most important lines for someone who was trained with the null hypothesis as an undergraduate many years ago:

The most important reason for these criticisms is that although one can calculate the probability of obtaining a finding given that the null hypothesis is true, this is not equivalent to calculating the probability that the null hypothesis is true given that one has obtained a finding. Thus, researchers are in the position of rejecting the null hypothesis even though they do not know its probability of being true (Cohen, 1994).

If you don’t know the probability of the null hypothesis, any conclusion you draw is on very shaky grounds.

Do you think any of the big data “shake-n-bake” mining/processing services are going to call that problem to your attention? True enough, such services may “empower” users but if “empowerment” means producing meaningless results, no thanks.

Trafimow cites Jacob Cohen’s The Earth is Round (p < .05) (1994) in his 2003 work. Cohen is angry and in full voice as only a senior scholar can afford to be.

Take the time to read both Trafimow and Cohen. Many errors are lurking outside your door but that will help you recognize this one.

Making Master Data Management Fun with Neo4j – Part 1, 2

February 27th, 2015

Making Master Data Management Fun with Neo4j – Part 1 by Brian Underwood.

From Part 1:

Joining multiple disparate data-sources, commonly dubbed Master Data Management (MDM), is usually not a fun exercise. I would like to show you how to use a graph database (Neo4j) and an interesting dataset (developer-oriented collaboration sites) to make MDM an enjoyable experience. This approach will allow you to quickly and sensibly merge data from different sources into a consistent picture and query across the data efficiently to answer your most pressing questions.

To start I’ll just be importing one data source: StackOverflow questions tagged with neo4j and their answers. In future blog posts I will discuss how to integrate other data sources into a single graph database to provide a richer view of the world of Neo4j developers’ online social interactions.

I’ve created a GraphGist to explore questions about the imported data, but in this post I’d like to briefly discuss the process of getting data from StackOverflow into Neo4j.

Part 1 imports data from Stackover flow into Neoj.

Making Master Data Management Fun with Neo4j – Part 2 imports Github data:

All together I was able to import:

  • 6,337 repositories
  • 6,232 users
  • 11,011 issues
  • 474 commits

In my next post I’ll show the process of how I linked the orignal StackOveflow data with the new GitHub data. Stay tuned for that, but in the meantime I’d also like to share the more technical details of what I did for those who are interested.

Definitely looking forward to seeing the reconciliation of data between StackOverflow and GitHub.

Data journalism: How to find stories in numbers

February 27th, 2015

Data journalism: How to find stories in numbers by Sandra Crucianelli.

From the post:

Colleagues often ask me what data journalism is. They’re confused by why it needs its own name — don’t all journalists use data?

The term is shorthand for ‘database journalism’ or ‘data-driven journalism’, where journalists find stories, or angles for stories, within large volumes of data.

It overlaps with investigative journalism in requiring lots of research, sometimes against people’s wishes. It can also overlap with data visualisation, as it requires close collaboration between journalists and digital specialists to find the best ways of presenting data.

So why get involved with spreadsheets and visualisation tools? At its most basic, adding data can give a story a new, factual dimension. But delving into datasets can also reveal new stories, or new aspects to them, that may not have otherwise surfaced.

Data journalism can also sometimes tell complicated stories more easily or clearly than relying on words alone — so it’s particularly useful for science journalists.

It can seem daunting if you’re trained in print or broadcast media. But I’ll introduce you to some new skills, and show you some excellent digital tools, so you too can soon find your feet as a data journalist.

Sandra gives as good an introduction to data journalism as you are likely to find. Her post covers everything from finding story ideas, researching relevant data, data processing and of course, presenting your findings in a persuasive way.

A must read for starting journalists but also for anyone needing an introduction to looking at data that supports a story (or not).

Gregor Aisch – Information Visualization, Data Journalism and Interactive Graphics

February 26th, 2015

Gregor has two sites that I wanted to bring to your attention on information visualization, data journalism and interactive graphics.

The first one, are graphics from New York Times stories created by Gregor and others. Impressive graphics. If you are looking for visualization ideas, not a bad place to stop.

The second one, is a blog that features Gregor’s work. But more than a blog, if you choose the navigation links at the top of the page:

Color – Posts on color.

Code – Posts focused on code.

Cartography – Posts on cartography.

Advice – Advice (not for the lovelorn).

Archive – Archive of his posts.

Rather than a long list of categories (ahem), Gregor has divided his material into easy to recognize and use divisions.

Always nice when you see a professional at work!


Data Visualization with JavaScript

February 26th, 2015

Data Visualization with JavaScript by Stephen A. Thomas.

From the webpage:

It’s getting hard to ignore the importance of data in our lives. Data is critical to the largest social organizations in human history. It can affect even the least consequential of our everyday decisions. And its collection has widespread geopolitical implications. Yet it also seems to be getting easier to ignore the data itself. One estimate suggests that 99.5% of the data our systems collect goes to waste. No one ever analyzes it effectively.

Data visualization is a tool that addresses this gap.

Effective visualizations clarify; they transform collections of abstract artifacts (otherwise known as numbers) into shapes and forms that viewers quickly grasp and understand. The best visualizations, in fact, impart this understanding intuitively. Viewers comprehend the data immediately—without thinking. Such presentations free the viewer to more fully consider the implications of the data: the stories it tells, the insights it reveals, or even the warnings it offers. That, of course, defines the best kind of communication.

If you’re developing web sites or web applications today, there’s a good chance you have data to communicate, and that data may be begging for a good visualization. But how do you know what kind of visualization is appropriate? And, even more importantly, how do you actually create one? Answers to those very questions are the core of this book. In the chapters that follow, we explore dozens of different visualizations, techniques, and tool kits. Each example discusses the appropriateness of the visualization (and suggests possible alternatives) and provides step-by-step instructions for including the visualization in your own web pages.

With a publication date of March 2015 its hard to get any more current information on data visualization and JavaScript!

You can view the text online or buy a proper ebook/hard copy.


Structure and Interpretation of Computer Programs (LFE Edition)

February 26th, 2015

Structure and Interpretation of Computer Programs (LFE Edition)

From the webpage:

This Gitbook (available here) is a work in progress, converting the MIT classic Structure and Interpretation of Computer Programs to Lisp Flavored Erlang. We are forever indebted to Harold Abelson, Gerald Jay Sussman, and Julie Sussman for their labor of love and intelligence. Needless to say, our gratitude also extends to the MIT press for their generosity in licensing this work as Creative Commons.


This is a huge project, and we can use your help! Got an idea? Found a bug? Let us know!.

Writing, or re-writing if you are transposing a CS classic into another language, is far harder than most people imagine. Probably even more difficult than the original because your range of creativity is bound by the organization and themes of the underlying text.

I may have some cycles to donate to proof reading. Anyone else?

Making A Mouse Seem Like A Dragon

February 26th, 2015

Ishaan Tharoor writes of a new edition of ‘Mein Kampf in What George Orwell said about Hitler’s ‘Mein Kampf’ saying in part:

But, in my view, the most poignant section of Orwell’s article dwells less on the underpinnings of Nazism and more on Hitler’s dictatorial style. Orwell gazes at the portrait of Hitler published in the edition he’s reviewing:

It is a pathetic, dog-like face, the face of a man suffering under intolerable wrongs. In a rather more manly way it reproduces the expression of innumerable pictures of Christ crucified, and there is little doubt that that is how Hitler sees himself. The initial, personal cause of his grievance against the universe can only be guessed at; but at any rate the grievance is here. He is the martyr, the victim, Prometheus chained to the rock, the self-sacrificing hero who fights single-handed against impossible odds. If he were killing a mouse he would know how to make it seem like a dragon. One feels, as with Napoleon, that he is fighting against destiny, that he can’t win, and yet that he somehow deserves to.

The line:

If he were killing a mouse he would know how to make it seem like a dragon.

is particularly appropriate in a time of defense budgets at all time highs, restrictions on travel, social media, “homeland” a/k/a “fatherland” security, torture as an instrument of democratic governments, etc.

Where is this dragon that threatens us so? Multiple smallish bands of people with no country, not national industrial base, no navy, no airforce, no armored divisions, no ICBMs, no nuclear weapons, no CBW, who are most skilled with knives and lite arms.

How many terrorists? In How Many Terrorists Are There: Not As Many As You Might Think Becky Ackers does the math based on the helpful report from the U.S. Department of State, Country Reports on Terrorism.

Before I give you Becky’s total, which errs on the generous side of rounding up, know that the Department of Homeland security already has them outnumbered.

Try 184,000.

Yep, just 184,000. Even big, bad “Al-Qa’ida (AQ)” and its three affiliates (“Al-Qa’ida in the Arabian Peninsula”; “Al-Qa’ida in Iraq”; and “Al-Qa’ida in the Islamic Maghreb”) boast only 4000 bad guys combined. (The main Al-Qa’ida’s “strength” is “impossible to estimate,” but the Reports admits that its “core has been seriously degraded” following “the death or arrest of dozens of mid- and senior-level AQ operatives.” “Dozens,” not “hundreds.” Hmmm.)

And remember, 184,000 is a ridiculously inflated figure – both because of our generous accounting and also because governments often expand a word’s meaning well beyond the dictionary’s. You may recall the Feds’ contending with straight faces in 2004 that if “a little old lady in Switzerland gave money to a charity for an Afghan orphanage, and the money was passed to al Qaeda,” she met the definition of “enemy combatant.” Five years later, a federal Fusion Center decreed that “if you’re an anti-abortion activist, or if you display political paraphernalia supporting a third-party candidate or [Ron Paul], if you possess subversive literature, you very well might be a member of a domestic paramilitary group.” No telling how many confused Swiss grandmothers and readers of Techdirt’s subversive articles cluster among those 184,000.

That number grows even more absurd when we compare it with the aforementioned Homeland Security’s 240,000 Warriors on Terror. Meanwhile, something like 780,000 cops stalk us nationwide, whose duties also encompass tilting at terrorism’s windmill. And that’s to say nothing of the scores of other bureaucracies at the national, state, and local levels hunting these same 184,000 guerrillas as well as an additional 1,368,137 troops from the armed forces [click on “Rank/Grade – current month”].

Even if you round the absurd number of terrorists up to 200,000 and round our total down to 2,000,000, at present the United States along has the terrorists outnumbered 10 to 1. Now add in Europe, China, India, etc. and you get the idea that terrorists really are the mice of the world.

Personally I’m glad they are re-printing ‘Mein Kampf.’

Good opportunity to be reminded that leaders who are making dragons out of the mice of terrorism aren’t planning on sacrificing themselves, they are going to sacrifice us, each and every one.

Category Theory – Reading List

February 26th, 2015

Category Theory – Reading List by Peter Smith.

Notes along with pointers to other materials.

About Peter Smith:

These pages are by me, Peter Smith. I retired in 2011 from the Faculty of Philosophy at Cambridge. It was my greatest good fortune to have secure, decently paid, university posts for forty years in leisurely times, with a very great deal of freedom to follow my interests wherever they led. Like many of my generation, I am sure I didn’t at the time really appreciate just how lucky I and my contemporaries were. Some of the more student-orientated areas of this site, then, such as the Teach Yourself Logic Guide, constitute my small but heartfelt effort to give something back by way of thanks.

There is much to explore at Peter’s site beside his notes on category theory.