Archive for November, 2015

A Challenge to Data Scientists

Sunday, November 22nd, 2015

A Challenge to Data Scientists by Renee Teate.

From the post:

As data scientists, we are aware that bias exists in the world. We read up on stories about how cognitive biases can affect decision-making. We know that, for instance, a resume with a white-sounding name will receive a different response than the same resume with a black-sounding name, and that writers of performance reviews use different language to describe contributions by women and men in the workplace. We read stories in the news about ageism in healthcare and racism in mortgage lending.

Data scientists are problem solvers at heart, and we love our data and our algorithms that sometimes seem to work like magic, so we may be inclined to try to solve these problems stemming from human bias by turning the decisions over to machines. Most people seem to believe that machines are less biased and more pure in their decision-making – that the data tells the truth, that the machines won’t discriminate.

Renee’s post summarizes a lot of information about bias, inside and outside of data science and issues this challenge:

Data scientists, I challenge you. I challenge you to figure out how to make the systems you design as fair as possible.

An admirable sentiment but one hard part is defining “…as fair as possible.”

Being professionally trained in a day to day “hermeneutic of suspicion,” as opposed to Paul Ricoeur‘s analysis of texts (Paul Ricoeur and the Hermeneutics of Suspicion: A Brief Overview and Critique by G.D. Robinson.), I have yet to encounter a definition of “fair” that does not define winners and losers.

Data science relies on classification, which has as its avowed purpose the separation of items into different categories. Some categories will be treated differently than others. Otherwise there would be no reason to perform the classification.

Another hard part is that employers of data scientists are more likely to say:

Analyze data X for market segments responding to ad campaign Y.

As opposed to:

What do you think about our ads targeting tweens by the use of sexual-content for our unhealthy product A?

Or change the questions to fit those asked of data scientists at any government intelligence agency.

The vast majority of data scientists are hired as data scientists, not amateur theologians.

Competence in data science has no demonstrable relationship to competence in ethics, fairness, morality, etc. Data scientists can have opinions about the same but shouldn’t presume to poach on other areas of expertise.

How you would feel if a competent user of spreadsheets decided to label themselves a “data scientist?”

Keep that in mind the next time someone starts to pontificate on “ethics” in data science.

PS: Renee is in the process of creating and assembling high quality resources for anyone interested in data science. Be sure to explore her blog and other links after reading her post.

Manufacturing Terror

Saturday, November 21st, 2015

Manufacturing Terror: An FBI Informant Seduced Eric McDavid Into a Bomb Plot. Then the Government Lied About It by Trevor Aaronson and Katie Galloway.

From the post:

Anna would go on to lead McDavid and two other activists in their 20s in a loose plot to bomb targets in Northern California. Maybe in the name of the Earth Liberation Front. Or maybe not. Fitting for the muddied plot, their motivation was as unclear as their targets. Anna, at the direction of the FBI, made the entire plot possible — providing the transportation, money, and a cabin in the woods that the FBI had wired up with hidden cameras. Anna even provided the recipe for homemade explosives, drawn up by FBI bomb experts. Members of the group suggested, in conversations with her, that they regarded her as their leader.
At trial, McDavid’s lawyer, Mark Reichel, argued that the FBI had used Anna to lure McDavid into a terrorism conspiracy through the promise of a sexual relationship once the mission was complete. “That’s inducement,” Reichel told the federal jury. “That’s entrapment.” The jurors weren’t persuaded, however. In 2007, McDavid was convicted of conspiring to use fire or explosives to damage corporate and government property, and he was sentenced to nearly 20 years in prison, one of the longest sentences given to an alleged eco-terrorist in the United States. At the time of his conviction, the FBI had built a network of more than 15,000 informants like Anna and the government had classified eco-terrorism as the No. 1 domestic terrorism threat — even though so-called eco-terrorism crimes in the United States were rare and never fatal.

Seven years after his conviction, the government’s deceit was finally revealed. Last November, federal prosecutors admitted they had potentially violated rules of evidence by withholding approximately 2,500 pages of documents from McDavid. Among the belatedly disclosed documents were love letters between Anna and McDavid and evidence that Anna’s handler, Special Agent Ricardo Torres, had quashed the FBI’s request to put Anna through a polygraph test, commonly used by the FBI to ensure informants aren’t lying to agents as they collect evidence. The new documents also revealed which of the letters and emails the FBI’s Behavioral Analysis Unit had reviewed before offering instructions on how to manipulate McDavid and guide him toward a terrorist conspiracy.

David was released earlier this year as part of an unusual settlement: He agreed to plead guilty to a lesser charge of general conspiracy in exchange for his immediate release. Yet when his lawyers demanded to know why the government had withheld evidence that had been specifically requested before trial, the government made a veiled threat to throw McDavid back into prison for violating the terms of his plea agreement.

The full story is much longer and makes a great read, holiday discussion issue.

This is another example of why I advocate a leak upon possession policy.

Whatever protest a government official may make, they may even be telling the truth as known to them, but it doesn’t mean the government isn’t lying to them and via them to the public.

The only way to combat systemic and widespread deception by government is for citizens to obtain concealed information and to leak it for use by other citizens.

Leaking Classified Information

Saturday, November 21st, 2015

I saw a tweet recently extolling the number of classified documents that could have obtained.

Not obtaining and/or leaking classified documents of any government denies the public information it can use.

Two suggestions:

If you can obtain classified information, do.

If you have classified information, leak it in its entirety.

Before some ambitious assistant US attorney decides I am advocating illegal activity, recall that some leaks of classified information are in fact authorized by the executive branch of the United States government. Read All Leaks Are Illegal, but Some Leaks Are More Illegal Than Others by Conor Friedersdorf for some example cases.

Classification is used to conceal embarrassing information or failures. No government has a right to conceal embarrassing information or failures.

Agonizing over what to leak creates power for those with leaked information from a government. Do you see yourself as that petty and vain?

Just leak it. Let the chips fall where they may.

The history of leaking is on the side of no harm to anyone.

Start with the Pentagon Papers (U.S. Archives), Watergate at 40, Public Library of US Diplomacy, which also includes Cablegate, the Kissinger cables and Carter cables parts 1 and 2, Afghan War Diaries, the Snowden leaks and count the bodies.

So far, I’ve got nothing. Zero. The empty set.

Over forty years of leaking and no bodies. If there was even one, it would be front and center at every leak story.

Doesn’t that tell you something about the truthfulness of government objections to leaks?

Committee Work (humor, maybe)

Friday, November 20th, 2015

Code Monkey Hate Bug tweets:

Is it inevitable that committee designs end up looking like this?

It isn’t statistically inevitable that committee designs have this result.

However, the history of the U.S. Congress indicates the odds of a different outcome are extremely low.

Four free online plagiarism checkers

Friday, November 20th, 2015

Four free online plagiarism checkers

From the post:

“Detecting duplicate content online has become so easy that spot-the-plagiarist is almost a party game,” former IJNet editor Nicole Martinelli wrote in 2012. “It’s no joke, however, for news organizations who discover they have published copycat content.”

When IJNet first ran Martinelli’s post, “Five free online plagiarism checkers,” two prominent U.S. journalists had recently been caught in the act: Fareed Zakaria and Jonah Lehrer.

Following acknowledgement that he had plagiarized sections of an article about gun control, Time and CNN suspended Zakaria. Lehrer first came under scrutiny for “self-plagiarism” at The New Yorker. Later, a journalist revealed Lehrer also fabricated or changed quotes attributed to Bob Dylan in his book, “Imagine.”

To date, Martinelli’s list of free plagiarism checkers has been one of IJNet’s most popular articles across all languages. It’s clear readers want to avoid the pitfalls of plagiarism, so we’ve updated the post with four of the best free online plagiarism checkers available to anyone, revised for 2015:

Great resource for checking your content and that of others for plagiarism.

The one caveat I offer is to not limit the use of text similarity software solely to plagiarism.

Text similarity can be a test for finding content that you would not otherwise discover. Depends on how high you set the test for “similarity.”

And/or it may find content that is so similar, while not plagiarism (say multiple outlets writing from the same wire service) it isn’t worth the effort to read every story that repeats the same story with some minor edits.

Multiple stories but only one wire service source. In that sense, a “plagiarism” checker can enable you to skip duplicative content.

The post I quote above was published by the international journalist’s network (ijnet). Even if you aren’t a journalist, great source to follow for developing news technology.

The History of SQL Injection…

Friday, November 20th, 2015

The History of SQL Injection, the Hack That Will Never Go Away by Joseph Cox.

From the post:

One of the hackers suspected of being behind the TalkTalk breach, which led to the personal details of at least 150,000 people being stolen, used a vulnerability discovered two years before he was even born.

That method of attack was SQL injection (SQLi), where hackers typically enter malicious commands into forms on a website to make it churn out juicy bits of data. It’s been used to steal the personal details of World Health Organization employees, grab data from the Wall Street Journal, and hit the sites of US federal agencies.

“It’s the most easy way to hack,” the pseudonymous hacker w0rm, who was responsible for the Wall Street Journal hack, told Motherboard. The attack took only a “few hours.”

But, for all its simplicity, as well as its effectiveness at siphoning the digital innards of corporations and governments alike, SQLi is relatively easy to defend against.

So why, in 2015, is SQLi still leading to some of the biggest breaches around?

SQLi was possibly first documented by Jeff Forristal in the hacker zine Phrack. Back then, Forristal went by the handle rain.forest.puppy, but he’s now CTO of mobile security at cybersecurity vendor Bluebox security.

Joseph’s history is another data point for the proposition:

To a vendor, your security falls under “…not my problem.

Android Smartphone+ for Christmas?

Friday, November 20th, 2015

I say Android Smartphone+ because Swati Khandelwal reports it’s a gift that keeps on giving.

This Malware Can Secretly Auto-Install any Android App to Your Phone

From the post:

Own an Android Smartphone?

Hackers can install any malicious third-party app on your smartphone remotely even if you have clearly tapped a reject button of the app.

Security researchers have uncovered a trojanized adware family that has the capability to automatically install any app on an Android device by abusing the operating system’s accessibility features.

Swati has a video of this remote installation in action. This is not theoretical hack.

Full Disclosure: I don’t have an iPhone either.

Infinite Dimensional Word Embeddings [Variable Representation, Death to Triples]

Thursday, November 19th, 2015

Infinite Dimensional Word Embeddings by Eric Nalisnick and Sachin Ravi.


We describe a method for learning word embeddings with stochastic dimensionality. Our Infinite Skip-Gram (iSG) model specifies an energy-based joint distribution over a word vector, a context vector, and their dimensionality, which can be defined over a countably infinite domain by employing the same techniques used to make the Infinite Restricted Boltzmann Machine (Cote & Larochelle, 2015) tractable. We find that the distribution over embedding dimensionality for a given word is highly interpretable and leads to an elegant probabilistic mechanism for word sense induction. We show qualitatively and quantitatively that the iSG produces parameter-efficient representations that are robust to language’s inherent ambiguity.

Even better from the introduction:

To better capture the semantic variability of words, we propose a novel embedding method that produces vectors with stochastic dimensionality. By employing the same mathematical tools that allow the definition of an Infinite Restricted Boltzmann Machine (Côté & Larochelle, 2015), we describe ´a log-bilinear energy-based model–called the Infinite Skip-Gram (iSG) model–that defines a joint distribution over a word vector, a context vector, and their dimensionality, which has a countably infinite domain. During training, the iSGM allows word representations to grow naturally based on how well they can predict their context. This behavior enables the vectors of specific words to use few dimensions and the vectors of vague words to elongate as needed. Manual and experimental analysis reveals this dynamic representation elegantly captures specificity, polysemy, and homonymy without explicit definition of such concepts within the model. As far as we are aware, this is the first word embedding method that allows representation dimensionality to be variable and exhibit data-dependent growth.

Imagine a topic map model that “allow[ed] representation dimensionality to be variable and exhibit data-dependent growth.

Simple subjects, say the sort you find at, can have simple representations.

More complex subjects, say the notion of “person” in U.S. statutory law (no, I won’t attempt to list them here), can extend its dimensional representation as far as is necessary.

Of course in this case, the dimensions are learned from a corpus but I don’t see any barrier to the intentional creation of dimensions for subjects and/or a combined automatic/directed creation of dimensions.

Or as I put it in the title, Death to All Triples.

More precisely, not just triples but any pre-determined limit on representation.

Looking forward to taking a slow read on this article and those it cites. Very promising.

In the realm of verification, context is king

Thursday, November 19th, 2015

In the realm of verification, context is king by Fergus Bell.

From the post:

By thinking about the wider context around shared UGC you can often avoid a lengthy forensic verification process where it isn’t required. For publishers looking at how they tackle competition with platforms – it is easy. Context is where you can make a distinction through strong editorial work and storytelling.

Fergus has four quick tips that will help you fashion a context for user-generated content (UGC).

Content always has a context. If you don’t supply one, consumers will invent a context for your content. (They may anyway but you can at least take the first shot at it.)

It is interesting that user-generated content (UGC) isn’t held in high regard, yet news outlets parrot the latest rantings of elected officials and public figures as gospel.

When public statements are false, such as suggesting that Syrian refugees pose a danger of terrorism, why aren’t those statements simply ignored? Why mis-inform the public?

Stop Comparing JSON and XML

Thursday, November 19th, 2015

Stop Comparing JSON and XML by Yegor Bugayenko.

From the post:

JSON or XML? Which one is better? Which one is faster? Which one should I use in my next project? Stop it! These things are not comparable. It’s similar to comparing a bicycle and an AMG S65. Seriously, which one is better? They both can take you from home to the office, right? In some cases, a bicycle will do it better. But does that mean they can be compared to each other? The same applies here with JSON and XML. They are very different things with their own areas of applicability.

Yegor follows that time-honored Web tradition of telling people, who aren’t listening, why they should follow his advice.


If nothing else, circulate this around the office to get everyone’s blood pumping this late in the week.

I would amend Yegor’s headline to read: Stop Comparing JSON and XML Online!

As long as your discussions don’t gum up email lists, news feeds, Twitter, have at it.


XSL Transformations (XSLT) Version 3.0 [Comments by 31 March 2016]

Thursday, November 19th, 2015

XSL Transformations (XSLT) Version 3.0


This specification defines the syntax and semantics of XSLT 3.0, a language designed primarily for transforming XML documents into other XML documents.

XSLT 3.0 is a revised version of the XSLT 2.0 Recommendation [XSLT 2.0] published on 23 January 2007.

The primary purpose of the changes in this version of the language is to enable transformations to be performed in streaming mode, where neither the source document nor the result document is ever held in memory in its entirety. Another important aim is to improve the modularity of large stylesheets, allowing stylesheets to be developed from independently-developed components with a high level of software engineering robustness.

XSLT 3.0 is designed to be used in conjunction with XPath 3.0, which is defined in [XPath 3.0]. XSLT shares the same data model as XPath 3.0, which is defined in [XDM 3.0], and it uses the library of functions and operators defined in [Functions and Operators 3.0]. XPath 3.0 and the underlying function library introduce a number of enhancements, for example the availability of higher-order functions.

As an implementer option, XSLT 3.0 can also be used with XPath 3.1. All XSLT 3.0 processors provide maps, an addition to the data model which is specified (identically) in both XSLT 3.0 and XPath 3.1. Other features from XPath 3.1, such as arrays, and new functions such as random-number-generatorFO31 and sortFO31, are available in XSLT 3.0 stylesheets only if the implementer chooses to support XPath 3.1.

Some of the functions that were previously defined in the XSLT 2.0 specification, such as the format-dateFO30 and format-numberFO30 functions, are now defined in the standard function library to make them available to other host languages.

XSLT 3.0 also includes optional facilities to serialize the results of a transformation, by means of an interface to the serialization component described in [XSLT and XQuery Serialization]. Again, the new serialization capabilities of [XSLT and XQuery Serialization 3.1] are available at the implementer’s option.

This document contains hyperlinks to specific sections or definitions within other documents in this family of specifications. These links are indicated visually by a superscript identifying the target specification: for example XP30 for XPath 3.0, DM30 for the XDM data model version 3.0, FO30 for Functions and Operators version 3.0.

Comments are due by 31 March 2016.

That may sound like a long time for comments but it is shorter than you might think.

It is a long document and standards are never an “easy” read.

Fortunately it is cold weather or about to be in many parts of the world with holidays rapidly approaching. Some extra time to curl up with XSL Transformations (XSLT) Version 3.0 and its related documents for a slow read.

Something I have never done before that I plan to attempt with this draft is running the test cases, almost 11,000 of them. I’m not an implementer but being more familiar with the test cases will my understanding of new features in XSL 3.0.

Comment early and often!


Knowing the Name of Something vs. Knowing How To Identify Something

Wednesday, November 18th, 2015

Richard Feynman: The Difference Between Knowing the Name of Something and Knowing Something

From the post:

In this short clip (below), Feynman articulates the difference between knowing the name of something and understanding it.

See that bird? It’s a brown-throated thrush, but in Germany it’s called a halzenfugel, and in Chinese they call it a chung ling and even if you know all those names for it, you still know nothing about the bird. You only know something about people; what they call the bird. Now that thrush sings, and teaches its young to fly, and flies so many miles away during the summer across the country, and nobody knows how it finds its way.

Knowing the name of something doesn’t mean you understand it. We talk in fact-deficient, obfuscating generalities to cover up our lack of understanding.

You won’t get to see the Feynman quote live because it has been blocked by BBC Worldwide on copyright grounds. No doubt they make a bag full of money every week off that 179 second clip of Feynman.

The stronger point for Feynman would be to point out that you can’t recognize anything on the basis of knowing a name.

I may be sitting next to Cindy Lou Who on the bus but knowing her name isn’t going to help me to recognize her.

Knowing the name of someone or something isn’t useful unless you know something about the person or thing you associate with a name.

That is you know when it is appropriate to use the name you have learned and when to say: “Sorry, I don’t know your name or the name of (indicating in some manner).” At which point you will learn a new name and store a new set of properties to know when to use that name, instead of any other name you know.

Everyone does that exercise, learning new names and the properties that establish when it is appropriate to use a particular name. And we do so seamlessly.

So seamlessly that when called upon to make explicit “how” we know which name to use, subject identification in other words, it takes a lot of effort.

It’s enough effort that it should be done only when necessary and when we can show the user an immediate semantic ROI for their effort.

More on this to follow.

State of Georgia Mails Out 6 Million+ SSNs, Birthdays, etc.

Wednesday, November 18th, 2015

In the race to be the most cyberinsecure state government, the Georgia Secretary of State sent out 6 million voter records that included social security numbers and birth dates, along with other information about Georgia voters.

Unlike the Paris attack reporting, all of the foregoing has been verified and even admitted by the Secretary of States office.

Georgia: ‘Clerical error’ in data breach involving 6 million voters by Kristina Torres reports:

Two Georgia women have filed a class action lawsuit alleging a massive data breach by Secretary of State Brian Kemp involving the Social Security numbers and other private information of more than six million voters statewide.

The suit, filed Tuesday in Fulton County Superior Court, alleges Kemp’s office released the information including personal identifying information to the media, political parties and other paying subscribers who legally buy voter information from the state.

In response, Kemp’s office blamed a “clerical error” and said Wednesday afternoon that they did not consider it to be a breach of its system. It said 12 organizations, including statewide political parties, news media organizations and Georgia GunOwner Magazine, received the file.

So a “clerical error” doesn’t count as a data breach?

Given that even a sanity check for file size didn’t prevent this breach, leak, clerical error, I have to wonder why they are so certain about the number of organizations that received the file?

And who they may have shared it with since October of 2015?

That’s the other odd fact. The file was sent in October but it takes someone filing a lawsuit in mid-November for the breach, leak, clerical error to come to light?

How’s your state government’s security?

PS: The case details (but not the pleadings) can be found at:

Christopher Meiklejohn – Doctoral Thesis Proposal

Wednesday, November 18th, 2015

Christopher Meiklejohn – Doctoral Thesis Proposal.

From the proposal:

The goal of this research is to provide a declarative way to design distributed, fault-tolerant applications that do not contain observable nondeterminism. These applications should be able to be placed at arbitrary locations in the network: mobile devices, “Internet of Things” hardware, or personal computers. Applications should be tolerant to arbitrary message delays, duplication and reordering: these are first-class requirements of distributed computations over unreliable networks. When writing these applications, developers should not have to use traditional concurrency control or synchronization mechanisms such as mutexes, semaphores, or monitors: the primitive operations for composition in the language should yield “deterministic-by-construction” applications.

Christopher is looking for comments on his doctoral thesis proposal.

His proposal is dated November 11, 2015, so time remains for you to review the proposal and make comments.

It would be really nice if the community that will benefit from Christopher’s work would contribute some comments on it.

Antidote to Network News Reporting

Wednesday, November 18th, 2015

Public Trust Through Public Access to CRS Reports by Rep. Mike Quigley.

Rep. Quigley addresses Congress to urge support for House Resolution 34 saying in part:

When the average American wants to learn about a policy, where do they turn for information?

Often, the answer is the 24-hour news cycle. Often filled by talking heads and sensationalism,

Or social media and message boards, where anyone can post anything – credible or completely misinformed.

The American public is no longer being informed by the likes of Cronkite and Murrow, and it is making our public debate increasingly partisan, polarized and misinformed.

What few realize, or like to admit, is that there is a way Congress can help elevate the debate and educate our constituents with neutral, unbiased, non-partisan information from the Congressional Research Service, or CRS.

For over 100 years, CRS has served Congress’ publicly-funded think tank.

Because they serve policy-makers on both sides of the aisle, CRS researchers produce exemplary work that is accurate, non-partisan, and easy to understand.

Despite the fact that CRS receives over $100 million from taxpayers each year, its reports are not made available to the public.

Instead, constituents must request individual reports through a Congressional office.

Rep. Quigley goes on to make several public policy point in favor of House Resolution 34 but he had me at:

  1. Citizens pay for it.
  2. Citizens can’t access it online.

Citizens of the United States are paying for some the best research in the world but can’t access it online.

That is wrong on so many levels that I don’t think it needs much discussion or debate.

All U.S. citizens need to contact their representative to urge support for House Resolution 34.


PS: Congressional Research Service (CRS) reports don’t look like coiffed news anchors but then you won’t find rank speculation, rumor and falsehoods reported as facts. It’s a trade-off I’m willing to make.

Paris: The Power of Unencrypted Vanilla SMS (Network News: You are now dumber…)

Wednesday, November 18th, 2015

After Endless Demonization Of Encryption, Police Find Paris Attackers Coordinated Via Unencrypted SMS by Karl Bode.

From the post:

In the wake of the tragic events in Paris last week encryption has continued to be a useful bogeyman for those with a voracious appetite for surveillance expansion. Like clockwork, numerous reports were quickly circulated suggesting that the terrorists used incredibly sophisticated encryption techniques, despite no evidence by investigators that this was the case. These reports varied in the amount of hallucination involved, the New York Times even having to pull one such report offline. Other claims the attackers had used encrypted Playstation 4 communications also wound up being bunk.

Yet pushed by their sources in the government, the media quickly became a sound wall of noise suggesting that encryption was hampering the government’s ability to stop these kinds of attacks. NBC was particularly breathless this week over the idea that ISIS was now running a 24 hour help desk aimed at helping its less technically proficient members understand encryption (even cults help each other use technology, who knew?). All of the reports had one central, underlying drum beat implication: Edward Snowden and encryption have made us less safe, and if you disagree the blood is on your hands.

You have heard that cybersecurity is too hard for most users?

Apparently cybersecurity is too hard for most terrorists too.

Perhaps we can gauge the progress of terrorist use of encryption by adoption of the same by the OPM?

Another consequence of the Paris attacks is more evidence for the proposition:

Network News: You are now dumber for having heard it.

There was no reason to speculate about how the attackers communicated with each other. Waiting for facts from the police investigation wasn’t going to harm the victims further.

Reporting facts about the Paris attack could have advanced public discussion of the attacks.

We will never know due to the network news generated cloud of mistakes, falsehoods and speculation around such events.

Update: See: Too little too late: The horror of Paris proves the media need to debunk rumours in real time by Claire Wardle.

A delightful piece on how fact-checking in real time isn’t all that difficult. Makes you wonder about the “value-add” of news reporting that doesn’t.

Follow First Draft on Twitter for more coverage on junk news and efforts to stem it.

As I said yesterday in Lies, Damn Lies, and Viral Content [I Know a Windmill When I See One]:

What journalism needs is pro-active readers to rebel against superficial, inaccurate and misleading reporting. Voting with their feet will be far more effective than exhortations to do better.

Unless and until there is economic pain from bad reporting, it is going to continue.

Conference Videos for the Holidays

Wednesday, November 18th, 2015

As you know, I saw Alexander Songe’s CRDT: Datatype for the Apocalypse presentation earlier today.

With holidays approaching next week, November 23rd-27th, 2015 in the United States, I thought some of you may need additional high quality video references.

Clojure TV

Elixir Conf 2014.

Elixir Conf 2015

Erlang Solutions



No slight intended for any conference videos I didn’t list. I will list different conference videos for the next holiday list, which will appear in December 2015.


PS: I have to apologize for the poor curating of videos by their hosts. With only a little more effort, these videos could be a valuable day to day resource.

On Teaching XQuery to Digital Humanists [Lesson in Immediate ROI]

Wednesday, November 18th, 2015

On Teaching XQuery to Digital Humanists by Clifford B. Anderson.

A paper presented at Balisage 2014 but still a great read for today. In particular where Clifford makes the case for teaching XQuery to humanists:

Making the Case for XQuery

I may as well state upfront that I regard XQuery as a fantastic language for digital humanists. If you are involved in marking up documents in XML, then learning XQuery will pay long-term dividends. I do have arguments for this bit of bravado. My reasons for lifting up XQuery as a programing language of particular interest to digital humanists are essentially three:

  • XQuery is domain-appropriate for digital humanists.

Let’s take each of these points in turn.

First, XQuery fits the domain of digital humanists. Admittedly, I am focusing here on a particular area of the digital humanities, namely the domain of digital text editing and analysis. In that domain, however, XQuery proves a nearly perfect match to the needs of digital humanists.

If you scour the online communities related to digital humanities, you will repeatedly find conversations about which programming languages to learn. Predictably, the advice is all over the map. PHP is easy to learn, readily accessible, and the language of many popular projects in the digital humanities such as Omeka and Neatline. Javascript is another obvious choice given its ubiquity. Others recommend Python or Ruby. At the margins, you’ll find the statistically-inclined recommending R. There are pluses and minuses to learning any of these languages. When you are working with XML, however, they all fall short. Inevitably, working with XML in these languages will require learning how to use packages to read XML and convert it to other formats.

Learning XQuery eliminates any impedance between data and code. There is no need to import any special packages to work with XML. Rather, you can proceed smoothly from teaching XML basics to showing how to navigate XML documents with XPath to querying XML with XQuery. You do not need to jump out of context to teach students about classes, objects, tables, or anything as awful-sounding as “shredding” XML documents or storing them as “blobs.” XQuery makes it possible for students to become productive without having to learn as many computer science or software engineering concepts. A simple four or five line FLWOR expression can easily demonstrate the power of XQuery and provide a basis for students’ tinkering and exploration. (emphasis added)

I commend the rest of the paper to you for reading but Clifford’s first point nails why learn XQuery for humanists and others.

The part I highlighted above sums it up:

XQuery makes it possible for students to become productive without having to learn as many computer science or software engineering concepts. A simple four or five line FLWOR expression can easily demonstrate the power of XQuery and provide a basis for students’ tinkering and exploration. (emphasis added)

Whether you are a student, scholar or even a type-A business type, what do you want?

To get sh*t done!

A few of us like tinkering with edge cases, proofs, theorems and automata, but having the needed output on time or sooner, really makes the day for most folks.

A minimal amount of XQuery expressions will open up XML encoded data for your custom exploration. You can experience an immediate ROI from the time you spend learning XQuery. Which will prompt you to learn more XQuery.

Think of learning XQuery as a step towards user independence. Independence from the choices made by unseen and unknown programmers.

Are you ready to take that step?

A Timeline of Terrorism Warning: Incomplete Data

Wednesday, November 18th, 2015

A Timeline of Terrorism by Trevor Martin.

From the post:

The recent terrorist attacks in Paris have unfortunately once again brought terrorism to the front of many people’s minds. While thinking about these attacks and what they mean in a broad historical context I’ve been curious about if terrorism really is more prevalent today (as it feels), and if data on terrorism throughout history can offer us perspective on the terrorism of today.

In particular:

  • Have incidents of terrorism been increasing over time?
  • Does the amount of attacks vary with the time of year?
  • What type of attack and what type of target are most common?
  • Are the terrorist groups committing attacks the same over decades long time scales?

In order to perform this analysis I’m using a comprehensive data set on 141,070 terrorist attacks from 1970-2014 compiled by START.

Trevor writes a very good post and the visualizations are ones that you will find useful for this and other date.

However, there is a major incompleteness in Trevor’s data. If you follow the link for “comprehensive data set” and the FAQ you find there, you will find excluded from this data set:

Criterion III: The action must be outside the context of legitimate warfare activities.

So that excludes the equivalent of five Hiroshimas dropped on rural Cambodia (1969-1973), the first and second Iraq wars, the invasion of Afghanistan, numerous other acts of terrorism using cruise missiles and drones, all by the United States, to say nothing of the atrocities committed by Russia against a variety of opponents and other governments since 1970.

Depending on how you count separate acts, I would say the comprehensive data set is short by several orders of magnitude in accounting for all the acts of terrorism between 1970 to 2014.

If that additional data were added to the data set, I suspect (don’t know because the data set is incomplete) that who is responsible for more deaths and more terror would have a quite different result from that offered by Trevor.

So I don’t just idly complain, I will contact the United States Air Force to see if there are public records on how many bombing missions and how many bombs were dropped on Cambodia and in subsequent campaigns. That could be a very interesting data set all on its own.

CRDT: Datatype for the Apocalypse

Wednesday, November 18th, 2015

CRDT: Datatype for the Apocalypse by Alexander Songe.

From the description:

Conflict-free Replicated Data Types (CRDTs) are a hot new datatype in distributed programming that promise coordination-free and always eventually-consistent writes. Most of the stuff you will read about them is limited to the context of high-end distributed databases, but CRDT’s are more flexible and shouldn’t be limited to this field. In the talk, I will demonstrate how CRDT’s are great for applications that have partial connectivity (including websites): updates can be applied locally, and when communication is possible, you can send the data back up, and the data will remain consistent even in the most byzantine (or apocalyptic) scenarios. There are even scenarios that can support multiple simultaneous editors.

Beyond that, I will also demonstrate how Elixir’s metaprogramming can be used to compose complex models out of CRDT’s that themselves exhibit the same exact features. I will also exhibit some newer CRDT features, such as shared-context and delta-operation CRDT’s to overcome some of the shortcomings of older CRDT’s.

I plan to keep the talk light on theory (the academic literature is sufficient for that).

Great presentation on CRDTs!

I hope we are closer to use of CRDTs with documents than Alexander appears to think. Starting at time mark 5:27, Alexander says that document CRDTs can be ten times the size of the document for collaborative editing. (rough paraphrase)

I have written to Alexander to inquire if it would be possible to have more granular CRDTs, such as <p> element CRDTs which would address changes only to a particular paragraph and a separate document level CRDT that covers changes at the document (insertion/deletion of paragraphs)?

Alexander is the author of the loom CRDT library and the one link I didn’t see in Alexander’s presentation was for his GitHub page:

Additional resources cited in the presentation:

Lindsey Kuper –

Christopher Meiklejohn –

Carlos Bquero


Summary Paper: Marc Sharpiro, Nuno Pregui,ca, Carlos Baquero, Markek Zawirksi

(I supplied the links for Carlos Bquero and riak_dt.)

If you think about it, static data is an edge case of the sources of data always being in flux. (Date arising from the measurement or recording of those sources.)

Using Twitter To Control A Botnet

Wednesday, November 18th, 2015

Twitter Direct Messages to control hacked computers by John Zorabedian.

From the post:

Direct Messages on Twitter are a way for users to send messages to individuals or a group of users privately, as opposed to regular tweets, which can be seen by everyone.

Twitter has expended a lot of effort to stamp out the predictable abuses of the Direct Message medium – namely spam and phishing attacks.

But now, self-styled security researcher Paul Amar has created a free Python-based tool called Twittor that uses Direct Messages on Twitter as a command-and-control server for botnets.

As you probably know, cybercriminals use botnets in a variety of ways to launch attacks.

But the one thing we don’t quite get in all of this is, “Why?”

Many security tools, like Nmap and Metasploit, cut both ways, being useful for researchers and penetration testers but also handy for crooks.

But publishing a free tool that helps you operate a botnet via Twitter Direct Message seems a strange way to conduct security research, especially when Twitbots are nothing new.

Amusing indignant stance by naked security on yet another tool for controlling botnets.

Notice the “self-styled security researcher,” I guess Anonymous are “self-styled” hackers and “…a strange way to conduct security research…,” as though anyone would make appoint naked security as security research censor.

Software is neither good nor bad and the conduct of government, police departments, corporations, security researchers has left little doubt that presuming a “good side” is at best naive if not fatally stupid.

There are those who, for present purposes, are not known to be on some other side but that is about as far as you can go safely.

You can find a highly similar article at: Tool Controls Botnet With Twitter Direct Messages by Kelly Jackson Higgins, which supplies the link missing from the naked security post:

Twittor is available on Github.

Kelly reports that Amar is working on adding a data extraction tool to Twittor.

Lies, Damn Lies, and Viral Content [I Know a Windmill When I See One]

Tuesday, November 17th, 2015

Lies, Damn Lies, and Viral Content How News Websites Spread (and Debunk) Online Rumors, Unverified Claims and Misinformation by Craig Silverman.

From the executive summary:

News organizations are meant to play a critical role in the dissemination of quality, accurate information in society. This has become more challenging with the onslaught of hoaxes, misinformation, and other forms of inaccurate content that flow constantly over digital platforms.

Journalists today have an imperative—and an opportunity—to sift through the mass of content being created and shared in order to separate true from false, and to help the truth to spread.

Unfortunately, as this paper details, that isn’t the current reality of how news organizations cover unverified claims, online rumors, and viral content. Lies spread much farther than the truth, and news organizations play a powerful role in making this happen.

News websites dedicate far more time and resources to propagating questionable and often false claims than they do working to verify and/or debunk viral content and online rumors. Rather than acting as a source of accurate information, online media frequently promote misinformation in an attempt to drive traffic and social engagement.

The above conclusions are the result of several months spent gathering and analyzing quantitative and qualitative data about how news organizations cover unverified claims and work to debunk false online information. This included interviews with journalists and other practitioners, a review of relevant scientific literature, and the analysis of over 1,500 news articles about more than 100 online rumors that circulated in the online press between August and December of 2014.

Many of the trends and findings detailed in the paper reflect poorly on how online media behave. Journalists have always sought out emerging (and often unverified) news. They have always followed-on the reports of other news organizations. But today the bar for what is worth giving attention seems to be much lower. There are also widely used practices in online news that are misleading and confusing to the public. These practices reflect short-term thinking that ultimately fails to deliver the full value of a piece of emerging news.

Silverman writes a compelling account (at length, some 164 pages including endnotes) to prove:

News websites dedicate far more time and resources to propagating questionable and often false claims than they do working to verify and/or debunk viral content and online rumors. Rather than acting as a source of accurate information, online media frequently promote misinformation in an attempt to drive traffic and social engagement.

We have all had the experience of watching news reports where we know the “facts” and see reporters making absurd claims about our domain of expertise. But their words may be reaching millions and you can only complain to your significant other.

I fully understand Silverman’s desire to make news reporting better, just as I labor to impress upon standards editors the difference between a reference (is used in the standard itself) and further reading (as listed in a bibliography). That distinction seems particularly difficult for some reason.

The reason I mention windmills in my title is because Silverman offers this rationale for improving verification by news outlets:

Another point of progress for journalists includes prioritizing verification and some kind of value-add to rumors and claims before engaging in propagation. This, in many cases, requires an investment of minutes rather than hours, and it helps push a story forward. The practice will lead to debunking false claims before they take hold in the collective consciousness. It will lead to fewer misinformed readers. It will surface new and important information faster. Most importantly, it will be journalism.

The benefits are:

  1. Debunking false claims before they take hold in the collective consciousness
  2. Fewer misinformed readers
  3. Surface new and important information faster
  4. It will be journalism

Starting from the top: Debunking false claims before they take hold in the collective consciousness.

How does “debunking false claims” impact traffic and social engagement? If my news outlet doesn’t have the attention grabbing headline about an image of Mary in a cheese sandwich, don’t I lose that traffic? Do you seriously think that debunking stories have the audience share of fantastic claim stories?

I suppose if the debunking involved “proving” that the image of Mary was due to witchcraft, that might drive traffic but straight up debunking seems unlikely to do so.

The second benefit was Fewer misinformed readers.

I’m at a loss to say how “fewer misinformed readers” is going to benefit the news outlet? The consequences of being misinformed accrue to the reader and not to the news outlet. I suspect the average attention span is short enough that news outlets could take the other side tomorrow without readers being overly disturbed. They would just be misinformed in a different direction.

The benefit of Surface new and important information faster comes in third.

I can see that argument but that presumes that news outlets want to report “new and important information” in the first place. What Silverman successfully argues is the practice is to report news that drives traffic and social engagement. Being “new and important has only a tangential relationship to traffic and engagement.

You probably remember during the wall-to-wall reporting about Katrina or the earthquakes in Haiti the members of the news media interviewing each other. That was nearly negative content. Even rumors and lies would be have been better.

The final advantage Silverman cites is It will be journalism.

As I said, I’m not unsympathetic to Silverman but when was journalism ever concerned with not reporting questionable and false claims? During the American Revolution perhaps?, The Civil War?, WWI?, WWWII?, Korea?, Vietnam?, and the list goes on.

There have been “good” journalists (depending upon your point of view) and “bad” journalists (again depending on your point of view). Yet, journalism, just like theology, has survived being populated in part by scalawags, charlatans, and rogues.

What journalism needs is pro-active readers to rebel against superficial, inaccurate and misleading reporting. Voting with their feet will be far more effective than exhortations to do better.

Building Software, Building Community: Lessons from the rOpenSci Project

Tuesday, November 17th, 2015

Building Software, Building Community: Lessons from the rOpenSci Project by Carl Boettiger, Scott Chamberlain, Edmund Hart, Karthik Ram.


rOpenSci is a developer collective originally formed in 2011 by graduate students and post-docs from ecology and evolutionary biology to collaborate on building software tools to facilitate a more open and synthetic approach in the face of transformative rise of large and heterogeneous data. Born on the internet (the collective only began through chance discussions over social media), we have grown into a widely recognized effort that supports an ecosystem of some 45 software packages, engages scores of collaborators, has taught dozens of workshops around the world, and has secured over $480,000 in grant support. As young scientists working in an academic context largely without direct support for our efforts, we have first hand experience with most of the the technical and social challenges WSSSPE seeks to address. In this paper we provide an experience report which describes our approach and success in building an effective and diverse community.

Given the state of world affairs, I can’t think of a better time for the publication of this article.

The key lesson that I urge you to draw from this paper is the proactive stance of the project in involving and reaching out to build a community around this project.

Too many projects (and academic organizations for that matter) take the approach that others know they exist and so they sit waiting for volunteers and members to queue up.

Very often they are surprised and bitter that the queue of volunteers and members is so sparse. If anyone dares to venture that more outreach might be helpful, the response is nearly always, sure, you go do that and let us know when it is successful.

How proactive are you in promoting your favorite project?

PS: The rOpenSci website.

DegDB (Open Source Distributed Graph Database) [Tackling Who Pays For This Data?]

Tuesday, November 17th, 2015

DegDB (Open Source Distributed Graph Database) (GitHub)

The Design Doc/Ramble reads in part:

Problems With Existing Graph Databases

  • Owned by private companies with no incentive to share.
  • Public databases are used by few people with no incentive to contribute.
  • Large databases can’t fit on one machine and are expensive to traverse.
  • Existing distributed graph databases require all nodes to be trusted.

Incentivizing Hosting of Data

Every request will have either a debit (with attached bitcoin) or credit (with bitcoin promised on delivery) payment system. The server nodes will attempt to estimate how much it will cost to serve the data and if there isn’t enough bitcoin attached, will drop the request. This makes large nodes want to serve as much popular data as possible, because it allows for faster responses as well as not having to pay other nodes for their data. At the same time, little used data will cost more to access due to requiring more hops to find the data and “cold storage” servers can inflate the prices thus making it profitable for them.

Incentivizing Creation of Data

Data Creation on Demand

A system for requesting certain data to be curated can be employed. The requestor would place a bid for a certain piece of data to be curated, and after n-sources add the data to the graph and verify its correctness the money would be split between them.
This system could be ripe for abuse by having bots automatically fulfilling every request with random data.

Creators Paid on Usage

This method involves the consumers of the data keeping track of their data sources and upon usage paying them. This is a trust based model and may end up not paying creators anything.

The one “wow” factor of this project is the forethought to put the discussion of “who pays for this data?” up front and center.

We have all seen the failing model that starts with:

For only $35.00 (U.S.) you can view this article for 24 hours.

That makes you feel like you are almost robbing the publisher at that price. (NOT!)

Right. I’m tracking down a citation to make sure a quote or data is correct and I am going to pay $35.00 (U.S.) to have access for 24 hours. Considering that the publishers with those pricing models have already made back their costs of production and publication plus a profit from institutional subscribers (challenge them for the evidence if they deny), a very low micro-payment would be more suitable. Say $00.01 per paragraph or something on that order. Payable out of a deposit with the publisher.

I would amend the Creators Paid on Usage section to have created content unlocked only upon payment (set by the creator). Over time, creators would develop reputations for the value of their data and if you choose to buy from a private seller with no online history, that’s just your bad.

Imagine that for the Paris incident (hypothetical, none of the following is true), I had the school records for half of the people carrying out that attack. Not only do I have the originals but I also have them translated into English, assuming some or all of them are in some other language. I could cast that data (I’m not fond of the poverty of triples) into a graph format and make it know as part of a distributed graph system.

Some of the data, such as the identities of the people for who I had records, would appear in the graphs of others as “new” data. Up to the readers of the graph to decide if the data and the conditions for seeing it are acceptable to them.

Data could even carry a public price tag. That is if you want to pay a large enough sum, then the data in question will be opened up for everyone to have access to it.

I don’t know of any micropayment systems that are eating at the foundations of traditional publishers now but there will be many attempts before one eviscerates them one and all.

The choices we face now of “free” (read unpaid for research, writing and publication, which excludes many) versus the “pay-per-view” model that supports early 20th century models of sloth, cronyism and gate-keeping, aren’t the only ones. We need to actively seek out better and more nuanced choices.

Debugging with the Scientific Method [Debugging Search Semantics]

Tuesday, November 17th, 2015

Debugging with the Scientific Method by Stuart Halloway.

This webpage points to a video of Stuart’s keynote address at Clojure/conj 2015 with the same title and has pointers to other resources on debugging.

Stuart summarizes the scientific method for debugging in his closing as:

know where you are going

make well-founded choices

write stuff down

Programmers, using Clojure or not, will profit from Stuart’s advice on debugging program code.

A group that Stuart does not mention, those of us interested in creating search interfaces for users will benefit as well.

We have all had a similar early library experience, we are facing (in my youth) what seems like an endless rack of card files with the desire to find information on a subject.

Of course the first problem, from Stuart’s summary, is that we don’t know where we are going. At best we have an ill-defined topic on which we are supposed to produce a report. Let’s say “George Washington, father of our country” for example. (Yes, U.S. specific but I wasn’t in elementary school outside of the U.S. Feel free to post or adapt this with other examples.)

The first step, with help from a librarian, is to learn the basic author, subject, title organization of the card catalog. And things like looking for “George Washington” starting with “George” isn’t likely to produce a useful result. Eliding over the other details that a librarian would convey, you are somewhat equipped to move to step two.

Understanding the basic organization and mechanics of a library card catalog, you can develop a plan to search for information on George Washington. Such a plan would include excluding works over the reading level of the searcher, for example.

The third step of course is to capture all the information that is found from the resources located by using the library card catalog.

I mention that scenario not just out of nostalgia for card catalogs but to illustrate the difference between a card catalog and its electronic counter-parts, which have an externally defined schema and search interfaces with no disclosed search semantics.

That is to say, if a user doesn’t find an expected result for their search, how do you debug that failure?

You could say the user should have used “term X” instead of “term Y” but that isn’t solving the search problem, that is fixing the user.

Fixing users, as any 12-step program can attest, is a very difficult and fraught with failure process.

Fixing search semantics, debugging search semantics as it were, can fix the search results for a large number of users with little or no effort on their part.

There are any number of examples of debugging or fixing search semantics but the most prominent one that comes to mine is spelling correction by search engines that result results with the “correct” spelling and offer the user an opportunity to pursue their “incorrect” spelling.

At one time search engines returned “no results” in the event of mis-spelled words.

The reason I mention this is you are likely to be debugging search semantics on a less than global search space scale but the same principle applies as does Stuart’s scientific method.

Treat complaints about search results as an opportunity to debug the search semantics of your application. Follow up with users and test your improved search semantics.

Recalling that is all events, some user signs your check, not your application.

Multiagent Systems

Monday, November 16th, 2015

Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations by Yoav Shoham and Kevin Leyton-Brown.

From the webpage:

Multiagent systems consist of multiple autonomous entities having different information and/or diverging interests. This comprehensive introduction to the field offers a computer science perspective, but also draws on ideas from game theory, economics, operations research, logic, philosophy and linguistics. It will serve as a reference for researchers in each of these fields, and be used as a text for advanced undergraduate and graduate courses.

Emphasizing foundations, the authors offer a broad and rigorous treatment of their subject, with thorough presentations of distributed problem solving, non-cooperative game theory, multiagent communication and learning, social choice, mechanism design, auctions, coalitional game theory, and logical theories of knowledge, belief, and other aspects of rational agency. For each topic, basic concepts are introduced, examples are given, proofs of key results are offered, and algorithmic considerations are examined. An appendix covers background material in probability theory, classical logic, Markov decision processes, and mathematical programming.

Even better from the introduction:

Imagine a personal software agent engaging in electronic commerce on your behalf. Say the task of this agent is to track goods available for sale in various online venues over time, and to purchase some of them on your behalf for an attractive price. In order to be successful, your agent will need to embody your preferences for products, your budget, and in general your knowledge about the environment in which it will operate. Moreover, the agent will need to embody your knowledge of other similar agents with which it will interact (e.g., agents who might compete with it in an auction, or agents representing store owners)—including their own preferences and knowledge. A collection of such agents forms a multiagent system. The goal of this book is to bring under one roof a variety of ideas and techniques that provide foundations for modeling, reasoning about, and building multiagent systems.

Somewhat strangely for a book that purports to be rigorous, we will not give a precise definition of a multiagent system. The reason is that many competing, mutually inconsistent answers have been offered in the past. Indeed, even the seemingly simpler question—What is a (single) agent?—has resisted a definitive answer. For our purposes, the following loose definition will suffice: Multiagent systems are those systems that include multiple autonomous entities with either diverging information or diverging interests, or both.

This looks like a great item for a wish list this close to the holidays. Broad enough to keep your interest up and relevant enough to argue you are “working” and not just reading. 😉

Recreational Constraint Programmer

Monday, November 16th, 2015 [Embedding of the video disabled at the source. Follow the link.]

From the description:

Many of us have hazy memories of finite state machines from computer science theory classes in college. But finite state machines (FSMs) have real, practical value, and it is useful to know how to build and apply them in Clojure. For example, FSMs have long been popular to model game AIs and workflow rules, and FSMs provide the behind-the-scenes magic that powers Java’s regexes and core.async’s go blocks. In this talk, we’ll look at two programming puzzles that, suprisingly, have very elegant solutions when looked at through the lens of FSMs, with code demonstrations using two different Clojure libraries for automata (automat and reduce-fsm), as well as loco, a Clojure constraint solver.

If you have never heard anyone describe themselves as a “recreational constraint programmer,” you really need to see this video!

If you think about having a single representative for a subject as a constraint on a set of topics, the question becomes what properties must each topic have to facilitate that constraint?

Some properties, such as family names, will lead to over-merging of topics and other properties, such as possession of one and only one social security number, will under-merge topics where a person has multiple social security numbers.

The best code demonstration in the video was the generation of a fairly complex cross-word puzzle, sans the clues for each word. I think the clues were left as an exercise for the reader. 😉

Code Repositories:

Encouraging enough that you might want to revisit regular expressions.


Connecting News Stories and Topic Maps

Monday, November 16th, 2015

New WordPress plug-in Catamount aims to connect data sets and stories by Mădălina Ciobanu.

From the post:

Non-profit news organisation VT Digger, based in the United States, is building an open-source WordPress plug-in that can automatically link news stories to relevant information collected in data sets.

The tool, called Catamount, is being developed with a $35,000 (£22,900) grant from Knight Foundation Prototype Fund, and aims to give news organisations a better way of linking existing data to their daily news coverage.

Rather than hyperlinking a person’s name in a story and sending readers to a different website, publishers can use the open-source plug-in to build a small window that pops up when readers hover over a selected section of the text.

“We have this great data set, but if people don’t know it exists, they’re not going to be racing to it every single day.

“The news cycle, however, provides a hook into data,” Diane Zeigler, publisher at VT Digger, told

If a person is mentioned in a news story and they are also a donor, candidate or representative of an organisation involved in campaign finance, for example, an editor would be able to check the two names coincide, and give Catamount permission to link the individual to all relevant information that exists in the database.

A brief overview of this information will then be available in a pop-up box, which readers can click in order to access the full data in a separate browser window or tab.

“It’s about being able to take large data sets and make them relevant to a daily news story, so thinking about ‘why does it matter that this data has been collected for years and years’?

“In theory, it might just sit there if people don’t have a reason to draw a connection,” said Zeigler.

While Catamount only works with WordPress, the code will be made available for publishers to customise and integrate with their own content management systems. reports on the grant and other winners in Knight Foundation awards $35,000 grant to VTDigger.

Assuming that the plugin will be agnostic as to the data source, this looks like an excellent opportunity to bind topic map managed content to news stories.

You could, I suppose, return one of those dreary listings of all the prior related stories from a news source.

But that is always a lot of repetitive text to wade through for very little gain.

If you curated content with a topic map, excerpting paragraphs from prior stories when necessary for quotes, that would be a high value return for a user following your link.

Since the award was made only days ago I assume there isn’t much to be reported on the Catamount tool, as of yet. I will be following the project and will report back when something testable surfaces.

I first saw this story in an alert from If you aren’t already following them you should be.

Unpronounceable — why can’t people give bioinformatics tools sensible names?

Monday, November 16th, 2015

Unpronounceable — why can’t people give bioinformatics tools sensible names? by Keith Bardnam.

From the post:

Okay, so many of you know that I have a bit of an issue with bioinformatics tools with names that are formed from very tenuous acronyms or initialisms. I’ve handed out many JABBA awards for cases of ‘Just Another Bogus Bioinformatics Acronym’. But now there is another blight on the landscape of bioinformatics nomenclature…that of unpronounceable names.

If you develop bioinformatics tools, you would hopefully want to promote those tools to others. This could be in a formal publication, or at a conference presentation, or even over a cup of coffee with a colleague. In all of these situations, you would hope that the name of your bioinformatics tool should be memorable. One way of making it memorable is to make it pronounceable. Surely, that’s not asking that much? And yet…

The examples Keith recites are quite amusing and you can find more at the JABBA awards.

He also includes some helpful advice on naming:

There is a lot of bioinformatics software in this world. If you choose to add to this ever growing software catalog, then it will be in your interest to make your software easy to discover and easy to promote. For your own sake, and for the sake of any potential users of your software, I strongly urge you to ask yourself the following five questions:

  1. Is the name memorable?
  2. Does the name have one obvious pronunciation?
  3. Could I easily spell the name out to a journalist over the phone?
  4. Is the name of my database tool free from any needless mixed capitalization?
  5. Have I considered whether my software name is based on such a tenuous acronym or intialism that it will probably end up receiving a JABBA award?

To which I would add:

6. Have you searched the name in popular Internet search engines?

I read a fair amount of computer news and little is more annoying that to search for new “name” only to find it has 10 million “hits.” Any relevant to the new usage are buried somewhere in the long set of results.

Two word names do better and three even better than two. That is if you want people to find your project, paper, software.

If not, then by all means use one of the most popular child name lists. You will know where to find your work, but the rest of us won’t.

On-Demand Data Journalism Training Site [Free Access Ends Nov. 30th]

Sunday, November 15th, 2015

Investigative Reporters and Editors launches on-demand data journalism training site

From the post:

Want to become a data journalist? You’re going to need a lot of perseverance — as well as the right training.

To help make data journalism more accessible to all, Investigative Reporters and Editors (IRE) recently launched NICAR-Learn, an online platform of training videos that can be accessed from anywhere, at any time.

“NICAR-Learn is a place for journalists to demonstrate their best tricks and strategies for working with data and for others to learn from some of the best data journalists in the business,” IRE wrote in a statement.

Unlike many online training platforms, NICAR-Learn’s content won’t consist of hour-long webinars. Instead, NICAR-Learn will produce a library of short videos, often less than 10 minutes long, to train journalists on specific topics or techniques relating to data journalism.

The first NICAR-Learn videos come from data journalist MaryJo Webster, who has produced four tutorials that draw from her popular “Excel Magic” course. Users can request specific tutorials by submitting their ideas to IRE.

These videos will be available at no charge to non-IRE members through the end of November. Beginning in December, IRE will add more videos to NICAR-Learn and place them behind a paywall.

To learn more, visit NICAR-Learn’s “Getting Started” page.

I can’t say that I like “paywalls,” which I prefer to call “privilegewalls.”

Privilegewalls because that is exactly what paywalls are meant to be. To create a feeling of privilege among those who have access, to separate them from those who don’t.

And beyond a feeling of privilege, privilegewalls are meant to advantage insiders over those unfortunate enough to be outsiders. Whether those advantages are real or in the imagination of members I leave for you to debate.

Personally I think helping anyone interested to become a better journalist or data journalist will benefit everyone. Journalists, member of the public who read their publications, perhaps even the profession itself.

Here’s an example of where being a better “data journalist” would make a significant difference:

So far as I know no journalist, despite several Republican and Democratic presidential candidate debates has every asked how they propose stop bank robberies in the United States?

In 2014 there were almost 4,000 of them at known locations, that is to say banks. If the government can’t stop robberies/attacks at known locations, how do they propose to stop terrorist attacks which can occur anywhere?

Just one fact, US bank robberies and a little creative thinking, would enable your journalists to pierce the foggy posturing on Paris and any future or past terror attacks.

The true answer is that you can’t. Not without monitoring everyone 24/7 as far as location, conversations, purchases, etc. But so far, no reporter has forced that admission from anyone. Curious don’t you think?

Bank Crime Statistics 2014.