Archive for October, 2015

A Cartoon Guide to Flux

Saturday, October 31st, 2015

From the webpage:

Flux is both one of the most popular and one of the least understood topics in current web development. This guide is an attempt to explain it in a way everyone can understand.

Lin uses cartoons to explain Flux (and in a separate posting Redux).

For more formal documentation, Flux and Redux.

BTW, in our semantically uncertain times, searching for Redux Facebook will not give you useful results for Redux as it is used in this post.

Successful use of cartoons as explanation is harder than more technical and precise explanations. In part because you have to abandon the shortcuts that technical jargon make available to the writer. Technical jargon that imposes a burden on the reader.

What technology would you want to explain using cartoons?

Fact Checking

Saturday, October 31st, 2015

Perhaps humor will help keep this in mind. 😉

I saw this in a Facebook post by Paul Prescott.

Saturday, October 31st, 2015

I should have thought about this book when I posted How to Read a Paper. I haven’t seen a copy in years but that’s a flimsy excuse for forgetting about it. I was reminded of it today when I saw it in a tweet by Michael Nielson.

Amazon has this description:

With half a million copies in print, How to Read a Book is the best and most successful guide to reading comprehension for the general reader, completely rewritten and updated with new material.

Originally published in 1940, this book is a rare phenomenon, a living classic that introduces and elucidates the various levels of reading and how to achieve them—from elementary reading, through systematic skimming and inspectional reading, to speed reading. Readers will learn when and how to “judge a book by its cover,” and also how to X-ray it, read critically, and extract the author’s message from the text.

Also included is instruction in the different techniques that work best for reading particular genres, such as practical books, imaginative literature, plays, poetry, history, science and mathematics, philosophy and social science works.

Finally, the authors offer a recommended reading list and supply reading tests you can use measure your own progress in reading skills, comprehension, and speed.

Is How to Read a Book as relevant today as it was in 1940?

In chapter 1, Adler makes a critical distinction between facts and understanding and laments the packaging of opinions:

Perhaps we know more about the world than we used to, and insofar as knowledge is prerequisite to understanding, that is all to the good. But knowledge is not as much a prerequisite to understanding as is commonly supposed. We do not have to know everything about something in order to understand it; too many facts are often as much of an obstacle to understanding as too few. There is a sense in which we moderns are inundated with facts to the detriment of understanding.

One of the reasons for this situation is that the very media we have mentioned are so designed as to make thinking seem unnecessary (though this is only an appearance). The packaging of intellectual positions and views is one of the most active enterprises of some of the best minds of our day. The viewer of television, the listener to radio, the reader of magazines, is presented with a whole complex of elements—all the way from ingenious rhetoric to carefully selected data and statistics—to make it easy for him to “make up his own mind” with the minimum of difficulty and effort. But the packaging is often done so effectively that the viewer, listener, or reader does not make up his own mind at all. Instead, he inserts a packaged opinion into his mind, somewhat like inserting a cassette into a cassette player. He then pushes a button and “plays back” the opinion whenever it seems appropriate to do so. He has performed acceptably without having had to think.

I can’t imagine Adler’s characterization of Fox News, CNN, Facebook and other forums that inundate us with nothing but pre-packaged opinions and repetition of the same.

Although not in modern gender neutral words:

…he inserts a packaged opinion into his mind, somewhat like inserting a cassette into a cassette player. He then pushes a button and “plays back” the opinion whenever it seems appropriate to do so. He has performed acceptably without having had to think.

In a modern context, such viewers, listeners, or readers, in addition to the “play back” function are also quick to denounce anyone who questions their pre-recorded narrative as a “troll.” Fearing discussion of other narratives, alternative experiences or explanations, is a sure sign of a pre-recorded opinion. Discussion interferes with the propagation of pre-recorded opinions.

How to Mark a Book has delightful advice from Adler on marking books. It captures the essence of Adler’s love of books and reading.

PAPERS ARE AMAZING: Profiling threaded programs with Coz

Saturday, October 31st, 2015

PAPERS ARE AMAZING: Profiling threaded programs with Coz by Julia Evans.

I don’t often mention profiling at all but I mention Julia’s post because:

1. It reports a non-intuitive insight in profiling threaded programs (at least until you have seen it).
2. Julia writes a great post on new ideas with perf.

From the post:

The core idea in this paper is – if you have a line of code in a thread, and you want to know if it’s making your program slow, speed up that line of code to see if it makes the whole program faster!

Of course, you can’t actually speed up a thread. But you can slow down all other threads! So that’s what they do. The implemention here is super super super interesting – they use the perf Linux system to do this, and in particular they can do it without modifying the program’s code. So this is a) wizardry, and b) uses perf

Which are both things we love here (omg perf). I’m going to refer you to the paper for now to learn more about how they use perf to slow down threads, because I honestly don’t totally understand it myself yet. There are some difficult details like “if the thread is already waiting on another thread, should we slow it down even more?” that they get into.

The insight that slowing down all but one thread is the equivalent to speeding up the thread of interest for performance evaluation sounds obvious when mentioned. But only after it is mentioned.

I suspect the ability to have that type of insight isn’t teachable other than by demonstration across a wide range of cases. If you know of other such insights, ping me.

For those interested in “real world” application of insights, Julia mentions the use of this profiler on SQLite and Memcached.

See Julia’s post for the paper and other references.

If you aren’t already checking Julia’s blog on a regular basis you might want to start.

What is Scholarly HTML?

Saturday, October 31st, 2015

What is Scholarly HTML? by Robin Berjon and Sébastien Ballesteros.

Abstract:

Scholarly HTML is a domain-specific data format built entirely on open standards that enables the interoperable exchange of scholarly articles in a manner that is compatible with off-the-shelf browsers. This document describes how Scholarly HTML works and how it is encoded as a document. It is, itself, written in Scholarly HTML.

The abstract is accurate enough but the “Motivation” section provides a better sense of this project:

Scholarly articles are still primarily encoded as unstructured graphics formats in which most of the information initially created by research, or even just in the text, is lost. This was an acceptable, if deplorable, condition when viable alternatives did not seem possible, but document technology has today reached a level of maturity and universality that makes this situation no longer tenable. Information cannot be disseminated if it is destroyed before even having left its creator’s laptop.

According to the New York Times, adding structured information to their recipes (instead of exposing simply as plain text) improved their discoverability to the point of producing an immediate rise of 52 percent in traffic (NYT, 2014). At this point in time, cupcake recipes are reaping greater benefits from modern data format practices than the whole scientific endeavour.

This is not solely a loss for the high principles of knowledge sharing in science, it also has very immediate pragmatic consequences. Any tool, any service that tries to integrate with scholarly publishing has to spend the brunt of its complexity (or budget) extracting data the author would have willingly shared out of antiquated formats. This places stringent limits on the improvement of the scholarly toolbox, on the discoverability of scientific knowledge, and particularly on processes of meta-analysis.

To address these issues, we have followed an approach rooted in established best practices for the reuse of open, standard formats. The «HTML Vernacular» body of practice provides guidelines for the creation of domain-specific data formats that make use of HTML’s inherent extensibility (Science.AI, 2015b). Using the vernacular foundation overlaid with «schema.org» metadata we have produced a format for the interchange of scholarly articles built on open standards, ready for all to use.

Our high-level goals were:

• Uncompromisingly enabling structured metadata, accessibility, and internationalisation.
• Pragmatically working in Web browsers, even if it occasionally incurs some markup overhead.
• Powerfully customisable for inclusion in arbitrary Web sites, while remaining easy to process and interoperable.
• Entirely built on top of open, royalty-free standards.
• Long-term viability as a data format.

Additionally, in view of the specific problem we addressed, in the creation of this vernacular we have favoured the reliability of interchange over ease of authoring; but have nevertheless attempted to cater to the latter as much as possible. A decent boilerplate template file can certainly make authoring relatively simple, but not as radically simple as it can be. For such use cases, Scholarly HTML provides a great output target and overview of the data model required to support scholarly publishing at the document level.

An example of an authoring format that was designed to target Scholarly HTML as an output is the DOCX Standard Scientific Style which enables authors who are comfortable with Microsoft Word to author documents that have a direct upgrade path to semantic, standard content.

Where semantic modelling is concerned, our approach is to stick as much as possible to schema.org. Beyond the obvious advantages there are in reusing a vocabulary that is supported by all the major search engines and is actively being developed towards enabling a shared understanding of many useful concepts, it also provides a protection against «ontological drift» whereby a new vocabulary is defined by a small group with insufficient input from a broader community of practice. A language that solely a single participant understands is of limited value.

In a small, circumscribed number of cases we have had to depart from schema.org, using the https://ns.science.ai/ (prefixed with sa:) vocabulary instead (Science.AI, 2015a). Our goal is to work with schema.org in order to extend their vocabulary, and we will align our usage with the outcome of these discussions.

I especially enjoyed the observation:

According to the New York Times, adding structured information to their recipes (instead of exposing simply as plain text) improved their discoverability to the point of producing an immediate rise of 52 percent in traffic (NYT, 2014). At this point in time, cupcake recipes are reaping greater benefits from modern data format practices than the whole scientific endeavour.

I don’t doubt the truth of that story but after all, a large number of people are interested in baking cupcakes. Not more than three in many cases, are interested in reading any particular academic paper.

The use of schema.org will provide advantages for common concepts but to be truly useful for scholarly writing, it will require serious extension.

Take for example my post yesterday Deep Feature Synthesis:… [Replacing Human Intuition?, Calling Bull Shit]. What microdata from schema.org would help readers find Propositionalisation and Aggregates, 2001, which describes substantially the same technique, without claims of surpassing human intuition? (Uncited by the authors the paper on deep feature synthesis.)

Or the 161 papers on propositionalisation that you can find at CiteSeer?

A crude classification that can be used by search engines is very useful but falls far short of the mark in terms of finding and retrieving scholarly writing.

Semantic uniformity for classifying scholarly content hasn’t been reached by scholars or librarians despite centuries of effort. Rather than taking up that Sisyphean task, let’s map across the ever increasing universe of semantic diversity.

Justice Department Press Gang News

Friday, October 30th, 2015

From the post:

Here’s the latest in the encryption case we’ve been writing about in which the Justice Department is asking Magistrate Judge James Orenstein to order Apple to unlock a criminal defendant’s passcode-protected iPhone. The government seized and has authority to search the phone pursuant to a search warrant. Rather than promptly grant the request (as other magistrates have done), Judge Orenstein expressed doubt that the law the government is relying on, the All Writs Act of 1789 (AWA), in fact authorizes him to enter such an order. After receiving briefing from Apple and the DOJ, Judge Orenstein heard oral arguments from both sides on Monday. He then invited them to submit additional briefing to address issues raised during the hearing.

Follow posts on the All Writs Act.

Whatever legal fiction the Justice Department invokes, impressing corporations or individuals into government service is involuntary servitude (slavery) and theft.

If you don’t speak up when they come for Apple, whose going to be around to speak up when they come for you?

Lessons in Truthful Disparagement

Friday, October 30th, 2015

Cathy O’Neil, mathbabe featured a guest post on her blog about the EU Human Brain project.

I am taking notes on truthful disparagement from Dirty Rant About The Human Brain Project.

Just listing the main section headers:

1. We have no fucking clue how to simulate a brain.
2. We have no fucking clue how to wire up a brain.
3. We have no fucking clue what makes human brains work so well.
4. We have no fucking clue what the parameters are.
5. We have no fucking clue what the important thing to simulate is.

The guest post was authored by a neuroscientist.

Cathy has just posted her slides for a day long workshop on data science (to be held in Stockholm), if you want something serious to read after you stop laughing about the EU Human Brain Project.

Free Your Maps from Web Mercator!

Friday, October 30th, 2015

Free Your Maps from Web Mercator! by Mamata Akella.

From the post:

Most maps that we see on the web use the Web Mercator projection. Web Mercator gained its popularity because it provides an efficient way for a two-dimensional flat map to be chopped up into seamless 256×256 pixel map tiles that load quickly into the rectangular shape of your browser.

If you asked a cartographer which map projection you should choose for your map, most of the time the answer would not be Web Mercator. What projection you choose depends on your map’s extent, the type of data you are mapping, and as with all maps, the story you want to tell.

Well, get excited because with a few lines of SQL in CartoDB, you can free your maps from Web Mercator!

Not only can you choose from a variety of projections at CartoDB but you can also define your own projections!

Mamata’s post walks you through these new features and promises that more detailed posts are to follow with “advanced cartographic effects on a variety of maps….”

You are probably already following the CartoDB blog but if not…, well today is a good day to start!

Time Curves

Friday, October 30th, 2015

Time Curves by Benjamin Bach, Conglei Shi, Nicolas Heulot, Tara Madhyastha, Tom Grabowski, Pierre Dragicevic.

From What are time curves?:

Time curves are a general approach to visualize patterns of evolution in temporal data, such as:

• Progression and stagantion,
• sudden changes,
• regularity and irregularity,
• reversals to previous states,
• temporal states and transitions,
• reversals to previous states,
• etc..

Time curves are based on the metaphor of folding a timeline visualization into itself so as to bring similar time points close to each other. This metaphor can be applied to any dataset where a similarity metric between temporal snapshots can be defined, thus it is largely datatype-agnostic. We illustrate how time curves can visually reveal informative patterns in a range of different datasets.

A website to accompany:

Time Curves: Folding Time to Visualize Patterns of Temporal Evolution in Data

Abstract:

We introduce time curves as a general approach for visualizing patterns of evolution in temporal data. Examples of such patterns include slow and regular progressions, large sudden changes, and reversals to previous states. These patterns can be of interest in a range of domains, such as collaborative document editing, dynamic network analysis, and video analysis. Time curves employ the metaphor of folding a timeline visualization into itself so as to bring similar time points close to each other. This metaphor can be applied to any dataset where a similarity metric between temporal snapshots can be defined, thus it is largely datatype-agnostic. We illustrate how time curves can visually reveal informative patterns in a range of different datasets.

From the introduction:

The time curve technique is a generic approach for visualizing temporal data based on self-similarity. It only assumes that the underlying information artefact can be broken down into discrete time points, and that the similarity between any two time points can be quantified through a meaningful metric. For example, a Wikipedia article can be broken down into revisions, and the edit distance can be used to quantify the similarity between any two revisions. A time curve can be seen as a timeline that has been folded into itself to reflect self-similarity (see Figure 1(a)). On the initial timeline, each dot is a time point, and position encodes time. The timeline is then stretched and folded into itself so that similar time points are brought close to each other (bottom). Quantitative temporal information is discarded as spacing now reflects similarity, but the temporal ordering is preserved.

Figure 1(a) also appears on the webpage as:

Obviously a great visualization tool for temporal data but the treatment of self-similarity is greatly encouraging:

that the similarity between any two time points can be quantified through a meaningful metric.

Time curves don’t dictate to users what “meaningful metric” to use for similarity.

BTW, as a bonus, you can upload your data (JSON format) to generate time curves from your own data.

Users/analysts of temporal data need to take a long look at time curves. A very long look.

I first saw this in a tweet by Moritz Stefaner.

Apple Open Sources Cryptographic Libraries

Friday, October 30th, 2015

Cryptographic Libraries

From the webpage:

The same libraries that secure iOS and OS X are available to third‑party developers to help them build advanced security features.

If you are requesting or implementing new features for a product, make cryptography a top priority.

Why?

The more strong legacy cryptography that is embedded into software if and when the feds decide on a position on cryptography the better.

Or put another way, the more secure your data, the harder for legislation to force you to make it less secure.

Word to the wise?

I first saw this in a tweet by Matthew J. Weaver.

SXSW Conference Reinstates Two Panels… [Summit on Harassment – Free Streaming]

Friday, October 30th, 2015

In politics this is called “flip-flop.”

SXSW flipped one way because of fear of violence and now SXSW has flopped the other way because of public anger at their flip.

From the SXSW flop statement:

It is clear that online harassment is a problem that requires more than two panel discussions to address.

To that end, we’ve added a day-long summit to examine this topic. Scheduled on Saturday, March 12, the Online Harassment Summit will take place at SXSW 2016, and we plan to live-stream the content free for the public throughout the day.

Hope and pray that Hugh Forrest doesn’t attempt to cross a piece of paisley between now and the Summit on Harassment.

The strain of changing his colors that rapidly could be harmful.

Deep Feature Synthesis:… [Replacing Human Intuition?, Calling Bull Shit]

Friday, October 30th, 2015

Deep Feature Synthesis: Towards Automating Data Science Endeavors by James Max Kanter and Kalyan Veeramachaneni.

Abstract:

In this paper, we develop the Data Science Machine, which is able to derive predictive models from raw data automatically. To achieve this automation, we first propose and develop the Deep Feature Synthesis algorithm for automatically generating features for relational datasets. The algorithm follows relationships in the data to a base field, and then sequentially applies mathematical functions along that path to create the final feature. Second, we implement a generalizable machine learning pipeline and tune it using a novel Gaussian Copula process based approach. We entered the Data Science Machine in 3 data science competitions that featured 906 other data science teams. Our approach beats 615 teams in these data science competitions. In 2 of the 3 competitions we beat a majority of competitors, and in the third, we achieved 94% of the best competitor’s score. In the best case, with an ongoing competition, we beat 85.6% of the teams and achieved 95.7% of the top submissions score.

The most common phrase I saw in headlines about this paper included some variation on: MIT algorithm replaces human intuition or words to that effect. For example, MIT developing a system that replaces human intuition for big data analysis siliconAngle, An Algorithm May Be Better Than Humans at Breaking Down Big Data Newsweek, Is an MIT algorithm better than human intuition? Christian Science Monitor, and A new AI algorithm can outperform human intuition The World Weekly, just to name a few.

Being the generous sort of reviewer that I am, ;-), I am going to assume that the reporters who wrote about the imperiled status of human intuition either didn’t read the article or were working from a poorly written press release.

The error is not buried in a deeply mathematical or computational part of the paper.

Take a look at the second, fourth and seventh paragraphs of the introduction to see if you can spot the error:

To begin with, we observed that many data science problems, such as the ones released by KAGGLE, and competitions at conferences (KDD cup, IJCAI, ECML) have a few common properties. First, the data is structured and relational, usually presented as a set of tables with relational links. Second, the data captures some aspect of human interactions with a complex system. Third, the presented problem attempts to predict some aspect of human behavior, decisions, or activities (e.g., to predict whether a customer will buy again after a sale [IJCAI], whether a project will get funded by donors [KDD Cup 2014], or even where a taxi rider will choose to go [ECML]). [Second paragraph of introduction]

Transforming raw data into features is often the part of the process that most heavily involves humans, because it is driven by intuition. While recent developments in deep learning and automated processing of images, text, and signals have enabled significant automation in feature engineering for those data types, feature engineering for relational and human behavioral data remains iterative, human-intuition driven, and challenging, and hence, time consuming. At the same time, because the efficacy of a machine learning algorithm relies heavily on the input features [1], any replacement for a human must be able to engineer them acceptably well. [Fourth paragraph of introduction]

With these components in place, we present the Data Science Machine — an automated system for generating predictive models from raw data. It starts with a relational database and automatically generates features to be used for predictive modeling. Most parameters of the system are optimized automatically, in pursuit of good general purpose performance. [Seventh paragraph of introduction]

Have you spotted the problem yet?

In the first paragraph the authors say:

First, the data is structured and relational, usually presented as a set of tables with relational links.

In the fourth paragraph the authors say:

Transforming raw data into features is often the part of the process that most heavily involves humans, because it is driven by intuition.

In the seventh paragraph the authors say:

…an automated system for generating predictive models from raw data. It starts with a relational database and automatically generates features…

That is the first time I have ever heard relational database tables and links called raw data.

Human intuition was baked into the data by the construction of the relational tables and links between them, before the Data Science Machine was ever given the data.

The Data Science Machine is wholly and solely dependent upon the human intuition already baked into the relational database data to work at all.

The researchers say as much in the seventh paragraph, unless you think data spontaneously organizes itself into relational tables. Spontaneous relational tables?

If you doubt that human intuition (decision making) is involved in the creation of relational tables, take a quick look at: A Quick-Start Tutorial on Relational Database Design.

This isn’t to take anything away from Kanter and Veeramachaneni. Their Data Science Machine builds upon human intuition captured in relational databases. That is no mean feat. Human intuition should be captured and used to augment machine learning whenever possible.

That isn’t the same as “replacing” human intuition.

PS: Please forward to any news outlet/reporter who has been repeating false information about “deep feature synthesis.”

I first saw this in a tweet by Kirk Borne.

Amateur Discovery Confirmed by NASA

Friday, October 30th, 2015

From the post:

High in the skies over Kazakhstan, space-age technology has revealed an ancient mystery on the ground.

Satellite pictures of a remote and treeless northern steppe reveal colossal earthworks — geometric figures of squares, crosses, lines and rings the size of several football fields, recognizable only from the air and the oldest estimated at 8,000 years old.

The largest, near a Neolithic settlement, is a giant square of 101 raised mounds, its opposite corners connected by a diagonal cross, covering more terrain than the Great Pyramid of Cheops. Another is a kind of three-limbed swastika, its arms ending in zigzags bent counterclockwise.

Described last year at an archaeology conference in Istanbul as unique and previously unstudied, the earthworks, in the Turgai region of northern Kazakhstan, number at least 260 — mounds, trenches and ramparts — arrayed in five basic shapes.

Spotted on Google Earth in 2007 by a Kazakh economist and archaeology enthusiast, Dmitriy Dey, the so-called Steppe Geoglyphs remain deeply puzzling and largely unknown to the outside world.

Two weeks ago, in the biggest sign so far of official interest in investigating the sites, NASA released clear satellite photographs of some of the figures from about 430 miles up.

More evidence you don’t need to be a globe trotter to make major discoveries!

A few of the satellite resources I have blogged about for your use: Free Access to EU Satellite Data, Planet Platform Beta & Open California:…, Skybox: A Tool to Help Investigate Environmental Crime.

Good luck!

Thursday, October 29th, 2015

From the post:

About 1.8 million new scientific papers are published each year, and most are of little consequence to the general public — or even read, really; one study estimates that up to half of all academic studies are only read by their authors, editors, and peer reviewers.

But the papers that are read can change our understanding of the universe — traces of water on Mars! — or impact our lives here on earth — sea levels rising! — and when journalists get called upon to cover these stories, they’re often thrown into complex topics without much background or understanding of the research that led to the breakthrough.

As a result, a group of researchers at Columbia and Stanford are in the process of developing Science Surveyor, a tool that algorithmically helps journalists get important context when reporting on scientific papers.

“The idea occurred to me that you could characterize the wealth of scientific literature around the topic of a new paper, and if you could do that in a way that showed the patterns in funding, or the temporal patterns of publishing in that field, or whether this new finding fit with the overall consensus with the field — or even if you could just generate images that show images very rapidly what the huge wealth, in millions of articles, in that field have shown — [journalists] could very quickly ask much better questions on deadline, and would be freed to see things in a larger context,” Columbia journalism professor Marguerite Holloway, who is leading the Science Surveyor effort, told me.

Science Surveyor is still being developed, but broadly the idea is that the tool takes the text of an academic paper and searches academic databases for other studies using similar terms. The algorithm will surface relevant articles and show how scientific thinking has changed through its use of language.

For example, look at the evolving research around neurogenesis, or the growth of new brain cells. Neurogenesis occurs primarily while babies are still in the womb, but it continues through adulthood in certain sections of the brain.

Up until a few decades ago, researchers generally thought that neurogenesis didn’t occur in humans — you had a set number of brain cells, and that’s it. But since then, research has shown that neurogenesis does in fact occur in humans.

“This tells you — aha! — this discovery is not an entirely new discovery,” Columbia professor Dennis Tenen, one of the researchers behind Science Surveyor, told me. “There was a period of activity in the ’70s, and now there is a second period of activity today. We hope to produce this interactive visualization, where given a paper on neurogenesis, you can kind of see other related papers on neurogenesis to give you the context for the story you’re telling.”

Given the number of papers published every year, an algorithmic approach like Science Surveyor is an absolute necessity.

But imagine how much richer the results would be if one of the three or four people who actually read the paper could easily link it to other research and context?

Or perhaps being a researcher who discovers the article and then blazes a trail to non-obvious literature that is also relevant?

Search engines now capture what choices users make in the links they follow but that’s a fairly crude approximation of relevancy of a particular resource. Such as not specifying why a particular resource is relevant.

Usage of literature should decide which articles merit greater attention from machine or human annotators. The last amount of humanities literature is never cited by anyone. Why spend resources annotating content that no one is likely to read?

Consequences for use of “found” USB flash drives?

Thursday, October 29th, 2015

Social experiment: 200 USB flash drives left in public locations

From the post:

Nearly one in five people who found a random USB stick in a public setting proceeded to use the drive in ways that posed cybersecurity risks to their personal devices and information and potentially, that of their employer, a recent experiment conducted on behalf of CompTIA revealed.

In a social experiment, 200 unbranded USB flash drives were left in high-traffic, public locations in Chicago, Cleveland, San Francisco and Washington, D.C. In about one in five instances, the flash drives were picked up and plugged into a device. Users then proceeded to engage in several potentially risky behaviors: opening text files, clicking on unfamiliar web links or sending messages to a listed email address.

“These actions may seem innocuous, but each has the potential to open the door to the very real threat of becoming the victim of a hacker or a cybercriminal,” Thibodeaux noted.

What I found missing from this article was any mention of the consequences for the employees who “found” USB drives and then plugged them into work computers.

Social experiment or not, the results indicate that forty people are too risky to be allowed to use their work computers.

If there are consequences for security failures, sharing passwords with Edward Snowden comes to mind, they are rarely reported in the mass media.

It is hardly surprising that cybersecurity is such a pressing issue when there are no consequences for distribution of deeply flawed software, no consequences for user-related breaches of security and almost always failing to capture and punish hackers for breaching your security.

Where are the incentives to improve cybersecurity?

Cassandra Summit 2015 (videos)

Thursday, October 29th, 2015

Cassandra Summit 2015

Courtesy of DataStax, thirty-six (36) presentations from Cassandra Summit 2015 are now online!

Spinning up a Spark Cluster on Spot Instances: Step by Step [$0.69 for 6 hours] Thursday, October 29th, 2015 From the post: The DevOps series covers how to get started with the leading open source distributed technologies. In this tutorial, we step through how to deploy a Spark Standalone cluster on AWS Spot Instances for less than$1. In a follow up post, we will show you how to use a Jupyter notebook on Spark for ad hoc analysis of reddit comment data on Amazon S3.

One of the significant hurdles in learning to build distributed systems is understanding how these various technologies are installed and their inter-dependencies. In our experience, the best way to get started with these technologies is to roll up your sleeves and build projects you are passionate about.

This following tutorial shows how you can deploy your own Spark cluster in standalone mode on top of Hadoop. Due to Spark’s memory demand, we recommend using m4.large spot instances with 200GB of magnetic hard drive space each.

m4.large spot instances are not within the free-tier package on AWS, so this tutorial will incur a small cost. The tutorial should not take any longer than a couple hours, but if we allot 6 hours for your 4 node spot cluster, the total cost should run around $0.69 depending on the region of your cluster. If you run this cluster for an entire month we can look at a bill of around$80, so be sure to spin down you cluster after you are finished using it.

How does \$0.69 to improve your experience with distributed systems sound?

It’s hard to imagine a better deal.

The only reason to lack experience with distributed systems is lack of interest.

Odd I know but it does happen (or so I have heard). 😉

I first saw this in a tweet by Kirk Borne.

Harvard Law Library Readies Trove of Decisions for Digital Age

Thursday, October 29th, 2015

From the post:

Shelves of law books are an august symbol of legal practice, and no place, save the Library of Congress, can match the collection at Harvard’s Law School Library. Its trove includes nearly every state, federal, territorial and tribal judicial decision since colonial times — a priceless potential resource for everyone from legal scholars to defense lawyers trying to challenge a criminal conviction.

Now, in a digital-age sacrifice intended to serve grand intentions, the Harvard librarians are slicing off the spines of all but the rarest volumes and feeding some 40 million pages through a high-speed scanner. They are taking this once unthinkable step to create a complete, searchable database of American case law that will be offered free on the Internet, allowing instant retrieval of vital records that usually must be paid for.

“Improving access to justice is a priority,” said Martha Minow, dean of Harvard Law School, explaining why Harvard has embarked on the project. “We feel an obligation and an opportunity here to open up our resources to the public.”

While Harvard’s “Free the Law” project cannot put the lone defense lawyer or citizen on an equal footing with a deep-pocketed law firm, legal experts say, it can at least guarantee a floor of essential information. The project will also offer some sophisticated techniques for visualizing relations among cases and searching for themes.

Complete state results will become publicly available this fall for California and New York, and the entire library will be online in 2017, said Daniel Lewis, chief executive and co-founder of Ravel Law, a commercial start-up in California that has teamed up with Harvard Law for the project. The cases will be available at www.ravellaw.com. Ravel is paying millions of dollars to support the scanning. The cases will be accessible in a searchable format and, along with the texts, they will be presented with visual maps developed by the company, which graphically show the evolution through cases of a judicial concept and how each key decision is cited in others.

A very challenging dataset for capturing and mapping semantics!

If you think current legal language is confusing, strap on a couple of centuries of decisions plus legislation as the meaning of words and concepts morph.

Some people will search it as flatly as they do Google Ngrams and that will be reflected in the quality of their results.

Yet another dataset where sharing search trails with commentary would enrich the data with every visit. Less experienced searchers could follow the trails of more accomplished searchers.

Whether capturing and annotating search trails and other non-WestLaw/LexisNexis features will make it into user facing interfaces remains to be seen.

There is some truth to the Westlaw claim that “Core primary law is only the beginning…” but the more court data becomes available, the greater the chance for innovative tools.

How to build and run your first deep learning network

Thursday, October 29th, 2015

From the post:

When I first became interested in using deep learning for computer vision I found it hard to get started. There were only a couple of open source projects available, they had little documentation, were very experimental, and relied on a lot of tricky-to-install dependencies. A lot of new projects have appeared since, but they’re still aimed at vision researchers, so you’ll still hit a lot of the same obstacles if you’re approaching them from outside the field.

In this article — and the accompanying webcast — I’m going to show you how to run a pre-built network, and then take you through the steps of training your own. I’ve listed the steps I followed to set up everything toward the end of the article, but because the process is so involved, I recommend you download a Vagrant virtual machine that I’ve pre-loaded with everything you need. This VM lets us skip over all the installation headaches and focus on building and running the neural networks.

I have been unable to find the posts that were to follow in this series.

Even by itself this will be enough to get you going on deep learning but the additional posts would be nice.

Pointers anyone?

NarcoData [Why Not TrollData?] + Zero Trollerance

Thursday, October 29th, 2015

From the post:

NarcoData, a collaboration between Mexican digital news site Animal Politico and data journalism platform Poderopedia, launched Tuesday with a mission to shine light on organized crime and drug trafficking in Mexico.

“The Mexican state has failed in giving its citizens accurate, updated, and systematic information about the fight against organized crime,” said Dulce Ramos, editor-in-chief of Animal Politico and the general coordinator for NarcoData. “NarcoData wants to fill that empty space.”

The site examines four decades of data to explain how drug trafficking reached its current size and influence in the country. The idea for the project came about last year, when Animal Politico obtained, via the Mexican transparency act, a government chart outlining all of the criminal cells operating in the country. Instead of immediately publishing an article with the data, Animal Politico delved further to fill in the information that the document was missing.

Even a couple of months later, when the document went public and some legacy media outlets wrote articles about it and made infographics from it, “we remained sure that that document had great potential, and we didn’t want to waste it,” Ramos said. Instead, Animal Politico requested and obtained more documents and corroborated the data with information from books, magazines, and interviews.

If you are unfamiliar with the status of the drug war in Mexico, consider the following:

At least 60,000 people are believed to have died between 2006 and 2012 as a result of the drug war as cartels, vigilante groups, and the Mexican army and police have battled each other.

Last week, the Mexican government released new data showing that between 2007 and 2014 — a period that accounts for some of the bloodiest years of the nation’s war against the drug cartels — more than 164,000 people were victims of homicide. Nearly 20,000 died last year alone, a substantial number, but still a decrease from the 27,000 killed at the peak of fighting in 2011.

Over the same seven-year period, slightly more than 103,000 died in Afghanistan and Iraq, according to data from the and the website .

Journalists and press freedom groups have expressed growing anger at Mexican authorities’ failure to tackle escalating violence against reporters and activists who dare to speak out against political corruption and organised crime.

Espinosa was the 13th journalist working in Veracruz to be killed since Governor Javier Duarte from the ruling Institutional Revolutionary party (PRI) came to power in 2011. According to the press freedom organisation Article 19, the state is now the most dangerous place to be a journalist in Latin America.

According to the Committee to Protect Journalists, about 90% of journalist murders in Mexico since 1992 have gone unpunished.

Patrick Timmons, a human rights expert who investigated violence against journalists while working for the UK embassy in Mexico City, said the massacre was another attempt to silence the press: “These are targeted murders which are wiping out a whole generation of critical leaders.”

Against that background of violence and terror, NarcoData emerges. Mexican journalists speak out against the drug cartels and on behalf of the people of Mexico who suffer under the cartels.

I am embarrassed to admit sharing U.S. citizenship with the organizers of South by Southwest (SXSW). Under undisclosed “threats” of violence because of panels to discuss online harassment, the SXSW organizers cancelled the panels. Lisa Vaas captures those organizers perfectly in her headline: SXSW turns tail and runs, nixing panels on harassment.

I offer thanks that the SXSW organizers were not civil rights organizers in: SXSW turns tail and runs… [Rejoice SXSW Organizers Weren’t Civil Rights Organizers] Troll Police.

NarcoData sets an example of how to respond to drug cartels or Internet trolls. Shine a bright light on them. Something the SXSW organizers were too timid to even contemplate.

Fighting Internet trolls requires more than anecdotal accounts of abuse. Imagine a TrollData database that collects data from all forms of social media, including SMS messages and email forwarded to it. So that data analytics can be brought to bear on the data with a view towards identifying trolls by their real world identities.

Limited to Twitter but a start in that direction is described in: How do you stop Twitter trolls? Unleash a robot swarm to troll them back by Jamie Bartlett.

Knowing how to deal with Internet trolls is tricky, because the separating line between offensive expression and harassment very fine, and usually depends on your vantage point. But one subspecies, the misogynist troll, has been causing an awful lot of trouble lately. Online abuse seems to accompany every woman that pops her head over the parapet: Mary Beard, Caroline Criado-Perez, Zelda Williams and so on. It’s not just the big fish, either. The non-celebs women cop it too, but we don’t hear about it. Despite near universal condemnation of this behaviour, it just seems to be getting worse.

Today, a strange and mysterious advocacy group based in Berlin called the “Peng! Collective” have launched a new way of tackling the misogynistic Twitter trolls. They’re calling it “Zero Trollerance.”

Here’s what they are doing. If a Twitter user posts any one of around one hundred preselected terms or words that are misogynistic, a bot – an automated account – spots it, and records that user’s Twitter handle in a database. (These terms, in case you’re wondering, include, but are not limited to, the following gems: #feministsareugly #dontdatesjws “die stupid bitch”, “feminazi” and “stupid whore”.)

This is the clever bit. This is a lurking, listening bot. It’s patrolling Twitter silently as we speak and taking details of the misogynists. But then there is another fleet of a hundred or so bots – I’ll call them the attack bots – that, soon after the offending post has been identified, will start auto-tweeting messages @ the offender (more on what they tweet below).

“Zero Trollerance” is a great idea and I applaud it. But it doesn’t capture the true power of data mining, which could uncover trolls that use multiple accounts, trolls that are harassing other users via other social media, not to mention being able to shine light directly on trolls in public, quite possibly the thing they fear the most.

TrollData would require high levels of security, monitoring of all public social media and the ability to accept email and SMS messages forwarded to it, governance and data mining tools.

Mexican journalists are willing to face death to populate NarcoData, what do you say to facing down trolls?

In case you want to watch or forward the Zero Trollerance videos:

Zero Trollerance Step 1: Zero Denial

Zero Trollerance Step 2: Zero Internet

Zero Trollerance Step 3: Zero Anger

Zero Trollerance Step 4: Zero Fear

Zero Trollerance Step 5: Zero Hate

Zero Trollerance Step 6: Zero Troll

Is It Foolish To Model Nature’s Complexity With Equations?

Thursday, October 29th, 2015

Is It Foolish To Model Nature’s Complexity With Equations? by Gabriel Popkin.

From the post:

Sometimes ecological data just don’t make sense. The sockeye salmon that spawn in British Columbia’s Fraser River offer a prime example. Scientists have tracked the fishery there since 1948, through numerous upswings and downswings. At first, population numbers seemed inversely correlated with ocean temperatures: The northern Pacific Ocean surface warms and then cools again every few decades, and in the early years of tracking, fish numbers seemed to rise when sea surface temperature fell. To biologists this seemed reasonable, since salmon thrive in cold waters. Represented as an equation, the population-temperature relationship also gave fishery managers a basis for setting catch limits so the salmon population did not crash.

But in the mid-1970s something strange happened: Ocean temperatures and fish numbers went out of sync. The tight correlation that scientists thought they had found between the two variables now seemed illusory, and the salmon population appeared to fluctuate randomly.

Trying to manage a major fishery with such a primitive understanding of its biology seems like folly to George Sugihara, an ecologist at the Scripps Institution of Oceanography in San Diego. But he and his colleagues now think they have solved the mystery of the Fraser River salmon. Their crucial insight? Throw out the equations.

Sugihara’s team has developed an approach based on chaos theory that they call “empirical dynamic modeling,” which makes no assumptions about salmon biology and uses only raw data as input. In designing it, the scientists found that sea surface temperature can in fact help predict population fluctuations, even though the two are not correlated in a simple way. Empirical dynamic modeling, Sugihara said, can reveal hidden causal relationships that lurk in the complex systems that abound in nature.

Sugihara and others are now starting to apply his methods not just in ecology but in finance, neuroscience and even genetics. These fields all involve complex, constantly changing phenomena that are difficult or impossible to predict using the equation-based models that have dominated science for the past 300 years. For such systems, DeAngelis said, empirical dynamic modeling “may very well be the future.”

If you like success stories with threads of chaos, strange attractors, and fractals running through them, you will enjoy Gabriel’s account of empirical dynamic modeling.

I have been a fan of chaos and fractals since reading Computer Recreations: A computer microscope zooms in for a look at the most complex object in mathematics in 1985 (Scientific American). That article was reposted as part of: DIY Fractals: Exploring the Mandelbrot Set on a Personal Computer by A. K. Dewdney.

Despite that long association with and appreciation of chaos theory, I would answer the title question with a firm maybe.

The answer depends upon whether equations or empirical dynamic modeling provide the amount of precision needed for some articulated purpose.

Both methods ignore any number of dimensions of data, each of which are as chaotic as any of the others. Which ones are taken into account and which ones are ignored is a design question.

Recitation of the uncertainty of data and analysis would be boring as a preface to every publication, but those factors should be upper most in the minds of every editor or reviewer.

Our choice of data or equations or some combination of both to simplify the world for reporting to others shapes the view we report.

What is foolish is to confuse those views with the world. They are not the same.

Concurrency, Specification & Programming (CS&P 2015)

Thursday, October 29th, 2015

Concurrency, Specification & Programming, volume 1, Zbigniew Suraj, Ludwik Czaja (Eds.)

Concurrency, Specification & Programming, volume 2, Zbigniew Suraj, Ludwik Czaja (Eds.)

From the preface:

This two-volume book contains the papers selected for presentation at the Concurrency, Specification and Programming (CS&P) Workshop. It is taking place from 28th to 30th September 2015 in Rzeszow, the biggest city in southeastern Poland. CS&P provides an international forum for exchanging scientific, research, and technological achievements in concurrency, programming, artificial intelligence, and related fields. In particular, major areas selected for CS&P 2015 include mathematical models of concurrency, data mining and applications, fuzzy computing, logic and probability in theory of computing, rough and granular computing, unconventional computing models. In addition, three plenary keynote talks were delivered.

Not for the faint of heart but if you are interested in the future of computing, these two volumes should be on your reading list.

Model-Based Machine Learning

Wednesday, October 28th, 2015

Model-Based Machine Learning by John Winn and Christopher Bishop with Thomas Diethe.

From How can machine learning solve my problem? (first chapter):

In this book we look at machine learning from a fresh perspective which we call model-based machine learning. This viewpoint helps to address all of these challenges, and makes the process of creating effective machine learning solutions much more systematic. It is applicable to the full spectrum of machine learning techniques and application domains, and will help guide you towards building successful machine learning solutions without requiring that you master the huge literature on machine learning.

The core idea at the heart of model-based machine learning is that all the assumptions about the problem domain are made explicit in the form of a model. In fact a model is just made up of this set of assumptions, expressed in a precise mathematical form. These assumptions include the number and types of variables in the problem domain, which variables affect each other, and what the effect of changing one variable is on another variable. For example, in the next chapter we build a model to help us solve a simple murder mystery. The assumptions of the model include the list of suspected culprits, the possible murder weapons, and the tendency for particular weapons to be preferred by different suspects. This model is then used to create a model-specific algorithm to solve the specific machine learning problem. Model-based machine learning can be applied to pretty much any problem, and its general-purpose approach means you don’t need to learn a huge number of machine learning algorithms and techniques.

So why do the assumptions of the model play such a key role? Well it turns out that machine learning cannot generate solutions purely from data alone. There are always assumptions built into any algorithm, although usually these assumptions are far from explicit. Different algorithms correspond to different sets of assumptions, and when the assumptions are implicit the only way to decide which algorithm is likely to give the best results is to compare them empirically. This is time-consuming and inefficient, and it requires software implementations of all of the algorithms being compared. And if none of the algorithms tried gives good results it is even harder to work out how to create a better algorithm.

Four chapters are complete now and four more are coming.

Not a fast read but has a great deal of promise, particularly if readers are honest about their assumptions when modeling problems.

When Lobbyists Write Legislation,…

Wednesday, October 28th, 2015

From the post:

Most kids learn the grade school civics lesson about how a bill becomes a law. What those lessons usually neglect to show is how legislation today is often birthed on a lobbyist’s desk.

But even for expert researchers, journalists, and government transparency groups, tracing a bill’s lineage isn’t easy—especially at the state level. Last year alone, there were 70,000 state bills introduced in 50 states. It would take one person five weeks to even read them all. Groups that do track state legislation usually focus narrowly on a single topic, such as abortion, or perhaps a single lobby groups.

Computers can do much better. A prototype tool, presented in September at Bloomberg’s Data for Good Exchange 2015 conference, mines the Sunlight Foundation’s database of more than 500,000 bills and 200,000 resolutions for the 50 states from 2007 to 2015. It also compares them to 1,500 pieces of “model legislation” written by a few lobbying groups that made their work available, such as the conservative group ALEC (American Legislative Exchange Council) and the liberal group the State Innovation Exchange (formerly called ALICE).

Jessica gives a great overview of Legislative Influence Detector (LID). That wasn’t how I heard “lid” in my youth but even acronyms change meaning over time. 😉

Legislative Influence Detector (LID) has this introduction at its website:

Journalists, researchers, and concerned citizens would like to know who’s actually writing legislative bills. But trying to read those bills, let alone trace their source, is tedious and time consuming. This is especially true at the state level, where important policy decisions are made every day. State legislatures consider roughly 70,000 bills each year, covering taxes, education, healthcare, crime, transportation, and more.

To solve this problem, we have created a tool we call the “Legislative Influence Detector” (LID, for short). LID helps watchdogs turn a mountain of text into digestible insights about the origin and diffusion of policy ideas and the real influence of various lobbying organizations. LID draws on more than 500,000 state bills (collected by the Sunlight Foundation) and 2,400 pieces of model legislation written by lobbyists (collected by us, ALEC Exposed, and other groups), searches for similarities, and flags them for review. LID users can then investigate the matches to look for possible lobbyist and special interest influence.

Improvements are planned and I am sure help would be welcome.

Looks like a great tool for public influencing of legislation, that is an identifiable group with posted proposed legislation.

Not quite so great for detecting the influencing of legislation as reflected in details of statutes.

Yes, there is a reason why the income tax laws in the United States are nearly 75,000 pages long. Each detail, exception, qualifier is there at the behest of some interest group.

The same is true with the rest of the United States Code but the benefits are more immediately obvious with tax law.

Text Mining Meets Neural Nets: Mining the Biomedical Literature

Wednesday, October 28th, 2015

From the webpage:

Text mining and natural language processing employ a range of techniques from syntactic parsing, statistical analysis, and more recently deep learning. This presentation presents recent advances in dense word representations, also known as word embedding, and their advantages over sparse representations, such as the popular term frequency-inverse document frequency (tf-idf) approach. It also discusses convolutional neural networks, a form of deep learning that is proving surprisingly effective in natural language processing tasks. Reference papers and tools are included for those interested in further details. Examples are drawn from the bio-medical domain.

Basically an abstract for the 58 slides you will find here: http://www.slideshare.net/DanSullivan10/text-mining-meets-neural-nets.

The best thing about these slides is the wealth of additional links to other resources. There is only so much you can say on a slide so links to more details should be a standard practice.

Slide 53: Formalize a Mathematical Model of Semantics, seems a bit ambitious to me. Considering mathematics are a subset of natural languages. Difficult to see how the lesser could model the greater.

You could create a mathematical model of some semantics and say it was all that is necessary, but that’s been done before. Always strive to make new mistakes.

Wednesday, October 28th, 2015

I tried to create a new Twitter account today but much to my surprise I could not use a phone number already in use by another Twitter account.

Moreover, the phone number has to be of an SMS-enabled phone.

I understand the need for security but you do realize that the SMS-enabled phone requirement ties your Twitter account to a particular phone. Yes?

Now, who was it that was tracking all phone traffic?

Oh, I remember, Justice Department plotting to resume NSA bulk phone records collection, it was the NSA!

The number of government mis-steps and outrages in just a few months is enough to drive earlier ones from immediate memory. It’s sad to have a government that deeply incompetent and dishonest.

Although it will be portrayed as requiring sophisticated analysis tools in order to justify the NSA’s budget.

Suggestion: Twitter should display the SMS code on a page returned to the browser requesting an account.

Unless of course, Twitter has already joined itself at the hip to the NSA.

SXSW turns tail and runs… [Rejoice SXSW Organizers Weren’t Civil Rights Organizers] Troll Police

Wednesday, October 28th, 2015

From the post:

Threats of violence have led the popular South by Southwest (SXSW) festival to nix two panel discussions about online harassment, organizers announced on Monday.

In his post, SXSW Interactive Director Hugh Forrest didn’t go into detail about the threats.

But given the names of the panels cancelled, there’s a strong smell of #gamergate in the air.

Namely, the panels for the 2016 event, announced about a week ago, were titled “SavePoint: A Discussion on the Gaming Community” and “Level Up: Overcoming Harassment in Games.”

This reaction sure isn’t what they had in mind, Forrest wrote:

We had hoped that hosting these two discussions in March 2016 in Austin would lead to a valuable exchange of ideas on this very important topic.

However, in the seven days since announcing these two sessions, SXSW has received numerous threats of on-site violence related to this programming. SXSW prides itself on being a big tent and a marketplace of diverse people and diverse ideas.

However, preserving the sanctity of the big tent at SXSW Interactive necessitates that we keep the dialogue civil and respectful.

Arthur Chu, who was going to be a male ally on the Level Up panel, has written up the behind-the-scenes mayhem for The Daily Beast.

As Chu tells it, SXSW has a process of making proposed panels available for – disastrously enough, given the tactics of torch-bearing villagers – a public vote.

I rejoice the SXSW organizers weren’t civil rights organizers.

Here is an entirely fictional account of that possible conversation about marching across the Pettus Bridge.

Hugh Forrest: Yesterday (March 6, 1965), Gov. Wallace ordered the state police to prevent a march on between Selma and Montgomery by “whatever means are necessary….”

SXSW organizer: I heard that! And the police turned off the street lights and beat a large group on February 18, 1965 and followed Jimmie Lee Jackson into a cafe, shooting him. He died eight days later.

Another SXSW organizer: There has been nothing but violence and more violence for weeks, plus threats of more violence.

Hugh Forrest: Run away! Run away!

A video compilation of the violence Hugh Forrest and his fellow cowards would have dodged as civil rights organizers: Selma-to-Montgomery “Bloody Sunday” – Video Compilation.

Hugh Forrest and SXSW have pitched a big tent that is comfortable for abusers.

I consider that siding with the abusers.

Safety and Physical Violence at Public Gatherings:

Assume that a panel discussion on online harassment does attract threats of physical violence. Isn’t that what police officers are trained to deal with?

And for that matter, victims of online harassment are more likely to be harmed in the real world when they are alone aren’t they?

So a public panel discussion, with the police in attendance, is actually safer for victims of online harassment than any other place for a real world confrontation.

Their abusers and their vermin-like supporters would have to come out from under their couches and closets into the light to harass them. Police officers are well equipped to hand out immediate consequences for such acts.

Abusers would become entangled in a legal system with little patience with or respect for their online presences.

Lessons from the Pettus Bridge:

In my view, civil and respectful dialogue isn’t how you deal with abusers, online or off. Civil and respectful dialogue didn’t protect the marchers to Montgomery and it won’t protect victims of online harassment.

The marchers to Montgomery were protected when forces more powerful than the local and state police moved into protect them.

What is required to protect targets of online harassment is a force larger and more powerful than their abusers.

Troll Police:

Consider this a call upon those with long histories of fighting online abuse individually and collectively to create a crowd-sourced Troll Police.

Public debate over the criteria for troll behavior and appropriate responses will take time but is an essential component to community validation for such an effort.

Imagine the Troll Police amassing a “big data” size database of online abuse. A database where members of the public can contribute analysis or research to help identify trolls.

That would be far more satisfying than wringing your hands when you hear of stories of abuse and wish things were better. Things can be better but if and only if we take steps to make them better.

I have some ideas and cycles I would contribute to such an effort.

Five Design Sheet [TM Interface Design]

Wednesday, October 28th, 2015

Five Design Sheet

Blog, resources and introductory materials for the Five Design Sheet (FdS) methodology.

FdS is described more formally in:

Abstract:

Sketching designs has been shown to be a useful way of planning and considering alternative solutions. The use of lo-fidelity prototyping, especially paper-based sketching, can save time, money and converge to better solutions more quickly. However, this design process is often viewed to be too informal. Consequently users do not know how to manage their thoughts and ideas (to first think divergently, to then finally converge on a suitable solution). We present the Five Design Sheet (FdS) methodology. The methodology enables users to create information visualization interfaces through lo-fidelity methods. Users sketch and plan their ideas, helping them express different possibilities, think through these ideas to consider their potential effectiveness as solutions to the task (sheet 1); they create three principle designs (sheets 2,3 and 4); before converging on a final realization design that can then be implemented (sheet 5). In this article, we present (i) a review of the use of sketching as a planning method for visualization and the benefits of sketching, (ii) a detailed description of the Five Design Sheet (FdS) methodology, and (iii) an evaluation of the FdS using the System Usability Scale, along with a case-study of its use in industry and experience of its use in teaching.

Abstract:

There are many challenges for a developer when creating an information visualization tool of some data for a
client. In particular students, learners and in fact any designer trying to apply the skills of information visualization
often find it difficult to understand what, how and when to do various aspects of the ideation. They need to
interact with clients, understand their requirements, design some solutions, implement and evaluate them. Thus,
they need a process to follow. Taking inspiration from product design, we present the Five design-Sheet approach.
The FdS methodology provides a clear set of stages and a simple approach to ideate information visualization
design solutions and critically analyze their worth in discussion with the client.

As written, FdS is entirely appropriate for a topic map interface, but how do you capture the subjects users do or want to talk about?

Suggestions?

PSA (sort of): The Legislative Process

Tuesday, October 27th, 2015

The Legislative Process Congress.gov.

Nine videos from Congress.gov that break the legislative process into nine stages:

A civics class view of the legislative process but also useful if you are trying to pass a citizenship exam for the United States.

The videos are done using Flash so view them on a school computer that you don’t care about getting infected.

The biggest departure from reality (I read the transcripts, I did not watch the videos) is where the legislative process says:

Committee members and staff focus much of their time on drafting and considering legislative proposals,….

Some staff spend time on legislative proposals but any number of legislative proposals are written by law firms on behalf of their clients and then proposed by members of Congress.

As far as “considering legislative proposals,” recall the Patriot Act was passed without time for it to be read, 357 to 66 in the House and 98 to 1 in the Senate.

If you are asking a favor from a delusional person, you best share their delusion or at least appear to do so. Watch these videos before any favor-seeking visit to Washington, D.C.

Statistical Reporting Errors in Psychology (1985–2013) [1 in 8]

Tuesday, October 27th, 2015

Do you remember your parents complaining about how far the latest psychology report departed from their reality?

Turns out there may be a scientific reason why those reports were as far off as your parents thought (or not).

The prevalence of statistical reporting errors in psychology (1985–2013) by Michèle B. Nuijten , Chris H. J. Hartgerink, Marcel A. L. M. van Assen, Sacha Epskamp, Jelte M. Wicherts, reports:

This study documents reporting errors in a sample of over 250,000 p-values reported in eight major psychology journals from 1985 until 2013, using the new R package “statcheck.” statcheck retrieved null-hypothesis significance testing (NHST) results from over half of the articles from this period. In line with earlier research, we found that half of all published psychology papers that use NHST contained at least one p-value that was inconsistent with its test statistic and degrees of freedom. One in eight papers contained a grossly inconsistent p-value that may have affected the statistical conclusion. In contrast to earlier findings, we found that the average prevalence of inconsistent p-values has been stable over the years or has declined. The prevalence of gross inconsistencies was higher in p-values reported as significant than in p-values reported as nonsignificant. This could indicate a systematic bias in favor of significant results. Possible solutions for the high prevalence of reporting inconsistencies could be to encourage sharing data, to let co-authors check results in a so-called “co-pilot model,” and to use statcheck to flag possible inconsistencies in one’s own manuscript or during the review process.

This is an open access article so dig in for all the details discovered by the authors.

The R package statcheck: Extract Statistics from Articles and Recompute P Values is quite amazing. The manual for statcheck should have you up and running in short order.

I did puzzle over the proposed solutions:

Possible solutions for the high prevalence of reporting inconsistencies could be to encourage sharing data, to let co-authors check results in a so-called “co-pilot model,” and to use statcheck to flag possible inconsistencies in one’s own manuscript or during the review process.

All of those are good suggestions but we already have the much valued process of “peer review” and the value-add of both non-profit and commercial publishers. Surely those weighty contributions to the process of review and publication should be enough to quell this “…systematic bias in favor of significant results.”

Unless, of course, dependence on “peer review” and the value-add of publishers for article quality is entirely misplaced. Yes?

What area with “p-values reported as significant” will fall to statcheck next?