$175K to Identify Plankton

December 21st, 2014

Oregon marine researchers offer $175,000 reward for ‘big data’ solution to identifying plankton by Kelly House.

From the post:

The marine scientists at Oregon State University need to catalog tens of millions of plankton photos, and they’re willing to pay good money to anyone willing to do the job.

The university’s Hatfield Marine Science Center on Monday announced the launch of the National Data Science Bowl, a competition that comes with a $175,000 reward for the best “big data” approach to sorting through the photos.

It’s a job that, done by human hands, would take two lifetimes to finish.

Data crunchers have 90 days to complete their task. Authors of the top three algorithms will share the $175,000 purse and Hatfield will gain ownership of their algorithms.

From the competition description:

The 2014/2015 National Data Science Bowl challenges you to create an image classification algorithm to automatically classify plankton species. This challenge is not easy— there are 100 classes of plankton, the images may contain non-plankton organisms and particles, and the plankton can appear in any orientation within three-dimensional space. The winning algorithms will be used by Hatfield Marine Science Center for simpler, faster population assessment. They represent a $1 million in-kind donation by the data science community!

There is a comprehensive tutorial to get you started and weekly blog posts on the contest.

You may also see this billed as the first National Data Science Bowl.

The contest runs from December 15, 2014 until March 16, 2015.

Competing is free and even if you don’t win the big prize, you will have gained valuable experience from the tutorials and discussions during the contest.

I first saw this in a tweet by Gregory Piatetsky

Our Favorite Maps of the Year Cover Everything From Bayous to Bullet Trains

December 20th, 2014

Our Favorite Maps of the Year Cover Everything From Bayous to Bullet Trains by Greg Miller (Wired MapLab)

From the post:

What makes a great map? It depends, of course, on who’s doing the judging. Teh internetz loves a map with dazzling colors and a simple message, preferably related to some pop-culture phenomenon. Professional mapmakers love a map that’s aesthetically pleasing and based on solid principles of cartographic design.

We love maps that have a story to tell, the kind of maps where the more you look the more you see. Sometimes we fall for a map mostly because of the data behind it. Sometimes, we’re not ashamed to say, we love a map just for the way it looks. Here are some of the maps we came across this year that captivated us with their brains, their beauty, and in many cases, both.

First, check out the animated map below to see a day’s worth of air traffic over the UK, then toggle the arrow at top right to see the rest of the maps in fullscreen mode.

The “arrow at top right” refers to an arrow that appears when you mouse over the map of the United States at the top of the post. An impressive collection of maps!

For an even more impressive display of air traffic:

Bear in mind that there are approximately 93,000 flights per day, zero (0) of which are troubled by terrorists. The next time your leaders decry terrorism, do remember to ask where?

Creating Tor Hidden Services With Python

December 20th, 2014

Creating Tor Hidden Services With Python by Jordan Wright.

From the post:

Tor is often used to protect the anonymity of someone who is trying to connect to a service. However, it is also possible to use Tor to protect the anonymity of a service provider via hidden services. These services, operating under the .onion TLD, allow publishers to anonymously create and host content viewable only by other Tor users.

The Tor project has instructions on how to create hidden services, but this can be a manual and arduous process if you want to setup multiple services. This post will show how we can use the fantastic stem Python library to automatically create and host a Tor hidden service.

If you are interested in the Tor network, this is a handy post to bookmark.

I was thinking about exploring the Tor network in the new year but you should be aware of a more recent post by Jordan:

What Happens if Tor Directory Authorities Are Seized?

From the post:

The Tor Project has announced that they have received threats about possible upcoming attempts to disable the Tor network through the seizure of Directory Authority (DA) servers. While we don’t know the legitimacy behind these threats, it’s worth looking at the role DA’s play in the Tor network, showing what effects their seizure could have on the Tor network.*

Nothing to panic about, yet, but if you know anyone you can urge to protect Tor, do so.

Mapazonia (Mapping the Amazon)

December 20th, 2014

Mapazonia (Mapping the Amazon)

From the about page:

Mapazonia has the aim of improve the OSM data in the Amazon region, using satellite images to map roads and rivers geometry.

A detailed cartography will help many organizations that are working in the Amazon to accomplish their objectives. Together we can collaborate to look after the Amazon and its inhabitants.

The project was born as an initiative of the Latinamerican OpenStreetMap Community with the objective of go ahead with collaborative mapping of common areas and problems in the continent.

We use the Tasking Manager of the Humanitarian OpenStreetMap Team to define the areas where we are going to work. Furthermore we will organize mapathons to teach the persons how to use the tools of collaborative mapping.

Normally I am a big supporter of mapping and especially crowd-sourced mapping projects.

However, a goal of an improved mapping of the Amazon makes me wonder who benefits from such a map?

The local inhabitants have known their portions of the Amazon for centuries well enough for their purposes. So I don’t think they are going to benefit from such a map for their day to day activities.

Hmmm, hmmm, who else might benefit from such a map? I haven’t seen any discussion of that topic in the mailing list archives. There seems to be a great deal of enthusiasm for the project, which is a good thing, but little awareness of potential future uses.

Who uses maps of as of yet not well mapped places? Oil, logging, and mining companies, just to name of few of the more pernicious users of maps that come to mind.

To say that the depredations of such users will be checked by government regulations is a jest too cruel for laughter.

There is a valid reason why maps were historically considered as military secrets. One’s opponent could use them to better plan their attacks.

An accurate map of the Amazon will be putting the Amazon directly in the cross-hairs of multiple attackers, with no effective defenders in sight. The Amazon may become as polluted as some American waterways but being unmapped will delay that unhappy day.

I first saw this in a tweet by Alex Barth.

Leading from the Back: Making Data Science Work at a UX-driven Business

December 20th, 2014

Leading from the Back: Making Data Science Work at a UX-driven Business by John Foreman. (Microsoft Visiting Speaker Series)

The first thirty (30) minutes are easily the best ones I have spent on a video this year. (I haven’t finished the Q&A part yet.)

John is a very good speaker but in part his presentation is fascinating because it illustrates how to “sell” data analysis to customers (internal and external).

You will find that while John can do the math, he is also very adept at delivering value to his customer.

Not surprisingly, customers are less interested in bells and whistles or your semantic religion and more interested in value as they perceive it.

Catch the switch in point of view, it isn’t value from your point of view but the customer’s point of view.

You need to set aside some time to watch at least the first thirty minutes of this presentation.

BTW, John Foreman is the author of Data Smart, which he confesses is “not sexy.”

I first saw this in a tweet by Microsoft Research.

Teaching Deep Convolutional Neural Networks to Play Go

December 20th, 2014

Teaching Deep Convolutional Neural Networks to Play Go by Christopher Clark and Amos Storkey.


Mastering the game of Go has remained a long standing challenge to the field of AI. Modern computer Go systems rely on processing millions of possible future positions to play well, but intuitively a stronger and more ‘humanlike’ way to play the game would be to rely on pattern recognition abilities rather then brute force computation. Following this sentiment, we train deep convolutional neural networks to play Go by training them to predict the moves made by expert Go players. To solve this problem we introduce a number of novel techniques, including a method of tying weights in the network to ‘hard code’ symmetries that are expect to exist in the target function, and demonstrate in an ablation study they considerably improve performance. Our final networks are able to achieve move prediction accuracies of 41.1% and 44.4% on two different Go datasets, surpassing previous state of the art on this task by significant margins. Additionally, while previous move prediction programs have not yielded strong Go playing programs, we show that the networks trained in this work acquired high levels of skill. Our convolutional neural networks can consistently defeat the well known Go program GNU Go, indicating it is state of the art among programs that do not use Monte Carlo Tree Search. It is also able to win some games against state of the art Go playing program Fuego while using a fraction of the play time. This success at playing Go indicates high level principles of the game were learned.

If you are going to pursue the study of Monte Carlo Tree Search for semantic purposes, there isn’t any reason to not enjoy yourself as well. ;-)

And following the best efforts in game playing will be educational as well.

I take the efforts at playing Go by computer as well as those for chess, as indicating how far ahead humans are to AI.

Both of those two-player, complete knowledge games were mastered long ago by humans. Multi-player games with extended networds of influence and motives, not to mention incomplete information as well, seem securely reserved for human players for the foreseeable future. (I wonder if multi-player scenarios are similar to the multi-body problem in physics? Except with more influences.)

I first saw this in a tweet by Ebenezer Fogus.

Monte-Carlo Tree Search for Multi-Player Games [Semantics as Multi-Player Game]

December 20th, 2014

Monte-Carlo Tree Search for Multi-Player Games by Joseph Antonius Maria Nijssen.

From the introduction:

The topic of this thesis lies in the area of adversarial search in multi-player zero-sum domains, i.e., search in domains having players with conflicting goals. In order to focus on the issues of searching in this type of domains, we shift our attention to abstract games. These games provide a good test domain for Artificial Intelligence (AI). They offer a pure abstract competition (i.e., comparison), with an exact closed domain (i.e., well-defined rules). The games under investigation have the following two properties. (1) They are too complex to be solved with current means, and (2) the games have characteristics that can be formalized in computer programs. AI research has been quite successful in the field of two-player zero-sum games, such as chess, checkers, and Go. This has been achieved by developing two-player search techniques. However, many games do not belong to the area where these search techniques are unconditionally applicable. Multi-player games are an example of such domains. This thesis focuses on two different categories of multi-player games: (1) deterministic multi-player games with perfect information and (2) multi-player hide-and-seek games. In particular, it investigates how Monte-Carlo Tree Search can be improved for games in these two categories. This technique has achieved impressive results in computer Go, but has also shown to be beneficial in a range of other domains.

This chapter is structured as follows. First, an introduction to games and the role they play in the field of AI is provided in Section 1.1. An overview of different game properties is given in Section 1.2. Next, Section 1.3 defines the notion of multi-player games and discusses the two different categories of multi-player games that are investigated in this thesis. A brief introduction to search techniques for two-player and multi-player games is provided in Section 1.4. Subsequently, Section 1.5 defines the problem statement and four research questions. Finally, an overview of this thesis is provided in Section 1.6.

This thesis is great background reading on the use of Monte-Carol tree search in games. While reading the first chapter, I realized that assigning semantics to a token is an instance of a multi-player game with hidden information. That is the “semantic” of any token doesn’t exist in some Platonic universe but rather is the result of some N number of players who also accept a particular semantic for some given token in a particular context. And we lack knowledge of the semantic and the reasons for it that will be assigned by some N number of players, which may change over time and context.

The semiotic triangle of Ogden and Richards (The Meaning of Meaning):


for any given symbol, represents the view of a single speaker. But as Ogden and Richards note, what is heard by listeners should be represented by multiple semiotic triangles:

Normally, whenever we hear anything said we spring spontaneously to an immediate conclusion, namely, that the speaker is referring to what we should be referring to were we speaking the words ourselves. In some cases this interpretation may be correct; this will prove to be what he has referred to. But in most discussions which attempt greater subtleties than could be handled in a gesture language this will not be so. (The Meaning of Meaning, page 15 of the 1923 edition)

Is RDF/OWL more subtle than can be handled by a gesture language? If you think so then you have discovered one of the central problems with the Semantic Web and any other universal semantic proposal.

Not that topic maps escape a similar accusation, but with topic maps you can encode additional semiotic triangles in an effort to avoid confusion, at least to the extent of funding and interest. And if you aren’t trying to avoid confusion, you can supply semiotic triangles that reach across understandings to convey additional information.

You can’t avoid confusion altogether nor can you achieve perfect communication with all listeners. But, for some defined set of confusions or listeners, you can do more than simply repeat your original statements in a louder voice.

Whether Monte-Carlo Tree searches will help deal with the multi-player nature of semantics isn’t clear but it is an alternative to repeating “…if everyone would use the same (my) system, the world would be better off…” ad nauseam.

I first saw this in a tweet by Ebenezer Fogus.

Linked Open Data Visualization Revisited: A Survey

December 20th, 2014

Linked Open Data Visualization Revisited: A Survey by Oscar Peña, Unai Aguilera and Diego López-de-Ipiña.


Mass adoption of the Semantic Web’s vision will not become a reality unless the benefits provided by data published under the Linked Open Data principles are understood by the majority of users. As technical and implementation details are far from being interesting for lay users, the ability of machines and algorithms to understand what the data is about should provide smarter summarisations of the available data. Visualization of Linked Open Data proposes itself as a perfect strategy to ease the access to information by all users, in order to save time learning what the dataset is about and without requiring knowledge on semantics.

This article collects previous studies from the Information Visualization and the Exploratory Data Analysis fields in order to apply the lessons learned to Linked Open Data visualization. Datatype analysis and visualization tasks proposed by Ben Shneiderman are also added in the research to cover different visualization features.

Finally, an evaluation of the current approaches is performed based on the dimensions previously exposed. The article ends with some conclusions extracted from the research.

I would like to see a version of this article after it has had several good editing passes. From the abstract alone, “…benefits provided by data…” and “…without requiring knowledge on semantics…” strike me as extremely problematic.

Data, accessible or not, does not provide benefits. The results of processing data may, which may explain the lack of enthusiasm when large data dumps are made web accessible. In and of itself, it is just another large dump of data. The results of processing that data may be very useful, but that is another step in the process.

I don’t think “…without requiring knowledge of semantics…” is in line with the rest of the article. I suspect the authors meant the semantics of data sets could be conveyed to users without their researching them prior to using the data set. I think that is problematic but it has the advantage of being plausible.

The various theories of visualization and datatypes (pages 3-8) don’t seem to advance the discussion and I would either drop that content or tie it into the actual visualization suites discussed. It’s educational but its relationship to the rest of the article is tenuous.

The coverage of visualization suites is encouraging and useful, but with an overall tighter focus, more time could be spent on each one and their entries being correspondingly longer.

Hopefully we will see a later, edited version of this paper as a good summary/guide to visualization tools for linked data would be a useful addition to the literature.

I first saw this in a tweet by Marin Dimitrov.

BigDataScript: a scripting language for data pipelines

December 19th, 2014

BigDataScript: a scripting language for data pipelines by Pablo Cingolani, Rob Sladek, and Mathieu Blanchette.


Motivation: The analysis of large biological datasets often requires complex processing pipelines that run for a long time on large computational infrastructures. We designed and implemented a simple script-like programming language with a clean and minimalist syntax to develop and manage pipeline execution and provide robustness to various types of software and hardware failures as well as portability.

Results: We introduce the BigDataScript (BDS) programming language for data processing pipelines, which improves abstraction from hardware resources and assists with robustness. Hardware abstraction allows BDS pipelines to run without modification on a wide range of computer architectures, from a small laptop to multi-core servers, server farms, clusters and clouds. BDS achieves robustness by incorporating the concepts of absolute serialization and lazy processing, thus allowing pipelines to recover from errors. By abstracting pipeline concepts at programming language level, BDS simplifies implementation, execution and management of complex bioinformatics pipelines, resulting in reduced development and debugging cycles as well as cleaner code.

Availability and implementation: BigDataScript is available under open-source license at http://pcingola.github.io/BigDataScript.

How would you compare this pipeline proposal to: XProc 2.0: An XML Pipeline Language?

I prefer XML solutions because I can reliably point to an element or attribute to endow it with explicit semantics.

While explicit semantics is my hobby horse, it may not be yours. Curious how you view this specialized language for bioinformatics pipelines?

I first saw this in a tweet by Pierre Lindenbaum.

Terms Defined in the W3C HTML5 Recommendation

December 19th, 2014

Terms Defined in the W3C HTML5 RecommendationHTML5 Recommendation.

I won’t say this document has much of a plot or that it is an easy read. ;-)

If you are using HTML5, however, this should either be a bookmark or open in your browser.


I first saw this in a tweet by AdobeWebCC.

DeepSpeech: Scaling up end-to-end speech recognition [Is Deep the new Big?]

December 19th, 2014

DeepSpeech: Scaling up end-to-end speech recognition by Awni Hannun, et al.


We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, our system does not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learns a function that is robust to such effects. We do not need a phoneme dictionary, nor even the concept of a “phoneme.” Key to our approach is a well-optimized RNN training system that uses multiple GPUs, as well as a set of novel data synthesis techniques that allow us to efficiently obtain a large amount of varied data for training. Our system, called DeepSpeech, outperforms previously published results on the widely studied Switchboard Hub5’00, achieving 16.5% error on the full test set. DeepSpeech also handles challenging noisy environments better than widely used, state-of-the-art commercial speech systems.

Although the academic papers, so far, are using “deep learning” in a meaningful sense, early 2015 is likely to see many vendors rebranding their offerings as incorporating or being based on deep learning.

When approached with any “deep learning” application or service, check out the Internet Archive WayBack Machine to see how they were marketing their software/service before “deep learning” became popular.

Is there a GPU-powered box in your future?

I first saw this in a tweet by Andrew Ng.

Update: After posting I encountered: Baidu claims deep learning breakthrough with Deep Speech by Derrick Harris. Talks to Andrew Ng, great write-up.

The top 10 Big data and analytics tutorials in 2014

December 19th, 2014

The top 10 Big data and analytics tutorials in 2014 by Sarah Domina.

From the post:

At developerWorks, our Big data and analytics content helps you learn to leverage the tools and technologies to harness and analyze data. Let’s take a look back at the top 10 tutorials from 2014, in no particular order.

There are a couple of IBM product line specific tutorials but the majority of them you will enjoy whether you are an IBM shop or not.

Oddly enough, the post for the top ten (10) in 2014 was made on 26 September 2014.

Either Watson is far better than I have ever imagined or IBM has its own calendar.

In favor of an IBM calendar, I would point out that IBM has its own song.

A flag:


IBM ranks ahead of Morocco in terms of GDP at $99.751 billion.

Does IBM have its own calendar? Hard to say for sure but I would not doubt it. ;-)

Collection of CRS reports released to the public

December 19th, 2014

Collection of CRS reports released to the public by Kevin Kosar.

From the post:

Something rare has occurred—a collection of reports authored by the Congressional Research Service has been published and made freely available to the public. The 400-page volume, titled, “The Evolving Congress,” and was produced in conjunction with CRS’s celebration of its 100th anniversary this year. Congress, not CRS, published it. (Disclaimer: Before departing CRS in October, I helped edit a portion of the volume.)

The Congressional Research Service does not release its reports publicly. CRS posts its reports at CRS.gov, a website accessible only to Congress and its staff. The agency has a variety of reasons for this policy, not least that its statute does not assign it this duty. Congress, with ease, could change this policy. Indeed, it already makes publicly available the bill digests (or “summaries”) CRS produces at Congress.gov.

The Evolving Congress” is a remarkable collection of essays that cover a broad range of topic. Readers would be advised to start from the beginning. Walter Oleszek provides a lengthy essay on how Congress has changed over the past century. Michael Koempel then assesses how the job of Congressman has evolved (or devolved depending on one’s perspective). “Over time, both Chambers developed strategies to reduce the quantity of time given over to legislative work in order to accommodate Members’ other duties,” Koempel observes.

The NIH (National Institutes of Health) requires that NIH funded research be made available to the public. Other government agencies are following suite. Isn’t it time for the Congressional Research Service to make its publicly funded research available to the public that paid for it?

Congress needs to require it. Contact your member of Congress today. Ask for all Congressional Research Service reports, past, present and future be made available to the public.

You have already paid for the reports, why shouldn’t you be able to read them?

Senate Joins House In Publishing Legislative Information In Modern Formats [No More Sneaking?]

December 19th, 2014

Senate Joins House In Publishing Legislative Information In Modern Formats by Daniel Schuman.

From the post:

There’s big news from today’s Legislative Branch Bulk Data Task Force meeting. The United States Senate announced it would begin publishing text and summary information for Senate legislation, going back to the 113th Congress, in bulk XML. It would join the House of Representatives, which already does this. Both chambers also expect to have bill status information available online in XML format as well, but a little later on in the year.

This move goes a long way to meet the request made by a coalition of transparency organizations, which asked for legislative information be made available online, in bulk, in machine-processable formats. These changes, once implemented, will hopefully put an end to screen scraping and empower users to build impressive tools with authoritative legislative data. A meeting to spec out publication methods will be hosted by the Task Force in late January/early February.

The Senate should be commended for making the leap into the 21st century with respect to providing the American people with crucial legislative information. We will watch closely to see how this is implemented and hope to work with the Senate as it moves forward.

In addition, the Clerk of the House announced significant new information will soon be published online in machine-processable formats. This includes data on nominees, election statistics, and members (such as committee assignments, bioguide IDs, start date, preferred name, etc.) Separately, House Live has been upgraded so that all video is now in H.264 format. The Clerk’s website is also undergoing a redesign.

The Office of Law Revision Counsel, which publishes the US Code, has further upgraded its website to allow pinpoint citations for the US Code. Users can drill down to the subclause level simply by typing the information into their search engine. This is incredibly handy.

This is great news!

Law is a notoriously opaque domain and the process of creating it even more so. Getting the data is a great first step, parsing out steps in the process and their meaning is another. To say nothing of the content of the laws themselves.

Still, progress is progress and always welcome!

Perhaps citizen review will stop the Senate from sneaking changes past sleepy members of the House.

New in Cloudera Labs: SparkOnHBase

December 19th, 2014

New in Cloudera Labs: SparkOnHBase by Ted Malaska.

From the post:

Apache Spark is making a huge impact across our industry, changing the way we think about batch processing and stream processing. However, as we progressively migrate from MapReduce toward Spark, we shouldn’t have to “give up” anything. One of those capabilities we need to retain is the ability to interact with Apache HBase.

In this post, we will share the work being done in Cloudera Labs to make integrating Spark and HBase super-easy in the form of the SparkOnHBase project. (As with everything else in Cloudera Labs, SparkOnHBase is not supported and there is no timetable for possible support in the future; it’s for experimentation only.) You’ll learn common patterns of HBase integration with Spark and see Scala and Java examples for each. (It may be helpful to have the SparkOnHBase repository open as you read along.)

Is it too late to amend my wish list to include an eighty-hour week with Spark? ;-)

This is an excellent opportunity to follow along with lab quality research on an important technology.

The Cloudera Labs discussion group strikes me as dreadfully under used.


A non-comprehensive list of awesome things other people did in 2014

December 19th, 2014

A non-comprehensive list of awesome things other people did in 2014 by Jeff Leek.

Thirty-eight (38) top resources from 2014! Ranging from data analysis and statistics to R and genomics and places in between.

If you missed or overlooked any of these resources during 2014, take the time to correct that error!

Thanks Jeff!

I first saw this in a tweet by Nicholas Horton.

XProc 2.0: An XML Pipeline Language

December 19th, 2014

XProc 2.0: An XML Pipeline Language W3C First Public Working Draft 18 December 2014


This specification describes the syntax and semantics of XProc 2.0: An XML Pipeline Language, a language for describing operations to be performed on documents.

An XML Pipeline specifies a sequence of operations to be performed on documents. Pipelines generally accept documents as input and produce documents as output. Pipelines are made up of simple steps which perform atomic operations on documents and constructs similar to conditionals, iteration, and exception handlers which control which steps are executed.

For your proofing responses:

Please report errors in this document by raising issues on the specification
. Alternatively, you may report errors in this document to the public mailing list public-xml-processing-model-comments@w3.org (public archives are available).

First drafts always need a close reading for omissions and errors. However, after looking at the editors of XProc 2.0, you aren’t likely to find any “cheap” errors. Makes proofing all the more fun.


XQuery, XPath, XQuery/XPath Functions and Operators 3.1

December 19th, 2014

XQuery, XPath, XQuery/XPath Functions and Operators 3.1 were published on 18 December 2014 as a call for implementation of these specifications.

The changes most often noted were the addition of capabilities for maps and arrays. “Support for JSON” means sections 17.4 and 17.5 of XPath and XQuery Functions and Operators 3.1.

XQuery 3.1 and XPath 3.1 depend on XPath and XQuery Functions and Operators 3.1 for JSON support. (Is there no acronym for XPath and XQuery Functions and Operators? Suggest XF&O.)

For your reading pleasure:

XQuery 3.1: An XML Query Language

    3.10.1 Maps.

    3.10.2 Arrays.

XML Path Language (XPath) 3.1

  1. 3.11.1 Maps
  2. 3.11.2 Arrays

XPath and XQuery Functions and Operators 3.1

  1. 17.1 Functions that Operate on Maps
  2. 17.3 Functions that Operate on Arrays
  3. 17.4 Conversion to and from JSON
  4. 17.5 Functions on JSON Data

Hoping that your holiday gifts include a large box of highlighters and/or a box of red pencils!

Oh, these specifications will “…remain as Candidate Recommendation(s) until at least 13 February 2015. (emphasis added)”

Less than two months so read quickly and carefully.


I first saw this in a tweet by Jonathan Robie.

The Top 10 Posts of 2014 from the Cloudera Engineering Blog

December 18th, 2014

The Top 10 Posts of 2014 from the Cloudera Engineering Blog by Justin Kestelyn.

From the post:

Our “Top 10″ list of blog posts published during a calendar year is a crowd favorite (see the 2013 version here), in particular because it serves as informal, crowdsourced research about popular interests. Page views don’t lie (although skew for publishing date—clearly, posts that publish earlier in the year have pole position—has to be taken into account).

In 2014, a strong interest in various new components that bring real time or near-real time capabilities to the Apache Hadoop ecosystem is apparent. And we’re particularly proud that the most popular post was authored by a non-employee.

See Justin’s post for the top ten (10) list!

The Cloudera blog always has high quality content so this the cream of the crop!


Announcing Apache Storm 0.9.3

December 18th, 2014

Announcing Apache Storm 0.9.3 by Taylor Goetz

From the post:

With Apache Hadoop YARN as its architectural center, Apache Hadoop continues to attract new engines to run within the data platform, as organizations want to efficiently store their data in a single repository and interact with it for batch, interactive and real-time streaming use cases. Apache Storm brings real-time data processing capabilities to help capture new business opportunities by powering low-latency dashboards, security alerts, and operational enhancements integrated with other applications running in the Hadoop cluster.


Now there’s an early holiday surprise!


GovTrack’s Summer/Fall Updates

December 18th, 2014

GovTrack’s Summer/Fall Updates by Josh Tauberer.

From the post:

Here’s what’s been improved on GovTrack in the summer and fall of this year.


  • Permalinks to individual paragraphs in bill text is now provided (example).
  • We now ask for your congressional district so that we can customize vote and bill pages to show how your Members of Congress voted.
  • Our bill action/status flow charts on bill pages now include activity on certain related bills, which are often crucially important to the main bill.
  • The bill cosponsors list now indicates when a cosponsor of a bill is no longer serving (i.e. because of retirement or death).
  • We switched to gender neutral language when referring to Members of Congress. Instead of “congressman/woman”, we now use “representative.”
  • Our historical votes database (1979-1989) from voteview.com was refreshed to correct long-standing data errors.
  • We dropped support for Internet Explorer 6 in order to address with POODLE SSL security vulnerability that plagued most of the web.
  • We dropped support for Internet Explorer 7 in order to allow us to make use of more modern technologies, which has always been the point of GovTrack.

The comment I posted was:

Great work! But I read the other day about legislation being “snuck” by the House (Senate changes), US Congress OKs ‘unprecedented’ codification of warrantless surveillance.

Do you have plans for a diff utility that warns members of either house of changes to pending legislation?

In case you aren’t familiar with GovTrack.us.

From the about page:

GovTrack.us, a project of Civic Impulse, LLC now in its 10th year, is one of the worldʼs most visited government transparency websites. The site helps ordinary citizens find and track bills in the U.S. Congress and understand their representatives’ legislative record.

In 2013, GovTrack.us was used by 8 million individuals. We sent out 3 million legislative update email alerts. Our embeddable widgets were deployed on more than 80 official websites of Members of Congress.

We bring together the status of U.S. federal legislation, voting records, congressional district maps, and more (see the table at the right).
and make it easier to understand. Use GovTrack to track bills for updates or get alerts about votes with email updates and RSS feeds. We also have unique statistical analyses to put the information in context. Read the «Analysis Methodology».

GovTrack openly shares the data it brings together so that other websites can build other tools to help citizens engage with government. See the «Developer Documentation» for more.

A Survey of Monte Carlo Tree Search Methods

December 18th, 2014

A Survey of Monte Carlo Tree Search Methods by Cameron Browne, et al.


Monte Carlo Tree Search (MCTS) is a recently proposed search method that combines the precision of tree search with the generality of random sampling. It has received considerable interest due to its spectacular success in the difficult problem of computer Go, but has also proved beneficial in a range of other domains. This paper is a survey of the literature to date, intended to provide a snapshot of the state of the art after the first five years of MCTS research. We outline the core algorithm’s derivation, impart some structure on the many variations and enhancements that have been proposed, and summarise the results from the key game and non-game domains to which MCTS methods have been applied. A number of open research questions indicate that the field is ripe for future work.

At almost fifty (50) pages, this review of the state of the art for MCTS research as of 2012, should keep even dedicated readers occupied for several days. The extensive bibliography will enhance your reading experience!

I first saw this in a tweet by Ebenezer Fogus.

Google’s alpha-stage email encryption plugin lands on GitHub

December 18th, 2014

Google’s alpha-stage email encryption plugin lands on GitHub by David Meyer.

From the post:

Google has updated its experimental End-to-End email encryption plugin for Chrome and moved the project to GitHub. The firm said in a Tuesday blog post that it had “always believed strongly that End-To-End must be an open source project.” The alpha-stage, OpenPGP-based extension now includes the first contributions from Yahoo’s chief security officer, Alex Stamos. Google will also make its new crypto library available to several other projects that have expressed interest. However, product manager Stephan Somogyi said the plugin still wasn’t ready for the Chrome Web Store, and won’t be widely released until Google is happy with the usability of its key distribution and management mechanisms.

Not to mention that being open source makes it harder to lean on management to make compromises to suit governments. Imagine that, the strength to resist tyranny in openness.

If you are looking for a “social good” project for 2015, it is hard to imagine a better one in the IT area.


December 18th, 2014


From the homepage:

DeepDive is a new type of system that enables developers to analyze data on a deeper level than ever before. DeepDive is a trained system: it uses machine learning techniques to leverage on domain-specific knowledge and incorporates user feedback to improve the quality of its analysis.

DeepDive differs from traditional systems in several ways:

  • DeepDive is aware that data is often noisy and imprecise: names are misspelled, natural language is ambiguous, and humans make mistakes. Taking such imprecisions into account, DeepDive computes calibrated probabilities for every assertion it makes. For example, if DeepDive produces a fact with probability 0.9 it means the fact is 90% likely to be true.
  • DeepDive is able to use large amounts of data from a variety of sources. Applications built using DeepDive have extracted data from millions of documents, web pages, PDFs, tables, and figures.
  • DeepDive allows developers to use their knowledge of a given domain to improve the quality of the results by writing simple rules that inform the inference (learning) process. DeepDive can also take into account user feedback on the correctness of the predictions, with the goal of improving the predictions.
  • DeepDive is able to use the data to learn "distantly". In contrast, most machine learning systems require tedious training for each prediction. In fact, many DeepDive applications, especially at early stages, need no traditional training data at all!
  • DeepDive’s secret is a scalable, high-performance inference and learning engine. For the past few years, we have been working to make the underlying algorithms run as fast as possible. The techniques pioneered in this project
    are part of commercial and open source tools including MADlib, Impala, a product from Oracle, and low-level techniques, such as Hogwild!. They have also been included in Microsoft's Adam.

This is an example of why I use Twitter for current awareness. My odds for encountering DeepDive on a web search, due primarily to page-ranked search results, are very, very low. From the change log, it looks like DeepDive was announced in March of 2014, which isn’t very long to build up a page-rank.

You do have to separate the wheat from the chaff with Twitter, but DeepDive is an example of what you may find. You won’t find it with search, not for another year or two, perhaps longer.

How does that go? He said he had a problem and was going to use search to find a solution? Now he has two problems? ;-)

I first saw this in a tweet by Stian Danenbarger.

PS: Take a long and careful look at DeepDive. Unless I find other means, I am likely to be using DeepDive to extract text and the redactions (character length) from a redacted text.

Michael Brown – Grand Jury Witness Index – Part 1

December 17th, 2014

I have completed the first half of the grand jury witness index for the Michael Brown case, covering volumes 1 – 12. (index volumes 13 -24, forthcoming)

The properties with each witness, along with others, will be used to identify that witness using a topic map.

Donate here to support this ongoing effort.

  1. Volume 1 Page 25 Line: 7 – Medical legal investigator – His report is Exhibit #1. (in released documents, 2014-5143-narrative-report-01.pdf)

  2. Volume 2 Page 20 Line: 6 – Crime Scene Detective with St. Louis County Police
  3. Volume 3 Page 7 Line: 7 – Crime Scene Detective with St. Louis County Police – 22 years with St. Louis – 14 years as crime scene detective
  4. Volume 3 Page 51 Line: 12 – Forensic Pathologist – St Louis City Medical Examiner’s Office (assistant medical examiner)
  5. Volume 4 Page 17 Line: 7 – Dorian Johnson
  6. Volume 5 Page 12 Line: 9 – Police Sergent – Ferguson Police – Since December 2001 (Volume-5 Page 14 – Prepared no written report)
  7. Volume 5 Page 75 Line: 11 – Detective St. Louis Police Department Two and 1/2 years
  8. Volume 5 Page 140 Line: 11 – Female FBI agent three and one-half years
  9. Volume 5 Page 196 Line: 23 – Darren Wilson (Volume-5 Page 197 talked to prosecutor before appearing)
  10. Volume 6 Page 149 Line: 18 – Witness #10
  11. Volume 6 Page 232 Line: 5 – Witness with marketing firm
  12. Volume 7 Page 9 Line: 1 – Canfield Green Apartments (female, no #)
  13. Volume 7 Page 153 Line: 9 – coming from a young lady’s house, passenger in white Monte Carlo
  14. Volume 8 Page 97 Line: 14 – Canfield Green Apartments, second floor, collecting Social Security, brother and his wife come over
  15. Volume 8 Page 173 Line: 9 – Detective St. Louis County Police Department – Since March 2008 (as detective) **primary case officer**
  16. Volume 8 Page 196 Line: 2 – Previously testified on Sept. 9th, page 7 Crime Scene Detective with St. Louis County Police – 22 years with St. Louis – 14 years as crime scene detective
  17. Volume 9 Page 7 Line: 7 – Sales consultant – Canfield Drive
  18. Volume 9 Page 68 Line: 15 – Visitor to Canfield Green Apartment Complex with wife
  19. Volume-10 Page 7 Line: 10 – Wife of witness in volume 9? visitor to complex
  20. Volume-10 Page 68 Line: 24 – Police officer, St. Louis County Police Department, assigned as a firearm and tool mark examiner in the crime laboratory.
  21. Volume-10 Page 128 Line: 8 – Detective, Crime Scene Unit for St. Louis County, 18 years as police officer, 3 years with crime scene – photographed Darren Wilson
  22. Volume-11 Page 6 Line: 21 – Canfield Apartment Complex, top floor, Living with girlfriend
  23. Volume-11 Page 59 Line: 7 – Girlfriend of witness at volume 11, page 6 – prosecutor has her renounce prior statements
  24. Volume-11 Page 80 Line: 7 – Drug chemist – crime lab
  25. Volume-11 Page 111 Line: 7 – Latent (fingerprint) examiner for the St. Louis County Police Department.
  26. Volume-11 Page 137 Line: 7 – Canfield Green Apartment Complex, fiancee for 3 1/2 to 4 years, south end of building, one floor above them, has children (boys)
  27. Volume-11 Page 169 Line: 16 – Doesn’t live at the Canfield Apartments, returning on August 9th to return?, in a van with husband, two daughters and granddaughter
  28. Volume-12 Page 11 Line: 7 – Husband of the witness driving the van, volume 11, page 169
  29. Volume-12 Page 51 Line: 15 – Special agent with the FBI assigned to the St. Louis field office, almost 24 years
  30. Volume-12 Page 102 Line: 18 – Lives in Northwinds Apartments, white ’99 Monte Carlo
  31. Volume-12 Page 149 Line: 6 – Contractor, retaining wall and brick patios

Caution: This list presents witnesses as they appeared and does not include the playing of prior statements and interviews. Those will be included in a separate index of statements because they play a role in identifying the witnesses who appeared before the grand jury.

The outcome of the Michael Brown grand jury was not the fault of the members of the grand jury. It was a result that was engineered by departing from usual and customary practices, distortion of evidence and misleading the grand jury about applicable law, among other things. All of that is hiding in plain sight in the grand jury transcripts.

Other Michael Brown Posts

Missing From Michael Brown Grand Jury Transcripts December 7, 2014. (The witness index I propose to replace.)

New recordings, documents released in Michael Brown case [LA Times Asks If There’s More?] Yes! December 9, 2014 (before the latest document dump on December 14, 2014).

Michael Brown Grand Jury – Presenting Evidence Before Knowing the Law December 10, 2014.

How to Indict Darren Wilson (Michael Brown Shooting) December 12, 2014.

More Missing Evidence In Ferguson (Michael Brown) December 15, 2014.

Michael Brown – Grand Jury Witness Index – Part 1 December 17, 2014. (above)

History & Philosophy of Computational and Genome Biology

December 17th, 2014

History & Philosophy of Computational and Genome Biology by Mark Boguski.

A nice collection of books and articles on computational and genome biology. It concludes with this anecdote:

Despite all of the recent books and biographies that have come out about the Human Genome Project, I think there are still many good stories to be told. One of them is the origin of the idea for whole-genome shotgun and assembly. I recall a GRRC (Genome Research Review Committee) review that took place in late 1996 or early 1997 where Jim Weber proposed a whole-genome shotgun approach. The review panel, at first, wanted to unceremoniously “NeRF” (Not Recommend for Funding) the grant but I convinced them that it deserved to be formally reviewed and scored, based on Jim’s pioneering reputation in the area of genetic polymorphism mapping and its impact on the positional cloning of human disease genes and the origins of whole-genome genotyping. After due deliberation, the GRRC gave the Weber application a non-fundable score (around 350 as I recall) largely on the basis of Weber’s inability to demonstrate that the “shotgun” data could be assembled effectively.

Some time later, I was giving a ride to Jim Weber who was in Bethesda for a meeting. He told me why his grant got a low score and asked me if I knew any computer scientists that could help him address the assembly problem. I suggested he talk with Gene Myers (I knew Gene and his interests well since, as one of the five authors of the BLAST algorithm, he was a not infrequent visitor to NCBI).

The following May, Weber and Myers submitted a “perspective” for publication in Genome Research entitled “Human whole-genome shotgun sequencing“. This article described computer simulations which showed that assembly was possible and was essentially a rebuttal to the negative review and low priority score that came out of the GRRC. The editors of Genome Research (including me at the time) sent the Weber/Myers article to Phil Green (a well-known critic of shotgun sequencing) for review. Phil’s review was extremely detailed and actually longer that the Weber/Myers paper itself! The editors convinced Phil to allow us to publish his critique entitled “Against a whole-genome shotgun” as a point-counterpoint feature alongside the Weber-Myers article in the journal.

The rest, as they say, is history, because only a short time later, Craig Venter (whose office at TIGR had requested FAX copies of both the point and counterpoint as soon as they were published) and Mike Hunkapiller announced their shotgun sequencing and assembly project and formed Celera. They hired Gene Myers to build the computational capabilities and assemble their shotgun data which was first applied to the Drosophila genome as practice for tackling a human genome which, as is now known, was Venter’s own. Three of my graduate students (Peter Kuehl, Jiong Zhang and Oxana Pickeral) and I participated in the Drosophila annotation “jamboree” (organized by Mark Adams of Celera and Gerry Rubin) working specifically on an analysis of the counterparts of human disease genes in the Drosophila genome. Other aspects of the Jamboree are described in a short book by one of the other participants, Michael Ashburner.

The same type of stories exist not only from the early days of computer science but since then as well. Stories that will capture the imaginations of potential CS majors as well as illuminate areas where computer science can or can’t be useful.

How many of those stories have you captured?

I first saw this in a tweet by Neil Saunders.

U.S. Says Europeans Tortured by Assad’s Death Machine

December 17th, 2014

U.S. Says Europeans Tortured by Assad’s Death Machine by Josh Rogin.

From the post:

The U.S. State Department has concluded that up to 10 European citizens have been tortured and killed while in the custody of the Syrian regime and that evidence of their deaths could be used for war crimes prosecutions against Bashar al-Assad in several European countries.

The new claim, made by the State Department’s ambassador at large for war crimes, Stephen Rapp, in an interview with me, is based on a newly completed FBI analysis of 27,000 photographs smuggled out of Syria by the former military photographer known as “Caesar.” The photos show evidence of the torture and murder of over 11,000 civilians in custody. The FBI spent months pouring over the photos and comparing them to consular databases with images of citizens from countries around the world.

Last month, the FBI gave the State Department its report, which included a group of photos that had been tentatively matched to individuals who were already in U.S. government files. “The group included multiple individuals who were non-Syrian, but none who had a birthplace in the United States, according to our information,” Rapp told me. “There were Europeans within that group.”

The implications could be huge for the international drive to prosecute Assad and other top Syrian officials for war crimes and crimes against humanity. While it’s unlikely that multilateral organizations such as the United Nations or the International Criminal Court will pursue cases against Assad in the near term, due to opposition by Assad’s allies including Russia, legal cases against the regime could be brought in individual countries whose citizens were victims of torture and murder.

Is this a “heads up” from the State Department that lists of war criminals in the CIA Torture Report should be circulated in European countries?

Even if they won’t be actively prosecuted, the threat of arrest might help keep Europe free of known American war criminals. Unfortunately that would mean they would still be in the United States but the American public supported them so that seems fair.

I first saw this in a tweet by the U.S. Dept. of Fear.

Endless Parentheses

December 17th, 2014

Endless Parentheses

From the about page:

Endless Parentheses is a blog about Emacs. It features concise posts on improving your productivity and making Emacs life easier in general.

Code included is predominantly emacs-lisp, lying anywhere in the complexity spectrum with a blatant disregard for explanations or tutorials. The outcome is that the posts read quickly and pleasantly for experienced Emacsers, while new enthusiasts are invited to digest the code and ask questions in the comments.

What you can expect:

  • Posts are always at least weekly, coming out on every weekend and on the occasional Wednesday.
  • Posts are always about Emacs. Within this constraint you can expect anything, from sophisticated functions to brief comments on my keybind preferences.
  • Posts are usually short, 5-minute reads, as opposed to 20+-minute investments. Don’t expect huge tutorials.

The editor if productivity is your goal.

I first saw this blog mentioned in a tweet by Anna Pawlicka.

Learn Physics by Programming in Haskell

December 17th, 2014

Learn Physics by Programming in Haskell by Scott N. Walck.


We describe a method for deepening a student’s understanding of basic physics by asking the student to express physical ideas in a functional programming language. The method is implemented in a second-year course in computational physics at Lebanon Valley College. We argue that the structure of Newtonian mechanics is clarified by its expression in a language (Haskell) that supports higher-order functions, types, and type classes. In electromagnetic theory, the type signatures of functions that calculate electric and magnetic fields clearly express the functional dependency on the charge and current distributions that produce the fields. Many of the ideas in basic physics are well-captured by a type or a function.

A nice combination of two subjects of academic importance!

Anyone working on the use of the NLTK to teach David Copperfield or Great Expectations? ;-)

I first saw this in a tweet by José A. Alonso.

Orleans Goes Open Source

December 17th, 2014

Orleans Goes Open Source

From the post:

Since the release of the Project “Orleans” Public Preview at //build/ 2014 we have received a lot of positive feedback from the community. We took your suggestions and fixed a number of issues that you reported in the Refresh release in September.

Now we decided to take the next logical step, and do the thing many of you have been asking for – to open-source “Orleans”. The preparation work has already commenced, and we expect to be ready in early 2015. The code will be released by Microsoft Research under an MIT license and published on GitHub. We hope this will enable direct contribution by the community to the project. We thought we would share the decision to open-source “Orleans” ahead of the actual availability of the code, so that you can plan accordingly.

The real excitement for me comes from a post just below this announcement: A Framework for Cloud Computing,

To avoid these complexities, we built the Orleans programming model and runtime, which raises the level of the actor abstraction. Orleans targets developers who are not distributed system experts, although our expert customers have found it attractive too. It is actor-based, but differs from existing actor-based platforms by treating actors as virtual entities, not as physical ones. First, an Orleans actor always exists, virtually. It cannot be explicitly created or destroyed. Its existence transcends the lifetime of any of its in-memory instantiations, and thus transcends the lifetime of any particular server. Second, Orleans actors are automatically instantiated: if there is no in-memory instance of an actor, a message sent to the actor causes a new instance to be created on an available server. An unused actor instance is automatically reclaimed as part of runtime resource management. An actor never fails: if a server S crashes, the next message sent to an actor A that was running on S causes Orleans to automatically re-instantiate A on another server, eliminating the need for applications to supervise and explicitly re-create failed actors. Third, the location of the actor instance is transparent to the application code, which greatly simplifies programming. And fourth, Orleans can automatically create multiple instances of the same stateless actor, seamlessly scaling out hot actors.

Overall, Orleans gives developers a virtual “actor space” that, analogous to virtual memory, allows them to invoke any actor in the system, whether or not it is present in memory. Virtualization relies on indirection that maps from virtual actors to their physical instantiations that are currently running. This level of indirection provides the runtime with the opportunity to solve many hard distributed systems problems that must otherwise be addressed by the developer, such as actor placement and load balancing, deactivation of unused actors, and actor recovery after server failures, which are notoriously difficult for them to get right. Thus, the virtual actor approach significantly simplifies the programming model while allowing the runtime to balance load and recover from failures transparently. (emphasis added)

Not in a distributed computing context but the “look and its there” model is something I recall from HyTime. So nice to see good ideas resurface!

Just imagine doing that with topic maps, including having properties of a topic, should you choose to look for them. If you don’t need a topic, why carry the overhead around? Wait for someone to ask for it.

This week alone, Microsoft continues its fight for users, announces an open source project that will make me at least read about .Net, ;-), I think Microsoft merits a lot of kudos and good wishes for the holiday season!

I first say this at: Microsoft open sources cloud framework that powers Halo by Jonathan Vanian.