Finger trees:…

July 23rd, 2014

Finger trees: a simple general-purpose data structure by Ralf Hinze and Ross Paterson.


We introduce 2-3 finger trees, a functional representation of a persistent sequences supporting access to the ends in amortized constant time, and concatenation and splitting in time logarithmic in the size of the smaller piece. Representations achieving these bounds have appeared previously, but 2-3 finger trees are much simpler, as are the operations on them. Further, by defi ning the split operation in a general form, we obtain a general purpose data structure that can serve as a sequence, priority queue, search tree, priority search queue and more.

Before the Hinze and Paterson article you may want to read: 2-3 finger trees in ASCII by Jens Nicolay.

Note 2-3 finger trees go unmentioned in Purely Functional Data Structures by Chris Okasaki.

Other omissions of note?

First World War Digital Resources

July 23rd, 2014

First World War Digital Resources by Christopher Phillips.

From the post:

The centenary of the First World War has acted as a catalyst for intense public and academic attention. One of the most prominent manifestations of this increasing interest in the conflict is in the proliferation of digital resources made available recently. Covering a range of national and internationally-focused websites, this review makes no pretence at comprehensiveness; indeed it will not cover the proliferation of locally-oriented sites such as the Tynemouth World War One Commemoration Project, or those on neutral territories like the Switzerland and the First World War. Instead, this review will offer an introduction to some of the major repositories of information for both public and academic audiences seeking further understanding of the history of the First World War.

The Imperial War Museum (IWM) in London has been designated by the British government as the focal point of British commemorations of the war. The museum itself has been the recipient of a £35million refurbishment, and the IWM’s Centenary Website acts as a collecting point for multiple regional, national and international cultural and educational organisations through the First World War Centenary Partnership. This aspect of the site is a real triumph, providing a huge, regularly updated events calendar which demonstrates both the geographical spread and the variety of the cultural and academic contributions scheduled to take place over the course of the centenary.

Built upon the stunning visual collections held by the museum, the website contains a number of introductory articles on a wide range of subjects. In addition to the relatively familiar subjects of trenches, weaponry and poets, the website also provides contributions on the less-traditional aspects of the conflict. The varied roles taken by women, the ‘sideshow’ theatres of war outside the Western Front, and the myriad animals used by the armed forces are also given featured. Although the many beautiful photographs and images from the IWM itself are individually recorded, the lack of a ‘further reading’ section to supplement the brief written descriptions is a weakness, particularly as the site is clearly geared towards those at an early stage in their research into the conflict (the site contains a number of advertisements for interactive talks at IWM sites aimed at students at KS3 and above).

The keystone of the IWM’s contribution to the centenary, however, is the Lives of the First World War project. Lives aims to create a ‘permanent digital memorial to more than eight million men and women from across Britain and the Commonwealth’ before the end of the centenary. Built upon the foundation of official medal index cards, the site relies upon contributions from the public, inputting data, photographs and information to help construct the ‘memorial’. Launched in February 2014, the database is currently sparsely populated, with very little added to the life stories of the majority of soldiers. Concentration at the moment appears to be on the more ‘celebrity’ soldiers of the war, men such as Captain Noel Chavasse and Wilfred Owen, upon whom significant research has already been undertaken. Although a search option is available to find individual soldiers by name, unit, or service number, the limitations of the search engine render a comparison of soldiers from the same city or from a shared workplace impossible. Lives is undoubtedly an ambitious project; however at this time there is little available for genealogists or academic researchers on the myriad stories still locked in attics and archives across Britain.

If you are interested in World War I and its history, this is an excellent starting point. Unlike military histories, the projects covered here paint a broader picture of the war, a picture that includes a wider cast of characters.

Awesome Big Data

July 23rd, 2014

Awesome Big Data by Onur Akpolat.

From the webpage:

A curated list of awesome big data frameworks, ressources and other awesomeness. Inspired by awesome-php, awesome-python, awesome-ruby, hadoopecosystemtable & big-data.

Your contributions are always welcome!


Great list of projects.

Curious to see if it develops enough community support to sustain the curation of the listing.

Finding resource collections like this one is so haphazard on the WWW that often times authors are duplicating the work of others. Not intentionally, just unaware of a similar resource.

Similar to the repeated questions that appear on newsgroups and email lists about basic commands or flaws in programs. The answer probably already exists in an archive or FAQ, but how is a new user to find it?

The social aspects of search and knowledge sharing are likely as important, if not more so, than the technologies we use to implement them.

Suggestions for reading on the social aspects of search and knowledge sharing?

Web Annotation Working Group Charter

July 23rd, 2014

Web Annotation Working Group Charter

From the webpage:

Annotating, which is the act of creating associations between distinct pieces of information, is a widespread activity online in many guises but currently lacks a structured approach. Web citizens make comments about online resources using either tools built into the hosting web site, external web services, or the functionality of an annotation client. Readers of ebooks make use the tools provided by reading systems to add and share their thoughts or highlight portions of texts. Comments about photos on Flickr, videos on YouTube, audio tracks on SoundCloud, people’s posts on Facebook, or mentions of resources on Twitter could all be considered to be annotations associated with the resource being discussed.

The possibility of annotation is essential for many application areas. For example, it is standard practice for students to mark up their printed textbooks when familiarizing themselves with new materials; the ability to do the same with electronic materials (e.g., books, journal articles, or infographics) is crucial for the advancement of e-learning. Submissions of manuscripts for publication by trade publishers or scientific journals undergo review cycles involving authors and editors or peer reviewers; although the end result of this publishing process usually involves Web formats (HTML, XML, etc.), the lack of proper annotation facilities for the Web platform makes this process unnecessarily complex and time consuming. Communities developing specifications jointly, and published, eventually, on the Web, need to annotate the documents they produce to improve the efficiency of their communication.

There is a large number of closed and proprietary web-based “sticky note” and annotation systems offering annotation facilities on the Web or as part of ebook reading systems. A common complaint about these is that the user-created annotations cannot be shared, reused in another environment, archived, and so on, due to a proprietary nature of the environments where they were created. Security and privacy are also issues where annotation systems should meet user expectations.

Additionally, there are the related topics of comments and footnotes, which do not yet have standardized solutions, and which might benefit from some of the groundwork on annotations.

The goal of this Working Group is to provide an open approach for annotation, making it possible for browsers, reading systems, JavaScript libraries, and other tools, to develop an annotation ecosystem where users have access to their annotations from various environments, can share those annotations, can archive them, and use them how they wish.

Depending on how fine grained you want your semantics, annotation is one way to convey them to others.

Unfortunately, looking at the starting point for this working group, “open” means RDF, OWL and other non-commercially adopted technologies from the W3C.

Defining the ability to point, using XQuery perhaps and reserving to users the ability to create standards for annotation payloads would be a much more “open” approach. That is an approach you are unlikely to see from the W3C.

I would be more than happy to be proven wrong on that point.

Supplying Missing Semantics? (IKWIM)

July 23rd, 2014

Chris Ford in Creating music with Clojure and Overtone uses an example of a harmonic sound missing its first two harmonic components and yet when heard, our ears supply the missing components. Quite spooky when you first see it but there is no doubt that the components are “missing” quite literally from the result.

Which makes me wonder, do we generally supply semantics, appropriately or inappropriately, to data?

Unless it is written in an unknown script, we “know” what data must mean, based on what we would mean by such data.

Using “data” in the broadest sense to include all recorded information.

Even unknown scripts don’t stop us from assigning our “meanings” to texts. I will have to run down some of the 17th century works on Egyptian Hieroglyphics at some point.

Entertaining and according to current work on historical Egyptian, not even close to what we now understand the texts to mean.

The “I know what it means” (IKWIM) syndrome may be the biggest single barrier to all semantic technologies. Capturing the semantics of texts is always an expensive proposition and if I already IKWIM, then why bother?

If you are capturing something I already know, that can be shared with others. Another disincentive for capturing semantics.

To paraphrase a tweet I saw today by no-hacker-news

Why take 1 minute to document when others can waste a day guessing?

Creating music with Clojure and Overtone

July 23rd, 2014

Creating music with Clojure and Overtone by Chris Ford.

From the description:

Chris Ford will show how to make music with Clojure, starting with the basic building block of sound, the sine wave, and gradually accumulating abstractions culminating in a canon by Johann Sebastian Bach.

Very impressive! You will pick up some music theory, details on sound that you have forgotten since high school physics, and perhaps yet another reason to learn Clojure!

Chris mentions: Music For Geeks And Nerds (Python) as a resource.

An A to Z of…D3 force layout

July 23rd, 2014

An A to Z of extra features for the D3 force layout by Simon Raper.

From the post:

Since d3 can be a little inaccessible at times I thought I’d make things easier by starting with a basic skeleton force directed layout (Mike Bostock’s original example) and then giving you some blocks of code that can be plugged in to add various features that I have found useful.

The idea is that you can pick the features you want and slot in the code. In other words I’ve tried to make things sort of modular. The code I’ve taken from various places and adapted so thank you to everyone who has shared. I will try to provide the credits as far as I remember them!

A great read and an even greater bookmark for graph layouts.

In Simon’s alphabet:

A is for arrows.

B is for breaking links.

C is for collision detection.

F is for fisheye.

H is for highlighting.

L is for labels.

P is for pinning down nodes.

S is for search.

T is for tooltip.

Not only does Simon show the code, he also shows the result of the code.

A model of how to post useful information on D3.

NLTK 3.0 Beta!

July 23rd, 2014

NLTK 3.0 Beta!

The official name is nltk 3.0.0b1 but I thought 3.0 beta rolls off the tongue better. ;-)

Interface changes.

Grab the latest, contribute bug reports, etc.

Commonplace Books at Harvard

July 22nd, 2014

Commonplace Books

From the webpage:

In the most general sense, a commonplace book contains a collection of significant or well-known passages that have been copied and organized in some way, often under topical or thematic headings, in order to serve as a memory aid or reference for the compiler. Commonplace books serve as a means of storing information, so that it may be retrieved and used by the compiler, often in his or her own work.

The commonplace book has its origins in antiquity in the idea of loci communes, or “common places,” under which ideas or arguments could be located in order to be used in different situations. The florilegium, or “gathering of flowers,” of the Middle Ages and early modern era, collected excerpts primarily on religious and theological themes. Commonplace books flourished during the Renaissance and early modern period: students and scholars were encouraged to keep commonplace books for study, and printed commonplace books offered models for organizing and arranging excerpts. In the 17th, 18th, and 19th centuries printed commonplace books, such as John Locke’s A New Method of Making Common-Place-Books (1706), continued to offer new models of arrangement. The practice of commonplacing continued to thrive in the modern era, as writers appropriated the form for compiling passages on various topics, including the law, science, alchemy, ballads, and theology. The manuscript commonplace books in this collection demonstrate varying degrees and diverse methods of organization, reflecting the idiosyncratic interests and practices of individual readers.

A great collection of selections from commonplace books!

I am rather “lite” on posts for the day because I tried to chase down John Locke’s publication of A New Method of Making Common-Place-Books in French, circa 1686/87.

Unfortunately, the scanned version of Bibliotheque Universelle et Historique I was using, listed “volumes” when they were actually four (4) issues per year and the issue containing Locke’s earlier publication is missing. A translation that appears in John Locke, The Works of John Locke in Nine Volumes, (London: Rivington, 1824 12th ed.). Vol. 2 gives this reference:

Translated out of the French from the second volume of Bibliotheque Universelle.

You can view an image of the work at: on page 441.

Someone who could not read Roman numerals gave varying dates for the “volumes” of Bibliotheque Universelle et Historique which didn’t improve my humor. I will try to find a complete scanned set tomorrow and try to chase down the earlier version of A New Method of Making Common-Place-Books. My concern is the graphic that appears in the translation and what appears to be examples at the end. I wanted to confirm that both appear in the original French version.


PS: I know, this isn’t as “practical” as functional programming, writing Pig or Cuda code, but on the other hand, understanding where you are going is at least as important as getting there quickly. Yes?

Readings in conflict-free replicated data types

July 22nd, 2014

Readings in conflict-free replicated data types by Christopher Meiklejohn.

From the post:

This is a work in progress post outlining research topics related to conflict-free replicated data types, or CRDTs.

Yesterday, Basho announced the release of Riak 2.0.0 RC1, which contains a comprehensive set of “data types” that can be used for building more robust distributed applications. For an overview of how to use these data types in Riak to avoid custom, and error prone, merge functions, see the Basho documentation site.

You’re probably more familiar with another name for these data types: conflict-free replicated data types (CRDTs). Simply put, CRDTs are data structures which capture some aspect of causality, along with providing interfaces for safely operating over the value and correctly merging state with diverged and concurrently edited structures.

This provides a very useful property when combined with an eventual consistency, or AP-focused, data store: Strong Eventual Consistency (SEC). Strong Eventual Consistency is an even stronger convergence property than eventual consistency: given that all updates are delivered to all replicas, there is no need for conflict resolution, given the conflict-free merge properties of the data structure. Simply put, correct replicas which have received all updates have the same state.

Here’s a great overview by one of the inventors of CRDTs, Marc Shapiro, where he discusses conflict-free replicated data types and their relation to strong eventual consistency.

In this Hacker News thread, there was an interesting discussion about why one might want to implement these on the server, why implementing them is non-trivial, and what the most recent research related to them consists of.

This post serves as a reading guide on the the various areas of conflict-free replicated data types. Papers are broken down into various areas and sorted in reverse chronologically.

Relevant to me because the new change tracking in ODF is likely to be informed by CRDTs and because eventually consistent merging is important for distributed topic maps.

Confusion would be the result if the order of merging topics results in different topic maps.

CRDTs are an approach to avoid that unhappy outcome.


PS: Remember to grab a copy of Riak 2.0.0 RC1.

Clojure in the Large

July 22nd, 2014

Clojure in the Large by Stuart Sierra.

From the summary:

Stuart Sierra discusses various Clojure features: protocols, records, DI, managing startup/shutdown of components, dynamic binding, interactive development workflow, testing and mocking.

Stuart describes this presentation as “intermediate” level.

Great examples to get you past common problems and “thinking” in Clojure.

Announcing Apache Pig 0.13.0

July 22nd, 2014

Announcing Apache Pig 0.13.0 by Daniel Dai.

From the post:

The Apache Pig community released Pig 0.13. earlier this month. Pig uses a simple scripting language to perform complex transformations on data stored in Apache Hadoop. The Pig community has been working diligently to prepare Pig to take advantage of the DAG processing capabilities in Apache Tez. We also improved usability and performance.

This blog post summarizes the progress we’ve made.

Pig 0.13 improvements

If you missed the Pig 0.13.0 release (I did), here’s a chance to catch up on the latest improvements.

Interactive Documents with R

July 22nd, 2014

Interactive Documents with R by Ramnath Vaidyanathan.

From the webpage:

The main goal of this tutorial is to provide a comprehensive overview of the workflow required to create, customize and share interactive web-friendly documents from R. We will cover a range of topics from how to create interactive visualizations and dashboards, to web pages and applications, straight from R. At the end of this tutorial, attendees will be able to apply these learnings to turn their own analyses and reports into interactive, web-friendly content.

Ramnath gave this tutorial at UseR2014. The slides have now been posted at:

The tutorial is listed as six (6) separate tutorials:

  1. Interactive Documents
  2. Slidify
  3. Frameworks
  4. Layouts
  5. Widgets
  6. How Slidify Works

I am always impressed by useful interactive web pages. Leaving aside the one that jump, pop and whizz with no discernible purpose, interactive documents add value to their content for readers.


I first saw this in a tweet by Gregory Piatetsky.

Why Functional Programming Matters

July 22nd, 2014

Why Functional Programming Matters by John Hughes.


As software becomes more and more complex, it is more and more important to structure it well. Well-structured software is easy to write and to debug, and provides a collection of modules that can be reused to reduce future programming costs. In this paper we show that two features of functional languages in particular, higher-order functions and lazy evaluation, can contribute significantly to modularity. As examples, we manipulate lists and trees, program several numerical algorithms, and implement the alpha-beta heuristic (an algorithm from Artificial Intelligence used in game-playing programs). We conclude that since modularity is the key to successful programming, functional programming offers important advantages for software development.

There’s a bottom line issue you can raise with your project leader or manager. “[R]educe future programming costs.” That it is also true isn’t the relevant point. Management isn’t capable of tracking programming costs well enough to know. It is the rhetoric of cheaper, faster, better that moves money down the management colon.

The same is very likely true for semantic integration tasks.

Many people have pointed out that topic maps can make the work that goes into semantic integration efforts re-usable. Which is true and would be a money saver to boot, but none of that really matters to management. Budgets and their share of a budget for a project are management motivators.

Perhaps the better way to sell the re-usable semantics of topic maps is to say the builders of integration systems can re-use the semantics of a topic map. Customers/users, on the other hand, just get the typical semantically opaque results. Just like now.

So the future cost for semantic integration experts to extend or refresh a current semantic integration solution goes down. The have re-usable semantics that they can re-apply to the customer’s situation. Either script it or even have interns do the heavy lifting. Which helps their bottom line.

Thinking about it that way, creating disclosed semantics for popular information resources would have the same impact. Something to think about.

I first saw this in a tweet by Computer Science.


July 22nd, 2014


From the webpage:

  • integrated LaTeX environment for Windows
  • powerful LaTeX editor with auto completion
  • full UTF-8 support
  • document navigator for easy navigation and referencing
  • tight viewer integration with forward and inverse search
  • quick setup wizard for MiKTeX
  • trusted by millions of users around the world
  • free and open source (GPL)

I need to install this so I can run it in a VM on Ubuntu. ;-)

Seriously, this looks really cool.

Let me know what you think. This could make a great recommendation for Windows users.

I first saw this in a tweet by TeX tips.

Commonplace Books

July 21st, 2014

Commonplace Books as a Source for Networked Knowledge and Combinatorial Creativity by Shane Parrish.

From the post:

“You know that I voluntarily communicated this method to you, as I have done to many others, to whom I believed it would not be unacceptable.”

There is an old saying that the truest form of poverty is “when if you have occasion for any thing, you can’t use it, because you know not where it is laid.”

The flood of information is nothing new.

“In fact,” the Harvard historian Ann Blair writes in her book Too Much to Know: Managing Scholarly Information Before the Modern Age, “many of our current ways of thinking about and handling information descend from patterns of thought and practices that extent back for centuries.” Her book explores “the history of one of the longest-running traditions of information management— the collection and arrangement of textual excerpts designed for consultation.” She calls them reference books.

Large collections of textual material, consisting typically of quotations, examples, or bibliographical references, were used in many times and places as a way of facilitating access to a mass of texts considered authoritative. Reference books have sometimes been mined for evidence about commonly held views on specific topics or the meanings of words, and some (encyclopedias especially) have been studied for the genre they formed.


No doubt we have access to and must cope with a much greater quantity of information than earlier generations on almost every issue, and we use technologies that are subject to frequent change and hence often new. Nonetheless, the basic methods we deploy are largely similar to those devised centuries ago in early reference books. Early compilations involved various combinations of four crucial operations: storing, sorting, selecting, and summarizing, which I think of as the four S’s of text management. We too store, sort, select, and summarize information, but now we rely not only on human memory, manuscript, and print, as in earlier centuries, but also on computer chips, search functions, data mining, and Wikipedia, along with other electronic techniques.

Knowing some of the background on the commonplace book will be helpful:

Commonplace books (or commonplaces) were a way to compile knowledge, usually by writing information into books. Such books were essentially scrapbooks filled with items of every kind: medical recipes, quotes, letters, poems, tables of weights and measures, proverbs, prayers, legal formulas. Commonplaces were used by readers, writers, students, and scholars as an aid for remembering useful concepts or facts they had learned. Each commonplace book was unique to its creator’s particular interests. They became significant in Early Modern Europe.

“Commonplace” is a translation of the Latin term locus communis (from Greek tópos koinós, see literary topos) which means “a theme or argument of general application”, such as a statement of proverbial wisdom. In this original sense, commonplace books were collections of such sayings, such as John Milton‘s commonplace book. Scholars have expanded this usage to include any manuscript that collects material along a common theme by an individual.

Commonplace books are not diaries nor travelogues, with which they can be contrasted: English Enlightenment philosopher John Locke wrote the 1706 book A New Method of Making a Common Place Book, “in which techniques for entering proverbs, quotations, ideas, speeches were formulated. Locke gave specific advice on how to arrange material by subject and category, using such key topics as love, politics, or religion. Commonplace books, it must be stressed, are not journals, which are chronological and introspective.” By the early eighteenth century they had become an information management device in which a note-taker stored quotations, observations and definitions. They were even used by influential scientists. Carl Linnaeus, for instance, used commonplacing techniques to invent and arrange the nomenclature of his Systema Naturae (which is still used by scientists today).

[footnote links omitted]

Have you ever had a commonplace book?

Impressed enough by Shane’s post to think about keeping one. In hard copy.

Curious how you would replicate a commonplace book in software?

Or perhaps better, what aspects of a commonplace book can you capture in software and what aspects can’t be captured.

I first saw this in a tweet by Aaron Kirschenfeld.

You’re not allowed bioinformatics anymore

July 21st, 2014

You’re not allowed bioinformatics anymore by Mick Watson.

Bump this to the head of your polemic reading list! Excellent writing.

To be fair, collaboration with others is a two-way street.

That is both communities in this tale needed to be reaching out to the other on a continuous basis. It isn’t enough that you offered once or twice and were rebuffed so now you will wait them out.

Successful collaborations don’t start with grudges and bad attitudes about prior failures to collaborate.

I know of two organizations that share common members, operate in the same area and despite both being more than a century old, have had only one, brief, collaborative project.

The collaboration fell apart because leadership in both was waiting for the other to call.

It is hard to sustain a collaboration when both parties considered themselves to be the center of the universe. (I have it on good authority neither one of them are the center of the universe.)

I can’t promise fame, professional success, etc., but reaching out and genuinely collaborating with others will advance your field of endeavor. Promise.

Enjoy the story.

I first saw this in a tweet by Neil Saunders.


July 21st, 2014

Graffeine by Julian Browne

From the webpage:

Caffeinated Graph Exploration for Neo4J

Graffeine is both a useful interactive demonstrator of graph capability and a simple visual administration interface for small graph databases.

Here it is with the, now canonical, Dr Who graph loaded up:

Dr. Who graph

From the description:

Graffeine plugs into Neo4J and renders nodes and relationships as an interactive D3 SVG graph so you can add, edit, delete and connect nodes. It’s not quite as easy as a whiteboard and a pen, but it’s close, and all interactions are persisted in Neo4J.

You can either make a graph from scratch or browse an existing one using search and paging. You can even “fold” your graph to bring different aspects of it together on the same screen.

Nodes can be added, updated, and removed. New relationships can be made using drag and drop and existing relationships broken.

It’s by no means phpmyadmin for Neo4J, but one day it could be (maybe).

A great example of D3 making visual editing possible.

Christmas in July?

July 21st, 2014

It won’t be Christmas in July but bioinformatics folks will feel like it with the release of the full annotation of the human genome assembly (GRCh38) due to drop at the end of July 2014.

Dan Murphy covers progress on the annotation and information about the upcoming release in: The new human annotation is almost here!

This is an important big data set.

How would you integrate it with other data sets?

I first saw this in a tweet by Neil Saunders.

Deploying Dionaea…

July 21st, 2014

Deploying Dionaea on a Raspberry Pi using MHN

A complete with screenshots guide to installing Dionaea on a Raspberry Pi.

MHN = Modern Honey Network.

With enough honeypots, do you think a “crowd” could capture most malware within days of its appearance?

I guess the NSA needs to run a honeypot inside its network firewalls. ;-)

I first saw this in a tweet by Jason Trost.

Security Data Science Papers

July 21st, 2014

Security Data Science Papers by Jason Trost

From the webpage:

Over the past several years I have collected and read many security research papers/slides and have started a small catalog of sorts. The topics of these papers range from intrusion detection, anomaly detection, machine learning/data mining, Internet scale data collection, malware analysis, and intrusion/breach reports. I figured this collection might useful to others. All links lead to PDFs hosted here.

I hope to clean this up (add author info, date, and publication) when I get some more time as well as adding some detailed notes I have on the various features, models, algorithms, and datasets used in many of these papers.

Here are some of my favorites (nice uses of machine learning, graph analytics, and/or anomaly detection to solve interesting security problems):

Nice looking collection but it doesn’t help a reader decide:

  • Is this the latest word on this problem?
  • What has this author written that is more recent? On this problem or others?
  • Does this paper cover concept X?
  • What does this paper say about concept X?
  • What other papers are there on concept X?
  • How does this paper fit into the domain as defined by it and other papers?

Not that I am picking on Jason. I do that same thing all the time.

Question: What information is the most useful, beyond location for a paper?

Serious question. I know what I look for related to an interesting paper. What do you look for?

I first saw this in a tweet by Adam Sealey.

Ninety-Nine Haskell Problems [Euler/Clojure too]

July 21st, 2014

Ninety-Nine Haskell Problems

From the webpage:

These are Haskell translations of Ninety-Nine Lisp Problems, which are themselves translations of Ninety-Nine Prolog Problems.

Also listed are:

Naming isn’t the only hard problem in computer science. The webpage points out that due to gaps and use of letters, there are 88 problems and not 99.

If you want something a bit more challenging, consider the Project Euler problems. No peeking but there is a wiki with some Clojure answers,


I first saw this in a tweet by Computer Science.

Non-Moral Case For Diversity

July 21st, 2014

Groups of diverse problem solvers can outperform groups of high-ability problem solvers by Lu Hong and Scott E. Page.


We introduce a general framework for modeling functionally diverse problem-solving agents. In this framework, problem-solving agents possess representations of problems and algorithms that they use to locate solutions. We use this framework to establish a result relevant to group composition. We find that when selecting a problem-solving team from a diverse population of intelligent agents, a team of randomly selected agents outperforms a team comprised of the best-performing agents. This result relies on the intuition that, as the initial pool of problem solvers becomes large, the best-performing agents necessarily become similar in the space of problem solvers. Their relatively greater ability is more than offset by their lack of problem-solving diversity.

I have heard people say that diverse teams are better, but always in the context of contending for members of one group or another to be included on a team.

Reading the paper carefully, I don’t think that is the author’s point at all.

From the conclusion:

The main result of this paper provides conditions under which, in the limit, a random group of intelligent problem solvers will outperform a group of the best problem solvers. Our result provides insights into the trade-off between diversity and ability. An ideal group would contain high-ability problem solvers who are diverse. But, as we see in the proof of the result, as the pool of problem solvers grows larger, the very best problem solvers must become similar. In the limit, the highest-ability problem solvers cannot be diverse. The result also relies on the size of the random group becoming large. If not, the individual members of the random group may still have substantial overlap in their local optima and not perform well. At the same time, the group size cannot be so large as to prevent the group of the best problem solvers from becoming similar. This effect can also be seen by comparing Table 1. As the group size becomes larger, the group of the best problem solvers becomes more diverse and, not surprisingly, the group performs relatively better.

A further implication of our result is that, in a problem-solving context, a person’s value depends on her ability to improve the collective decision (8). A person’s expected contribution is contextual, depending on the perspectives and heuristics of others who work on the problem. The diversity of an agent’s problem-solving approach, as embedded in her perspective-heuristic pair, relative to the other problem solvers is an important predictor of her value and may be more relevant than her ability to solve the problem on her own. Thus, even if we were to accept the claim that IQ tests, Scholastic Aptitude Test scores, and college grades predict individual problem-solving ability, they may not be as important in determining a person’s potential contribution as a problem solver as would be measures of how differently that person thinks. (emphasis added)

Some people accept gender, race, nationality, etc. as markers for thinking differently and no doubt that is true in some cases. But presuming it is just as uninformed as presuming no differences in how people of different gender, race, and nationalities think.

You could ask. Such as presenting candidates for a team with open ended problems that are capable of multiple solutions. Group similar solutions together and then pick randomly across the solution groups.

You may have a gender, race, nationality diverse team but if they think the same way, say Anthony Scalia and Clarence Thomas, then your team isn’t usefully diverse.

Diversity of thinking should be your goal, not diversity of markers of diversity.

I first saw this in a tweet by Chris Dixon.

Hadoop Doesn’t Cure HIV

July 21st, 2014

If I were Gartner, I could get IBM to support my stating the obvious. I would have to dress it up by repeating a lot of other obvious things but that seems to be the role for some “analysts.”

If you need proof of that claim, consider this report: Hadoop Is Not a Data Integration Solution. Really? Did any sane person familiar with Hadoop think otherwise?

The “key” findings from the report:

  • Many Hadoop projects perform extract, transform and load workstreams. Although these serve a purpose, the technology lacks the necessary key features and functions of commercially-supported data integration tools.
  • Data integration requires a method for rationalizing inconsistent semantics, which helps developers rationalize various sources of data (depending on some of the metadata and policy capabilities that are entirely absent from the Hadoop stack).
  • Data quality is a key component of any appropriately governed data integration project. The Hadoop stack offers no support for this, other than the individual programmer’s code, one data element at a time, or one program at a time.
  • Because Hadoop workstreams are independent — and separately programmed for specific use cases — there is no method for relating one to another, nor for identifying or reconciling underlying semantic differences.

All true, all obvious and all a function of Hadoop’s design. It never had data integration as a requirement so finding that it doesn’t do data integration isn’t a surprise.

If you switch “commercially-supported data integration tools,” you will be working “…one data element at a time,” because common data integration tools don’t capture their own semantics. Which means you can’t re-use your prior data integration with one tool when you transition to another. Does that sound like vendor lock-in?

Odd that Gartner didn’t mention that.

Perhaps that’s stating the obvious as well.

A topic mapping of your present data integration solution will enable you to capture and re-use your investment in its semantics, with any data integration solution.

Did I hear someone say “increased ROI?”

German Record Linkage Center

July 20th, 2014

German Record Linkage Center

From the webpage:

The German Record Linkage Center (GermanRLC) was established in 2011 to promote research on record linkage and to facilitate practical applications in Germany. The Center will provide several services related to record linkage applications as well as conduct research on central topics of the field. The services of the GermanRLC are open to all academic disciplines.

Wikipedia describes record linkage as:

Record linkage (RL) refers to the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, databases). Record linkage is necessary when joining data sets based on entities that may or may not share a common identifier (e.g., database key, URI, National identification number), as may be the case due to differences in record shape, storage location, and/or curator style or preference. A data set that has undergone RL-oriented reconciliation may be referred to as being cross-linked. Record Linkage is called Data Linkage in many jurisdictions, but is the same process.

While very similar to topic maps, record linkage relies upon the creation of a common record for further processing, as opposed to pointing into an infoverse to identify subjects in their natural surroundings.

Another difference in practice is that the subjects (headers, fields, etc.) that contain subjects are not themselves treated as subjects with identity. That is to say that how a mapping from an original form was made to the target form is opaque to a subsequent researcher.

I first saw this in a tweet by Lars Marius Garshol.

Combinatorial Algorithms

July 20th, 2014

Combinatorial Algorithms For Computers and Calculators by Albert Jijenhuis and Hebert S. Wilf. (PDF)

I suspect the word “calculators” betrays the age of this item. This edition was published in 1978. Still, Amazon is currently asking > $50.00 U.S. for it so at least knowing about it can save you some money.

Not that the algorithms covered have changed but the authors say that combinatorics changed enough between 1975 and 1978 to warrant a second edition.

I suspect that is true several times over between 1978 and 2014.

Still, there is value in a different presentation than you would see today, even without the latest content.

I first saw this in a tweet by Atabey Kaygun.

Learn Haskell

July 20th, 2014

Learn Haskell by Chris Allen.

Chris has created a GitHub repository on his recommended path for learning Haskell.

Quite a list of resources but if he has missed anything, please file a pull request.

I first saw this in a tweet by Debasish Ghosh.

SciPy Videos – Title Sort Order

July 20th, 2014

You have probably seen that the SciPy 2014 videos are up! Good News! SciPy 2014.

You may have also noticed, the videos are in no discernable order. Not so good news.

However, I have created a list of the SciPy Videos in Title Sort Order.


Ad-hoc Biocuration Workflows?

July 19th, 2014

Text-mining-assisted biocuration workflows in Argo by Rafal Rak, et al. (Database (2014) 2014 : bau070 doi: 10.1093/database/bau070)


Biocuration activities have been broadly categorized into the selection of relevant documents, the annotation of biological concepts of interest and identification of interactions between the concepts. Text mining has been shown to have a potential to significantly reduce the effort of biocurators in all the three activities, and various semi-automatic methodologies have been integrated into curation pipelines to support them. We investigate the suitability of Argo, a workbench for building text-mining solutions with the use of a rich graphical user interface, for the process of biocuration. Central to Argo are customizable workflows that users compose by arranging available elementary analytics to form task-specific processing units. A built-in manual annotation editor is the single most used biocuration tool of the workbench, as it allows users to create annotations directly in text, as well as modify or delete annotations created by automatic processing components. Apart from syntactic and semantic analytics, the ever-growing library of components includes several data readers and consumers that support well-established as well as emerging data interchange formats such as XMI, RDF and BioC, which facilitate the interoperability of Argo with other platforms or resources. To validate the suitability of Argo for curation activities, we participated in the BioCreative IV challenge whose purpose was to evaluate Web-based systems addressing user-defined biocuration tasks. Argo proved to have the edge over other systems in terms of flexibility of defining biocuration tasks. As expected, the versatility of the workbench inevitably lengthened the time the curators spent on learning the system before taking on the task, which may have affected the usability of Argo. The participation in the challenge gave us an opportunity to gather valuable feedback and identify areas of improvement, some of which have already been introduced.

Database URL:

From the introduction:

Data curation from biomedical literature had been traditionally carried out as an entirely manual effort, in which a curator handpicks relevant documents and creates annotations for elements of interest from scratch. To increase the efficiency of this task, text-mining methodologies have been integrated into curation pipelines. In curating the Biomolecular Interaction Network Database (1), a protein–protein interaction extraction system was used and was shown to be effective in reducing the curation work-load by 70% (2). Similarly, a usability study revealed that the time needed to curate FlyBase records (3) was reduced by 20% with the use of a gene mention recognizer (4). Textpresso (5), a text-mining tool that marks up biomedical entities of interest, was used to semi-automatically curate mentions of Caenorhabditis elegans proteins from the literature and brought about an 8-fold increase in curation efficiency (6). More recently, the series of BioCreative workshops ( have fostered the synergy between biocuration efforts and text-mining solutions. The user-interactive track of the latest workshop saw nine Web-based systems featuring rich graphical user interfaces designed to perform text-mining-assisted biocuration tasks. The tasks can be broadly categorized into the selection of documents for curation, the annotation of mentions of relevant biological entities in text and the annotation of interactions between biological entities (7).

Argo is a truly impressive text-mining-assisted biocuration application but the first line of a biocuration article needs to read:

Data curation from biomedical literature had been traditionally carried out as an entirely ad-hoc effort, after the author has submitted their paper for publication.

There is an enormous backlog of material that desperately needs biocuration and Argo (and other systems) have a vital role to play in that effort.

However, the situation of ad-hoc biocuration is never going to improve unless and until biocuration is addressed in the authoring of papers to appear in biomedical literature.

Who better to answer questions or ambiguities that appear in biocuration than the author of papers?

That would require working to extend MS Office and Apache OpenOffice, to name two of the more common authoring platforms.

But the return would be higher quality publications earlier in the publication cycle, which would enable publishers to provide enhanced services based upon higher quality products and enhance tracing and searching of the end products.

No offense to ad-hoc efforts but higher quality sooner in the publication process seems like an unbeatable deal.

…Ad-hoc Contextual Inquiry

July 19th, 2014

Honing Your Research Skills Through Ad-hoc Contextual Inquiry by Will Hacker.

From the post:

It’s common in our field to hear that we don’t get enough time to regularly practice all the types of research available to us, and that’s often true, given tight project deadlines and limited resources. But one form of user research–contextual inquiry–can be practiced regularly just by watching people use the things around them and asking a few questions.

I started thinking about this after a recent experience returning a rental car to a national brand at the Phoenix, Arizona, airport.

My experience was something like this: I pulled into the appropriate lane and an attendant came up to get the rental papers and send me on my way. But, as soon as he started, someone farther up the lane called loudly to him saying he’d been waiting longer. The attendant looked at me, said “sorry,” and ran ahead to attend to the other customer.

A few seconds later a second attendant came up, took my papers, and jumped into the car to check it in. She was using an app on an tablet that was attached to a large case with a battery pack, which she carried over her shoulder. She started quickly tapping buttons, but I noticed she kept navigating back to the previous screen to tap another button.

Curious being that I am, I asked her if she had to go back and forth like that a lot. She said “yes, I keep hitting the wrong thing and have to go back.”

Will expands his story into why and how to explore random user interactions with technology.

If you want to become better at contextual inquiry and observation, Will has the agenda for you.

He concludes:

Although exercises like this won’t tell us the things we’d like to know about the products we work on, they do let us practice the techniques of contextual inquiry and observation and make us more sensitive to various design issues. These experiences may also help us build the case in more companies for scheduling time and resources for in-field research with our actual customers.