Introducing Splainer…

August 25th, 2014

Introducing Splainer — The Open Source Search Sandbox That Tells You Why by Doug Turnbull.

Splainer is a step towards addressing two problems:

From the post:

  • Collaboration: At OpenSource Connections, we believe that collaboration with non-techies is the secret ingredient of search relevancy. We need to arm business analysts and content experts with a human readable version of the explain information so they can inform the search tuning process.
  • Usability: I want to paste a Solr URL, full of query paramaters and all, and go! Then, once I see more helpful explain information, I want to tweak (and tweak and tweak) until I get the search results I want. Much like some of my favorite regex tools. Get out of the way and let me tune!
  • ….

    We hope you’ll give it a spin and let us know how it can be improved. We welcome your bugs, feedback, and pull requests. And if you want to try the Splainer experience over multiple queries, with diffing, results grading, a develoment history, and more — give Quepid a spin for free!

Improving the information content of the tokens you are searching is another way to improve search results.

Desperately Seeking Algorithms!

August 25th, 2014

I don’t know for sure that Christophe Grand is “desperately” seeking algorithms but he has tweeted a request for “favorite algorithms” to be cast into posts similar to:

Tarjan’s strongly connected components algorithm

I dislike algorithms that are full of indices and mutations. Not because they are bad but because I always have the feeling that the core idea is buried. As such, Tarjan’s SCC algorithm irked me.

So I took the traditional algorithm, implemented it in Clojure with explicit environment passing, then I replaced indices by explicit stacks (thanks to persistence!) and after some tweaks, I realized that I’ve gone full circle and could switch to stacks lengths instead of stacks themselves and get rid of the loop. However the whole process made the code cleaner to my eye. You can look at the whole history.

Here is the resulting code:

See the Tarjan post for the Clojure version. Something similar is planned for “favorite” algorithms.

What algorithm are you going to submit?

Pass this along.

Research topics in e-discovery

August 25th, 2014

Research topics in e-discovery by William Webber.

From the post:

Dr. Dave Lewis is visiting us in Melbourne on a short sabbatical, and yesterday he gave an interesting talk at RMIT University on research topics in e-discovery. We also had Dr. Paul Hunter, Principal Research Scientist at FTI Consulting, in the audience, as well as research academics from RMIT and the University of Melbourne, including Professor Mark Sanderson and Professor Tim Baldwin. The discussion amongst attendees was almost as interesting as the talk itself, and a number of suggestions for fruitful research were raised, many with fairly direct relevance to application development. I thought I’d capture some of these topics here:

E-discovery, if you don’t know, is found in civil litigation and government investigations. Think of it as hacking with rules as the purpose of e-discovery is to find information that supports your claims or defense. E-discovery is high stakes data mining that pays very well. Need I say more?

Webber lists the following research topics:

  1. Classification across heterogeneous document types
  2. Automatic detection of document types
  3. Faceted categorization
  4. Label propagation across related documents
  5. Identifying unclassifiable documents
  6. Identifying poor training examples
  7. Identifying significant fragments in non-significant text
  8. Routing of documents to specialized trainers
  9. Total cost of annotation

“Label propagation across related documents” looks like a natural for topic maps but searching over defined properties that identify subjects as opposed to opaque tokens would enhance the results for a number of these topics.

Speedy Short and Long DNA Reads

August 25th, 2014

Acceleration of short and long DNA read mapping without loss of accuracy using suffix array by Joaquín Tárraga, et al. (Bioinformatics (2014) doi: 10.1093/bioinformatics/btu553)


HPG Aligner applies suffix arrays for DNA read mapping. This implementation produces a highly sensitive and extremely fast mapping of DNA reads that scales up almost linearly with read length. The approach presented here is faster (over 20x for long reads) and more sensitive (over 98% in a wide range of read lengths) than the current, state-of-the-art mappers. HPG Aligner is not only an optimal alternative for current sequencers but also the only solution available to cope with longer reads and growing throughputs produced by forthcoming sequencing technologies.

Always nice to see an old friend, suffix arrays, in the news!

Source code:

For documentation and software:

I first saw this in a tweet by Bioinfocipf.

Where to get help with Common Lisp

August 25th, 2014

Where to get help with Common Lisp by Zach Beane.

An annotated listing of help resources for Common Lisp.

I especially appreciated his quote/comment from one mailing list:

Beware that “the accuracy of postings made to the list is not guaranteed.” (I think that goes for every mailing list, ever.)


I first saw this in a tweet by Planet Lisp.

Who Dat?

August 24th, 2014


From the about page:

Dat is an grant funded, open source project housed in the US Open Data Institute. While dat is a general purpose tool, we have a focus on open science use cases.

The high level goal of the dat project is to build a streaming interface between every database and file storage backend in the world. By building tools to build and share data pipelines we aim to bring to data a style of collaboration similar to what git brings to source code.

The first alpha release is now out!

More on this project later this coming week.

I first saw this in Nat Torkington’s Four short links: 21 August 2014.

Introducing Riffmuse

August 24th, 2014

Introducing Riffmuse by Dave Yarwood.

From the post:

I’ve written a simple command line app in Clojure that will take a musical scale as a command line argument and algorithmically generate a short musical idea or “riff” using the notes in that scale. I call it Riffmuse.

Here’s how Riffmuse works in a nutshell: it takes the command line argument(s), figures out what scale you want (For this, I used the awesome parser generator library Instaparse to create a parser that allows a flexible syntax in specifying scales. C major can be represented as, e.g., “C major,” “CMAJ” or even just “c”), determines what notes are in that scale, comes up with a rhythmic pattern for the notes (represented as 16 “slots” that can either have a note or no note), and then fills the slots with notes from the scale you’ve specified.

On the theory that you never know what will capture someone’s interest, this is a post that needs to be shared. It may spark an interest in some future Clojure or music rock star!

I first saw this in a tweet by coderpost.

Pro Daniel

August 23rd, 2014

Michael Daniel, cybersecurity coordinator for the White House is being taken to task for saying:

“You don’t have to be a coder in order to really do well in this position,” Daniel said, when asked if his job required knowledge of the technology behind information security. “In fact, actually, I think being too down in the weeds at the technical level could actually be a little bit of a distraction.”

“You can get taken up and enamored with the very detailed aspects of some of the technical solutions,” he explained, arguing that “the real issue is looking at the broad strategic picture.”

That quote, from White House cybersecurity czar brags about his lack of technical expertise by Timothy B. Lee, has provoked all manner of huffing and puffing across the computer security community.

It reminds me of candidates for the county commission who would brag about their expertise at running graders, backhoes, and similar heavy equipment. Always puzzled me because I assumed county government would hire people with those skills. Commissioners needed skills at representing the county for grants, making policy decisions, etc.

The security community, or at least reporters purporting to speak for the security community don’t appear to understand the difference between cybersecurity software and cybersecurity policy. You need coders for the former and policy wonks for the latter. Someone could be both but that’s fairly unlikely.

For example, assume a new security algorithm is discovered that can encrypt telephone and email communications with very little overhead for encryption/decryption. Further assume that Daniel has been assured by none other than Bruce Schneier that the algorithm and software that implements it, performs as advertised. And assume Daniel understands none of the details about the algorithm and software.

How does his ignorance impact the formulation of cybersecurity policy with regard to this algorithm or software?

The FBI opposes it because the FBI prefers non-encrypted communications like in the old days when it could just plug into a phone junction box.

The NSA opposes it, at least for others, because then it could not easily tap into email and phone conversations.

The Department of Defense opposes it, primarily because it has long term contractual relationships for security services with firms that don’t have access to the algorithm.

The Library of Congress supports it, at least those outside of the copyright office support it.

Various other groups take positions that seem reasonable to them.

So, how are coding skills going to help Daniel balance the political, social, agency and other politics for a policy concerning such an algorithm?

We all know the answer to that question.

Not at all.

PS: If my example looks like a strawman, come up with one of your own. Technical expertise Daniel can hire, policy expertise, meaning what is expedient given the stakeholders and their influence, not so much.

Large-Scale Object Classification…

August 23rd, 2014

Large-Scale Object Classi cation using Label Relation Graphs by Jia Deng, et al.


In this paper we study how to perform object classi cation in a principled way that exploits the rich structure of real world labels. We develop a new model that allows encoding of flexible relations between labels. We introduce Hierarchy and Exclusion (HEX) graphs, a new formalism that captures semantic relations between any two labels applied to the same object: mutual exclusion, overlap and subsumption. We then provide rigorous theoretical analysis that illustrates properties of HEX graphs such as consistency, equivalence, and computational implications of the graph structure. Next, we propose a probabilistic classifi cation model based on HEX graphs and show that it enjoys a number of desirable properties. Finally, we evaluate our method using a large-scale benchmark. Empirical results demonstrate that our model can signifi cantly improve object classifi cation by exploiting the label relations.

Let’s hear it for “real world labels!”

By which the authors mean:

  • An object can have more than one label.
  • There are relationships between labels.

From the introduction:

We first introduce Hierarchy and Exclusion (HEX) graphs, a new formalism allowing flexible specifi cation of relations between labels applied to the same object: (1) mutual exclusion (e.g. an object cannot be dog and cat), (2) overlapping (e.g. a husky may or may not be a puppy and vice versa), and (3) subsumption (e.g. all huskies are dogs). We provide theoretical analysis on properties of HEX graphs such as consistency, equivalence, and computational implications.

Next, we propose a probabilistic classi fication model leveraging HEX graphs. In particular, it is a special type of Conditional Random Field (CRF) that encodes the label relations as pairwise potentials. We show that this model enjoys
a number of desirable properties, including flexible encoding of label relations, predictions consistent with label relations, efficient exact inference for typical graphs, learning labels with varying specifi city, knowledge transfer, and uni fication of existing models.

Having more than one label is trivially possible in topic maps. The more interesting case is the authors choosing to treat semantic labels as subjects and to define permitted associations between those subjects.

A world of possibilities opens up when you can treat something as a subject that can have relationships defined to other subjects. Noting that those relationships can also be treated as subjects should someone desire to do so.

I first saw this at: Is that husky a puppy?

Data + Design

August 23rd, 2014

Data + Design: A simple introduction to preparing and visualizing information by Trina Chiasson, Dyanna Gregory and others.

From the webpage:


Information design is about understanding data.

Whether you’re writing an article for your newspaper, showing the results of a campaign, introducing your academic research, illustrating your team’s performance metrics, or shedding light on civic issues, you need to know how to present your data so that other people can understand it.

Regardless of what tools you use to collect data and build visualizations, as an author you need to make decisions around your subjects and datasets in order to tell a good story. And for that, you need to understand key topics in collecting, cleaning, and visualizing data.

This free, Creative Commons-licensed e-book explains important data concepts in simple language. Think of it as an in-depth data FAQ for graphic designers, content producers, and less-technical folks who want some extra help knowing where to begin, and what to watch out for when visualizing information.

As of today, the Data + Design is the product of fifty (50) volunteers from fourteen (14) countries. At eighteen (18) chapters and just shy of three-hundred (300) pages, this is a solid introduction to data and its visualization.

The source code is on GitHub, along with information on how you can contribute to this project.

A great starting place but my social science background is responsible for my caution concerning chapters 3 and 4 on survey design and questions.

All of the information and advice in those chapters is good, but it leaves the impression that you (the reader) can design an effective survey instrument. There is a big difference between an “effective” survey instrument and a series of questions pretending to be a survey instrument. Both will measure “something” but the question is whether a survey instrument provides you will actionable intelligence.

For a survey on any remotely mission critical, like user feedback on an interface or service, get as much professional help as you can afford.

When was the last time you heard of a candidate for political office or serious vendor using Survey Monkey? There’s a reason for that lack of reports. Can you guess that reason?

I first saw this in a tweet by Meta Brown.


August 23rd, 2014

National Library of Medicine RSS Feeds

RSS feeds covering a broad range National Library of Medicine activities.

I am reporting it here because as soon as I don’t, I will need the listing.

NLM Technical Bulletin

August 23rd, 2014

NLM Technical Bulletin

A publication of the U.S. National Library of Medicine (NLM). The about page for NLM gives the following overview:

The National Library of Medicine (NLM), on the campus of the National Institutes of Health in Bethesda, Maryland, has been a center of information innovation since its founding in 1836. The world’s largest biomedical library, NLM maintains and makes available a vast print collection and produces electronic information resources on a wide range of topics that are searched billions of times each year by millions of people around the globe. It also supports and conducts research, development, and training in biomedical informatics and health information technology. In addition, the Library coordinates a 6,000-member National Network of Libraries of Medicine that promotes and provides access to health information in communities across the United States.

The bulletin about page says:

The NLM Technical Bulletin, your source for the latest searching information, is produced by: MEDLARS Management Section, National Library of Medicine, Bethesda, Maryland, USA.

Which is true but seems inadequate to describe the richness of what you can find at the bulletin.

For example, in 2014 July&emdash;August No. 399 you find:

MeSH on Demand Update: How to Find Citations Related to Your Text

New CMT Subsets Available

New Tutorial: Searching Drugs or Chemicals in PubMed

If medical terminology touches your field of interest, this is a must read.

MeSH on Demand Tool:…

August 23rd, 2014

MeSH on Demand Tool: An Easy Way to Identify Relevant MeSH Terms by Dan Cho.

From the post:

Currently, the MeSH Browser allows for searches of MeSH terms, text-word searches of the Annotation and Scope Note, and searches of various fields for chemicals. These searches assume that users are familiar with MeSH terms and using the MeSH Browser.

Wouldn’t it be great if you could find MeSH terms directly from your text such as an abstract or grant summary? MeSH on Demand has been developed in close collaboration among MeSH Section, NLM Index Section, and the Lister Hill National Center for Biomedical Communications to address this need.

Using MeSH on Demand

Use MeSH on Demand to find MeSH terms relevant to your text up to 10,000 characters. One of the strengths of MeSH on Demand is its ease of use without any prior knowledge of the MeSH vocabulary and without any downloads.

Now there’s a clever idea!

Imagine extending it just a bit so that it produces topics for subjects it detects in your text and associations with the text and author of the text. I would call that assisted topic map authoring. You?

I followed a tweet by Michael Hoffman, which lead to: MeSH on Demand Update: How to Find Citations Related to Your Text, which describes an enhancement to MeSH on demands that finds relevant citations (10) based on your text.

The enhanced version mimics the traditional method of writing court opinions. A judge writes his decision and then a law clerk finds cases that support the positions taken in the opinion. You really thought it worked some other way? ;-)

Imaging Planets and Disks [Not in our Solar System]

August 22nd, 2014

Videos From the 2014 Sagan Summer Workshop On-line

From the post:

The NASA Exoplanet Science Center (NEXScI) hosts the Sagan Workshops, annual themed conferences aimed at introducing the latest techniques in exoplanet astronomy to young researchers. The workshops emphasize interaction with data, and include hands-on sessions where participants use their laptops to follow step-by-step tutorials given by experts. This year’s conference topic was “Imaging Planets and Disks”. It covered topics such as

  • Properties of Imaged Planets
  • Integrating Imaging and RV Datasets
  • Thermal Evolution of Planets
  • The Challenges and Science of Protostellar And Debris Disks…

You can see the agenda and the presentations here, and the videos have been posted here. Some of the talks are also on youtube at

The presentations showcase the extraordinary richness of exoplanet research. If you are unfamiliar with NASA’s exoplanet program, Gary Lockwood provides an introduction (not available for embedding – visit the web page). My favorite talk, of many good ones, was Travis Barman speaking on the “Crown Jewels of Young Exoplanets.”

Looking to expand you data processing horizons? ;-)


Manhattan District History

August 22nd, 2014

Manhattan District History

From the post:

General Leslie Groves, head of the Manhattan Engineer District, in late 1944 commissioned a multi-volume history of the Manhattan Project called the Manhattan District History. Prepared by multiple authors under the general editorship of Gavin Hadden, a longtime civil employee of the Army Corps of Engineers, the classified history was “intended to describe, in simple terms, easily understood by the average reader, just what the Manhattan District did, and how, when, and where.” The volumes record the Manhattan Project’s activities and achievements in research, design, construction, operation, and administration, assembling a vast amount of information in a systematic, readily available form. The Manhattan District History contains extensive annotations, statistical tables, charts, engineering drawings, maps, photographs, and detailed indices. Only a handful of copies of the history were prepared. The Department of Energy’s Office of History and Heritage Resources is custodian of one of these copies.

The history is arranged in thirty-six volumes grouped in eight books. Some of the volumes were further divided into stand-alone chapters. Several of the volumes and stand-alone chapters were never security classified. Many of the volumes and chapters were declassified at various times and were available to the public on microfilm. Parts of approximately a third of the volumes remain classified.

The Office of Classification and the Office of History and Heritage Resources, in collaboration with the Department’s Office of Science and Technical Information, have made the full-text of the entire thirty-six volume Manhattan District History available on this OpenNet website. Unclassified and declassified volumes have been scanned and posted. Classified volumes were declassified in full or with redactions, i.e., still classified terms, phrases, sentences, and paragraphs were removed and the remaining unclassified parts made available to the public. All volumes have been posted.

In case you are interested in the Manhattan project generally or want to follow its participants into the late 20th century, this is the resource for you!

Just occurred to me that the 1940 Census Records are now online. What other records would you want to map together from this time period?

I first saw this in a tweet by Michael Nielsen.

Getty Thesaurus of Geographic Names (TGN)

August 22nd, 2014

Getty Thesaurus of Geographic Names Released as Linked Open Data by James Cuno.

From the post:

We’re delighted to announce that the Getty Research Institute has released the Getty Thesaurus of Geographic Names (TGN)® as Linked Open Data. This represents an important step in the Getty’s ongoing work to make our knowledge resources freely available to all.

Following the release of the Art & Architecture Thesaurus (AAT)® in February, TGN is now the second of the four Getty vocabularies to be made entirely free to download, share, and modify. Both data sets are available for download at under an Open Data Commons Attribution License (ODC BY 1.0).

What Is TGN?

The Getty Thesaurus of Geographic Names is a resource of over 2,000,000 names of current and historical places, including cities, archaeological sites, nations, and physical features. It focuses mainly on places relevant to art, architecture, archaeology, art conservation, and related fields.

TGN is powerful for humanities research because of its linkages to the three other Getty vocabularies—the Union List of Artist Names, the Art & Architecture Thesaurus, and the Cultural Objects Name Authority. Together the vocabularies provide a suite of research resources covering a vast range of places, makers, objects, and artistic concepts. The work of three decades, the Getty vocabularies are living resources that continue to grow and improve.

Because they serve as standard references for cataloguing, the Getty vocabularies are also the conduits through which data published by museums, archives, libraries, and other cultural institutions can find and connect to each other.

A resource where you could loose some serious time!

Try this entry for London.

Or Paris.

Bear in mind the data that underlies this rich display is now available for free downloading.

The Truth About Triplestores [Opaqueness]

August 22nd, 2014

The Truth About Triplestores

A vendor “truth” document from Ontotext. Not that being from a vendor is a bad thing, but you should always consider the source of a document when evaluating its claims.

Quite naturally I jumped to: “6. Data Integration & Identity Resolution: Identifying the same entity across disparate data sources.”

With so many different databases and systems existing inside any single organization, how do companies integrate all of their data? How do they recognize that an entity in one database is the same entity in a completely separate database?

Resolving identities across disparate sources can be tricky. First, they need to be identified and then linked.

To do this effectively, you need two things. Earlier, we mentioned that through the use of text analysis, the same entity spelled differently can be recognized. Once this happens, the references to entities need to be stored correctly in the triplestore. The triplestore needs to support predicates that can declare two different Universal Resource Indicators (URIs) as one in the same. By doing this, you can align the same real-world entity used in different data sources. The most standard and powerful predicate used to establish mappings between multiple URIs of a single object is owl:sameAs. In turn, this allows you to very easily merge information from multiple sources including linked open data or proprietary sources. The ability to recognize entities across multiple sources holds great promise helping to manage your data more effectively and pinpointing connections in your data that may be masked by slightly different entity references. Merging this information produces more accurate results, a clearer picture of how entities are related to one another and the ability to improve the speed with which your organization operates.

In case you are unfamiliar with owl:sameAS, here is an example from OWL Web Ontology Language Reference

<rdf:Description rdf:about="#William_Jefferson_Clinton"&gt:
  <owl:sameAs rdf:resource="#BillClinton"/>

The owl:sameAs in this case is opaque because there is no way to express why an author thought #William_Jefferson_Clinton and #BillClinton were about the same subject. You could argue that any prostitute in Columbia would recognize that mapping so let’s try a harder case.

<rdf:Description rdf:about="#United States of America"&gt:
  <owl:sameAs rdf:resource="#الولايات المتحدة الأمريكية"/>

Less confident than you were about the first one?

The problem with owl:sameAs is its opaqueness. You don’t know why an author used owl:sameAs. You don’t know what property or properties they saw that caused them to use one of the various understandings of owl:sameAs.

Without knowing those properties, accepting any owl:sameAs mapping is buying a pig in a poke. Not a proposition that interests me. You?

I first saw this in a tweet by graphityhq.

Computer Science – Know Thyself!

August 22nd, 2014

Putting the science in computer science by Felienne Hermans.

From the description:

Programmers love science! At least, so they say. Because when it comes to the ‘science’ of developing code, the most used tool is brutal debate. Vim versus emacs, static versus dynamic typing, Java versus C#, this can go on for hours at end. In this session, software engineering professor Felienne Hermans will present the latest research in software engineering that tries to understand and explain what programming methods, languages and tools are best suited for different types of development.

Great slides from Felienne’s keynote at ALE 2014.

I mention this to emphasize the need for social science research techniques and methodologies for application development. Investigation of computer science debates with such methods may lead to less resistance to them for user facing issues.

Perhaps a recognition that we are all “users,” bringing common human experiences to different interfaces with computers, will result in better interfaces for all.

Data Carpentry (+ Sorted Nordic Scores)

August 21st, 2014

Data Carpentry by David Mimno.

From the post:

The New York Times has an article titled For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights. Mostly I really like it. The fact that raw data is rarely usable for analysis without significant work is a point I try hard to make with my students. I told them “do not underestimate the difficulty of data preparation”. When they turned in their projects, many of them reported that they had underestimated the difficulty of data preparation. Recognizing this as a hard problem is great.

What I’m less thrilled about is calling this “janitor work”. For one thing, it’s not particularly respectful of custodians, whose work I really appreciate. But it also mischaracterizes what this type of work is about. I’d like to propose a different analogy that I think fits a lot better: data carpentry.

Note: data carpentry seems to already be a thing

I’m not convinced that “carpentry” is the best prestige target.

The first mention of carpenters on a sorted version of the Nordic Scores (Colorado Adoption Project: Resources for Researchers. Institute for Behavioral Genetics, University of Colorado Boulder) is at 147.*

I would go for data scientist since mercenary isn’t listed as an occupation. ;-)

The usual cautions apply. Prestige is as difficult or perhaps more so to measure than any other social construct. The data is from 1989 and so may not reflect “current” prestige rankings.

*(I have removed the classes and sorted by prestige score, to create Sorted Nordic Scores.)

…Loosely Consistent Distributed Programming

August 21st, 2014

Language Support for Loosely Consistent Distributed Programming by Neil Conway.


Driven by the widespread adoption of both cloud computing and mobile devices, distributed computing is increasingly commonplace. As a result, a growing proportion of developers must tackle the complexity of distributed programming—that is, they must ensure correct application behavior in the face of asynchrony, concurrency, and partial failure.

To help address these difficulties, developers have traditionally relied upon system infrastructure that provides strong consistency guarantees (e.g., consensus protocols and distributed transactions). These mechanisms hide much of the complexity of distributed computing—for example, by allowing programmers to assume that all nodes observe the same set of events in the same order. Unfortunately, providing such strong guarantees becomes increasingly expensive as the scale of the system grows, resulting in availability and latency costs that are unacceptable for many modern applications.

Hence, many developers have explored building applications that only require loose consistency guarantees—for example, storage systems that only guarantee that all replicas eventually converge to the same state, meaning that a replica might exhibit an arbitrary state at any particular time. Adopting loose consistency involves making a well-known tradeoff: developers can avoid paying the latency and availability costs incurred by mechanisms for achieving strong consistency, but inexchange they must deal with the full complexity of distributed computing. As a result, achieving correct application behavior in this environment is very difficult.

This thesis explores how to aid developers of loosely consistent applications by providing programming language support for the difficulties they face. The language level is a natural place to tackle this problem: because developers that use loose consistency have fewer system facilities that they can depend on, consistency concerns are naturally pushed into application logic. In part, our goal has been to recognize, formalize, and automate application-level consistency patterns.

We describe three language variants that each tackle a different challenge in distributed programming. Each variant is a modification of Bloom, a declarative language for distributed programming we have developed at UC Berkeley. The first variant of Bloom, BloomL, enables deterministic distributed programming without the need for distributed coordination. Second, Edelweiss allows distributed storage reclamation protocols to be generated in a safe and automatic fashion. Finally, BloomPO adds sophisticated ordering constraints that we use to develop a declarative, high-level implementation of concurrent editing, a particularly difficult class of loosely consistent programs.

Unless you think of topic maps as static files, recent developments in “loosely consistent distributed programming” should be high on your reading list.

It’s entirely possible to have a topic map that is a static file, even one that has been printed out to paper. But that seems like a poor target for development. Captured information begins progressing towards staleness from the moment of its capture.

I first saw this in a tweet by Peter Bailis.

The Little Book of Semaphores

August 21st, 2014

The Little Book of Semaphores by Allen Downey.

From the webpage:

The Little Book of Semaphores is a free (in both senses of the word) textbook that introduces the principles of synchronization for concurrent programming.

In most computer science curricula, synchronization is a module in an Operating Systems class. OS textbooks present a standard set of problems with a standard set of solutions, but most students don’t get a good understanding of the material or the ability to solve similar problems.

The approach of this book is to identify patterns that are useful for a variety of synchronization problems and then show how they can be assembled into solutions. After each problem, the book offers a hint before showing a solution, giving students a better chance of discovering solutions on their own.

The book covers the classical problems, including “Readers-writers,” “Producer-consumer”, and “Dining Philosophers.” In addition, it collects a number of not-so-classical problems, some written by the author and some by other teachers and textbook writers. Readers are invited to create and submit new problems.

If you want a deep understanding of concurrency, this looks like a very good place to start!

Some of the more colorful problem names:

  • The dining savages problem
  • The Santa Claus problem
  • The unisex bathroom problem
  • The Senate Bus problem

There are problems (and patterns) for your discovery and enjoyment!

I first saw this in a tweet by Computer Science.

CSV Fingerprints

August 21st, 2014

CSV Fingerprints by Victor Powell.

From the post:

CSV is a simple and common format for tabular data that uses commas to separate rows and columns. Nearly every spreadsheet and database program lets users import from and export to CSV. But until recently, these programs varied in how they treated special cases, like when the data itself has a comma in it.

It’s easy to make a mistake when you try to make a CSV file fit a particular format. To make it easier to spot mistakes, I’ve made a “CSV Fingerprint” viewer (named after the “Fashion Fingerprints” from The New York Times’s “Front Row to Fashion Week” interactive ). The idea is to provide a birdseye view of the file without too much distracting detail. The idea is similar to Tufte’s Image Quilts…a qualitative view, as opposed to a rendering of the data in the file themselves. In this sense, the CSV Fingerprint is a sort of meta visualization.

This is very clever. Not only can you test a CSV snippet on the webpage, but the source code is on Github. (source code)

Of course, it does rely on the most powerful image processing system known to date. Err, that would be you. ;-)

Pass this along. I can imagine any number of data miners who will be glad you did.

Math for machine learning

August 20th, 2014

Math for machine learning by Zygmunt Zając.

From the post:

Sometimes people ask what math they need for machine learning. The answer depends on what you want to do, but in short our opinion is that it is good to have some familiarity with linear algebra and multivariate differentiation.

Linear algebra is a cornerstone because everything in machine learning is a vector or a matrix. Dot products, distance, matrix factorization, eigenvalues etc. come up all the time.

Differentiation matters because of gradient descent. Again, gradient descent is almost everywhere*. It found its way even into the tree domain in the form of gradient boosting – a gradient descent in function space.

We file probability under statistics and that’s why we don’t mention it here.

Following this introduction you will find a series of books, MOOCs, etc. on linear algebra, calculus and other math resources.

Pass it along!

Mapping Out Lambda Land:…

August 20th, 2014

Mapping Out Lambda Land: An Introduction to Functional Programming by Katie Miller.

From the post:

Anyone who has met me will probably know that I am wildly enthusiastic about functional programming (FP). I co-founded a group for women in FP, have presented a series of talks and workshops about functional concepts, and have even been known to create lambda-branded clothing and jewellery. In this blog post, I will try to give some insight into what the fuss is about. I will briefly explain what functional programming is, why you should care, and how you can use OpenShift to learn more about FP.

With the publicity around OpenShift and functional programming, it seems entirely reasonable to put them together.

Katie gives you a quick overview of functional programming along with resources and next steps for your OpenShift account.

I first saw this in a post by Jonathan Murray.

Web Annotation Working Group (Preventing Semantic Rot)

August 20th, 2014

Web Annotation Working Group

From the post:

The W3C Web Annotation Working Group is chartered to develop a set of specifications for an interoperable, sharable, distributed Web annotation architecture. The chartered specs consist of:

  1. Abstract Annotation Data Model
  2. Data Model Vocabulary
  3. Data Model Serializations
  5. Client-side API

The working group intends to use the Open Annotation Data Model and Open Annotation Extension specifications, from the W3C Open Annotation Community Group, as a starting point for development of the data model specification.

The Robust Link Anchoring specification will be jointly developed with the WebApps WG, where many client-side experts and browser implementers participate.

Some good news for the middle of a week!

Shortcomings to watch for:

Can annotations be annotated?

Can non-Web addressing schemes be used by annotators?

Can the structure of files (visible or not) in addition to content be annotated?

If we don’t have all three of those capabilities, then the semantics of annotations will rot, just as semantics of earlier times have rotted away. The main distinction is that most of our ancestors didn’t choose to allow the rot to happen.

I first saw this in a tweet by Rob Sanderson.

Not just the government’s playbook

August 20th, 2014

Not just the government’s playbook by Mike Loukides.

From the post:

Whenever I hear someone say that “government should be run like a business,” my first reaction is “do you know how badly most businesses are run?” Seriously. I do not want my government to run like a business — whether it’s like the local restaurants that pop up and die like wildflowers, or megacorporations that sell broken products, whether financial, automotive, or otherwise.

If you read some elements of the press, it’s easy to think that is the first time that a website failed. And it’s easy to forget that a large non-government website was failing, in surprisingly similar ways, at roughly the same time. I’m talking about the Common App site, the site high school seniors use to apply to most colleges in the US. There were problems with pasting in essays, problems with accepting payments, problems with the app mysteriously hanging for hours, and more.

I don’t mean to pick on Common App; you’ve no doubt had your own experience with woefully bad online services: insurance companies, Internet providers, even online shopping. I’ve seen my doctor swear at the Epic electronic medical records application when it crashed repeatedly during an appointment. So, yes, the government builds bad software. So does private enterprise. All the time. According to TechRepublic, 68% of all software projects fail. We can debate why, and we can even debate the numbers, but there’s clearly a lot of software #fail out there — in industry, in non-profits, and yes, in government.

With that in mind, it’s worth looking at the U.S. CIO’s Digital Services Playbook. It’s not ideal, and in many respects, its flaws reveal its origins. But it’s pretty good, and should certainly serve as a model, not just for the government, but for any organization, small or large, that is building an online presence.

See Mike’s post for the extracted thirteen (13) principles (plays in Obama-speak) for software projects.

While everybody needs a reminder, what puzzles me is that none of the principles are new. That being the case, shouldn’t we be asking:

Why haven’t projects been following these rules?

Reasoning that if we (collectively) know what makes software projects succeed, what are the barrier to implementing those steps in all software projects?

Re-stating rules that we already know to be true, without more, isn’t very helpful. Projects that start tomorrow with have a fresh warning in their ears and commit the same errors that doom 68% of all other projects.

My favorite suggestion and the one I have seen violated most often is:

Bring in experienced teams

I am told, “…our staff don’t know how to do X, Y or Z….” That sounds to me like a personnel problem. In an IT recession, a problem that isn’t hard to fix. But no, the project has to succeed with IT staff known to lack the project management or technical skills to succeed. You can guess the outcome of such projects in advance.

The restatement of project rules isn’t a bad thing to have but your real challenge is going to be following them. Suggestions for everyone’s benefit welcome!

International Conference on Functional Programming 2014

August 20th, 2014

The 19th ACM SIGPLAN International Conference on Functional Programming Complete Proceedings of ICFP 2014 available for free for one year.

I count thirty-one (31) papers that you can access for the next year.

Be aware the ACM has imposed a petty 10 second “wait” screen even though the articles are available without charge.

I first saw this in a tweet by David Van Horn.

Exposing Resources in Datomic…

August 20th, 2014

Exposing Resources in Datomic Using Linked Data by Ratan Sebastian.

From the post:

Financial data feeds from various data providers tend to be closed off from most people due to high costs, licensing agreements, obscure documentation, and complicated business logic. The problem of understanding this data, and providing access to it for our application is something that we (and many others) have had to solve over and over again. Recently at Pellucid we were faced with three concrete problems

  1. Adding a new data set to make data visualizations with. This one was a high-dimensional data set and we were certain that the queries that would be needed to make the charts had to be very parameterizable.

  2. We were starting to come to terms with the difficulty of answering support questions about the data we use in our charts given that we were serving up the data using a Finagle service that spoke a binary protocol over TCP. Support staff should not have to learn Datomic’s highly expressive query language, Datalog or have to set up a Scala console to look at the raw data that was being served up.

  3. Different data sets that we use had semantically equivalent data that was being accessed in ways specific to that data set.

And as a long-term goal we wanted to be able to query across data sets instead of doing multiple queries and joining in memory.

These are very orthogonal goals to be sure. We embarked on a project which we thought might move us in those three directions simultaneously. We’d already ingested the data set from the raw file format into Datomic, which we love. Goal 2 was easily addressable by conveying data over a more accessible protocol. And what’s more accessible than REST. Goal 1 meant that we’d have expose quite a bit of Datalog expressivity to be able to write all the queries we needed. And Goal 3 hinted at the need for some way to talk about things in different data silos using a common vocabulary. Enter the Linked Data Platform. A W3C project, the need for which is brilliantly covered in this talk. What’s the connection? Wait for it…

The RDF Datomic Mapping

If you are happy with Datomic and RDF primitives, for semantic purposes, this may be all you need.

You have to appreciate Ratan’s closing sentiments:

We believe that a shared ontology of financial data could be very beneficial to many and open up the normally closeted world of handling financial data.

Even though we know as a practical matter that no “shared ontology of financial data” is likely to emerge.

In the absence of such a shared ontology, there are always topic maps.

Deep Learning (MIT Press Book)

August 20th, 2014

Deep Learning (MIT Press Book) by Yoshua Bengio, Ian Goodfellow, and Aaron Courville.

From the webpage:

Draft chapters available for feedback – August 2014
Please help us make this a great book! This draft is still full of typos and can be improved in many ways. Your suggestions are more than welcome. Do not hesitate to contact any of the authors directly by e-mail or Google+ messages: Yoshua, Ian, Aaron.

Teaching a subject isn’t the only way to learn it cold. Proofing a book on a subject is another way to learn material cold.

Ready to dig in?

I first saw this in a tweet by Gregory Piatetsky

Wandora 2014-08-20 Release

August 20th, 2014

Wandora 2014-08-20 Release

From the change log:


For a file with the distribution date, in case you have multiple versions, try

In the latest round of new features, Rekognition extractor and Alchemy API image keyword extractor are the two I am most likely to try first. Images are one of the weakest forms of evidence but they still carry the imprimatur of being “photographic.”

What photo collection will you be tagging first?