Archive for January, 2014

RProtoBuf:… [will it be clear later?]

Friday, January 31st, 2014

RProtoBuf: Efficient Cross-Language Data Serialization in R by Dirk Eddelbuettel, Murray Stokely, and Jeroen Ooms.


Modern data collection and analysis pipelines often involve a sophisticated mix of applications written in general purpose and specialized programming languages. Many formats commonly used to import and export data between diff erent programs or systems, such as CSV or JSON , are verbose, inefficient, not type-safe, or tied to a specifi c programming language. Protocol Bu ffers are a popular method of serializing structured data between applications|while remaining independent of programming languages or operating systems. They o er a unique combination of features, performance, and maturity that seems particulary well suited for data-driven applications and numerical computing. The RProtoBuf package provides a complete interface to Protocol Bu ers from the R environment for statistical computing. This paper outlines the general class of data serialization requirements for statistical computing, describes the implementation of the RProtoBuf package, and illustrates its use with example applications in large-scale data collection pipelines and web services.

Anyone using RProtoBuf, or any other encoding where a “schema” is separated from data, needs to assign someone the task of reuniting the data with its schema.

Sandra Blakeslee reported on the consequences of failing to document data in Lost on Earth: Wealth of Data Found in Space some fourteen (14) years ago this coming March 20th.

In attempting to recover data from an Viking mission, one NASA staffer observed:

After tracking down the data, Mr. Eliason looked up the NASA documents that described how they were entered. ”It was written in technical jargon,” he said. ”Maybe it was clear to the person who wrote it but it was not clear to me 20 years later.” (emphasis added)

You may say, “…but that’s history, we know better now…,” but can you name who is responsible for documentation on your data and or the steps for processing it? Is it current?

I have no problem with binary formats for data interchange in processing pipelines. But, data going into a pipeline should be converted from a documented format and data coming out of a pipeline should be serialized into a documented format.

The Gold Book

Friday, January 31st, 2014

IUPAC Compendium of Chemical Terminology (Gold Book)

From the webpage:

The Compendium is popularly referred to as the “Gold Book”, in recognition of the contribution of the late Victor Gold, who initiated work on the first edition. It is one of the series of IUPAC “Colour Books” on chemical nomenclature, terminology, symbols and units (see the list of source documents), and collects together terminology definitions from IUPAC recommendations already published in Pure and Applied Chemistry and in the other Colour Books.

Terminology definitions published by IUPAC are drafted by international committees of experts in the appropriate chemistry sub-disciplines, and ratified by IUPAC’s Interdivisional Committee on Terminology, Nomenclature and Symbols (ICTNS). In this edition of the Compendium these IUPAC-approved definitions are supplemented with some definitions from ISO and from the International Vocabulary of Basic and General Terms in Metrology; both these sources are recognised by IUPAC as authoritative. The result is a collection of nearly 7000 terms, with authoritative definitions, spanning the whole range of chemistry.

Some minor editorial changes were made to the originally published definitions, to harmonise the presentation and to clarify their applicability, if this is limited to a particular sub-discipline. Verbal definitions of terms from Quantities, Units and Symbols in Physical Chemistry (the IUPAC Green Book, in which definitions are generally given as mathematical expressions) were developed specially for this Compendium by the Physical Chemistry Division of IUPAC. Definitions of a few physicochemical terms not mentioned in the Green Book were added at the same time (referred to here as Physical Chemistry Division, unpublished).

The first reference given at the end of each definition is to the page of Pure Appl. Chem. or other source where the original definition appears; other references given designate other places where compatible definitions of the same term or additional information may be found, in other IUPAC documents. The complete reference citations are given in the appended list of source documents. Highlighted terms within individual definitions link to other entries where additional information is available.

If you are looking for authoritative chemistry terminology, you may not need to look any further!

IUPUC – International Union of Pure and Applied Chemistry.

The “color” books that were mentioned:

Chemical Terminology (Gold book)

Quantities, Units and Symbols in Physical Chemistry (Green Book)

Nomenclature of Organic Chemistry (Blue book)

Macromolecular Nomenclature (Purple book)

Analytical Terminology (Orange book)

Biochemical Nomenclature (White Book)

Nomenclature of Inorganic Chemistry (Red Book)

Some “lite” weekend reading. 😉

Apps for Energy

Friday, January 31st, 2014

Apps for Energy

Deadline: March 9, 2014

From the webpage:

The Department of Energy is awarding $100,000 in prizes for the best web and mobile applications that use one or more featured APIs, standards or ideas to help solve a problem in a unique way.

Submit an application by March 9, 2014!

Not much in the way of semantic integration opportunities, at least as the contest is written.

Still, it is an opportunity to work with government data and there is a chance you could win some money!

Sigma.js Version 1.0 Released!

Friday, January 31st, 2014

Sigma.js Version 1.0 Released!

From the homepage:

Sigma is a JavaScript library dedicated to graph drawing. It makes easy to publish networks on Web pages, and allows developers to integrate network exploration in rich Web applications.

Appreciated the inclusion of Victor Hugo’s Les Misérables example that comes with Gephi.

Something familiar always makes learning easier.

I first saw this in a tweet by Bryan Connor.

Introducing R

Friday, January 31st, 2014

Introducing R by Germán Rodríguez. (PDF) HTML Version.

A brief but very useful introduction to R.

I first saw this at: Princeton’s guide to linear modeling and logistic regression with R. I was lead there by a tweet by David Smith.


Friday, January 31st, 2014

Transform DOCX to HTML/CSS with High-Fidelity using PowerTools for Open XML by Eric White.

From the post:

Today I am happy to announce the release of HtmlConverter version 2.06.00, which is a high fidelity conversion from DOCX to HTML/CSS. HtmlConverter is a module in the PowerTools for Open XML project.

HtmlConverter.cs 2.06.00 supports:

  • Paragraph styles, character styles, and table styles, including styles that are based on other styles.
  • Table styles includes support for conditional table style options (header row, total row, banded rows, first column, last column, and banded columns.
  • Fonts, including font styles such as bold, italic, underline, strikethrough, foreground and background colors, shading, sub-script, super-script, and more.  HtmlConverter is, in effect, guidance on how to correctly determine the font and formatting for each paragraph and text run in a document.
  • Numbered and bulleted lists.  Current support is only for en-US and fr-FR; however, HtmlConverter is factored and parameterized so that you can support other languages without altering the source code.  In the near future, I’ll be publishing guidance and instructions on how to support additional languages, and I’ll be asking for volunteers to write and contribute the bits of code to generate canonical (one, two, three) and ordinal (first, second, third) implementations for your native language, as well as the various Asian and RTL numbering systems.
  • Tabs, including left tabs, right tabs, centered tabs, and decimal tabs.  HtmlConverter takes the approach of using font metrics to calculate the exact width of the various pieces of text in a line, and inserts <span> elements with precisely calculated widths.
  • High fidelity support for vertical white space and horizontal white space, including indented text, hanging indents, centered text, right justified text, and justified text.
  • Borders around paragraphs, and high fidelity for borders of tables.
  • Horizontally and vertically merged cells in tables.
  • External hyperlinks, and internal hyperlinks to bookmarks within the document.
  • You have much more control over the conversion when compared to other approaches to converting to HTML.  There are already a number of parameters that enable you to control the transformation, and in the future I’ll be adding many more knobs and levers to fine tune the conversion.  And of course, you have the source code, so you can customize the conversion for your scenario.

See Eric’s post for questions about what priority desired features should have for addition to HtmlConverter.


PowerTools for Open XML is licensed under the Microsoft Public License (Ms-PL), which gives you wide latitude in how you use the code, including its use in commercial products and open source projects.

It won’t be long until “not open source” software will be worthy of comment.

I first saw this in a tweet by Open Microsoft.

…only the information that they can ‘see’…

Friday, January 31st, 2014

Jumping NLP Curves: A Review of Natural Language Processing Research by Erik Cambria and Bebo White.

From the post:

Natural language processing (NLP) is a theory-motivated range of computational techniques for the automatic analysis and representation of human language. NLP research has evolved from the era of punch cards and batch processing (in which the analysis of a sentence could take up to 7 minutes) to the era of Google and the likes of it (in which millions of webpages can be processed in less than a second). This review paper draws on recent developments in NLP research to look at the past, present, and future of NLP technology in a new light. Borrowing the paradigm of `jumping curves’ from the eld of business management and marketing prediction, this survey article reinterprets the evolution of NLP research as the intersection of three overlapping curves –namely Syntactics, Semantics, and Pragmatics Curves– which will eventually lead NLP research to evolve into natural language understanding.

This is not your average review of the literature as the authors point out:

…this review paper focuses on the evolution of NLP research according to three diff erent paradigms, namely: the bag-of-words, bag-of-concepts, and bag-of-narratives models.

But what caught my eye was:

All such capabilities are required to shift from mere NLP to what is usually referred to as natural language understanding (Allen, 1987). Today, most of the existing approaches are still based on the syntactic representation of text, a method which mainly relies on word co-occurrence frequencies. Such algorithms are limited by the fact that they can process only the information that they can `see’. As human text processors, we do not have such limitations as every word we see activates a cascade of semantically related concepts, relevant episodes, and sensory experiences, all of which enable the completion of complex NLP tasks &endash; such as word-sense disambiguation, textual entailment, and semantic role labeling &endash; in a quick and e ffortless way. (emphasis added)

The phrase, “only the information that they can `see’” captures the essence of the problem that topic maps address. A program can only see the surface of a text, nothing more.

The next phrase summarizes the promise of topic maps, to capture “…a cascade of semantically related concepts, relevant episodes, and sensory experiences…” related to a particular subject.

Not that any topic map could capture the full extent of related information to any subject but it can capture information to the extent plausible and useful.

I first saw this in a tweet by Marin Dimitrov.

Scientific Data

Friday, January 31st, 2014

Scientific Data

From the homepage:

Scientific Data is a new open-access, online-only publication for descriptions of scientifically valuable datasets. It introduces a new type of content called the Data Descriptor designed to make your data more discoverable, interpretable and reusable. Scientific Data is currently calling for submissions, and will launch in May 2014.

The Data Descriptors are described in more detail in Metadata associated with Data Descriptor articles to be released under CC0 waiver with this overview:

Box 1. Overview of information in Data Descriptor metadata

Metadata files will be released in the ISA-Tab format, and potentially in other formats in the future, such as Linked Data. An example metadata file is available here, associated with one of our sample Data Descriptors. The information in these files is designed to be a machine-readable supplement to the main Data Descriptor article.

  • Article citation information: Manuscript title, Author list, DOI, publication date, etc
  • Subject terms: according to NPG’s own subject categorization system
  • Annotation of the experimental design and main technologies used: Annotation terms will be derived from community-based ontologies wherever possible. Fields are derived from the ISA framework and include: Design Type, Measurement Type, Factors, Technology Type, and Technology Platform.
  • Information about external data records: Names of the data repositories, data record accession or DOIs, and links to the externally-stored data records
  • Structured tables that provide a detailed accounting of the experimental samples and data-producing assays, including characteristics of samples or subjects of the study, such as species name and tissue type, described using standard terminologies.

For more information on the value of this structured content and how it relates to the narrative article-like content see this earlier blog post by our Honorary Academic Editor, Susanna-Assunta Sansone.

Nature is taking the lead in this effort, which should bring a sense of security to generations of researchers. Security in knowing Nature takes the rights of authors seriously but also knowing the results will be professional grade.

I am slightly concerned that there is no obvious mechanism for maintenance of “annotation terms” from community-based ontologies or other terms, as terminology changes over time. Change in the vocabulary of for any discipline is too familiar to require citation. As those terms change, so will access to valuable historical resources.

Looking at the Advisory Panel, it is heavily weighted in favor of medical and biological sciences. Is there an existing publication that performs a similar function for data sets from physics, astronomy, botany, etc.?

I first saw this in a tweet by ChemConnector.

Open Science Leaps Forward! (Johnson & Johnson)

Friday, January 31st, 2014

In Stunning Win For Open Science, Johnson & Johnson Decides To Release Its Clinical Trial Data To Researchers by Matthew Herper.

From the post:

Drug companies tend to be secretive, to say the least, about studies of their medicines. For years, negative trials would not even be published. Except for the U.S. Food and Drug Administration, nobody got to look at the raw information behind those studies. The medical data behind important drugs, devices, and other products was kept shrouded.

Today, Johnson & Johnson is taking a major step toward changing that, not only for drugs like the blood thinner Xarelto or prostate cancer pill Zytiga but also for the artificial hips and knees made for its orthopedics division or even consumer products. “You want to know about Listerine trials? They’ll have it,” says Harlan Krumholz of Yale University, who is overseeing the group that will release the data to researchers.


Here’s how the process will work: J&J has enlisted The Yale School of Medicine’s Open Data Access Project (YODA) to review requests from physicians to obtain data from J&J products. Initially, this will only include products from the drug division, but it will expand to include devices and consumer products. If YODA approves a request, raw, anonymized data will be provided to the physician. That includes not just the results of a study, but the results collected for each patient who volunteered for it with identifying information removed. That will allow researchers to re-analyze or combine that data in ways that would not have been previously possible.


Scientists can make a request for data on J&J drugs by going to

The ability to “…re-analyze or combine that data in ways that would not have been previously possible…” is the public benefit of Johnson & Johnson’s sharing of data.

With any luck, this will be the start of a general trend among drug companies.

Mappings of the semantics of such data sets should be contributed back to the Yale School of Medicine’s Open Data Access Project (YODA), to further enhance re-use of these data sets.


Friday, January 31st, 2014


From the webpage:

Semantic Web technologies are emerging as an increasingly important approach to distribute and integrate scientific data. These technologies include the trio of the Resource Description Framework (RDF), Web Ontology Language (OWL), and SPARQL query language. The PubChemRDF project provides RDF formatted information for the PubChem Compound, Substance, and Bioassay databases.

This document provides detailed technical information (release notes) about the PubChemRDF project. Downloadable RDF data is available on the PubChemRDF FTP Site. Past presentations on the PubChemRDF project are available giving a PubChemRDF introduction and on the PubChemRDF details. The PubChem Blog may provide most recent updates on the PubChemRDF project. Please note that the PubChemRDF is evolving as a function of time. However, we intend for such enhancements to be backwards compatible by adding additional information and annotations.

A twitter post commented on there being 59 billion triples.

Nothing to sneeze at but I was more impressed with the types of connections at page 8 of

I am sure there are others but just on that slide:

  • sio:has_component
  • sio:is_stereoisomer_of
  • sio:is_isotopologue_of
  • sio:has_same_connectivity_as
  • sio:similar_to_by_PubChem_2D_similarity_algorithm
  • sio:similar_to_by_PubChem_3D_similarity_algorithm

Using such annotations, the user could decide on what basis to consider compounds “similar” or not.

True, it is non-obvious how I would offer an alternative vocabulary for isotopologue but in this domain, that may not be a requirement.

That we can offer alternative vocabularies for any domain does not mean there is a requirement for alternative vocabularies in any particular domain.

A great source of data!

I first saw this in a tweet by Paul Groth.

Visualize your Twitter followers…

Thursday, January 30th, 2014

Visualize your Twitter followers in 3 fairly easy — and totally free — steps by Derrick Harris.

From the post:

Twitter is a great service, but it’s not exactly easy for users without programming skills to access their account data, much less do anything with it. Until now.

There already are services that will let you download reports about when you tweet and which of your tweets were the most popular, some — like SimplyMeasured and FollowerWonk — will even summarize data about your followers. If you’re willing to wait hours to days (Twitter’s API rate limits are just that — limiting) and play around with open source software, NodeXL will help you build your own social graph. (I started and gave up after realizing how long it would take if you have more than a handful of followers.) But you never really see the raw data, so you have to trust the services and you have to hope they present the information you want to see.

Then, last week, someone from ScraperWiki tweeted at me, noting that service can now gather raw data about users’ accounts. (I’ve used the service before to measure tweet activity.) I was intrigued. But I didn’t want to just see the data in a table, I wanted to do something more with it. Here’s what I did.

Another illustration that the technology expertise gap between users does not matter as much as the imagination gap between users.

The Google Fusion Table image is quite good.

The Data Visualization Catalogue

Thursday, January 30th, 2014

The Data Visualization Catalogue by Drew Skau.

From the post:

If you’ve ever struggled with what visualization to create to best show the data you have, The Data Visualization Catalogue might provide just the help you need.

Severino Ribecca has begun the process of categorizing data visualizations based on what relationships and properties of data that they show. With 54 visualizations currently slated to be categorized, the catalog aims to be a comprehensive list of visualizations, searchable by what you want to show.

Just having a quick reference to the different visualization types is helpful by itself. The details make it even more helpful.

Certified Programming with Dependent Types

Thursday, January 30th, 2014

Certified Programming with Dependent Types by Adam Chlipala.

From the introduction:

We would all like to have programs check that our programs are correct. Due in no small part to some bold but unfulfilled promises in the history of computer science, today most people who write software, practitioners and academics alike, assume that the costs of formal program verification outweigh the benefits. The purpose of this book is to convince you that the technology of program verification is mature enough today that it makes sense to use it in a support role in many kinds of research projects in computer science. Beyond the convincing, I also want to provide a handbook on practical engineering of certified programs with the Coq proof assistant. Almost every subject covered is also relevant to interactive computer theorem-proving in general, such as for traditional mathematical theorems. In fact, I hope to demonstrate how verified programs are useful as building blocks in all sorts of formalizations.

The idea of certified program features prominently in this book’s title. Here the word “certified” does not refer to governmental rules for how the reliability of engineered systems may be demonstrated to sufficiently high standards. Rather, this concept of certification, a standard one in the programming languages and formal methods communities, has to do with the idea of a certificate, or formal mathematical artifact proving that a program meets its specification. Government certification procedures rarely provide strong mathematical guarantees, while certified programming provides guarantees about as strong as anything we could hope for. We trust the definition of a foundational mathematical logic, we trust an implementation of that logic, and we trust that we have encoded our informal intent properly in formal specifications, but few other opportunities remain to certify incorrect software. For compilers and other programs that run in batch mode, the notion of a certifying program is also common, where each run of the program outputs both an answer and a proof that the answer is correct. Any certifying program can be composed with a proof checker to produce a certified program, and this book focuses on the certified case, while also introducing principles and techniques of general interest for stating and proving theorems in Coq.

It is hard to say whether this effort at certified programming will prove to be any more successful than Z notation.

On the other hand, the demand for programs that are provably free of government intrusion or backdoors, is at an all time high.

Government overreaching, overreaching that was disproportionate to any rational goal, will power the success of open source programming and the advent of certified programs.

Ironic that such a pernicious activity will have such unintended good results.

I first saw this in a tweet by Computer Science.

100 numpy exercises

Thursday, January 30th, 2014

100 numpy exercises A joint effort of the numpy community.

The categories are:


Further on Numpy.


I first saw this in a tweet by Gregory Piatetsky.

Resources for learning D3.js

Thursday, January 30th, 2014

Resources for learning D3.js

Nineteen “pinned” resources.

Capabilities of D3.js?

The TweetMap I mentioned yesterday uses D3.js.

Other questions about the capabilities of D3.js?

Capstone [Open Source + Binaries, The New Norm?]

Thursday, January 30th, 2014


From the webpage:

Capstone is a lightweight multi-platform, multi-architecture disassembly framework.Our target is to make Capstone the ultimate disassembly engine for binary analysis and reversing in the security community.


  • Support hardware architectures: ARM, ARM64 (aka ARMv8), Mips, PowerPC & X86 (more details).
  • Clean/simple/lightweight/intuitive architecture-neutral API.
  • Provide details on disassembled instruction (called “decomposer” by others).
  • Provide some semantics of the disassembled instruction, such as list of implicit registers read & written.
  • Implemented in pure C language, with bindings for Python, Ruby, C#, Java, GO, OCaml & Vala available.
  • Native support for Windows & *nix (including MacOSX, Linux, *BSD & Solaris platforms).
  • Thread-safe by design.
  • Distributed under the open source BSD license.

Some of the reasons make Capstone unique are elaborated here.

The faithless in the software industry have no one but themselves to blame if open source and binary distributions become the norm for all software. Having proven themselves unworthy of trust, at any level, it is hard to imagine steps to regain that trust.

Perhaps we should reword the old adage that “to many eyes all bugs are shallow,” to “to many eyes all surveillance attempts are shallow?” To make it clear that open source code can decreases your risk of government or industrial surveillance.

Note the emphasis on “can decrease your risk.” No guarantees but an open and vigilant open source community is a step in the right direction.

Before that day arrives, however, you are going to need tools to discover what people are talking about in binary code. Which is where products like Capstone come into play.

Disassembly is more difficult than vetting source code but the greater the need, the more likely that frameworks like Capstone will become easier and easier to use. You may even spot patterns in how particular agencies attempt to suborn software that you purchased.

If source code isn’t publicly available, the best answer to software vendors is “…thanks, but no thanks.”

PS: Apache really should develop an NSA-Free icon to go with the feather. Pass the word along.

XQueryX 3.0 Proposed Recommendation Published

Thursday, January 30th, 2014

XQueryX 3.0 Proposed Recommendation Published

From the post:

The XML Query Working Group has published a Proposed Recommendation of XQueryX 3.0. XQueryX is an XML representation of an XQuery. It was created by mapping the productions of the XQuery grammar into XML productions. The result is not particularly convenient for humans to read and write, but it is easy for programs to parse, and because XQueryX is represented in XML, standard XML tools can be used to create, interpret, or modify queries. Comments are welcome through 25 February 2014.

Be mindful of the 25 February 2014 deadline for comments and enjoy!

The Structure Data awards:… [Vote For GraphLab]

Wednesday, January 29th, 2014

The Structure Data awards: Honoring the best data startups of 2013 by Derrick Harris.

From the post:

Data is taking over the world, which makes for an exciting time to be covering information technology. Almost every new company understands the importance of analyzing data, and many of their products — from fertility apps to stream-processing engines — are based on this understanding. Whether it’s helping users do new things or just do the same old things better, data analysis really is changing the enterprise and consumer technology spaces, and the world, in general.

With that in mind, we have decided to honor some of the most-promising, innovative and useful data-based startups with our inaugural Structure Data awards. The criteria were simple. Companies (or projects) must have launched in 2013; must have been covered in Gigaom; and, most importantly, must make the collection and analysis of data a key part of the user experience. Identifying these companies was the easy part; the hard part was paring down the list of categories and candidates to a reasonable number.

Just a quick head’s up about the Readers’ Choice awards at GIgaom. Voting closes 14 February 2014.

If you need a suggestion under Machine Learning/AI, vote for GraphLab!

Map-D (the details)

Wednesday, January 29th, 2014

MIT Spinout Exploits GPU Memory for Vast Visualization by Alex Woodie.

From the post:

An MIT research project turned open source project dubbed the Massively Parallel Database (Map-D) is turning heads for its capability to generate visualizations on the fly from billions of data points. The software—an SQL-based, column-oriented database that runs in the memory of GPUs—can deliver interactive analysis of 10TB datasets with millisecond latencies. For this reason, its creator feels comfortable is calling it “the fastest database in the world.”

Map-D is the brainchild of Todd Mostak, who created the software while taking a class in database development at MIT. By optimizing the database to run in the memory of off-the-shelf graphics processing units (GPUs), Mostak found that he could create a mini supercomputer cluster that offered an order of magnitude better performance than a database running on regular CPUs.

“Map-D is an in-memory column store coded into the onboard memory of GPUs and CPUs,” Mostak said today during Webinar on Map-D. “It’s really designed from the ground up to maximize whatever hardware it’s using, whether it’s running on Intel CPU or Nvidia GPU. It’s optimized to maximize the throughput, meaning if a GPU has this much memory bandwidth, what we really try to do is make sure we’re hitting that memory bandwidth.”

During the webinar, Mostak and Tom Graham, his fellow co-founder of the startup Map-D, demonstrated the technology’s capability to interactively analyze datasets composed of a billion individual records, constituting more than 1TB of data. The demo included a heat map of Twitter posts made from 2010 to the present. Map-D’s “TweetMap” (which the company also demonstrated at the recent SC 2013 conference) runs on eight K40 Tesla GPUs, each with 12 GB of memory, in a single node configuration.

You really need to try the TweetMap example. This rocks!

The details on TweetMap:

You can search tweet text, heatmap results, identify and animate trends, share maps and regress results against census data.

For each click Map-D scans the entire database and visualizes results in real-time. Unlike many other tweetmap demos, nothing is canned or pre-rendered. Recent tweets also stream live onto the system and are available for view within seconds of broadcast.

TweetMap is powered by 8 NVIDIA Tesla K40 GPUs with a total of 96GB of GPU memory in a single node. While we sometimes switch between interesting datasets of various size, for the most part TweetMap houses over 1 billion tweets from 2010 to the present.

Imagine interactive “merging” of subjects based on their properties.

Come to think of it, don’t GPUs handle edges between nodes? As in graphs? 😉

A couple of links for more information, although I suspect the list of resources on Map-D is going to grow by leaps and bounds:

Resources page (included videos of demonstrations).

An Overview of MapD (Massively Parallel Database) by Todd Mostak. (whitepaper)

Map of Preventable Diseases

Wednesday, January 29th, 2014

preventable disease

Be sure to see the interactive version of this map by the Council on Foreign Relations.

I first saw this at Chart Porn, which was linking to Map of preventable disease outbreaks shows the influence of anti-vaccination movements by Rich McCormick, which in turn pointed to the CFG map.

The dataset is downloadable from the CFG.

Vaccination being more a matter of public health, I have always wondered by anyone would be allowed an option to decline. Certainly some people will have adverse reactions, even die, and they or their families should be cared for and/or compensated. But they should not be allowed to put large numbers of others at risk.

BTW, when you look at the interactive map, locate Georgia in the United States and you will see the large green dot reports 247 cases of whooping cough for Georgia. The next green dot which slightly overlaps with it, reports 2 cases. While being more than half the size of the dot on Georgia.

Disproportionate scaling of icons reduces the accuracy of the information conveyed by the map. Unfortunate because this is an important public health issue.

Create a Simple Hadoop Cluster with VirtualBox ( < 1 Hour)

Wednesday, January 29th, 2014

How-to: Create a Simple Hadoop Cluster with VirtualBox by Christian Javet.

From the post:

I wanted to get familiar with the big data world, and decided to test Hadoop. Initially, I used Cloudera’s pre-built virtual machine with its full Apache Hadoop suite pre-configured (called Cloudera QuickStart VM), and gave it a try. It was a really interesting and informative experience. The QuickStart VM is fully functional and you can test many Hadoop services, even though it is running as a single-node cluster.

I wondered what it would take to install a small four-node cluster…

I did some research and I found this excellent video on YouTube presenting a step by step explanation on how to setup a cluster with VMware and Cloudera. I adapted this tutorial to use VirtualBox instead, and this article describes the steps used.

Watch for the line:

Overall we will allocate 14GB of memory, so ensure that the host machine has sufficient memory, otherwise this will impact your experience negatively.

Yes, “…impact your experience negatively.”



Applying linked data approaches to pharmacology:…

Wednesday, January 29th, 2014

Applying linked data approaches to pharmacology: Architectural decisions and implementation by Alasdair J.G. Gray, et. al.


The discovery of new medicines requires pharmacologists to interact with a number of information sources ranging from tabular data to scientific papers, and other specialized formats. In this application report, we describe a linked data platform for integrating multiple pharmacology datasets that form the basis for several drug discovery applications. The functionality offered by the platform has been drawn from a collection of prioritised drug discovery business questions created as part of the Open PHACTS project, a collaboration of research institutions and major pharmaceutical companies. We describe the architecture of the platform focusing on seven design decisions that drove its development with the aim of informing others developing similar software in this or other domains. The utility of the platform is demonstrated by the variety of drug discovery applications being built to access the integrated data.

An alpha version of the OPS platform is currently available to the Open PHACTS consortium and a first public release will be made in late 2012, see for details.

The paper acknowledges that present database entries lack semantics.

A further challenge is the lack of semantics associated with links in traditional database entries. For example, the entry in UniProt for the protein “kinase C alpha type homo sapien4 contains a link to the Enzyme database record 5, which has complementary data about the same protein and thus the identifiers can be considered as being equivalent. One approach to resolve this, proposed by, is to provide a URI for the concept which contains links to the database records about the concept [27]. However, the UniProt entry also contains a link to the DrugBank compound “Phosphatidylserine6. Clearly, these concepts are not identical as one is a protein and the other a chemical compound. The link in this case is representative of some interaction between the compound and the protein, but this is left to a human to interpret. Thus, for successful data integration one must devise strategies that address such inconsistencies within the existing data.

I would have said databases lack properties to identify the subjects in question but there is little difference in the outcome of our respective positions, i.e., we need more semantics to make robust use of existing data.

Perhaps even more importantly, the paper treats “equality” as context dependent:

Equality is context dependent

Datasets often provide links to equivalent concepts in other datasets. These result in a profusion of “equivalent” identifiers for a concept. provide a single identifier that links to all the underlying equivalent dataset records for a concept. However, this constrains the system to a single view of the data, albeit an important one.

A novel approach to instance level links between the datasets is used in the OPS platform. Scientists care about the types of links between entities: different scientists will accept concepts being linked in different ways and for different tasks they are willing to accept different forms of relationships. For example, when trying to find the targets that a particular compound interacts with, some data sources may have created mappings to gene rather than protein identifiers: in such instances it may be acceptable to users to treat gene and protein IDs as being in some sense equivalent. However, in other situations this may not be acceptable and the OPS platform needs to allow for this dynamic equivalence within a scientific context. As a consequence, rather than hard coding the links into the datasets, the OPS platform defers the instance level links to be resolved during query execution by the Identity Mapping Service (IMS). Thus, by changing the set of dataset links used to execute the query, different interpretations over the data can be provided.

Opaque mappings between datasets, i.e., mappings that don’t assign properties to source, target and then say what properties or conditions must be met for the mapping to be vaild, are of little use. Rely on opaque mappings at your own risk.

On the other hand, I fully agree that equality is context dependent and the choice of the criteria for equivalence should be left up to users. I suppose in that sense if users wanted to rely on opaque mappings, that would be their choice.

While an exciting paper, it is discussing architectural decisions and so we are not at the point of debating these issues in detail. It promises to be an exciting discussion!

Identifying Case Law

Wednesday, January 29th, 2014

Costs of the (Increasingly) Lengthy Path to U.S. Report Pagination by Peter W. Martin.

If you are not familiar with the U.S. Supreme Court, the thumbnail sketch is that the court publishes its opinions without official page numbers and they remain that way for years. When the final printed version appears, all the cases citing a case without official page numbers, have to be updated. Oh joy! 😉

Peter does a great job illustrating the costs of this approach.

From the post:

On May 17, 2010, the U.S. Supreme Court decided United States v. Comstock, holding that Congress had power under the Necessary and Proper Clause of the U.S. Constitution to authorize civil commitment of a mentally ill, sexually dangerous federal prisoner beyond his release date. (18 U.S.C. § 4248). Three and a half years later, the Court communicated the Comstock decision’s citation pagination with the shipment of the “preliminary print” of Part 1 of volume 560 of the United States Reports. That paperbound publication was logged into the Cornell Law Library on January 3 of this year. (According to the Court’s web site the final bound volume shouldn’t be expected for another year.) United States v. Comstock, appears in that volume at page 126, allowing the full case finally to be cited: United States v. Comstock, 560 U.S. 126 (2010) and specific portions of the majority, concurring and dissenting opinions to be cited by means of official page numbers.

This lag between opinion release and attachment of official volume and page numbers along the slow march to a final bound volume has grown in recent years, most likely as a result of tighter budgets at the Court and the Government Printing Office. Less than two years separated the end of the Court’s term in 2001 and our library’s receipt of the bound volume containing its last decisions. By 2006, five years later, the gap had widened to a full three years. Volume 554 containing the last decisions from the term ending in 2008 didn’t arrive until July 9 of last year. That amounts to nearly five years of delay.

If the printed volumes of the Court’s decisions served solely an archival function, this increasingly tardy path to print would warrant little concern or comment. But because the Court provides no means other than volume and page numbers to cite its decisions and their constituent parts, the increasing delays cast a widening ripple of costs on the federal judiciary, the services that distribute case law, and the many who need to cite it.

The nature of those costs can be illustrated using the Comstock case itself.

In addition to detailing the costs of delayed formal citation, Peter’s analysis is equally applicable to multiple gene names, for example, that precede any attempt at an official name.

What happens to all the literature that was published using the “interim” names?

Yes, we can map between them or create synonym tables, but who knows on what basis we created those tables or mappings?

Legal citations aren’t changing rapidly but the fact they are changing at all is fairly remarkable. Taken as lessons in the management of identifiers, it is a area to watch closely.

Apache Lucene 4.6.1 and Apache SolrTM 4.6.1

Wednesday, January 29th, 2014

Download: Apache Lucene 4.6.1


Lucene CHANGES.txt

New features include: FreeTextSuggester (Lucene-5214), New Document Dictionary (Lucene-5221), and twenty-one (21) others!


Download: Apache SolrTM 4.6.1

Solr CHANGES.txt

New features include: support for AnalyzingInfixSuggester (Solr-5167), new field type EnumField (Solr-5084), and fifteen (15) others!

Not that I will know anytime soon but I am curious how well the AnalyzingInfixSuggester would work with Akkadian.

Change Tracking, Excel, and Subjects

Wednesday, January 29th, 2014

Change tracking is an active topic of discussion in the OpenDocument TC at OASIS. So much so that a sub-committee was formed to create a change tracking proposal for ODF 1.3. OpenDocument – Advanced Document Collaboration SC

In a recent discussion, the sub-committee was reminded that MS Excel, that change tracking is only engaged when working on a “shared workbook.”

If I am working on a non-shared workbook, any changes I make, of whatever nature, formatting, data in cells, formulas, etc., are not being tracked.

Without change tracking, what are several subjects we can’t talk about in an Excel spreadsheet?

  1. We can’t talk about the author of a particular change.
  2. We can’t talk about the author of a change relative to other people or events (such as emails).
  3. We can’t talk about a prior value or formula “as it was.”
  4. We can’t talk about the origin of a prior value or formula.
  5. We can’t talk about a prior value or formula as compared to a later value or formula.

Transparency is the watchword of government and industry.

Opacity is the watchword of spreadsheet change tracking.

Do you see a conflict there?

Supporting the development change tracking in Open Document (ODF) at the OpenDocument TC could shine a bright light in a very dark place.

ZooKeys 50 (2010) Special Issue

Wednesday, January 29th, 2014

Taxonomy shifts up a gear: New publishing tools to accelerate biodiversity research by Lyubomir Penev, et. al.

From the editorial:

The principles of Open Access greatly facilitate dissemination of information through the Web where it is freely accessed, shared and updated in a form that is accessible to indexing and data mining engines using Web 2.0 technologies. Web 2.0 turns the taxonomic information into a global resource well beyond the taxonomic community. A significant bottleneck in naming species is the requirement by the current Codes of biological nomenclature ruling that new names and their associated descriptions must be published on paper, which can be slow, costly and render the new information difficult to find. In order to make progress in documenting the diversity of life, we must remove the publishing impediment in order to move taxonomy “from a cottage industry into a production line” (Lane et al. 2008), and to make best use of new technologies warranting the fastest and widest distribution of these new results.

In this special edition of ZooKeys we present a practical demonstration of such a process. The issue opens with a forum paper from Penev et al. (doi: 10.3897/zookeys.50.538) that presents the landscape of semantic tagging and text enhancements in taxonomy. It describes how the content of the manuscript is enriched by semantic tagging and marking up of four exemplar papers submitted to the publisher in three different ways: (i) written in Microsoft Word and submitted as non-tagged manuscript (Stoev et al., doi: 10.3897/zookeys.50.504); (ii) generated from Scratchpads (Blagoderov et al., doi: 10.3897/zookeys.50.506 and Brake and Tschirnhaus, doi: 10.3897/zookeys.50.505); (iii) generated from an author’s database (Taekul et al., doi: 10.3897/zookeys.50.485). The latter two were submitted as XML-tagged manuscript. These examples demonstrate the suitability of the workflow to a range of possibilities that should encompass most current taxonomic efforts. To implement the aforementioned routes for XML mark up in prospective taxonomic publishing, a special software tool (Pensoft Mark Up Tool, PMT) was developed and its features were demonstrated in the current issue. The XML schema used was version #123 of TaxPub, an extension to the Document Type Definitions (DTD) of the US National Library of Medicine (NLM) (

A second forum paper from Blagoderov et al. (doi: 10.3897/zookeys.50.539) sets out a workflow that describes the assembly of elements from a Scratchpad taxon page ( to export a structured XML file. The publisher receives the submission, automatically renders the file into the journal‘s layout style as a PDF and transmits it to a selection of referees, based on the key words in the manuscript and the publisher’s database. Several steps, from the author’s decision to submit the manuscript to final publication and dissemination, are automatic. A journal editor first spends time on the submission when the referees’ reports are received, making the decision to publish, modify or reject the manuscript. If the decision is to publish, then PDF proofs are sent back to the author and, when verified, the paper is published both on paper and on-line, in PDF, HTML and XML formats. The original information is also preserved on the original Scratchpad where it may, in due course, be updated. A visitor arriving at the web site by tracing the original publication will be able to jump forward to the current version of the taxon page.

This sounds like the promise of SGML/XML made real doesn’t it?

See the rest of the editorial or ZooKeys 50 for a very good example of XML and semantics in action.

This is a long way from the “related” or “recent” article citations in most publisher interfaces. Thoughts on how to make that change?

A Semantic Web Example? Nearly a Topic Map?

Wednesday, January 29th, 2014

Morphological and Geographical Traits of the British Odonata by Gary D Powney, el. al.


Trait data are fundamental for many aspects of ecological research, particularly for modeling species response to environmental change. We synthesised information from the literature (mainly field guides) and direct measurements from museum specimens, providing a comprehensive dataset of 26 attributes, covering the 43 resident species of Odonata in Britain. Traits included in this database range from morphological traits (e.g. body length) to attributes based on the distribution of the species (e.g. climatic restriction). We measured 11 morphometric traits from five adult males and five adult females per species. Using digital callipers, these measurements were taken from dry museum specimens, all of which were wild caught individuals. Repeated measures were also taken to estimate measurement error. The trait data are stored in an online repository (, alongside R code designed to give an overview of the morphometric data, and to combine the morphometric data to the single value per trait per species data.

A great example of publishing data along with software to manipulate it.

I mention it here because the publisher, Pensoft, references the Semantic Web saying:

The Semantic Web could also be called a “linked Web” because most semantic enhancements are in fact provided through various kinds of links to external resources. The results of these linkages will be visualized in the HTML versions of the published papers through various cross-links within the text and more particularly through the Pensoft Taxon Profile (PTP) ( PTP is a web-based harvester that automatically links any taxon name mentioned within a text to external sources and creates a dynamic web-page for that taxon. PTP saves readers a great amount of time and effort by gathering for them the relevant information on a taxon from leading biodiversity sources in real time.

A substantial feature of the semantic Web is open data publishing, where not only analysed results, but original datasets can be published as citeable items so that the data authors may receive academic dredit for their efforts. For more information, please visit our detailed Data Publishing Policies and Guidelines for Biodiversity Data.

When you view the article, you will find related resources displayed next to the article. A lot of related resources.

Of course it remains to every reader to assemble data across varying semantics but this is definitely a step in the right direction.


I first saw this in a tweet by S.K. Morgan Ernest.

JSON-LD and Why I Hate the Semantic Web

Tuesday, January 28th, 2014

JSON-LD and Why I Hate the Semantic Web by Manu Sporny.

From the post:

JSON-LD became an official Web Standard last week. This is after exactly 100 teleconferences typically lasting an hour and a half, fully transparent with text minutes and recorded audio for every call. There were 218+ issues addressed, 2,000+ source code commits, and 3,102+ emails that went through the JSON-LD Community Group. The journey was a fairly smooth one with only a few jarring bumps along the road. The specification is already deployed in production by companies like Google, the BBC,, Yandex, Yahoo!, and Microsoft. There is a quickly growing list of other companies that are incorporating JSON-LD. We’re off to a good start.

In the previous blog post, I detailed the key people that brought JSON-LD to where it is today and gave a rough timeline of the creation of JSON-LD. In this post I’m going to outline the key decisions we made that made JSON-LD stand out from the rest of the technologies in this space.

I’ve heard many people say that JSON-LD is primarily about the Semantic Web, but I disagree, it’s not about that at all. JSON-LD was created for Web Developers that are working with data that is important to other people and must interoperate across the Web. The Semantic Web was near the bottom of my list of “things to care about” when working on JSON-LD, and anyone that tells you otherwise is wrong. :P

TL;DR: The desire for better Web APIs is what motivated the creation of JSON-LD, not the Semantic Web. If you want to make the Semantic Web a reality, stop making the case for it and spend your time doing something more useful, like actually making machines smarter or helping people publish data in a way that’s useful to them.


Something to get your blood pumping early in the week.

Although, I don’t think it is healthy for Manu to hold back so much. 😉

Read the comments to the post as well.

ICML 2014

Tuesday, January 28th, 2014

Volume 32: Proceedings of The 31st International Conference on Machine Learning. Edited by Xing, Eric P. and Jebara, Tony.

I count some eighty-five (85) papers, many with supplementary materials.


I first saw this in a tweet by Mark Reid.


Tuesday, January 28th, 2014

Large scale data analysis made easier with SparkR by Shivaram Venkataraman.

From the post:

R is a widely used statistical programming language and supports a variety of data analysis tasks through extension packages. In fact, a recent survey of data scientists showed that R is the most frequently used tool other than SQL databases. However, data analysis in R is limited as the runtime is single threaded and can only process data sets that fit in a single machine.

In an effort to enable large scale data analysis from R, we have recently released SparkR. SparkR is an R package that provides a light-weight frontend to use Spark from R. SparkR allows users to create and transform RDDs in R and interactively run jobs from the R shell on a Spark cluster. You can can try out SparkR today by installing it from our github repo.

Be mindful of the closing caveat:

Right now, SparkR works well for algorithms like gradient descent that are parallelizable but requires users to decide which parts of the algorithm can be run in parallel. In the future, we hope to provide direct access to large scale machine learning algorithms by integrating with Spark’s MLLib. More examples and details about SparkR can be found at

Early days for SparkR but it has a lot of promise.

I first saw this in a tweet by Jason Trost.