Ad-hoc Biocuration Workflows?

July 19th, 2014

Text-mining-assisted biocuration workflows in Argo by Rafal Rak, et al. (Database (2014) 2014 : bau070 doi: 10.1093/database/bau070)


Biocuration activities have been broadly categorized into the selection of relevant documents, the annotation of biological concepts of interest and identification of interactions between the concepts. Text mining has been shown to have a potential to significantly reduce the effort of biocurators in all the three activities, and various semi-automatic methodologies have been integrated into curation pipelines to support them. We investigate the suitability of Argo, a workbench for building text-mining solutions with the use of a rich graphical user interface, for the process of biocuration. Central to Argo are customizable workflows that users compose by arranging available elementary analytics to form task-specific processing units. A built-in manual annotation editor is the single most used biocuration tool of the workbench, as it allows users to create annotations directly in text, as well as modify or delete annotations created by automatic processing components. Apart from syntactic and semantic analytics, the ever-growing library of components includes several data readers and consumers that support well-established as well as emerging data interchange formats such as XMI, RDF and BioC, which facilitate the interoperability of Argo with other platforms or resources. To validate the suitability of Argo for curation activities, we participated in the BioCreative IV challenge whose purpose was to evaluate Web-based systems addressing user-defined biocuration tasks. Argo proved to have the edge over other systems in terms of flexibility of defining biocuration tasks. As expected, the versatility of the workbench inevitably lengthened the time the curators spent on learning the system before taking on the task, which may have affected the usability of Argo. The participation in the challenge gave us an opportunity to gather valuable feedback and identify areas of improvement, some of which have already been introduced.

Database URL:

From the introduction:

Data curation from biomedical literature had been traditionally carried out as an entirely manual effort, in which a curator handpicks relevant documents and creates annotations for elements of interest from scratch. To increase the efficiency of this task, text-mining methodologies have been integrated into curation pipelines. In curating the Biomolecular Interaction Network Database (1), a protein–protein interaction extraction system was used and was shown to be effective in reducing the curation work-load by 70% (2). Similarly, a usability study revealed that the time needed to curate FlyBase records (3) was reduced by 20% with the use of a gene mention recognizer (4). Textpresso (5), a text-mining tool that marks up biomedical entities of interest, was used to semi-automatically curate mentions of Caenorhabditis elegans proteins from the literature and brought about an 8-fold increase in curation efficiency (6). More recently, the series of BioCreative workshops ( have fostered the synergy between biocuration efforts and text-mining solutions. The user-interactive track of the latest workshop saw nine Web-based systems featuring rich graphical user interfaces designed to perform text-mining-assisted biocuration tasks. The tasks can be broadly categorized into the selection of documents for curation, the annotation of mentions of relevant biological entities in text and the annotation of interactions between biological entities (7).

Argo is a truly impressive text-mining-assisted biocuration application but the first line of a biocuration article needs to read:

Data curation from biomedical literature had been traditionally carried out as an entirely ad-hoc effort, after the author has submitted their paper for publication.

There is an enormous backlog of material that desperately needs biocuration and Argo (and other systems) have a vital role to play in that effort.

However, the situation of ad-hoc biocuration is never going to improve unless and until biocuration is addressed in the authoring of papers to appear in biomedical literature.

Who better to answer questions or ambiguities that appear in biocuration than the author of papers?

That would require working to extend MS Office and Apache OpenOffice, to name two of the more common authoring platforms.

But the return would be higher quality publications earlier in the publication cycle, which would enable publishers to provide enhanced services based upon higher quality products and enhance tracing and searching of the end products.

No offense to ad-hoc efforts but higher quality sooner in the publication process seems like an unbeatable deal.

…Ad-hoc Contextual Inquiry

July 19th, 2014

Honing Your Research Skills Through Ad-hoc Contextual Inquiry by Will Hacker.

From the post:

It’s common in our field to hear that we don’t get enough time to regularly practice all the types of research available to us, and that’s often true, given tight project deadlines and limited resources. But one form of user research–contextual inquiry–can be practiced regularly just by watching people use the things around them and asking a few questions.

I started thinking about this after a recent experience returning a rental car to a national brand at the Phoenix, Arizona, airport.

My experience was something like this: I pulled into the appropriate lane and an attendant came up to get the rental papers and send me on my way. But, as soon as he started, someone farther up the lane called loudly to him saying he’d been waiting longer. The attendant looked at me, said “sorry,” and ran ahead to attend to the other customer.

A few seconds later a second attendant came up, took my papers, and jumped into the car to check it in. She was using an app on an tablet that was attached to a large case with a battery pack, which she carried over her shoulder. She started quickly tapping buttons, but I noticed she kept navigating back to the previous screen to tap another button.

Curious being that I am, I asked her if she had to go back and forth like that a lot. She said “yes, I keep hitting the wrong thing and have to go back.”

Will expands his story into why and how to explore random user interactions with technology.

If you want to become better at contextual inquiry and observation, Will has the agenda for you.

He concludes:

Although exercises like this won’t tell us the things we’d like to know about the products we work on, they do let us practice the techniques of contextual inquiry and observation and make us more sensitive to various design issues. These experiences may also help us build the case in more companies for scheduling time and resources for in-field research with our actual customers.

Government-Grade Stealth Malware…

July 19th, 2014

Government-Grade Stealth Malware In Hands Of Criminals by Sara Peters.

From the post:

Malware originally developed for government espionage is now in use by criminals, who are bolting it onto their rootkits and ransomware.

The malware, dubbed Gyges, was first discovered in March by Sentinel Labs, which just released an intelligence report outlining their findings. From the report: “Gyges is an early example of how advanced techniques and code developed by governments for espionage are effectively being repurposed, modularized and coupled with other malware to commit cybercrime.”

Sentinel was able to detect Gyges with on-device heuristic sensors, but many intrusion prevention systems would miss it. The report states that Gyges’ evasion techniques are “significantly more sophisticated” than the payloads attached. It includes anti-detection, anti-tampering, anti-debugging, and anti-reverse-engineering capabilities.

The figure I keep hearing quoted is that cybersecurity attackers are ten years ahead of cybersecurity defenders.

Is that what you hear?

Whatever the actual gap, what makes me curious is why the gap exists at all? I assume the attackers and defenders are on par as far as intelligence, programming skills, financial support, etc., so what is the difference that accounts for the gap?

I don’t have the answer or even a suspicion of a suggestion but suspect someone else does.

Pointers anyone?

First complex, then simple

July 19th, 2014

First complex, then simple by James D Malley and Jason H Moore. (BioData Mining 2014, 7:13)


At the start of a data analysis project it is often suggested that the researcher look first at multiple simple models. That is, always begin with simple, one variable at a time analyses, such as multiple single-variable tests for association or significance. Then, later, somehow (how?) pull all the separate pieces together into a single comprehensive framework, an inclusive data narrative. For detecting true compound effects with more than just marginal associations, this is easily defeated with simple examples. But more critically, it is looking through the data telescope from wrong end.

I would have titled this article: “Data First, Models Later.”

That is the author’s start with no formal theories about what data will prove and upon finding signals in the data, then generate simple models to explain the signals.

I am sure their questions of the data are driven by a suspicion of what the data may prove, but that isn’t the same thing as asking questions designed to prove a model generated before the data is queried.

Government Software Design Questions

July 19th, 2014

10 questions to ask when reviewing design work by Ben Terrett.

Ben and a colleague reduced a list of design review questions by Jason Fried down to ten:

10 questions to ask when reviewing design work

1. What is the user need?

2. Is what it says and what it means the same thing?

3. What’s the take away after 3 seconds? (We thought 8 seconds was a bit long.)

4. Who needs to know that?

5. What does someone know now that they didn’t know before?

6. Why is that worth a click?

7. Are we assuming too much?

8. Why that order?

9. What would happen if we got rid of that?

10. How can we make this more obvious?


I’m Ben, Director of Design at GDS. You can follow me on twitter @benterrett

A great list for reviewing any design!

Where design doesn’t just mean an interface but presentation of data as well.

I am now following @benterrett and you should too.

It is a healthy reminder that not everyone in government wants to harm their own citizens and others. A minority do but let’s not forget true public servants while opposing tyrants.

I first saw the ten questions post in Nat Torkington’s Four short links: 18 July 2014.

What is deep learning, and why should you care?

July 19th, 2014

What is deep learning, and why should you care? by Pete Warden.

From the post:


When I first ran across the results in the Kaggle image-recognition competitions, I didn’t believe them. I’ve spent years working with machine vision, and the reported accuracy on tricky tasks like distinguishing dogs from cats was beyond anything I’d seen, or imagined I’d see anytime soon. To understand more, I reached out to one of the competitors, Daniel Nouri, and he demonstrated how he used the Decaf open-source project to do so well. Even better, he showed me how he was quickly able to apply it to a whole bunch of other image-recognition problems we had at Jetpac, and produce much better results than my conventional methods.

I’ve never encountered such a big improvement from a technique that was largely unheard of just a couple of years before, so I became obsessed with understanding more. To be able to use it commercially across hundreds of millions of photos, I built my own specialized library to efficiently run prediction on clusters of low-end machines and embedded devices, and I also spent months learning the dark arts of training neural networks. Now I’m keen to share some of what I’ve found, so if you’re curious about what on earth deep learning is, and how it might help you, I’ll be covering the basics in a series of blog posts here on Radar, and in a short upcoming ebook.

Pete gives a brief sketch of “deep learning” and promises more posts and a short ebook to follow.

Along those same lines you will want to see:

Microsoft Challenges Google’s Artificial Brain With ‘Project Adam’ by Daniela Hernandez (WIRED).

If you want in depth (technical) coverage, see: Deep Learning…moving beyond shallow machine learning since 2006! The reading list and references here should keep you busy for some time.

BTW, on “…shallow machine learning…” you do know the “Dark Ages” really weren’t “dark” but were so named in the Renaissance in order to show the fall into darkness (the Fall of Rome), the “Dark Ages,” and then the return of “light” in the Renaissance? See: Dark Ages (historiography).

Don’t overly credit characterizations of ages or technologies by later ages or newer technologies. They too will be found primitive and superstitious.


July 19th, 2014

Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent by Feng Niu, Benjamin Recht, Christopher Ré and Stephen J. Wright.


Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performance-destroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called Hogwild! which allows processors access to shared memory with the possibility of over-writing each other’s work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then Hogwild! achieves a nearly optimal rate of convergence. We demonstrate experimentally that Hogwild! outperforms alternative schemes that use locking by an order of magnitude. (emphasis in original)

From further in the paper:

Our second graph cut problem sought a mulit-way cut to determine entity recognition in a large database of web data. We created a data set of clean entity lists from the DBLife website and of entity mentions from the DBLife Web Crawl [11]. The data set consists of 18,167 entities and 180,110 mentions and similarities given by string similarity. In this problem each stochastic gradient step must compute a Euclidean projection onto a simplex of dimension 18,167.

A 9X speedup on 10 cores. (Against Vowpal Wabbit.)

A must read paper.

I first saw this in Nat Torkington’s Four short links: 15 July 2014. Nat says:

the algorithm that Microsoft credit with the success of their Adam deep learning system.

Artificial Intelligence | Natural Language Processing

July 18th, 2014

Artificial Intelligence | Natural Language Processing by Christopher Manning.

From the webpage:

This course is designed to introduce students to the fundamental concepts and ideas in natural language processing (NLP), and to get them up to speed with current research in the area. It develops an in-depth understanding of both the algorithms available for the processing of linguistic information and the underlying computational properties of natural languages. Wordlevel, syntactic, and semantic processing from both a linguistic and an algorithmic perspective are considered. The focus is on modern quantitative techniques in NLP: using large corpora, statistical models for acquisition, disambiguation, and parsing. Also, it examines and constructs representative systems.

Lectures with notes.

If you are new to natural language processing, it would be hard to point at a better starting point.


Build Roads not Stagecoaches

July 18th, 2014

Build Roads not Stagecoaches by Martin Fenner.

Describing Eric Hysen’s keynote, Martin says:

In his keynote he described how travel from Cambridge to London in the 18th and early 19th century improved mainly as a result of better roads, made possible by changes in how these roads are financed. Translated to today, he urged the audience to think more about the infrastructure and less about the end products:

Ecosystems, not apps

— Eric Hysen

On Tuesday at csv,conf, Nick Stenning – Technical Director of the Open Knowledge Foundation – talked about data packages, an evolving standard to describe data that are passed around betwen different systems. He used the metaphor of containers, and how they have dramatically changed the transportation of goods in the last 50 years. He argued that the cost of shipping was in large part determined by the cost of loading and unloading, and the container has dramatically changed that equation. We are in a very similar situation with datasets, where most of the time is spent translating between different formats, joining things together that use different names for the same thing [emphasis added], etc.

…different names for the same thing.

Have you heard that before? ;-)

But here is the irony:

When I thought more about this I realized that these building blocks are exactly the projects I get most excited about, i.e. projects that develop standards or provide APIs or libraries. Some examples would be

  • ORCID: unique identifiers for scholarly authors

OK, but many authors already have unique identifiers in DBLP, Library of Congress, Twitter, and at places I have not listed.

Nothing against ORCID, but adding yet another identifier isn’t all that helpful.

A mapping between identifiers, so having one means I can leverage the others, now that is what I call infrastructure.


Duplicate Tool Names

July 18th, 2014

You wait ages for somebody to develop a bioinformatics tool called ‘Kraken’ and then three come along at once by Keith Bradnam.

From the post:

So Kraken is either a universal genomic coordinate translator for comparative genomics, or a tool for ultrafast metagenomic sequence classification using exact alignments, or even a set of tools for quality control and analysis of high-throughput sequence data. The latter publication is from 2013, and the other two are from this year (2014).

Yet another illustration that names are not enough.

A URL identifier would not help unless you recognize the URL.

Identification with name/value plus other key/value pairs?

Leaves everyone free to choose whatever names they like.

It also enables the rest of us to distinguish tools (or other subjects) with the same names apart.

Simply concept. Easy to apply. Disappoints people who want to be in charge of naming things.

Sounds like three good reasons to me, especially the last one.

Scikit-learn 0.15 release

July 17th, 2014

Scikit-learn 0.15 release by Gaël Varoquaux.

From the post:


Quality— Looking at the commit log, there has been a huge amount of work to fix minor annoying issues.

Speed— There has been a huge effort put in making many parts of scikit-learn faster. Little details all over the codebase. We do hope that you’ll find that your applications run faster. For instance, we find that the worst case speed of Ward clustering is 1.5 times faster in 0.15 than 0.14. K-means clustering is often 1.1 times faster. KNN, when used in brute-force mode, got faster by a factor of 2 or 3.

Random Forest and various tree methods— The random forest and various tree methods are much much faster, use parallel computing much better, and use less memory. For instance, the picture on the right shows the scikit-learn random forest running in parallel on a fat Amazon node, and nicely using all the CPUs with little RAM usage.

Hierarchical aglomerative clusteringComplete linkage and average linkage clustering have been added. The benefit of these approach compared to the existing Ward clustering is that they can take an arbitrary distance matrix.

Robust linear models— Scikit-learn now includes RANSAC for robust linear regression.

HMM are deprecated— We have been discussing for a long time removing HMMs, that do not fit in the focus of scikit-learn on predictive modeling. We have created a separate hmmlearn repository for the HMM code. It is looking for maintainers.

And much more— plenty of “minor things”, such as better support for sparse data, better support for multi-label data…

Get thee to Scikit-learn!

April 2014 Crawl Data Available

July 17th, 2014

April 2014 Crawl Data Available by Stephen Merity.

From the post:

The April crawl of 2014 is now available! The new dataset is over 183TB in size containing approximately 2.6 billion webpages. The new data is located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2014-15/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://aws-publicdatasets/ or to each line, you end up with the S3 and HTTP paths respectively.

Thanks again to Blekko for their ongoing donation of URLs for our crawl!

Well, at 183TB, I don’t guess I am going to have a local copy. ;-)


FDA Recall Data

July 16th, 2014

OpenFDA Provides Ready Access to Recall Data by Taha A. Kass-Hout.

From the post:

Every year, hundreds of foods, drugs, and medical devices are recalled from the market by manufacturers. These products may be labeled incorrectly or might pose health or safety issues. Most recalls are voluntary; in some cases they may be ordered by the U.S. Food and Drug Administration. Recalls are reported to the FDA, and compiled into its Recall Enterprise System, or RES. Every week, the FDA releases an enforcement report that catalogues these recalls. And now, for the first time, there is an Application Programming Interface (API) that offers developers and researchers direct access to all of the drug, device, and food enforcement reports, dating back to 2004.

The recalls in this dataset provide an illuminating window into both the safety of individual products and the safety of the marketplace at large. Recent reports have included such recalls as certain food products (for not containing the vitamins listed on the label), a soba noodle salad (for containing unlisted soy ingredients), and a pain reliever (for not following laboratory testing requirements).

You will get warnings that this data is “not for clinical use.”

Sounds like a treasure trove of data if you are looking for products still being sold despite being recalled.

Or if you want to advertise for “victims” of faulty products that have been recalled.

I think both of those are non-clinical uses. ;-)

Darwin’s ship library goes online

July 16th, 2014

Darwin’s ship library goes online by Dennis Normile.

From the post:

As Charles Darwin cruised the world on the HMS Beagle, he had access to an unusually well-stocked 400-volume library. That collection, which contained the observations of numerous other naturalists and explorers, has now been recreated online. As of today, all of more than 195,000 pages and 5000 illustrations from the works are available for the perusal of scholars and armchair naturalists alike, thanks to the Darwin Online project.

Perhaps it isn’t the amount of information you have available but how deeply you understand it that makes a difference.


Which gene did you mean?

July 16th, 2014

Which gene did you mean? by Barend Mons.


Computational Biology needs computer-readable information records. Increasingly, meta-analysed and pre-digested information is being used in the follow up of high throughput experiments and other investigations that yield massive data sets. Semantic enrichment of plain text is crucial for computer aided analysis. In general people will think about semantic tagging as just another form of text mining, and that term has quite a negative connotation in the minds of some biologists who have been disappointed by classical approaches of text mining. Efforts so far have tried to develop tools and technologies that retrospectively extract the correct information from text, which is usually full of ambiguities. Although remarkable results have been obtained in experimental circumstances, the wide spread use of information mining tools is lagging behind earlier expectations. This commentary proposes to make semantic tagging an integral process to electronic publishing.

From within the post:

If all words had only one possible meaning, computers would be perfectly able to analyse texts. In reality however, words, terms and phrases in text are highly ambiguous. Knowledgeable people have few problems with these ambiguities when they read, because they use context to disambiguate ‘on the fly’. Even when fed a lot of semantically sound background information, however, computers currently lag far behind humans in their ability to interpret natural language. Therefore, proper semantic tagging of concepts in texts is crucial to make Computational Biology truly viable. Electronic Publishing has so far only scratched the surface of what is needed.

Open Access publication shows great potential, andis essential for effective information mining, but it will not achieve its full potential if information continues to be buried in plain text. Having true semantic mark up combined with open access for mining is an urgent need to make possible a computational approach to life sciences.

Creating semantically enriched content as part and parcel of the publication process should be a winning strategy.

First, for current data, estimates of what others will be searching for should not be hard to find out. That will help focus tagging on the material users are seeking. Second, a current and growing base of enriched material will help answer questions about the return on enriching material.

Other suggestions for BMC Bioinformatics?

Introducing Source Han Sans:…

July 16th, 2014

Introducing Source Han Sans: An open source Pan-CJK typeface by Caleb Belohlavek.

From the post:

Adobe, in partnership with Google, is pleased to announce the release of Source Han Sans, a new open source Pan-CJK typeface family that is now available on Typekit for desktop use. If you don’t have a Typekit account, it’s easy to set one up and start using the font immediately with our free subscription. And for those who want to play with the original source files, you can get those from our download page on SourceForge.

It’s rather difficult to describe your semantics when you can’t write in your own language.

Kudos to Adobe and Google for sponsoring this project!

I first saw this in a tweet by James Clark.

…[S]emantically enriched open pharmacological space…

July 16th, 2014

Scientific competency questions as the basis for semantically enriched open pharmacological space development by Kamal Azzaoui, et al. (Drug Discovery Today, Volume 18, Issues 17–18, September 2013, Pages 843–852)


Molecular information systems play an important part in modern data-driven drug discovery. They do not only support decision making but also enable new discoveries via association and inference. In this review, we outline the scientific requirements identified by the Innovative Medicines Initiative (IMI) Open PHACTS consortium for the design of an open pharmacological space (OPS) information system. The focus of this work is the integration of compound–target–pathway–disease/phenotype data for public and industrial drug discovery research. Typical scientific competency questions provided by the consortium members will be analyzed based on the underlying data concepts and associations needed to answer the questions. Publicly available data sources used to target these questions as well as the need for and potential of semantic web-based technology will be presented.

Pharmacology may not be your space but this is a good example of what it takes for semantic integration of resources in a complex area.

Despite the “…you too can be a brain surgeon with our new web-based app…” from various sources, semantic integration has been, is and will remain difficult under the best of circumstances.

I don’t say that to discourage anyone but to avoid the let-down when integration projects don’t provide easy returns.

It is far better to plan for incremental and measurable benefits along the way than to fashion grandiose goals that are ever receding on the horizon.

I first saw this in a tweet by ChemConnector.

Free Companies House data to boost UK economy

July 15th, 2014

Free Companies House data to boost UK economy

From the post:

Companies House is to make all of its digital data available free of charge. This will make the UK the first country to establish a truly open register of business information.

As a result, it will be easier for businesses and members of the public to research and scrutinise the activities and ownership of companies and connected individuals. Last year (2013/14), customers searching the Companies House website spent £8.7 million accessing company information on the register.

This is a considerable step forward in improving corporate transparency; a key strand of the G8 declaration at the Lough Erne summit in 2013.

It will also open up opportunities for entrepreneurs to come up with innovative ways of using the information.

This change will come into effect from the second quarter of 2015 (April – June).

In a side bar, Business Secretary Vince Cable said in part:

Companies House is making the UK a more transparent, efficient and effective place to do business.

I’m not sure about “efficient,” but providing incentives for lawyers and others to track down insider trading and other business as usual practices and arming them with open data would be a start in the right direction.

I first saw this in a tweet by Hadley Beeman.

Spy vs. Spies

July 15th, 2014

XRay: Enhancing the Web’s Transparency with Differential Correlation by Mathias Lécuyer, et al.


Today’s Web services – such as Google, Amazon, and Facebook – leverage user data for varied purposes, including personalizing recommendations, targeting advertisements, and adjusting prices. At present, users have little insight into how their data is being used. Hence, they cannot make informed choices about the services they choose. To increase transparency, we developed XRay, the first fine-grained, robust, and scalable personal data tracking system for the Web. XRay predicts which data in an arbitrary Web account (such as emails, searches, or viewed products) is being used to target which outputs (such as ads, recommended products, or prices). XRay’s core functions are service agnostic and easy to instantiate for new services, and they can track data within and across services. To make predictions independent of the audited service, XRay relies on the following insight: by comparing outputs from different accounts with similar, but not identical, subsets of data, one can pinpoint targeting through correlation. We show both theoretically, and through experiments on Gmail, Amazon, and YouTube, that XRay achieves high precision and recall by correlating data from a surprisingly small number of extra accounts.

Not immediately obvious, until someone explains it, but any system that reacts based on input you control can be investigated. Whether that includes dark marketing forces or government security agencies.

Be aware that provoking government security agencies is best left to professionals. ;-)

The next step will be to have bots that project false electronic trails for us to throw advertisers (or others) off track.

Very much worth your time to read.

Graph Classes and their Inclusions

July 15th, 2014

Information System on Graph Classes and their Inclusions

From the webpage:

What is ISGCI?

ISGCI is an encyclopaedia of graphclasses with an accompanying java application that helps you to research what’s known about particular graph classes. You can:

  • check the relation between graph classes and get a witness for the result
  • draw clear inclusion diagrams
  • colour these diagrams according to the complexity of selected problems
  • find the P/NP boundary for a problem
  • save your diagrams as Postscript, GraphML or SVG files
  • find references on classes, inclusions and algorithms

As of 214-07-06, the database contains 1497 classes and 176,888 inclusions.

If you are past the giddy stage of “Everything’s a graph!,” you may find this site useful.


July 15th, 2014

RDFUnit – an RDF Unit-Testing suite

From the post:

RDFUnit is a test driven data-debugging framework that can run automatically generated (based on a schema) and manually generated test cases against an endpoint. All test cases are executed as SPARQL queries using a pattern-based transformation approach.

For more information on our methodology please refer to our report:

Test-driven evaluation of linked data quality. Dimitris Kontokostas, Patrick Westphal, Sören Auer, Sebastian Hellmann, Jens Lehmann, Roland Cornelissen, and Amrapali J. Zaveri in Proceedings of the 23rd International Conference on World Wide Web.

RDFUnit in a Nutshell

  • Test case: a data constraint that involves one or more triples. We use SPARQL as a test definition language.
  • Test suite: a set of test cases for testing a dataset
  • Status: Success, Fail, Timeout (complexity) or Error (e.g. network). A Fail can be an actual error, a warning or a notice
  • Data Quality Test Pattern (DQTP): Abstract test cases that can be intantiated into concrete test cases using pattern bindings
  • Pattern Bindings: valid replacements for a DQTP variable
  • Test Auto Generators (TAGs): Converts RDFS/OWL axioms into concrete test cases

If you are working with RDF data, this will certainly be helpful.

BTW, don’t miss the publications further down on the homepage for RDFUnit.

I first saw this in a tweet by Marin Dimitrov.

Classification and regression trees

July 15th, 2014

Classification and regression trees by Wei-Yin Loh.


Classification and regression trees are machine-learningmethods for constructing prediction models from data. The models are obtained by recursively partitioning the data space and fitting a simple prediction model within each partition. As a result, the partitioning can be represented graphically as a decision tree. Classification trees are designed for dependent variables that take a finite number of unordered values, with prediction error measured in terms of misclassification cost. Regression trees are for dependent variables that take continuous or ordered discrete values, with prediction error typically measured by the squared difference between the observed and predicted values. This article gives an introduction to the subject by reviewing some widely available algorithms and comparing their capabilities, strengths, and weakness in two examples. 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 14–23 DOI: 10.1002/widm.8.

A bit more challenging that CSV formats but also very useful.

I heard a joke many years ago but a then U.S. Assistant Attorney General who said:

To create a suspect list for a truck hijacking in New York, you choose files with certain name characteristics, delete the ones that are currently in prison and those that remain are your suspect list. (paraphrase)

If topic maps can represent any “subject” then they should be able to represent “group subjects” as well. We may know that our particular suspect is the member of a group, but we just don’t know which member of the group is our suspect.

Think of it as a topic map that evolves as more data/analysis is brought to the map and members of a group subject can be broken out into smaller groups or even individuals.

In fact, displaying summaries of characteristics of members of a group in response to classification/regression could well help with the subject analysis process. An interactive construction/mining of the topic map as it were.

Great paper whether you use it for topic map subject analysis or more traditional purposes.

Linked Data Guidelines (Australia)

July 15th, 2014

First Version of Guidelines for Publishing Linked Data released by Allan Barger.

From the post:

The Australian Government Linked Data Working group (AGLDWG) is pleased to announce the release of a first version of a set of guidelines for the publishing of Linked Datasets on at:

The “URI Guidelines for publishing Linked Datasets on” document provides a set of general guidelines aimed at helping Australian Government agencies to define and manage URIs for Linked Datasets and the resources described within that are published on The Australian Government Linked Data Working group has developed the report over the last two years while the first datasets under the sub-domain have been published following the patterns defined in this document.

Thought you might find this useful in mapping linked data sets from the Australian government to:

  • non-Australian government linked data sets
  • non-government linked data sets
  • non-linked data data sets (all sources)
  • pre-linked data data sets (all sources)
  • post-linked data data sets (all sources)


CSV validator – a new digital preservation tool

July 15th, 2014

CSV validator – a new digital preservation tool by David Underdown.

From the post:

Today marks the official release of a new digital preservation tool developed by The National Archives, CSV Validator version 1.0. This follows on from well known tools such as DROID and PRONOM database used in file identification (discussed in several previous blog posts). The release comprises the validator itself, but perhaps more importantly, it also includes the formal specification of a CSV schema language for describing the allowable content of fields within CSV (Comma Separated Value) files, which gives something to validate against.

Odd to find two presentations about CSV on the same day!

Adam Retter presented on this project today. slides.

It will be interesting to see how much cross-pollination occurs with the CSV on the Web Working Group.

Suggest you follow both groups.

CSV on the Web

July 15th, 2014

CSV on the Web – What’s Happening in the W3C Working Group by Jeni Tennison.

After seeing Software Carpentry: Lessons Learned yesterday, I have a new appreciation for documenting the semantics of data as used by its users.

Not to say we don’t need specialized semantic syntaxes and technologies, but if we expect market share, then we need to follow the software and data users are using.

How important is CSV?

Jeni gives that stats as:

  • >90% open data is tabular
  • 2/3rds “CSV” files on aren’t machine readable

Which means people use customized solutions (read vendor lockin).

A good overview of the CSV WG’s work so far with a request for your assistance:

I need to start following this workgroup. Curious to see if they reuse XQuery addressing to annotate CSV files, columns, rows, cells.

PS: If you don’t see arrows in the presentation, I didn’t, use your space bar to change slides and Esc to see all the slides.

Visualizing ggplot2 internals…

July 15th, 2014

Visualizing ggplot2 internals with shiny and D3 by Carson Sievert.

From the post:

As I started this project, I became frustrated trying to understand/navigate through the nested list-like structure of ggplot objects. As you can imagine, it isn’t an optimal approach to print out the structure everytime you want to checkout a particular element. Out of this frustration came an idea to build this tool to help interact with and visualize this structure. Thankfully, my wonderful GSoC mentor Toby Dylan Hocking agreed that this project could bring value to the ggplot2 community and encouraged me to pursue it.

By default, this tool presents a radial Reingold–Tilford Tree of this nested list structure, but also has options to use the collapsable or cartesian versions. It also leverages the shinyAce package which allows users to send arbitrary ggplot2 code to a shiny server thats evaluate the results and re-renders the visuals. I’m quite happy with the results as I think this tool is a great way to quickly grasp the internal building blocks of ggplot(s). Please share your thoughts below!

I started with the blog post about the visualization but seeing the visualization is more powerful:

Visualizing ggplot2 internals (demo)

I rather like the radial layout.

For either topic map design or analysis, this looks like a good technique to explore the properties we assign to subjects.

Be Secure, Be Very Secure

July 15th, 2014

Using strong crypto, TunnelX offers a conversation tool that no one can snoop on by Jeff John Roberts.

From the post:

Between NSA surveillance and giant corporations that sniff our messages for ad money, it sometimes feels as if there’s no such thing as a private online conversation. An intriguing group of techno-types and lawyers are trying to change that with a secure new messaging service called TunnelX.

TunnelX, which is free, offers online “tunnels” where two people can meet and share messages and media in a space no one else can see. While TunnelX isn’t the only company trying to restore privacy in the post-Snowden era, its tool is worth a look because it is aimed at everyday people — and not just the usual crowd of crypto-heads and paranoiacs.

Jeff gives a good overview of TunnelX and how it can be used by ordinary users.

TunnelX gives the technical skinny as:

Tunnel X “superenciphers” all stored messages and uploaded files with AES, TwoFish, and Serpent using different 256-bit keys for each layer. (AES is the cipher approved by the U.S. National Security Agency for encrypting classified data across all U.S. government agencies; TwoFish and Serpent are the two most well-known “runner-up” AES candidates.

Tunnel X allows only SSL/TLS-encrypted connections (sometimes called “https” connections). Furthermore, we strongly encourage you to connect with the latest version of TLS (1.2). Finally, as part of our SSL/TLS setup, Tunnel X only allows connections which are secured with a PFS (perfect forward secrecy) ciphersuite. PFS is a technology which prevents encrypted messages from being stored and then decrypted in the future if a server’s private SSL key is ever compromised.

Under “What is a tunnel?” on the homepage you will find a list of technologies that TunnelX does not use!

I just created an account and the service merits high marks for ease of use!

The one feature I did not see and that would be useful, would be a “delete on read” setting so that messages are deleted as soon as they are read by the intended target.

Just another layer of security on top of what TunnelX already offers.

For all the layers of security, realize the black shirts don’t need to decrypt your messages once they discover your identity.

Knowing your identity, they can apply very unreliable techniques to extract messages from you personally. That is one of the problems with saviors of civilization. Given the stakes, no atrocity is beyond them.

Flax Clade PoC

July 14th, 2014

Flax Clade PoC by Tom Mortimer.

From the webpage:

Flax Clade PoC is a proof-of-concept open source taxonomy management and document classification system, based on Apache Solr. In its current state it should be considered pre-alpha. As open-source software you are welcome to try, use, copy and modify Clade as you like. We would love to hear any constructive suggestions you might have.

Tom Mortimer

Taxonomies and document classification

Clade taxonomies have a tree structure, with a single top-level category (e.g. in the example data, “Social Psychology”). There is no distinction between parent and child nodes (except that the former has children) and the hierachical structure of the taxonomy is completely orthogonal from the node data. The structure may be freely edited.

Each node represents a category, which is represented by a set of “keywords” (words or phrases) which should be present in a document belonging to that category. Not all the keywords have to be present – they are joined with Boolean OR rather than AND. A document may belong to multiple categories, which are ranked according to standard Solr (TF-IDF) scoring. It is also possible to exclude certain keywords from categories.

Clade will also suggest keywords to add to a category, based on the content of the documents already in the category. This feature is currently slow as it uses the standard Solr MoreLikeThis component to analyse a large number of documents. We plan to improve this for a future release by writing a custom Solr plugin.

Documents are stored in a standard Solr index and are categorised dynamically as taxonomy nodes are selected. There is currently no way of writing the categorisation results to the documents in SOLR, but see below for how to export the document categorisation to an XML or CSV file.

A very interesting project!

I am particularly interested in the dynamic categorisation when nodes are selected.

CMU Machine Learning Summer School (2014)

July 14th, 2014

CMU Machine Learning Summer School (2014)

From the webpage:

Machine Learning is a foundational discipline that forms the basis of much modern data analysis. It combines theory from areas as diverse as Statistics, Mathematics, Engineering, and Information Technology with many practical and relevant real life applications. The focus of the current summer school is big data analytics, distributed inference, scalable algorithms, and applications to the digital economy. The event is targeted at research students, IT professionals, and academics from all over the world.

This school is suitable for all levels, both for researchers without previous knowledge in Machine Learning, and those wishing to broaden their expertise in this area. That said, some background will prove useful. For a research student, the summer school provides a unique, high-quality, and intensive period of study. It is ideally suited for students currently pursuing, or intending to pursue, research in Machine Learning or related fields. Limited scholarships are available for students to cover accommodation, registration costs, and partial travel expenses.

Videos have been posted at YouTube!


An Empirical Investigation into Programming Language Syntax

July 14th, 2014

An Empirical Investigation into Programming Language Syntax by Greg Wilson.

A great synopsis of Andreas Stefik and Susanna Siebert’s “An Empirical Investigation into Programming Language Syntax.” ACM Transactions on Computing Education, 13(4), Nov. 2013.

A sample to interest you in the post:

  1. Programming language designers needlessly make programming languages harder to learn by not doing basic usability testing. For example, “…the three most common words for looping in computer science, for, while, and foreach, were rated as the three most unintuitive choices by non-programmers.”
  2. C-style syntax, as used in Java and Perl, is just as hard for novices to learn as a randomly-designed syntax. Again, this pain is needless, because the syntax of other languages (such as Python and Ruby) is significantly easier.

Let me repeat part of that:

C-style syntax, as used in Java and Perl, is just as hard for novices to learn as a randomly-designed syntax.

Randomly-designed syntax?

Now, think about the latest semantic syntax or semantic query syntax you have read about.

Was it designed for users? Was there any user testing at all?

Is there a lesson here for designers of semantic syntaxes and query languages?


I first saw this in Greg Wilson’s Software Carpentry: Lessons Learned video.