The Dirty Little Secret of Cancer Research

October 13th, 2014

The Dirty Little Secret of Cancer Research by Jill Neimark.

From the post:

Across different fields of cancer research, up to a third of all cell lines have been identified as imposters. Yet this fact is widely ignored, and the lines continue to be used under their false identities. As recently as 2013, one of Ain’s contaminated lines was used in a paper on thyroid cancer published in the journal Oncogene.

“There are about 10,000 citations every year on false lines—new publications that refer to or rely on papers based on imposter (human cancer) celllines,” says geneticist Christopher Korch, former director of the University of Colorado’s DNA Sequencing Analysis & Core Facility. “It’s like a huge pyramid of toothpicks precariously and deceptively held together.”

For all the worry about “big data,” where is the concern over “big bad data?”

Or is “big data” too big for correctness of the data to matter?

Once you discover that a paper is based on “imposter (human cancer) celllines,” how do you pass that information along to anyone who attempts to cite the article?

In other words, where do you write down that data about the paper, where the paper is the subject in question?

And how do you propagate that data across a universe of citations?

The post ends on a high note of current improvements but it is far from settled how to prevent reliance on compromised research.

I first saw this in a tweet by Dan Graur

Scrape the Gibson: Python skills for data scrapers

October 13th, 2014

Scrape the Gibson: Python skills for data scrapers by Brian Abelson.

From the post:

Two years ago, I learned I had superpowers. Steve Romalewski was working on some fascinating analyses of CitiBike locations and needed some help scraping information from the city’s data portal. Cobbling together the little I knew about R, I wrote a simple scraper to fetch the json files for each bike share location and output it as a csv. When I opened the clean data in Excel, the feeling was tantamount to this scene from Hackers:

Ever since then I’ve spent a good portion of my life scraping data from websites. From movies, to bird sounds, to missed connections, and john boards (don’t ask, I promise it’s for good!), there’s not much I haven’t tried to scrape. In many cases, I dont’t even analyze the data I’ve obtained, and the whole process amounts to a nerdy version of sport hunting, with my comma-delimited trophies mounted proudly on Amazon S3.

Important post for two reasons:

  • Good introduction to the art of scraping data
  • Set the norm for sharing scraped data
    • The people who force scraping of data don’t want it shared, combined, merged or analyzed.

      You can help in disappointing them! ;-)

Making of: Introduction to A*

October 13th, 2014

Making of: Introduction to A* by Amit Patel.

From the post:

(Warning: these notes are rough – the main page is here and these are some notes I wrote for a few colleagues and then I kept adding to it until it became a longer page)

Several people have asked me how I make the diagrams on my tutorials.

I need to learn the algorithm and data structures I want to demonstrate. Sometimes I already know them. Sometimes I know nothing about them. It varies a lot. It can take 1 to 5 months to make a tutorial. It’s slow, but the more I make, the faster I am getting.

I need to figure out what I want to show. I start with what’s in the algorithm itself: inputs, outputs, internal variables. With A*, the input is (start, goal, graph), the output is (parent pointers, distances), and the internal variables are (open set, closed set, parent pointers, distances, current node, neighbors, child node). I’m looking for the main idea to visualize. With A* it’s the frontier, which is the open set. Sometimes the thing I want to visualize is one of the algorithm’s internal variables, but not always.

Pure gold on making diagrams for tutorials here. You may make different choices but it isn’t often that the process of making a choice is exposed.

Pass this along. We all benefit from better illustrations in tutorials!

The Big List of D3.js Examples (Approx. 2500 Examples)

October 13th, 2014

The Big List of D3.js Examples by Christophe Viau.

The interactive version has 2523 examples, whereas the numbered list has 1897 examples, as of 13 October 2014.

There is a rudimentary index of the examples. That’s an observation, not a compliant. Effective indexing of the examples would be a real challenge to the art of indexing.

The current index uses chart type, a rather open ended category. The subject matter of the chart would be another way to index. Indexing by the D3 techniques used would be useful. Data that is being combined with other data?

Effective access to the techniques and data represented by this collection would be awesome!

Give it some thought.

I first saw this in a tweet by Michael McGuffin.

Introduction to Graphing with D3.js

October 13th, 2014

Introduction to Graphing with D3.js by Jan Milosh.

From the post:

D3.js ( stands for Data-Driven Documents, a JavaScript library for data visualization. It was created by Mike Bostock, based on his PhD studies in the Stanford University data visualization program. Mike now works at the New York Times who sponsors his open source work.

D3 was designed for more than just graphs and charts. It’s also capable of presenting maps, networks, and ordered lists. It was created for the efficient manipulation of documents based on data.

This demonstration will focus on creating a simple scatter plot.

If you are not already using D3 for graphics, Jan’s post is an easy introduction with additional references to take you further.


I first saw this in a tweet by Christophe Viau.

Mirrors for Princes and Sultans:…

October 13th, 2014

Mirrors for Princes and Sultans: Advice on the Art of Governance in the Medieval Christian and Islamic Worlds by Lisa Blaydes, Justin Grimmery, and Alison McQueen.


Among the most signi cant forms of political writing to emerge from the medieval period are texts off ering advice to kings and other high-ranking ocials. Books of counsel varied considerably in their content and form; scholars agree, however, that such texts reflected the political exigencies of their day. As a result, writings in the “mirrors for princes” tradition o er valuable insights into the evolution of medieval modes of governance. While European mirrors (and Machiavelli’s Prince in particular) have been extensively studied, there has been less scholarly examination of a parallel political advice literature emanating from the Islamic world. We compare Muslim and Christian advisory writings from the medieval period using automated text analysis, identify sixty conceptually distinct topics that our method automatically categorizes into three areas of concern common to both Muslim and Christian polities, and examine how they evolve over time. We o er some tentative explanations for these trends.

If you don’t know the phrase, “mirrors for princes,”:

texts that seek to off er wisdom or guidance to monarchs and other high-ranking advisors.

Since nearly all bloggers and everyone with a byline in traditional media considers themselves qualified to offer advice to “…monarchs and other high-ranking advisors,” one wonders how the techniques presented would fare with modern texts?

Certainly a different style of textual analysis than is seen outside the humanities and so instructive for that purpose.

I do wonder about the comparison of texts in translation into English. Obviously easier but runs the risk of comparing translators to translators and not so much the thoughts of the original authors.

I first saw this in a tweet by Christopher Phipps.

Measuring Search Relevance

October 13th, 2014

Measuring Search Relevance by Hugh E. Williams.

From the post:

The process of asking many judges to assess search performance is known as relevance judgment: collecting human judgments on the relevance of search results. The basic task goes like this: you present a judge with a search result, and a search engine query, and you ask the judge to assess how relevant the item is to the query on (say) a four-point scale.

Suppose the query you want to assess is ipod nano 16Gb. Imagine that one of the results is a link to Apple’s page that describes the latest Apple iPod nano 16Gb. A judge might decide that this is a “great result” (which might be, say, our top rating on the four-point scale). They’d then click on a radio button to record their vote and move on to the next task. If the result we showed them was a story about a giraffe, the judge might decide this result is “irrelevant” (say the lowest rating on the four point scale). If it were information about an iPhone, it might be “partially relevant” (say the second-to-lowest), and if it were a review of the latest iPod nano, the judge might say “relevant” (it’s not perfect, but it sure is useful information about an Apple iPod).

The human judgment process itself is subjective, and different people will make different choices. You could argue that a review of the latest iPod nano is a “great result” — maybe you think it’s even better than Apple’s page on the topic. You could also argue that the definitive Apple page isn’t terribly useful in making a buying decision, and you might only rate it as relevant. A judge who knows everything about Apple’s products might make a different decision to someone who’s never owned an digital music player. You get the idea. In practice, judging decisions depend on training, experience, context, knowledge, and quality — it’s an art at best.

There are a few different ways to address subjectivity and get meaningful results. First, you can ask multiple judges to assess the same results to get an average score. Second, you can judge thousands of queries, so that you can compute metrics and be confident statistically that the numbers you see represent true differences in performance between algorithms. Last, you can train your judges carefully, and give them information about what you think relevance means.

An illustrated walk through measuring search relevance. Useful for a basic understanding of the measurement process and its parameters.

Bookmark this post so When you tell your judges what “…relevance means”, you can return here and post what you told your judges.

I ask because I deeply suspect that our ideas of “relevance” vary widely from subject to subject.


Twitter Mapping: Foundations

October 12th, 2014

Twitter Mapping: Foundations by Simon Rogers.

From the post:

With more than 500 million tweets sent every day, Twitter data as a whole can seem huge and unimaginable, like cramming the contents of the Library of Congress into your living room.

One way of trying to make that big data understandable is by making it smaller and easier to handle by giving it context; by putting it on a map.

It’s something I do a lot—I’ve published over 1,000 maps in the past five years, mostly at Guardian Data. At Twitter, with 77% of users outside the US, it’s often aimed at seeing if regional variations can give us a global picture, an insight into the way a story spreads around the globe. Here’s what I’ve learned about using Twitter data on maps.

… (lots of really cool maps and links omitted)

Creating data visualizations is simpler now than it’s ever been, with a plethora of tools (free and paid) meaning that any journalist working in any newsroom can make a chart or a map in a matter of minutes. Because of time constraints, we often use CartoDB to animate maps of tweets over time. The process is straightforward—I’ve written a how-to guide on my blog that shows how to create an animated map of dots using the basic interface, and if the data is not too big it won’t cost you anything. CartoDB is also handy for other reasons: as it has access to Twitter data, you can use it to get the geotagged tweets too. And it’s not the only one: Trendsmap is a great way to see location of conversations over time.

Have you made a map with Twitter Data that tells a compelling story? Share it with us via @TwitterData.

While composing this post I looked at CartoDB solution for geotagged tweets and while impressive, it is currently in beta with a starting price of $300/month. Works if you get your expenses paid but a bit pricey for occasional use.

There is a free option for CartoDB (up to 50 MB of data) but I don’t think it includes the twitter capabilities.

Sample mapping tweets on your favorite issues. Maps are persuasive in ways that are not completely understood.

Making Your First Map

October 11th, 2014

Making Your First Map from Mapbox.

From the webpage:

Regardless of your skill level, we have the tools that allow you to quickly build maps and share them online in minutes.

In this guide, we’ll cover the basics of our online tool, the Mapbox Editor, by creating a store location map for a bike shop.

A great example of the sort of authoring interface that is needed by topic maps.

Hmmm, by the way, did you notice that “…creating a store location map for a bike shop” is creating an association between the “bike shop” and a “street location?” True, Mapbox doesn’t include roles or the association type but the role players are present.

For a topic map authoring interface, you could default the role of location for any geographic point on the map and the association type to be “street-location.”

The user would only have to pick, possibly from a pick list, the role of the role player, bike shop, bar, etc.

Mapbox could have started their guide with a review of map projections, used and theoretical.

Or covered the basics of surveying and a brief overview of surveying instruments. They didn’t.

I think there is a lesson there.

Microsoft’s Quantum Mechanics

October 11th, 2014

Microsoft’s Quantum Mechanics by Tom Simonite.

From the post:

In 2012, physicists in the Netherlands announced a discovery in particle physics that started chatter about a Nobel Prize. Inside a tiny rod of semiconductor crystal chilled cooler than outer space, they had caught the first glimpse of a strange particle called the Majorana fermion, finally confirming a prediction made in 1937. It was an advance seemingly unrelated to the challenges of selling office productivity software or competing with Amazon in cloud computing, but Craig Mundie, then heading Microsoft’s technology and research strategy, was delighted. The abstruse discovery—partly underwritten by Microsoft—was crucial to a project at the company aimed at making it possible to build immensely powerful computers that crunch data using quantum physics. “It was a pivotal moment,” says Mundie. “This research was guiding us toward a way of realizing one of these systems.”

Microsoft is now almost a decade into that project and has just begun to talk publicly about it. If it succeeds, the world could change dramatically. Since the physicist Richard Feynman first suggested the idea of a quantum computer in 1982, theorists have proved that such a machine could solve problems that would take the fastest conventional computers hundreds of millions of years or longer. Quantum computers might, for example, give researchers better tools to design novel medicines or super-efficient solar cells. They could revolutionize artificial intelligence.

Fairly upbeat review of current efforts to build a quantum computer.

You may want to off-set it by reading Scott Aaronson’s blog, Shtetl-Optimized, which has the following header note:

If you take just one piece of information from this blog:
Quantum computers would not solve hard search problems
instantaneously by simply trying all the possible solutions at once. (emphasis added)

See in particular: Speaking Truth to Parallelism at Cornell

Whatever speedups are possible with quantum computers, getting a semantically incorrect answer faster isn’t an advantage.

Assumptions about faster computing platforms include an assumption of correct semantics. There have been no proofs of default correct handling of semantics by present day or future computing platforms.

I first saw this in a tweet by Peter Lee.

PS: I saw the reference to Scott Aaronson’s blog in a comment to Tom’s post.

The 100 Worst Landlords in New York City [Here Be Bastards]

October 11th, 2014

The 100 Worst Landlords in New York City

A great illustration of the power of mapping to bring information together! (Like a topic map does.)

I don’t live in New York so the classes of violations (mis-named “details”) wasn’t helpful to me. Nor were the actual “details” of particular violations available. (If I am wrong on that, please post a response saying how to obtain the details via the map interface.)

Suggested Improvement: Names of owners as hyperlinks to their residences on a map with the notation “Here Be Bastards” (a riff off of the “Here Be Dragons” from early sea maps).

Spark Breaks Previous Large-Scale Sort Record

October 11th, 2014

Spark Breaks Previous Large-Scale Sort Record by Reynold Xin.

From the post:

Apache Spark has seen phenomenal adoption, being widely slated as the successor to Hadoop MapReduce, and being deployed in clusters from a handful to thousands of nodes. While it was clear to everybody that Spark is more efficient than MapReduce for data that fits in memory, we heard that some organizations were having trouble pushing it to large scale datasets that could not fit in memory. Therefore, since the inception of Databricks, we have devoted much effort, together with the Spark community, to improve the stability, scalability, and performance of Spark. Spark works well for gigabytes or terabytes of data, and it should also work well for petabytes.

To evaluate these improvements, we decided to participate in the Sort Benchmark. With help from Amazon Web Services, we participated in the Daytona Gray category, an industry benchmark on how fast a system can sort 100 TB of data (1 trillion records). The previous world record was 72 minutes, set by Yahoo using a Hadoop MapReduce cluster of 2100 nodes. Using Spark on 206 EC2 nodes, we completed the benchmark in 23 minutes. This means that Spark sorted the same data 3X faster using 10X fewer machines. All the sorting took place on disk (HDFS), without using Spark’s in-memory cache.

Additionally, while no official petabyte (PB) sort competition exists, we pushed Spark further to also sort 1 PB of data (10 trillion records) on 190 machines in under 4 hours. This PB time beats previously reported results based on Hadoop MapReduce (16 hours on 3800 machines). To the best of our knowledge, this is the first petabyte-scale sort ever done in a public cloud.

Bottom line: Sorted 100 TB of data in 23 minutes, beat old record of 72 minutes, on fewer machines.

Read Reynold’s post and then get thee to Apache Spark!

I first saw this in a tweet by paco nathan.

Visualizing MNIST: An Exploration of Dimensionality Reduction

October 10th, 2014

Visualizing MNIST: An Exploration of Dimensionality Reduction by Christopher Olah.

From the post:

At some fundamental level, no one understands machine learning.

It isn’t a matter of things being too complicated. Almost everything we do is fundamentally very simple. Unfortunately, an innate human handicap interferes with us understanding these simple things.

Humans evolved to reason fluidly about two and three dimensions. With some effort, we may think in four dimensions. Machine learning often demands we work with thousands of dimensions – or tens of thousands, or millions! Even very simple things become hard to understand when you do them in very high numbers of dimensions.

Reasoning directly about these high dimensional spaces is just short of hopeless.

As is often the case when humans can’t directly do something, we’ve built tools to help us. There is an entire, well-developed field, called dimensionality reduction, which explores techniques for translating high-dimensional data into lower dimensional data. Much work has also been done on the closely related subject of visualizing high dimensional data.

These techniques are the basic building blocks we will need if we wish to visualize machine learning, and deep learning specifically. My hope is that, through visualization and observing more directly what is actually happening, we can understand neural networks in a much deeper and more direct way.

And so, the first thing on our agenda is to familiarize ourselves with dimensionality reduction. To do that, we’re going to need a dataset to test these techniques on.

Extremely useful illustration of dimensional reduction, exploring “recognition” of hand written digits.

I agree that “Reasoning directly about these high dimensional spaces is just short of hopeless.” However, unlike our machines, humans don’t need high dimensional spaces in order to recognize hand written digits. ;-)

I first saw this in a tweet by Christophe Lalanne.

Supporting Open Annotation

October 10th, 2014

Supporting Open Annotation by Gerben.

From the post:

In its mission to connect the world’s knowledge and thoughts, the solution pursues is a web-wide mechanism to create, share and discover annotations. One of our principal steps towards this end is providing a browser add-on that works with our annotation server, enabling people to read others’ annotations on any web page they visit, and to publish their own annotations for others to see.

I spent my summer interning at to work towards a longer term goal, taking annotation sharing to the next level: an open, decentralised approach. In this post I will describe how such an approach could work, how standardisation efforts are taking off to get us there, and how we are involved in making this happen – the first step being support for the preliminary Open Annotation data model.

An annotation ecosystem

While we are glad to provide a service enabling people to create and share annotations on the web, we by no means want to become the sole annotation service provider, as this would imply creating a monopoly position and a single point of failure. Rather, we encourage anyone to build annotation tools and services, possibly using the same code we use. Of course, a problematic consequence of having multiple organisations each running separate systems is that even more information silos emerge on the web, and quite likely the most popular service would obtain a monopoly position after all.

To prevent either fragmentation or monopolisation of the world’s knowledge, we would like an ecosystem to evolve, comprising interoperable annotation services and client implementations. Such an ecosystem would promote freedom of innovation, prevent dependence on a single party, and provide scalability and robustness. It would be like the architecture of the web itself.

Not a bad read if you accept the notion that interoperable annotation servers are an acceptable architecture for annotation of web resources.

Me? I would just as soon put:


on my annotation and mail the URL to the CIA, FBI, NSA and any foreign intelligence agencies that I can think of with a copy of my annotaton.

You can believe that government agencies will follow the directives of Congress with regard to spying on United States citizens, but then that was already against the law. Remember the old saying, “Fool me once, shame on you. Fool me twice, shame on me.”? That is applicable to government surveillance.

We need robust annotation mechanisms but not ones that make sitting targets out of our annotations. Local, encrypted annotation mechanisms that can cooperate with other local, encrypted annotation mechanisms would be much more attractive to me.

I first saw this in a tweet by Ivan Herman.

Flexible Neo4j Batch Import with Groovy

October 10th, 2014

Flexible Neo4j Batch Import with Groovy by Michael Hunger.

From the post:

You might have data as CSV files to create nodes and relationships from in your Neo4j Graph Database.
It might be a lot of data, like many tens of million lines.
Too much for LOAD CSV to handle transactionally.

Usually you can just fire up my batch-importer and prepare node and relationship files that adhere to its input format requirements.

Your Requirements

There are some things you probably want to do differently than the batch-importer does by default:

  • not create legacy indexes
  • not index properties at all that you just need for connecting data
  • create schema indexes
  • skip certain columns
  • rename properties from the column names
  • create your own labels based on the data in the row
  • convert column values into Neo4j types (e.g. split strings or parse JSON)

Michael helps you avoid the defaults of batch importing into Neo4j.

Lance’s Lesson – Gödel Incompleteness

October 10th, 2014

Lance’s Lesson – Gödel Incompleteness by Lance Fortnow.

The “entertainment” category on YouTube is very flexible since it included this lesson on Gödel Incompleteness. ;-)

Lance uses Turing machines to “prove” the first and second incompleteness theorems in under a page of notation.

Recognizing patterns in genomic data

October 10th, 2014

Recognizing patterns in genomic data – New visualization software uncovers cancer subtypes from a vast repository of biomedical information by Stephanie Dutchen.

From the post:

Much of biomedical research these days is about big data—collecting and analyzing vast, detailed repositories of information about health and disease. These data sets can be treasure troves for investigators, often uncovering genetic mutations that drive a particular kind of cancer, for example.

Trouble is, it’s impossible for humans to browse that much data, let alone make any sense of it.

“It’s [StratomeX] a tool to help you make sense of the data you’re collecting and find the right questions to ask,” said Nils Gehlenborg, research associate in biomedical informatics at HMS and co-senior author of the correspondence in Nature Methods. “It gives you an unbiased view of patterns in the data. Then you can explore whether those patterns are meaningful.”

The software, called StratomeX, was developed to help researchers distinguish subtypes of cancer by crunching through the incredible amount of data gathered as part of The Cancer Genome Atlas, a National Institutes of Health–funded initiative. Identifying distinct cancer subtypes can lead to more effective, personalized treatments.

When users input a query, StratomeX compares tumor data at the molecular level that was collected from hundreds of patients and detects patterns that might indicate significant similarities or differences between groups of patients. The software presents those connections in an easy-to-grasp visual format.

“It helps you make meaningful distinctions,” said co-first author Alexander Lex, a postdoctoral researcher in the Pfister group.

Other than the obvious merits of this project, note the the role of software as the assistant to the user. It crunches the numbers in a specific domain and presents those results in a meaningful fashion.

It is up to the user to decide which patters are useful and which are not. Shades of “recommending” other instances of the “same” subject?

StratomeX is available for download.

I first saw this in a tweet by Harvard SEAS.

No Query Language Needed: Using Python with an Ordered Key-Value Store

October 10th, 2014

No Query Language Needed: Using Python with an Ordered Key-Value Store by Stephen Pimentel.

From the post:

FoundationDB is a complex and powerful database, designed to handle sharding, replication, network hiccups, and server failures gracefully and automatically. However, when we designed our Python API, we wanted most of that complexity to be hidden from the developer. By utilizing familiar features- such as generators, itertools, and comprehensions-we tried to make FoundationDB’s API as easy to us as a Python dictionary.

In the video below, I show how FoundationDB lets you query data directly using Python language features, rather than a separate query language.

Most applications have back-end data stores that developers need to query. This talk presents an approach to storing and querying data that directly employs Python language features. Using the Key-Value Store, we can make our data persistent with an interface similar to a Python dictionary. Python then gives us a number of tools “out of the box” that we can use to form queries:

  • generators for memory-efficient data retrieval;
  • itertools to filter and group data;
  • comprehensions to assemble the query results.

Taken together, these features give us a query capability using straight Python. The talk walks through a number of example queries using the Enron email dataset. For code and the details.

More motivation to take a look at FoundationDB!

I do wonder about the “no query language needed.” Users, despite their poor results, appear to be committed to querying and query languages.

Whether it is the illusion of “empowerment” of users, the current inability to measure the cost of ineffectual searching, or acceptance of poor search results, search and search operators continue to be the preferred means of interaction. Plan accordingly.

I first saw this in a tweet by Hari Kishan.

Super-Detailed Interactive 3-D Seafloor Map

October 10th, 2014

Super-Detailed Interactive 3-D Seafloor Map by Nick Stockton.

From the post:

This super-detailed map of the ocean floor’s topography is based on satellite measurements of subtle lumps on the ocean’s surface. These lumps of water, which are subtle, low, and wide on the ocean’s surface, are caused by the gravitational pull of underwater features like mountains and ridges. The team of scientists wrapped their data around a Google Earth globe, so you and I could explore it ourselves, in the visualization above.

The map has more than twice the resolution of previous seafloor maps, and shows a plethora of never-before-seen features. These include thousands of volcanoes and what could be the ridge where two plates pulled apart to create the Gulf of Mexico. The map is part of new research published last week in Science.

The visualization at the top of the page (click here for a full screen view) lets you play with the vertical exaggeration of both continental and subsea topography using the upper left drop-down menu. (They might seem huge to us at ground level, but the planet’s mountains and valleys are almost imperceptible from the vantage of space.) Another visualization of the study’s map lets you drag a time bar to simulate the movement of tectonic plates.

Great seafloor map and visualization techniques!

Read Nick’s post to get some background on “gravitational mapping.” In short, gravitational mapping relies on the impact of features of the seafloor on ocean height to create detailed seafloor maps.

Sounds like very interesting data sets with many discoveries left to be made.

Last Call: XQuery 3.1 and XQueryX 3.1; and additional supporting documents

October 10th, 2014

Last Call: XQuery 3.1 and XQueryX 3.1; and additional supporting documents

From the post:

Today the XQuery Working Group published a Last Call Working Draft of XQuery 3.1 and XQueryX 3.1. Additional supporting documents were published jointly with the XSLT Working Group: a Last Call Working Draft of XPath 3.1, together with XPath Functions and Operators, XQuery and XPath Data Model, and XSLT and XQuery Serialization. XQuery 3.1 and XPath 3.1 introduce improved support for working with JSON data with map and array data structures as well as loading and serializing JSON; additional support for HTML class attributes, HTTP dates, scientific notation, cross-scaling between XSLT and XQuery and more. Comments are welcome through 7 November 2014. Learn more about the XML Activity.

How closely do you read?

To answer that question, read all the mentioned documents by 7 November 2014, keeping a list of errors you spot.

Submit your list to the XQuery Working Group by by 7 November 2014 and score your reading based on the number of “errors” accepted by the working group.

What is your W3C Proofing Number? (Average number of accepted “errors” divided by the number of W3C drafts where “errors” were submitted.)

Annotated Chemical Patent Corpus: A Gold Standard for Text Mining

October 10th, 2014

Annotated Chemical Patent Corpus: A Gold Standard for Text Mining by Saber A. Akhondi, et al. (Published: September 30, 2014 DOI: 10.1371/journal.pone.0107477)


Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, targets, and modes of action. Spelling mistakes and spurious line break due to optical character recognition errors were also annotated. A subset of 47 patents was annotated by at least three annotator groups, from which harmonized annotations and inter-annotator agreement scores were derived. One group annotated the full set. The patent corpus includes 400,125 annotations for the full set and 36,537 annotations for the harmonized set. All patents and annotated entities are publicly available at

Highly recommended both as a “gold standard” for chemical patent text mining but also as the state of the art in developing such a standard.

To say nothing of annotation as a means of automatic creation of topic maps where entities are imbued with subject identity properties.

I first saw this in a tweet by ChemConnector.

These aren’t the reducing functions you are looking for

October 9th, 2014

These aren’t the reducing functions you are looking for by Christophe Grand.

From the post:

Transducers are powerful and easy to grasp when they claim they transform reducing functions. However once you scratch their surface you quickly realize that’s not their true nature: they transform stateful processes.

In a previous post, I explained why seeded transduce forces transducers to return stateful reducing functions. However this can be fixed. The current implementation of transduce reads:

Heavy sledding but the leading edge always is. Relevance depends on your need for/dedication to advancing a paradigm.

I first saw this in a tweet by Johnathan Winandy.

2014 State of Clojure & ClojureScript Survey

October 9th, 2014

2014 State of Clojure & ClojureScript Survey by Alex Miller.

From the post:

For the past four years, Chas Emerick has run an annual State of Clojure survey (expanded last year to cover ClojureScript as well). Due to recent happy arrivals in the Emerick household, Chas has asked Cognitect to run the survey this year.

The survey has been broken into two parts. A link to the second survey will appear after you submit the first, or you can use these links directly:

The surveys will be open until October 17th. Shortly thereafter we will release all of the data and some analysis.

If you’re not yet planning to attend the Clojure/conj in Washington DC, Nov 20-22, tickets are on sale now (regular registration rate ends Oct. 17th)!


Two Important Dates:

Oct. 17th – Deadline for State of Clojure/ClojureScript 2014 surveys and regular registration for Clojure/conj.

Nov. 20-22Clojure/conj in Washington, DC.

Avoid entering Oct. 17th in your calendar by completing the surveys and purchasing your ticket to Clojure/conj after reading this post.

Simple Testing Can Prevent Most Critical Failures:…

October 9th, 2014

Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems by Ding Yuan, et al.


Large, production quality distributed systems still fail periodically,and do so sometimes catastrophically, where most or all users experience an outage or data loss. We present the result of a comprehensive study investigating 198 randomly selected, user-reported failures that occurred on Cassandra, HBase, Hadoop Distributed File System (HDFS), Hadoop Map Reduce, and Redis, with the goal of understanding how one or multiple faults eventually evolve into a user-visible failure. We found that from a testing point of view, almost all failures require only 3 or fewer nodes to reproduce, which is good news considering that these services typically run on a very large number of nodes. However, multiple inputs are needed to trigger the failures with the order between them being important. Finally, we found the error logs of these systems typically contain sufficient data on both the errors and the input events that triggered the failure, enabling the diagnose and the reproduction of the production failures.

We found the majority of catastrophic failures could easily have been prevented by performing simple testing on error handling code–the last line of defense–even with out an understanding of the software design. We extracted three simple rules from the bugs that have lead to some of the catastrophic failures, and developed a static checker, Aspirator, capable of locating these bugs. Over 30% of the catastrophic failures would have been prevented had Aspirator been used and the identified bugs fixed. Running Aspirator on the code of 9 distributed systems located 143 bugs and bad practices that have been fixed or confirmed by the developers.

If you aren’t already convinced you need to read this paper, consider one more quote:

almost all (92%) of the catastrophic system failures are the result of incorrect handling of non-fatal errors explicitly signaled in software. (emphasis added)

How will catastrophic system failure reflect on your product or service? Hint: It doesn’t reflect well on topic maps or any other service or technology.

I say “read” this paper, perhaps putting it on a 90-day reading rotation would be better.

Intriguing properties of neural networks [Gaming Neural Networks]

October 9th, 2014

Intriguing properties of neural networks by Christian Szegedy, et al.


Deep neural networks are highly expressive models that have recently achieved state of the art performance on speech and visual recognition tasks. While their expressiveness is the reason they succeed, it also causes them to learn uninterpretable solutions that could have counter-intuitive properties. In this paper we report two such properties.

First, we find that there is no distinction between individual high level units and random linear combinations of high level units, according to various methods of unit analysis. It suggests that it is the space, rather than the individual units, that contains of the semantic information in the high layers of neural networks.

Second, we find that deep neural networks learn input-output mappings that are fairly discontinuous to a significant extend. Specifically, we find that we can cause the network to misclassify an image by applying a certain imperceptible perturbation, which is found by maximizing the network’s prediction error. In addition, the specific nature of these perturbations is not a random artifact of learning: the same perturbation can cause a different network, that was trained on a different subset of the dataset, to misclassify the same input.

Both findings are of interest but the discovery of “adversarial examples” that can cause a trained network to misclassify images, is the more intriguing of the two.

How do you validate a result from a neural network? Possessing the same network and data isn’t going to help if it contains “adversarial examples.” I suppose you could “spot” a misclassification but one assumes a neural network is being used because physical inspection by a person isn’t feasible.

What “adversarial examples” work best against particular neural networks? How to best generate such examples?

How do users of off-the-shelf neural networks guard against “adversarial examples?” (One of those cases where “shrink-wrap” data services may not be a good choice.)

I first saw this in a tweet by Xavier Amatriain

Sir Tim Berners-Lee speaks out on data ownership

October 9th, 2014

Sir Tim Berners-Lee speaks out on data ownership by Alex Hern.

From the post:

The data we create about ourselves should be owned by each of us, not by the large companies that harvest it, the Tim Berners-Lee, the inventor of the world wide web, said today.

Berners-Lee told the IPExpo Europe in London’s Excel Centre that the potential of big data will be wasted as its current owners use it to serve ever more “queasy” targeted advertising.

Berners-Lee, who wrote the first memo detailing the idea of the world wide web 25 years ago this year, while working for physics lab Cern in Switzerland, told the conference that the value of “merging” data was under-appreciated in many areas.

Speaking to public data providers, he said: “I’m not interested in your data; I’m interested in merging your data with other data. Your data will never be as exciting as what I can merge it with.

No disagreement with: …the value of “merging” data was under-appreciated in many areas. ;-)

Considerable disagreement on how best to accomplish that merging but will be an empirical question when people wake up to the value of “merging” data.

Berners-Lee may be right about who “should” own data about ourselves, but that isn’t in fact who owns it now. Changing property laws means taking rights away from those with them under the current regime and creating new rights for others in a new system. Property laws have changed before but it requires more than slogans and wishful thinking to make it so.

Programming for Biologists

October 9th, 2014

Programming for Biologists by Ethan White.

From the post:

This is the website for Ethan White’s programming and database management courses designed for biologists. At the moment there are four courses being taught during Fall 2014.

The goal of these courses is to teach biologists how to use computers more effectively to make their research easier. We avoid a lot of the theory that is taught in introductory computer science classes in favor of covering more of the practical side of programming that is necessary for conducting research. In other words, the purpose of these courses is to teach you how to drive the car, not prepare you to be a mechanic.

Hmmm, less theory of engine design and more driving lessons? ;-)

Despite my qualms about turn-key machine learning solutions, more people want to learn to drive a car than want to design an engine.

Should we teach topic maps the “right way” or should we teach them to drive?

I first saw this in a tweet by Christophe Lalanne.

Twitter sues US federal agencies in attempt to remove the gag around surveillance

October 9th, 2014

Twitter sues US federal agencies in attempt to remove the gag around surveillance by Lisa Vaas.

From the post:

Twitter doesn’t want its transparency report to be fuzzy to the point of meaninglessness, full of “broad, inexact ranges” about how many times the US government has shaken the company down in its surveillance operations, it says – for example, by counting them to the nearest thousand.

So on Tuesday, Twitter sued the Feds over the surveillance laws they’re using to gag it.

Twitter’s lawyer, Ben Lee, said in a post that First Amendment rights should allow the company to be crystal clear about the actual scope of surveillance of Twitter users by the US, as opposed to the current state of affairs, where companies such as Twitter are bound by laws that punish them for disclosing requests for information.

Lisa has links to the court documents and mentions that Twitter isn’t standing alone against government surveillance:

Both Apple and Google announced in September new mobile phone encryption policies meant to thwart government attempts to get at user data – a move that’s sparked hand-wringing on the part of multiple government officials.

Other US tech companies, including Microsoft, Facebook, Dropbox, and, again, Google, have been fighting government demands for user data in other ways, including attempting to convince the Senate to reform government surveillance.

The “hand-wringing” Lisa mentions is a measure of the technical illiteracy of government policy makers. New mobile phone policies will make secure voice marginally easier for the average user, but even the semi-literate have had access to secure voice for years, see: PRISM-proof your phone with these encrypted apps and services.

Support technology company opposition to government surveillance at every opportunity.

Unicode Version 7.0…

October 8th, 2014

Unicode Version 7.0 – Complete Text of the Core Specification Published

From the post:

The Unicode® Consortium announces the publication of the core specification for Unicode 7.0. The Version 7.0 core specification contains significant changes:

  • Major reorganization of the chapters and overall layout
  • New page size tailored for easy viewing on e-readers and other mobile devices
  • Addition of twenty-two new scripts and a shorthand writing system
  • Alignment with updates to the Unicode Bidirectional Algorithm

In Version 7.0, the standard grew by 2,834 characters. This version continues the Unicode Consortium’s long-term commitment to support the full diversity of languages around the world with its newly encoded scripts and other additional characters. The text of the latest version documents two newly adopted currency symbols: the manat, used in Azerbaijan, and the ruble, used in Russia and other countries. It also includes information about newly added pictographic symbols, geometric symbols, arrows and ornaments.

This version of the Standard brings technical improvements to support implementers, including further clarification of the case pair stability policy, and a new stability policy for Numeric_Type.

All other components of Unicode 7.0 were released on June 16, 2014: the Unicode Standard Annexes, code charts, and the Unicode Character Database, to allow vendors to update their implementations of Unicode 7.0 as early as possible. The release of the core specification completes the definitive documentation of the Unicode Standard, Version 7.0.

For more information on all of The Unicode Standard, Version 7.0, see

For non-backtick + Unicode character applications, this is good news!

Following the Unicode standard should be the first test for consideration of an application. The time for ad hoc character hacks passed a long time ago.

A look at Cayley

October 8th, 2014

A look at Cayley by Tony.

From the post:

Recently I took the time to check out Cayley, a graph database written in Go that’s been getting some good attention.


A great introduction to Cayley. Tony has some comparisons to Neo4j, but for beginners with graph databases, those comparisons may not be real useful. Come back for those comparisons once you have moved beyond example graphs.