Archive for December, 2012

[Another] … getting started with data science

Monday, December 31st, 2012

Software engineer’s guide to getting started with data science

Longer than Hilary Mason’s Getting Started with Data Science but different people have different learning styles.

This post may or may not resonate with you.

It another post does a better job for you, please pass it along.

Spirograph with R

Monday, December 31st, 2012

Spirograph with R

A great post on duplicating a “toy” from many years ago!

A sample of the results you may see:

Spirograph Output

I never owned one but think you could change the color of the pen while drawing.


Let’s Make a Map

Monday, December 31st, 2012

Let’s Make a Map by Mike Bostock.

From the post:

In this tutorial, I’ll cover how to make a modest map from scratch using D3 and TopoJSON. I’ll show you a few places where you can find free geographic data online, and how to convert it into a format that is both efficient and convenient for display. I won’t cover thematic mapping, but the map we’ll make includes labels for populated places and you can extend this technique to geographic visualizations such as graduated symbol maps and choropleths.

Excellent introduction!

MongoDB Puzzlers #1

Sunday, December 30th, 2012

MongoDB Puzzlers #1 by Kristina Chodorow.

If you are not too deeply invested in the fiscal cliff debate, ;-), you may enjoy the distraction of a puzzler based on the MongoDB query language.

Collecting puzzler’s for MongoDB and other query languages would be a good idea.

Something to be enjoyed in times of national “crisis,” aka, collective hand wringing by the media.

When is “Hello World,” Not “Hello World?”

Sunday, December 30th, 2012

To answer that question, you need to see the post: Travel NoSQL Application – Polyglot NoSQL with SpringData on Neo4J and MongoDB.

Just a quick sample:

 In this Fuse day, Tikal Java group decided to continue its previous Fuse research for NoSQL, but this time from a different point of view – SpringData and Polyglot persistence. We had two goals in this Fuse day: try working with more than one NoSQL in the same application, and also taking advantage of SpringData data access abstractions for NoSQL databases. We decided to take MongoDB and Neo4J as document DB, and Neo4J as graph database and put them behind an existing, classic and well known application – Spring Travel Sample application.

More than the usual “Hello World” example for languages and a bit more than for most applications.

It would be a nice trend to see more robust, perhaps “Hello World+” examples.

What is your enhanced “Hello World+” going to look like in 2013?


Sunday, December 30th, 2012


From the webpage:

TopoJSON is an extension of GeoJSON that encodes topology. Rather than representing geometries discretely, geometries in TopoJSON files are stitched together from shared line segments called arcs. TopoJSON eliminates redundancy, offering much more compact representations of geometry than with GeoJSON; typical TopoJSON files are 80% smaller than their GeoJSON equivalents. In addition, TopoJSON facilitates applications that use topology, such as topology-preserving shape simplification, automatic map coloring, and cartograms.

I stumbled on this by viewing TopoJSON Points.

Displaying airports in the example but could be any geographic feature.

See the wiki for more details.

Getting Started with Data Science

Sunday, December 30th, 2012

Getting Started with Data Science by Hilary Mason.

From the post:

I get quite a few e-mail messages from very smart people who are looking to get started in data science. Here’s what I usually tell them:

The best way to get started in data science is to DO data science!

First, data scientists do three fundamentally different things: math, code (and engineer systems), and communicate. Figure out which one of these you’re weakest at, and do a project that enhances your capabilities. Then figure out which one of these you’re best at, and pick a project which shows off your abilities.

Everyone keeps talking about the shortage of data scientists but not doing anything about it.

Well, everyone but Hilary.

Hilary has specific advice that if followed, would go a long way to producing more data scientists.

You may not need it. If so, pass it along to someone who does.

Natural Language Processing-(NLP) Tools

Sunday, December 30th, 2012

Natural Language Processing-(NLP) Tools

A very good collection of NLP tools, running from the general to taggers and pointers to other NLP resource pages.

Over The Fiscal Cliff – Blindfolded

Saturday, December 29th, 2012

The United States government is about to go over the “fiscal cliff.”

The really sad part is that the people of the United States are going with it, but they are blindfolded.

Intentionally blindfolded by their own government.

The OMB (“o” stands for opaque) report: OMB Report Pursuant to the Sequestration Transparency Act of 2012 (P. L. 112–155), Appendix B. Preliminary Sequestrable / Exempt Classification, classifies accounts as sequestrable, exempt, etc.

One reason to be “exempt” is funds were already sequestered elsewhere in the budget. Makes sense on the face of it.

But for 487 entries out of 2126 in Appendix B, or 22.9%, are being sequestered from some unstated part of the government.

Totally opaque.

Unlike the OMB, I am willing to share an electronic version of the files: Satisfy yourself if I am right or wrong.

You can make it the last time the US government puts a blindfold on the American people.

Contact the White House, your Senator or Representative.

Parallel Computing – Prof. Alan Edelman

Saturday, December 29th, 2012

Parallel Computing – Prof. Alan Edelman MIT Course Number 18.337J / 6.338J.

From the webpage:

This is an advanced interdisciplinary introduction to applied parallel computing on modern supercomputers. It has a hands-on emphasis on understanding the realities and myths of what is possible on the world’s fastest machines. We will make prominent use of the Julia Language software project.

A “modern supercomputer” may be in your near term future. Would not hurt to start preparing now.

Similar courses that you would recommend?

GRADES: Graph Data-management Experiences & Systems

Saturday, December 29th, 2012

GRADES: Graph Data-management Experiences & Systems

Workshop: Sunday June 23, 2013

Papers Due: March 31, 2013

Notification: April 22, 2013

Camera-ready: May 19, 2013

Workshop Scope:

Application Areas

A new data economy is emerging, based on the analysis of distributed, heterogeneous, and complexly structured data sets. GRADES focuses on the problem of managing such data, specifically when data takes the form of graphs that connect many millions of nodes, and the worth of the data and its analysis is not only in the attribute values of these nodes, but in the way these nodes are connected. Specific application areas that exhibit the growing need for management of such graph shaped data include:

  • life science analytics, e.g., tracking relationships between illnesses, genes, and molecular compounds.
  • social network marketing, e.g., identifying influential speakers and trends propagating through a community.
  • digital forensics, e.g., analyzing the relationships between persons and entities for law enforcement purposes.
  • telecommunication network analysis, e.g., directed at fixing network bottlenecks and costing of network traffic.
  • digital publishing, e.g., enriching entities occurring in digital content with external data sources, and finding relationships among the entities.


The GRADES workshop solicits contributions from two perspectives:

  • Experiences. This includes topics that describe use case scenarios, datasets, and analysis opportunities occurring in real-life graph-shaped, ans well as benchmark descriptions and benchmark results.
  • Systems. This includes topics that describe data management system architectures for processing of Graph and RDF data, and specific techniques and algorithms employed inside such systems.

The combination of the two (Experiences with Systems) and benchmarking RDF and graph database systems, is of special interest.

Topics Of Interest

The following is a non-exhaustive list describing the scope of GRADES:

  • vision papers describing potential applications and benefits of graph data management.
  • descriptions of graph data management use cases and query workloads.
  • experiences with applying data management technologies in such situations.
  • experiences or techniques for specific operations such as traversals or RDF reasoning.
  • proposals for benchmarks for data integration tasks (instance matching and ETL techniques).
  • proposals for benchmarks for RDF and graph database workloads.
  • evaluation of benchmark performance results on RDF or graph database systems.
  • system descriptions covering RDF or graph database technology.
  • data and index structures that can be used in RDF and graph database systems.
  • query processing and optimization algorithms for RDF and graph database systems.
  • methods and technique for measuring graph characteristics.
  • methods and techniques for visualizing graphs and graph query results.
  • proposals and experiences with graph query languages.

The GRADES workshop is co-located and sponsored by SIGMOD in recognition that these problems are only interesting at large-scale and the contribution of the SIGMOD community to handle such topics on large amounts of data of many millions or even billions of nodes and edges is of critical importance.

That sounds promising doesn’t it? (Please email, copy, post, etc.)

Treat Your Users Like Children

Saturday, December 29th, 2012

Treat Your Users Like Children by Jamal Jackson.

From the post:

Do you have kids of your own? How about young nieces, nephews, or nephews? Do you spend time around your friends’ children? Is there that one neighbor who has youngsters who makes it a point to disturb you any chance they get? If you’ve answered yes to any of these questions, then you understand that caring for kids is difficult! Many people would argue that my use of the word “difficult” is a strong understatement. They’d be right!

Young minds are almost impossible to predict and equally hard to control. A parent, or any other adult, can plan out an assortment of ideal procedures for a kid to follow to accomplish something, but it will likely feel like wasted time. This is because kids have no intention of following any form of procedures, no matter how beneficial to them.

Speaking of people with no intention of following any form of procedures, no matter how beneficial those procedures may be, I can’t help but wonder why dealing with children reminds me of the life of a UX professional.

How many hours have you spent toiling away in front of your monitor and notepad, hoping the end result will be to the user’s benefit? If they even bother to proceed as you predicted, that is. In the end, the majority of users end up navigating your site in a way that leaves head-scratching as the only suitable reaction. This is why web users should be treated like kids.

The post is worth reading if only for the images!

But having said that, it gives good advice on changing your perspective on design, to that of a user.

Designing for ourselves is a lot easier, at least for us.

Unfortunately, that isn’t the same a designing an interface users will find helpful or intuitive.

I “prefer” an interface that most users find intuitive.

An audience/market of < 10 can be pretty lonely, not to mention unprofitable.

The Top 5 Website UX Trends of 2012

Saturday, December 29th, 2012

The Top 5 Website UX Trends of 2012

From the post:

User interface techniques continued to evolve in 2012, often blurring the lines between design, usability, and technology in positive ways to create an overall experience that has been both useful and pleasurable.

Infinite scrolling, for example, is a technological achievement that also helps the user by enabling a more seamless experience. Similarly, advances in Web typography have an aesthetic dimension but also represent a movement toward greater clarity of communication.

Quick coverage of:

  1. Single-Page Sites
  2. Infinite Scrolling
  3. Persistent Top Navigation or “Sticky Nav”
  4. The Death of Web 2.0 Aesthetics
  5. Typography Returns

Examples of each trend but you are left on your own for the details.

Good time to review your web presence for the coming year.

Missing-Data Imputation

Saturday, December 29th, 2012

New book by Stef van Buuren on missing-data imputation looks really good! by Andrew Gelman.

From the post:

Ben points us to a new book, Flexible Imputation of Missing Data. It’s excellent and I highly recommend it. Definitely worth the $89.95. Van Buuren’s book is great even if you don’t end up using the algorithm described in the book (I actually like their approach but I do think there are some limitations with their particular implementation, which is one reason we’re developing our own package); he supplies lots of intuition, examples, and graphs.

Steve Newcomb makes the point that data is dirty. Always.

Stef van Buuren suggests that data may be missing and requires imputation.

Together that means dirty data may be missing and requires imputation.


Imputed or not, data is no more reliable than we are. Use with caution.

Analyzing the Enron Data…

Saturday, December 29th, 2012

Analyzing the Enron Data: Frequency Distribution, Page Rank and Document Clustering by Sujit Pal.

From the post:

I’ve been using the Enron Dataset for a couple of projects now, and I figured that it would be interesting to see if I could glean some information out of the data. One can of course simply read the Wikipedia article, but that would be too easy and not as much fun :-).

My focus on this analysis is on the “what” and the “who”, ie, what are the important ideas in this corpus and who are the principal players. For that I did the following:

  • Extracted the words from Lucene’s inverted index into (term, docID, freq) triples. Using this, I construct a frequency distribution of words in the corpus. Looking at the most frequent words gives us an idea of what is being discussed.
  • Extract the email (from, {to, cc, bcc}) pairs from MongoDB. Using this, I piggyback on Scalding’s PageRank implementation to produce a list of emails by page rank. This gives us an idea of the “important” players.
  • Using the triples extracted from Lucene, construct tuples of (docID, termvector), then cluster the documents using KMeans. This gives us an idea of the spread of ideas in the corpus. Originally, the idea was to use Mahout for the clustering, but I ended up using Weka instead.

I also wanted to get more familiar with Scalding beyond the basic stuff I did before, so I used that where I would have used Hadoop previously. The rest of the code is in Scala as usual.

Good practice for discovery of the players and main ideas when the “fiscal cliff” document set “leaks,” as you know it will.

Relationships between players and their self-serving recountings versus the data set will make an interesting topic map.

Analyzing Categorical Data

Saturday, December 29th, 2012

Analyzing Categorical Data by Jeffrey S. Simonoff.

Mentioned in My Intro to Multiple Classification… but thought it merited a more prominent mention.

From the webpage:

Welcome to the web site for the book Analyzing Categorical Data, published by Springer-Verlag in July 2003 as part of the Springer Texts in Statistics series. This site allows access to the data sets used in the book, S-PLUS/R and SAS code to perform the analyses in the book, some general information on statistical software for analyzing categorical data, and an errata list. I would be very happy to receive comments on this site, and on the book itself.

Data sets, code to duplicate the analysis in the book and other information at this site.

My Intro to Multiple Classification…

Saturday, December 29th, 2012

My Intro to Multiple Classification with Random Forests, Conditional Inference Trees, and Linear Discriminant Analysis

From the post:

After the work I did for my last post, I wanted to practice doing multiple classification. I first thought of using the famous iris dataset, but felt that was a little boring. Ideally, I wanted to look for a practice dataset where I could successfully classify data using both categorical and numeric predictors. Unfortunately it was tough for me to find such a dataset that was easy enough for me to understand.

The dataset I use in this post comes from a textbook called Analyzing Categorical Data by Jeffrey S Simonoff, and lends itself to basically the same kind of analysis done by blogger “Wingfeet” in his post predicting authorship of Wheel of Time books. In this case, the dataset contains counts of stop words (function words in English, such as “as”, “also, “even”, etc.) in chapters, or scenes, from books or plays written by Jane Austen, Jack London (I’m not sure if “London” in the dataset might actually refer to another author), John Milton, and William Shakespeare. Being a textbook example, you just know there’s something worth analyzing in it!! The following table describes the numerical breakdown of books and chapters from each author:

Introduction to authorship studies as they were known (may still be) in the academic circles of my youth.

I wonder if the same techniques are as viable today as on the Federalist Papers?

The Wheel of Time example demonstrates the technique remains viable for novel authors.

But what about authorship more broadly?

Can we reliably distinguish between news commentary from multiple sources?

Or between statements by elected officials?

How would your topic map represent purported authorship versus attributed authorship?

Or even a common authorship for multiple purported authors? (speech writers)

Installing Neo4j in an Azure Linux VM

Saturday, December 29th, 2012

Installing Neo4j in an Azure Linux VM by Howard Dierking.

From the post:

I’ve been playing with Neo4j a lot recently. I’ll be writing a lot more about that later, but at a very very high level, Neo4j is a graph database that in addition to some language-specific bindings has a slick HTTP interface. You can install it on Windows, Linux, and Mac OSX, so if you’re more comfortable on Windows, don’t read this post and think that you can’t play with this awesome database unless you forget everything you know, replace your wardrobe with black turtlenecks, and write all your code in vi (though that is an option). For me, though, I hate installers and want the power of a package manager such as homebrew (OSX) or apt-get (Linux). So I’m going to take you through the steps that I went through to get neo4j running on Linux. And just to have a little more fun with things, I’ll host neo4j on a Linux VM hosted in Azure.

Azure, Neo4j, a Linux VM and CLI tools, what more could you want?

Definitely a must read post for an easy Neo4j launch on an Azure Linux VM.

Howard promises more posts on Neo4j to follow.

IPython Notebook Viewer

Friday, December 28th, 2012

IPython Notebook Viewer

From the webpage:

A Simple way to share your IP[y]thon Notebook as Gists.

Share your own notebook, or browse others’

Scientific Python retweeted a post from Hilary Mason on the IPython Notebook Viewer so I had to go look.

For details on IPython and notebooks, see: IP[y]: IPython Interactive Computing:

IPython provides a rich toolkit to help you make the most out of using Python, with:

  • Powerful Python shells (terminal and Qt-based).
  • A web-based notebook with the same core features but support for code, text, mathematical expressions, inline plots and other rich media.
  • Support for interactive data visualization and use of GUI toolkits.
  • Flexible, embeddable interpreters to load into your own projects.
  • Easy to use, high performance tools for parallel computing.

As Hilary says in her tweet: “…one of the coolest things I’ve seen in a long time. It makes analysis more collaborative!”

Useful for exchanging data analysis.

Possibly a good lesson on what a merging data by example resource might look like.


LIBOL 0.1.0

Friday, December 28th, 2012

LIBOL 0.1.0

From the webpage:

LIBOL is an open-source library for large-scale online classification, which consists of a large family of efficient and scalable state-of-the-art online learning algorithms for large-scale online classification tasks. We have offered easy-to-use command-line tools and examples for users and developers. We also have made documents available for both beginners and advanced users. LIBOL is not only a machine learning tool, but also a comprehensive experimental platform for conducting online learning research.

In general, the existing online learning algorithms for linear classication tasks can be grouped into two major categories: (i) first order learning (Rosenblatt, 1958; Crammer et al., 2006), and (ii) second order learning (Dredze et al., 2008; Wang et al., 2012; Yang et al., 2009).

Example online learning algorithms in the first order learning category implemented in this library include:

• Perceptron: the classical online learning algorithm (Rosenblatt, 1958);

• ALMA: A New ApproximateMaximal Margin Classification Algorithm (Gentile, 2001);

• ROMMA: the relaxed online maxiumu margin algorithms (Li and Long, 2002);

• OGD: the Online Gradient Descent (OGD) algorithms (Zinkevich, 2003);

• PA: Passive Aggressive (PA) algorithms (Crammer et al., 2006), one of state-of-the-art first order online learning algorithms;

Example algorithms in the second order online learning category implemented in this library include the following:

• SOP: the Second Order Perceptron (SOP) algorithm (Cesa-Bianchi et al., 2005);

• CW: the Confidence-Weighted (CW) learning algorithm (Dredze et al., 2008);

• IELLIP: online learning algorithms by improved ellipsoid method (Yang et al., 2009);

• AROW: the Adaptive Regularization of Weight Vectors (Crammer et al., 2009);

• NAROW: New variant of Adaptive Regularization (Orabona and Crammer, 2010);

• NHERD: the Normal Herding method via Gaussian Herding (Crammer and Lee, 2010)

• SCW: the recently proposed Soft ConfidenceWeighted algorithms (Wang et al., 2012).

LIBOL is still being improved by improvements from practical users and new research results.

More information can be found in our project website:

Consider this an early New Year’s present!

Royal Society of Chemistry (RSC) – National Chemical Database Service

Friday, December 28th, 2012

From the homepage: (Goes live: 2nd January 2013)

National Chemical Database Service

The RSC will be operating the EPSRC National Chemical Database Service from 2013-2017

What is the RSC’s vision for the Service?

We intend to build the Service for the future – to develop a chemistry data repository for UK academia, and to build tools, models and services on this data store to increase the value and impact of researchers’ funded work. We will continue to develop this data store through the lifetime of the contract period and look forward to working with the community to make this a world-leading exemplar of the value of research data availability.

The Service will also offer access to a suite of commercial databases and services. While there will be some overlap with currently provided databases popular with the user community we will deliver new data and services and optimize the offering based on user feedback.

When will the Service be available?

The Service will start on 2nd January 2013, and will be available at

The database services we are working to have available at launch are the Cambridge Structural Database, ACD/ILab and Accelrys’ Available Chemicals Directory. The Service will also include integrated access to the RSC’s award winning ChemSpider database. As ‘live’ dates for other services become clear, they will appear here.

See also: Initial Demonstrations of the Interactive Laboratory Service as part of the Chemical Database Service


Initial Demonstration of the Integration to the Accelrys Available Chemicals Directory Web Service

I just looked at the demos but was particularly impressed with their handling of identifiers. Really impressed. There are lessons here for other information services.

BTW, I did have to hunt to discover that RCS = Royal Society of Chemistry. 😉

Computing Strongly Connected Components [Distributed Merging?]

Friday, December 28th, 2012

Computing Strongly Connected Components by Mark C. Chu-Carroll.

From the post:

As promised, today I’m going to talk about how to compute the strongly connected components of a directed graph. I’m going to go through one method, called Kosaraju’s algorithm, which is the easiest to understand. It’s possible to do better that Kosaraju’s by a factor of 2, using an algorithm called Tarjan’s algorithm, but Tarjan’s is really just a variation on the theme of Kosaraju’s.

Kosaraju’s algorithm is amazingly simple. It uses a very clever trick based on the fact that if you reverse all of the edges in a graph, the resulting graph has the same strongly connected components as the original. So using that, we can get the SCCs by doing a forward traversal to find an ordering of vertices, then doing a traversal of the reverse of the graph in the order generated by the first traversal.

That may sound a bit mysterious, but it’s really very simple. Take the graph G, and do a recursive depth-first traversal of the graph, along with an initially empty stack of vertices. As the recursion started at a vertex V finishes, push V onto the stack. At the end of the traversal, you’ll have all of the vertices of G in the stack. The order of the reverse traversal will be starting with the vertices on the top of that stack.

So you reverse all of the edges in the graph, creating the reverse graph, G’. Start with the vertex on top of the stack, and to a traversal from that vertex. All of the nodes reachable from that vertex form one strongly connected component. Remove everything in that SCC from the stack, and then repeat the process with the new top of the stack. When the stack is empty, you’ll have accumulated all of the SCCs.

Is graph decomposition suggestive of way to manage distributed merging?

Instead of using “strongly connected” as a criteria for decomposition, one or more subject characteristics are used to decompose the topic map.

Natural language is one that comes immediately to mind. Based on the content most likely to be viewed by a particular audience.

What subject characteristics would you use to decompose a topic map?

Making Graph Algorithms Fast, using Strongly Connected Components

Friday, December 28th, 2012

Making Graph Algorithms Fast, using Strongly Connected Components by Mark C. Chu-Carroll.

From the post:

One of the problems with many of the interesting graph algorithms is that they’re complex. As graphs get large, computations over them can become extremely slow. For example, graph coloring is NP-complete – so the time to run a true optimal graph coloring algorithm on an arbitrary graph grows exponentially in the size of the graph!

So, quite naturally, we look for ways to make it faster. We’ve already talked about using heuristics to get fast approximations. But what if we really need the true optimum? The other approach to making it faster, when you really want the true optimum, is parallelism: finding some way to subdivide the problem into smaller pieces which can be executed at the same time. This is obviously a limited technique – you can’t stop the exponential growth in execution time by adding a specific finite number of processors. But for particular problems, parallelism can make the difference between being able to do something fast enough to be useful, and not being able to. For a non-graph theoretic but compelling example, there’s been work in something called microforecasting, which is precise weather forecasting for extreme small areas. It’s useful for predicting things like severe lightning storms in local areas during events where there are large numbers of people. (For example, it was used at the Atlanta olympics.) Microforecasting inputs a huge amount of information about local conditions – temperature, air pressure, wind direction, wind velocity, humidity, and such, and computes a short-term forecast for that area. Microforecasting is completely useless if it takes longer to compute the forecast than it takes for the actual weather to develop. If you can find a way to split the computation among a hundred processors, and get a forecast in 5 minutes, you’ve got something really useful. If you have to run it on a single processor, and it takes 2 hours – well, by then, any stormthat the forecast predicts is already over.

For graphs, there’s a very natural way of decomposing problems for parallelization. Many complex graphs for real problems can be divided into a set of subgraphs, which can be handled in parallel. In fact, this can happen on multiple levels: graphs can be divided into subgraphs, which can be divided into further subgraphs. But for now, we’ll mainly focus on doing it once – one decomposition of the graph into subgraphs for parallelization.

Mark illustrates the power of graph decomposition.

Don’t skimp on the comments, there are references I am mining from there as well.

Graph Contraction and Minors [Merging by Another Name?]

Friday, December 28th, 2012

Graph Contraction and Minors by Mark C. Chu-Carroll.

From the post:

Another useful concept in simple graph theory is *contraction* and its result, *minors* of graphs. The idea is that there are several ways of simplifying a graph in order to study its properties: cutting edges, removing vertices, and decomposing a graph are all methods we’ve seen before. Contraction is a different technique that works by *merging* vertices, rather than removing them.

Here’s how contraction works. Suppose you have a graph G. Pick a pair of vertices v and w which are adjacent in G. You can create a graph G’ which is a contraction of G by replacing v and w with a *single* vertex, and taking any edge in G which is incident on either v or w, and making it incident on the newly contracted node.


Not quite merging in the topic map sense because edges between vertexes are the basis for “merging.”

Instead of on the basis of properties of subjects the vertexes represent.

Still, deeply intriguing work.

This post was first published in July of 2007, so recent it’s not.

More resources you would suggest?

I first saw this in a tweet by Marko A. Rodriguez.

Classic Visualization Papers

Thursday, December 27th, 2012

7 Classic Foundational Vis Papers You Might not Want to Publicly Confess you Don’t Know by Enrico Bertini.

From the post:

Even if I am definitely not a veteran of infovis research (far from it) I started reading my first papers around the year 2000 and since then I’ve never stopped. One thing I noticed is that some papers recur over and over and they really are (at least in part) the foundation of information visualization. Here is a list of those that:

  1. come from the very early days of infovis
  2. are foundational
  3. are cited over and over
  4. I like a lot

Of course this doesn’t mean these are the only ones you should read if you want to dig into this matter. Some other papers are foundational as well. For sure a side effect of the maturation of this field is that some newer papers are more solid and deep and I had to refrain myself to not include them in the list. But this is a collection of classics. A list of papers you just cannot avoid to know unless you want to risk a bad impression at VisWeek (ok ok it’s a joke … but there’s a pinch of truth in it). A retrospective. Definitely a must read. Call me nostalgic.

Take the time to read Enrico’s post and the papers he cites. Whatever your experience with visualization, you will be enriched by the experience.

I first saw this in “the sixth issue of’s Weekly Newsletter” but I can’t give you a link to it that is not tied to my subscription. ?? Nor is there an archive page for the newsletter posts. Until those issues are corrected, see:

What’s New in Apache Cassandra 1.2 [Webinar]

Thursday, December 27th, 2012

What’s New in Apache Cassandra 1.2

From the registration page:

Date: Wednesday, January 9, 2013
Time: 11AM PT / 2 PM ET
Duration: 1 hour


Join Apache Cassandra Project Chair, Jonathan Ellis, as he looks at all the great improvements in Cassandra 1.2, including Vnodes, Parallel Leveled Compaction, Collections, Atomic Batches and CQL3.

About the Speaker:

Jonathan Ellis (@spyced), CTO at DataStax and Apache Cassandra Project Chair

Jonathan is CTO and co-founder at DataStax. Prior to DataStax, Jonathan worked extensively with Apache Cassandra while employed at Rackspace. Prior to Rackspace, Jonathan built a multi-petabyte, scalable storage system based on Reed-Solomon encoding for backup provider Mozy. In addition to his work with DataStax, Jonathan is project chair of Apache Cassandra.

Another opportunity to start the new year by acquiring new skills or knowledge!

Utopia Documents

Thursday, December 27th, 2012

Checking the “sponsored by” link for pdfx v1.0 and discovered: Utopia Documents.

From the homepage:

Reading, redefined.

Utopia Documents brings a fresh new perspective to reading the scientific literature, combining the convenience and reliability of the PDF with the flexibility and power of the web. Free for Linux, Mac and Windows.

Building Bridges

The scientific article has been described as a Story That Persuades With Data, but all too often the link between data and narrative is lost somewhere in the modern publishing process. Utopia Documents helps to rebuild these connections, linking articles to underlying datasets, and making it easy to access online resources relating to an article’s content.

A Living Resource

Published articles form the ‘minutes of science‘, creating a stable record of ideas and discoveries. But no idea exists in isolation, and just because something has been published doesn’t mean that the story is over. Utopia Documents reconnects PDFs with the ongoing discussion, keeping you up-to-date with the latest knowledge and metrics.


Make private notes for yourself, annotate a document for others to see or take part in an online discussion.

Explore article content

Looking for clarification of given terms? Or more information about them? Do just that, with integrated semantic search.

Interact with live data

Interact directly with curated database entries- play with molecular structures; edit sequence and alignment data; even plot and export tabular data.

A finger on the pulse

Stay up to date with the latest news. Utopia connects what you read with live data from Altmetric, Mendeley, CrossRef, Scibite and others.

A user can register for an account (enabling comments on documents) or use the application anonymously.

Presently focused on the life sciences but no impediment to expansion into computer science for example.

It doesn’t solve semantic diversity issues so an opportunity for topic maps there.

Doesn’t address the issue of documents being good at information delivery but not so good for information storage.

But issues of semantic diversity and information storage, are growth areas for Utopia Documents, not reservations about its use.

Suggest you start using and exploring Utopia Documents sooner rather than later!

pdfx v1.0 [PDF-to-XML]

Thursday, December 27th, 2012

pdfx v1.0

From the homepage:

Fully-automated PDF-to-XML conversion of scientific text

I submitted Static and Dynamic Semantics of NoSQL Languages, a paper I blogged about earlier this week. Twenty-four pages of lots of citations and equations.

I forgot to set a timer but it isn’t for the impatient. I think the conversion ran more than ten (10) minutes.

Some mathematical notation defeats the conversion process.

See: Static-and-Dynamic-Semantics-NoSQL-Languages.tar.gz for the original PDF plus the HTML and PDF outputs.

For occasional conversions where heavy math notation isn’t required, this may prove to be quite useful.

Why Do the New Orleans Saints Lose?…

Thursday, December 27th, 2012

Why Do the New Orleans Saints Lose? Data Visualization II by Nathan Lemoine.

I’m not a nationalist, apparatchik, school, state, profession, class, religion, language or development approach booster.

I must confess, however, I am a New Orleans Saints fan. Diversity, read other teams, are a necessary evil to give the Saints someone to beat. 😉

An exercise you can repeat/expand with other teams (shudder), in other sports (shudder, shudder), to explore R and visualization of data.

What other stats/information would you want to incorporate/visualize?

Majuro.JS [Useful Open Data]

Thursday, December 27th, 2012

Majuro.JS by Nick Doiron.

From the homepage:

Majuro.JS helps you make detailed, interactive maps with open buildings data.

Great examples on the homepage but I prefer the explanation at Github.

This is wicked cool!

This type of open data I can see as the basis for “innovation.”

Resulting in a target for rich annotation by a topic map based application.