Archive for May, 2013

…How to program Ruzzle

Monday, May 27th, 2013

Graph search, algorithmic optimization, and word games: How to program Ruzzle

The graph processing aspects of programming Ruzzle are interesting.

With a suitable dictionary, this could be converted into a spelling puzzle for not so recent languages.

Or for that matter, a spelling puzzle for vocabularies in general (albeit some might require a larger board).

Category Theory for Scientists

Monday, May 27th, 2013

Category Theory for Scientists by David Spivak.


There are many books designed to introduce category theory to either a mathematical audience or a computer science audience. In this book, our audience is the broader scientific community. We attempt to show that category theory can be applied throughout the sciences as a framework for modeling phenomena and communicating results. In order to target the scientific audience, this book is example-based rather than proof-based. For example, monoids are framed in terms of agents acting on objects, sheaves are introduced with primary examples coming from geography, and colored operads are discussed in terms of their ability to model self-similarity.

I first saw this at: Category Theory for Scientists by John Baez.

I forwarded it to Jack Park who responded with a link to an earlier post: Spivak on Category Theory by Bruce Bartlett.

David Spivak responds to the earlier post with a link to a Google doc for posting comments:

CT4S book: Typos, comments, questions, and suggestions

The MIT course description with links to supplemental materials: 18-S996, Spring 2013: Category theory for scientists.

If you post about this book, please include a pointer to the Google doc for comments as well.

Feature Selection with Scikit-Learn

Sunday, May 26th, 2013

Feature Selection with Scikit Learn by Sujit Pal.

From the post:

I am currently doing the Web Intelligence and Big Data course from Coursera, and one of the assignments was to predict a person’s ethnicity from a set of about 200,000 genetic markers (provided as boolean values). As you can see, a simple classification problem.

One of the optimization suggestions for the exercise was to prune the featureset. Prior to this, I had only a vague notion that one could do this by running correlations of each feature against the outcome, and choosing the most highly correlated ones. This seemed like a good opportunity to learn a bit about this, so I did some reading and digging within Scikit-Learn to find if they had something to do this (they did). I also decided to investigate how the accuracy of a classifier varies with the feature size. This post is a result of this effort.

The IR Book has a sub-chapter on Feature Selection. Three main approaches to Feature Selection are covered – Mutual Information based, Chi-square based and Frequency based. Scikit-Learn provides several methods to select features based on Chi-Squared and ANOVA F-values for classification. I learned about this from Matt Spitz’s passing reference to Chi-squared feature selection in Scikit-Learn in his Slugger ML talk at Pycon USA 2012.

In the code below, I compute the accuracies with various feature sizes for 9 different classifiers, using both the Chi-squared measure and the ANOVA F measures.

Sujit uses Scikit-Learn to investigate the accuracy of classifiers.

SecLists.Org Security Mailing List Archive

Sunday, May 26th, 2013

SecLists.Org Security Mailing List Archive

Speaking of reading material for the summer, how do you keep up with hacking news?

From the webpage:

Any hacker will tell you that the latest news and exploits are not found on any web site—not even Insecure.Org. No, the cutting edge in security research is and will continue to be the full disclosure mailing lists such as Bugtraq. Here we provide web archives and RSS feeds (now including message extracts), updated in real-time, for many of our favorite lists. Browse the individual lists below, or search them all:

Site includes one of those “hit or miss” search boxes that doesn’t learn from the successes of other users.

It’s better than reading each post separately, but only just.

With every search, you still have to read the posts, over and over again.

SIAM Archives

Sunday, May 26th, 2013

I saw an announcement for SDM 2014 : SIAM International Conference on Data Mining, Philadelphia, Pennsylvania, USA, April 24 – 26, 2014, today but the call for papers hasn’t appeared, yet.

While visiting the conference site I followed the proceedings link to discover:

Workshop on Algorithm Engineering and Experiments (ALENEX) 2006 – 2013

Workshop on Analytic Algorithmics and Combinatorics (ANALCO) 2006 – 2013

ACM-SIAM Symposium on Discrete Algorithms (SODA) 2009 – 2013

Data Mining – 2001 – 2013

Mathematics for Industry 2009

Just in case you are short on reading material for the summer. 😉

The Sokal Hoax: At Whom Are We Laughing?

Sunday, May 26th, 2013

The Sokal Hoax: At Whom Are We Laughing? by by Mara Beller.

The philosophical pronouncements of Bohr, Born, Heisenberg and Pauli deserve some of the blame for the excesses of the postmodernist critique of science.

The hoax perpetrated by New York University theoretical physicist Alan Sokal in 1996 on the editors of the journal Social Text quickly became widely known and hotly debated. (See Physics Today January 1997, page 61, and March 1997, page 73.) “Transgressing the Boundaries – Toward a Transformative Hermeneutics of Quantum Gravity,” was the title of the parody he slipped past the unsuspecting editors. [1]

Many readers of Sokal’s article characterized it as an ingenious exposure of the decline of the intellectual standards in contemporary academia, and as a brilliant parody of the postmodern nonsense rampant among the cultural studies of science. Sokal’s paper is variously, so we read, “a hilarious compilation of pomo gibberish”, “an imitation of academic babble”, and even “a transformative hermeneutics of total bullshit”. [2] Many scientists reported having “great fun” and “a great laugh” reading Sokal’s article. Yet whom, exactly, are we laughing at?

As telling examples of the views Sokal satirized, one might quote some other statements. Consider the following extrapolation of Heisenberg’s uncertainty and Bohr’s complementarity into the political realm:

“The thesis ‘light consists of particles’ and the antithesis ‘light consists of waves’ fought with one another until they were united in the synthesis of quantum mechanics. …Only why not apply it to the thesis Liberalism (or Capitalism), the antithesis Communism, and expect a synthesis, instead of a complete and permanent victory for the antithesis? There seems to be some inconsistency. But the idea of complementarity goes deeper. In fact, this thesis and antithesis represent two psychological motives and economic forces, both justified in themselves, but, in their extremes, mutually exclusive. …there must exist a relation between the latitudes of freedom df and of regulation dr, of the type df dr=p. …But what is the ‘political constant’ p? I must leave this to a future quantum theory of human affairs.”

Before you burst out laughing at such “absurdities,” let me disclose the author: Max Born, one of the venerated founding fathers of quantum theory [3]. Born’s words were not written tongue in cheek; he soberly declared that “epistemological lessons [from physics] may help towards a deeper understanding of social and political relations”. Such was Born’s enthusiasm to infer from the scientific to the political realm, that he devoted a whole book to the subject, unequivocally titled Physics and Politics [3].

A helpful illustration that poor or confused writing, accepted on the basis of “authority,” is not limited to the humanities.

The weakness of postmodernism does not lie exclusively in:

While publicly abstaining from criticizing Bohr, many of his contemporaries did not share his peculiar insistence on the impossibility of devising new nonclassical concepts – an insistence that put rigid strictures on the freedom to theorize. It is on this issue that the silence of other physicists had the most far-reaching consequences. This silence created and sustained the illusion that one needed no technical knowledge of quantum mechanics to fully comprehend its revolutionary epistemological lessons. Many postmodernist critics of science have fallen prey to this strategy of argumentation and freely proclaimed that physics itself irrevoably banished the notion of objective reality.

The question of “objective reality” can be answered only within some universe of discourse, such as quantum mechanics for example.

There are no reports of “objective reality” or “subjective reality” that do not originate from some human speaker situated in a cultural, social, espistemological, etc., context.

Postmodernists, Stanley Fish comes to mind, should have made strong epistemological move to say that all reports, of whatever nature, from literature to quantum mechanics, are reports situated in human context.

The rules for acceptable argument vary from one domain to another.

But there is no “out there” where anyone stands to judge between domains.

Should anyone lay claim to an “out there,” you should feel free to ask how they escaped the human condition of context?

And for what purpose do they claim an “out there?”

I suspect you will find they are trying to privilege some form of argumentation or to exclude other forms of argument.

That is a question of motive and not of some “out there.”

I first saw this at Pete Warden’s Five short links.

CLAVIN [Geotagging – Some Proofing Required]

Sunday, May 26th, 2013


From the webpage:

CLAVIN (*Cartographic Location And Vicinity INdexer*) is an open source software package for document geotagging and geoparsing that employs context-based geographic entity resolution. It combines a variety of open source tools with natural language processing techniques to extract location names from unstructured text documents and resolve them against gazetteer records. Importantly, CLAVIN does not simply “look up” location names; rather, it uses intelligent heuristics in an attempt to identify precisely which “Springfield” (for example) was intended by the author, based on the context of the document. CLAVIN also employs fuzzy search to handle incorrectly-spelled location names, and it recognizes alternative names (e.g., “Ivory Coast” and “Côte d’Ivoire”) as referring to the same geographic entity. By enriching text documents with structured geo data, CLAVIN enables hierarchical geospatial search and advanced geospatial analytics on unstructured data.

See for an online demo, videos and other materials.

Your mileage may vary.

I used a quote from today’s New York Times (Rockets Hit Hezbollah Stronghold in Lebanon):

An ongoing battle in the Syrian town of Qusair on the Lebanese border has laid bare Hezbollah’s growing role in the Syrian conflict. The Iranian-backed militia and Syrian troops launched an offensive against the town last weekend. After dozens of Hezbollah fighters were killed in Qusair over the past week and buried in large funerals in Lebanon, Hezbollah could no longer play down its involvement.

Col. Abdul-Jabbar al-Aqidi, commander of the Syrian rebels’ Military Council in Aleppo, appeared in a video this week while apparently en route to Qusair, in which he threatened to strike in Beirut’s southern suburbs in retaliation for Hezbollah’s involvement in Syria.

“We used to say before, ‘We are coming Bashar.’ Now we say, ‘We are coming Bashar and we are coming Hassan Nasrallah,'” he said, in reference to Hezbollah’s leader.

“We will strike at your strongholds in Dahiyeh, God willing,” he said, using the Lebanese name for Hezbollah’s power center in southern Beirut. The video was still online on Youtube on Sunday.

Hezbollah lawmaker Ali Ammar said the incident targeted coexistence between the Lebanese and claimed the U.S. and Israel want to return Lebanon to the years of civil war. “They want to throw Lebanon backward into the traps of civil wars that we left behind,” he told reporters. “We will not go backward.”

The results from CLAVIN:

Locations Extracted and Resolved From Text

ID Name Lat, Lon Country Code #
272103 Lebanon 33.83333, 35.83333 LB 3
6951366 Lebanese 44.49123, 26.0877 RO 3
276781 Beirut 33.88894, 35.49442 LB 2
162037 Dahiyeh 38.19023, 57.00984 TM 1
6252001 U.S. 39.76, -98.5 US 1
103089 Qusair 25.91667, 40.45 SA 1
163843 Syria 35, 38 SY 1
163843 Syrian 35, 38 SY 1
294640 Israel 31.5, 34.75 IL 1
170062 Aleppo 36.25, 37.5 SY 1

(The highlight added to show incorrect resolutions.)


RO = Romania

SA = Saudia Arabia

TM = Turkmenistan

Plus “Qusair” appears twice in the quoted text.

For the ten locations mentioned a seventy (70%) percent accuracy rate.

Better than the average American but proofing is still an essential step in editorial workflow.

I first saw this in Pete Warden’s Five short links.

Weaponized Information Price Guide

Sunday, May 26th, 2013

A price guide for zero day exploits was reported in Shopping For Zero-Days: A Price List For Hackers’ Secret Software Exploits by Andy Greenberg:


A rough guide as reliable sales information is hard to obtain.

But accurate enough to know that Tavis Ormandy (see: Google Researcher Reveals Zero-Day Windows Bug by Mathew J. Schwartz) dropped the ball when he:

published full details for a zero-day Windows vulnerability, including proof-of-concept (PoC) exploit code.

A weakness present in Windows 7 and 8, possibly going back 20 years in Windows.

With the number of legacy Windows systems, particularly in government offices, that should have pushed the price up substantially.

Buyers of software exploits want for secrecy to preserve an exploit’s value. Leaving brokers as the best marketing strategy for software exploits.

But what if the weaponized information, even if known, was unlikely to change?

For example, mapping response times by fire/police to metropolitan maps? Or locations of accidents that cause the longest traffic delays?

The data is available but not always assembled for easy reference. Subject to legitimate as well as illegitimate uses.

Where would you market it them? How would you price them?



I encountered the story about the zero-day exploit for Windows almost two weeks later in: Google researcher discloses zero-day exploit for Windows. Slight differences. H associates were able to reproduce the bug. Opening a file is sufficient to run programs with system privileges, even with a guest account.


Saturday, May 25th, 2013

Cato’s “Deepbills” Project Advances Government Transparency by Jim Harper.

From the post:

But there’s no sense in sitting around waiting for things to improve. Given the incentives, transparency is something that we will have to force on government. We won’t receive it like a gift.

So with software we acquired and modified for the purpose, we’ve been adding data to the bills in Congress, making it possible to learn automatically more of what they do. The bills published by the Government Printing Office have data about who introduced them and the committees to which they were referred. We are adding data that reflects:

– What agencies and bureaus the bills in Congress affect;

– What laws the bills in Congress effect: by popular name, U.S. Code section, Statutes at Large citation, and more;

– What budget authorities bills include, the amount of this proposed spending, its purpose, and the fiscal year(s).

We are capturing proposed new bureaus and programs, proposed new sections of existing law, and other subtleties in legislation. Our “Deepbills” project is documented at

This data can tell a more complete story of what is happening in Congress. Given the right Web site, app, or information service, you will be able to tell who proposed to spend your taxpayer dollars and in what amounts. You’ll be able to tell how your member of Congress and senators voted on each one. You might even find out about votes you care about before they happen!

Two important points:

First, transparency must be forced upon government (I would add businesses).

Second, transparency is up to us.

Do you know something the rest of us should know?

On your mark!

Get set!


I first saw this at: Harper: Cato’s “Deepbills” Project Advances Government Transparency.

Data Visualization: Exploring Biodiversity

Saturday, May 25th, 2013

Data Visualization: Exploring Biodiversity by Sean Gonzalez.

From the post:

When you have a few hundred years worth of data on biological records, as the Smithsonian does, from journals to preserved specimens to field notes to sensor data, even the most diligently kept records don’t perfectly align over the years, and in some cases there is outright conflicting information. This data is important, it is our civilization’s best minds giving their all to capture and record the biological diversity of our planet. Unfortunately, as it stands today, if you or I were to decide we wanted to learn more, or if we wanted to research a specific species or subject, accessing and making sense of that data effectively becomes a career. Earlier this year an executive order was given which generally stated that federally funded research had to comply with certain data management rules, and the Smithsonian took that order to heart, event though it didn’t necessarily directly apply to them, and has embarked to make their treasure of information more easily accessible. This is a laudable goal, but how do we actually go about accomplishing this? Starting with digitized information, which is a challenge in and of itself, we have a real Big Data challenge, setting the stage for data visualization.

The Smithsonian has already gone a long way in curating their biodiversity data on the Biodiversity Heritage Library (BHL) website, where you can find ever increasing sources. However, we know this curation challenge can not be met by simply wrapping the data with a single structure or taxonomy. When we search and explore the BHL data we may not know precisely what we’re looking for, and we don’t want a scavenger hunt to ensue where we’re forced to find clues and hidden secrets in hopes of reaching our national treasure; maybe the Gates family can help us out…

People see relationships in the data differently, so when we go exploring one person may do better with a tree structure, others prefer a classic title/subject style search, or we may be interested in reference types and frequencies. Why we don’t think about it as one monolithic system is akin to discussing the number of Angels that fit on the head of a pin, we’ll never be able to test our theories. Our best course is to accept that we all dive into data from different perspectives, and we must therefore make available different methods of exploration.

What would you do beyond visualization?

Semantics as Data

Saturday, May 25th, 2013

Semantics as Data by Oliver Kennedy.

From the post:

Something I’ve been getting drawn to more and more is the idea of computation as data.

This is one of the core precepts in PL and computation: any sort of computation can be encoded as data. Yet, this doesn’t fully capture the essence of what I’ve been seeing. Sure you can encode computation as data, but then what do you do with it? How do you make use of the fact that semantics can be encoded?

Let’s take this question from another perspective. In Databases, we’re used to imposing semantics on data. Data has meaning because we chose to give it meaning. The number 100,000 is meaningless, until I tell you that it’s the average salary of an employee at BigCorporateCo. Nevertheless, we can still ask questions in the abstract. Whatever semantics you use, 100,000 < 120,000. We can create abstractions (query languages) that allow us to ask questions about data, regardless of their semantics.

By comparison, an encoded computation carries its own semantics. This makes it harder to analyze, as the nature of those semantics is limited only by the type of encoding used to store the computation. But this doesn’t stop us from asking questions about the computation.

The Computation’s Effects

The simplest thing we can do is to ask a question about what it will compute. These questions span the range from the trivial to the typically intractable. For example, we can ask about…

  • … what the computation will produce given a specific input, or a specific set of inputs.
  • … what inputs will produce a given (range of) output(s).
  • … whether a particular output is possible.
  • … whether two computations are equivalent.

One particularly fun example in this space is Oracle’s Expression type [1]. An Expression stores (as a datatype) an arbitrary boolean expression with variables. The result of evaluating this expression on a given valuation of the variables can be injected into the WHERE clause of any SELECT statement. Notably, Expression objects can be indexed based on variable valuations. Given 3 such expressions: (A = 3), (A = 5), (A = 7), we can build an index to identify which expressions are satisfied for a particular valuation of A.

I find this beyond cool. Not only can Expression objects themselves be queried, it’s actually possible to build index structures to accelerate those queries.

Those familiar with probabilistic databases will note some convenient parallels between the expression type and Condition Columns used in C-Tables. Indeed, the concepts are almost identical. A C-Table encodes the semantics of the queries that went into its construction. When we compute a confidence in a C-Table (or row), what we’re effectively asking about is the fraction of the input space that the C-Table (row) produces an output for.

At every level of semantics there is semantic diversity.

Whether it is code or data, there are levels of semantics, each with semantic diversity.

You don’t have to resolve all semantic diversity, just enough to give you an advantage over others.’s Climb to the Social Gaming Throne [TM Incentives]

Saturday, May 25th, 2013’s Climb to the Social Gaming Throne by Karina Babcock.

From the post:

This week I’d like to highlight, a European social gaming giant that recently claimed the throne for having the most daily active users (more than 66 million). has methodically and successfully expanded its reach beyond mainstream social gaming to dominate the mobile gaming market — it offers a streamlined experience that allows gamers to pick up their gaming session from wherever they left off, in any game and on any device.’s top games include “Candy Crush Saga” and “Bubble Saga”.

And — you guessed it — runs on CDH.

With a business model that offers all games for free, relies advertising and in-game products like boosters and extra lives to generate revenue. In other words, it has to be smart in every communication with customers in order to create value for both the gamer and the advertiser. uses Hadoop to process, store, and analyze massive volumes of log data generated from the games along with other data sources such as daily currency exchange rates from the European Central bank, multiple metadata feeds, and advertising servers’ log files.

Karina ends with links to more details on the Hadoop setup at

I don’t know how to make a useful topic map as easy as “Candy Crush Saga” or “Bubble Saga,” but you might.

Or perhaps a combination of topic maps and games.

For example, buying up extra lives for popular games and they are awarded as incentives for uses of a topic map interface?

You can search G.*e with no prize or use Topic Map X, with a prize.

Which one would you choose?

Protests of unfairness from a house that rigs counts aren’t going to bother me.


Apache Pig Editor in Hue 2.3

Saturday, May 25th, 2013

Apache Pig Editor in Hue 2.3

From the post:

In the previous installment of the demo series about Hue — the open source Web UI that makes Apache Hadoop easier to use — you learned how to analyze data with Hue using Apache Hive via Hue’s Beeswax and Catalog applications. In this installment, we’ll focus on using the new editor for Apache Pig in Hue 2.3.

Complementing the editors for Hive and Cloudera Impala, the Pig editor provides a great starting point for exploration and real-time interaction with Hadoop. This new application lets you edit and run Pig scripts interactively in an editor tailored for a great user experience. Features include:

  • UDFs and parameters (with default value) support
  • Autocompletion of Pig keywords, aliases, and HDFS paths
  • Syntax highlighting
  • One-click script submission
  • Progress, result, and logs display
  • Interactive single-page application

Here’s a short video demoing its capabilities and ease of use:


How are you editing your Pig scripts now?

How are you documenting the semantics of your Pig scripts?

How do you search across your Pig scripts?

“Correlation versus causation in a single graph”

Saturday, May 25th, 2013


From Chris Blattman.

Familiar story for readers of this blog, but a lesson worth repeating.

Open Source Release: postgresql-hll

Saturday, May 25th, 2013

Open Source Release: postgresql-hll

From the post:

We’re happy to announce the first open-source release of AK’s PostgreSQL extension for building and manipulating HyperLogLog data structures in SQL, postgresql-hll. We are releasing this code under the Apache License, Version 2.0 which we feel is an excellent balance between permissive usage and liability limitation.

What is it and what can I do with it?

The extension introduces a new data type, hll, which represents a probabilistic distinct value counter that is a hybrid between a HyperLogLog data structure (for large cardinalities) and a simple set (for small cardinalities). These structures support the basic HLL methods: insert, union, and cardinality, and we’ve also provided aggregate and debugging functions that make using and understanding these things a breeze. We’ve also included a way to do schema versioning of the binary representations of hlls, which should allow a clear path to upgrading the algorithm, as new engineering insights come up.

A quick overview of what’s included in the release:

  • C-based extension that provides the hll data structure and algorithms
  • Austin Appleby’s MurmurHash3 implementation and SQL-land wrappers for integer numerics, bytes, and text
  • Full storage specification in STORAGE.markdown
  • Full function reference in REFERENCE.markdown
  • .spec file for rpmbuild
  • Full test suite

A quick note on why we included MurmurHash3 in the extension: we’ve done a good bit of research on the importance of a good hash function when using sketching algorithms like HyperLogLog and we came to the conclusion that it wouldn’t be very user-friendly to force the user to figure out how to get a good hash function into SQL-land. Sure, there are plenty of cryptographic hash functions available, but those are (computationally) overkill for what is needed. We did the research and found MurmurHash3 to be an excellent non-cryptographic hash function in both theory and practice. We’ve been using it in production for a while now with excellent results. As mentioned in the README, it’s of crucial importance to reliably hash the inputs to hlls.

Would you say topic maps aggregate data?

I thought so.

How would you adapt HLL to synonymous identifications?

I ask because of this line in the post:

Essentially, we’re doing interactive set-intersection of operands with millions or billions of entries in milliseconds. This is intersection computed using the inclusion-exclusion principle as applied to hlls:

Performing “…interactive set-intersection of operands with millions or billions of entries in milliseconds…” sounds like an attractive feature for topic map software.


Pornography: what we know, what we don’t

Friday, May 24th, 2013

Pornography: what we know, what we don’t by Mona Chalabi.

From the post:

Unsurprisingly, on the Datablog we often write articles about data when we have data. But some topics, like pornography, aren’t conducive to statistical analysis, no matter how important many claim they are.

Despite these challenges, a report released today has sought to assess children and young people’s exposure to pornography and understand its impact. Led by Middlesex University and commissioned by the Children’s Commissioner, this was a rapid evidence assessment – completed in the space of just three months as part of a much larger ongoing inquiry into child sexual exploitation.

The report found that a “significant proportion of children and young people are exposed to or access pornography”, and that this is linked to “unrealistic attitudes about sex” as well as “less progressive gender role attitudes (e.g. male dominance and female submission)”.

Though the report makes these and other important conclusions, you’ll notice that numbers are conspicuously absent in its language. One reason is that its findings were not based on primary research but a literature review that began with 41,000 identified sources and concluded by using 276 of those that were deemed relevant.

The post doesn’t even mention that we will know pornography when we see it.


Perhaps that is part of the problem of measurement.

Rather than processing a trillion triples, the next big data measure should be indexing all the pornography on the WWW over some time period.


Graph Databases and Software Metrics & Analysis

Friday, May 24th, 2013

Graph Databases and Software Metrics & Analysis by Michael Hunger

From the post:

This is the first in a series of blog posts that discuss the usage of a graph database like Neo4j to store, compute and visualize a variety of software metrics and other types of software analytics (method call hierarchies, transitive clojure, critical path analysis, volatility & code quality). Follow up posts by different contributors will be linked from this one.

Everyone who works in software development comes across software metrics at some point.

Just because of curiosity about the quality or complexity of the code we’ve written, or a real interest to improve quality and reduce technical debt, there are many reasons.

In general there are many ways of approaching this topic, from just gathering and rendering statistics in diagrams to visualizing the structure of programs and systems.

There are a number of commercial and free tools available that compute software metrics and help expose the current trend in your projects development.

Software metrics can cover different areas. Computing cyclomatic complexity, analysing dependencies or call traces is probably easy, using statical analysis to find smaller or larger issues is more involved and detecting code smells can be an interesting challenge in AST parsing.

Interesting work on using graph databases (here Neo4j) for software analysis.

Be sure to see the resources listed at the end of the post.

Universal Properties

Friday, May 24th, 2013

Universal Properties by Jeremy Kun.

From the post:

Previously in this series we’ve seen the definition of a category and a bunch of examples, basic properties of morphisms, and a first look at how to represent categories as types in ML. In this post we’ll expand these ideas and introduce the notion of a universal property. We’ll see examples from mathematics and write some programs which simultaneously prove certain objects have universal properties and construct the morphisms involved.

Just in time for a long holiday weekend in the U.S., Jeremy continues his series on category theory.


Wakari.IO Web-based Python Data Analysis

Friday, May 24th, 2013

Wakari.IO Web-based Python Data Analysis

From: Continuum Analytics Launches Full-Featured, In-Browser Data Analytics Environment by Corinna Bahr.

Continuum Analytics, the premier provider of Python-based data analytics solutions and services, today announced the release of Wakari version 1.0, an easy-to-use, cloud-based, collaborative Python environment for analyzing, exploring and visualizing large data sets .

Hosted on Amazon’s Elastic Compute Cloud (EC2), Wakari gives users the ability to share analyses and results via IPython notebook, visualize with Matplotlib, easily switch between multiple versions of Python and its scientific libraries, and quickly collaborate on analyses without having to download data locally to their laptops or workstations. Users can share code and results as simple web URLs, from which other users can easily create their own copies to modify and explore.

Previously in beta, the version 1.0 release of Wakari boasts a number of new features, including:

  • Premium access to SSH, ipcluster configuration, and the full range of Amazon compute nodes and clusters via a drop-down menu
  • Enhanced IPython notebook support, most notably an IPython notebook gallery and an improved UI for sharing
  • Bundles for simplified sharing of files, folders, and Python library dependencies
  • Expanded Wakari documentation
  • Numerous enhancements to the user interface

This looks quite interesting. There is a free option if you are undecided.

I first saw this at: Wakari: Continuum In-Browser Data Analytics Environment.

From data to analysis:… [Data Integration For a Purpose]

Friday, May 24th, 2013

From data to analysis: linking NWChem and Avogadro with the syntax and semantics of Chemical Markup Language by Wibe A de Jong, Andrew M Walker and Marcus D Hanwell. (Journal of Cheminformatics 2013, 5:25 doi:10.1186/1758-2946-5-25)



Multidisciplinary integrated research requires the ability to couple the diverse sets of data obtained from a range of complex experiments and computer simulations. Integrating data requires semantically rich information. In this paper an end-to-end use of semantically rich data in computational chemistry is demonstrated utilizing the Chemical Markup Language (CML) framework. Semantically rich data is generated by the NWChem computational chemistry software with the FoX library and utilized by the Avogadro molecular editor for analysis and visualization.


The NWChem computational chemistry software has been modified and coupled to the FoX library to write CML compliant XML data files. The FoX library was expanded to represent the lexical input files and molecular orbitals used by the computational chemistry software. Draft dictionary entries and a format for molecular orbitals within CML CompChem were developed. The Avogadro application was extended to read in CML data, and display molecular geometry and electronic structure in the GUI allowing for an end-to-end solution where Avogadro can create input structures, generate input files, NWChem can run the calculation and Avogadro can then read in and analyse the CML output produced. The developments outlined in this paper will be made available in future releases of NWChem, FoX, and Avogadro.


The production of CML compliant XML files for computational chemistry software such as NWChem can be accomplished relatively easily using the FoX library. The CML data can be read in by a newly developed reader in Avogadro and analysed or visualized in various ways. A community-based effort is needed to further develop the CML CompChem convention and dictionary. This will enable the long-term goal of allowing a researcher to run simple “Google-style” searches of chemistry and physics and have the results of computational calculations returned in a comprehensible form alongside articles from the published literature.

Aside from its obvious importance for cheminformatics, I think there is another lesson in this article.

Integration of data required “…semantically rich information…, but just as importantly, integration was not a goal in and of itself.

Integration was only part of a workflow that had other goals.

No doubt some topic maps are useful as end products of integrated data, but what of cases where integration is part of a workflow?

Think of the non-reusable data integration mappings that are offered by many enterprise integration packages.

How Does A Search Engine Work?…

Friday, May 24th, 2013

How Does A Search Engine Work? An Educational Trek Through A Lucene Postings Format by Doug Turnbull.

From the post:

A new feature of Lucene 4 – pluggable codecs – allows for the modification of Lucene’s underlying storage engine. Working with codecs and examining their output yields fascinating insights into how exactly Lucene’s search works in its most fundamental form.

The centerpiece of a Lucene codec is it’s postings format. Postings are a commonly thrown around word in the Lucene space. A Postings format is the representation of the inverted search index – the core data structure used to lookup documents that contain a term. I think nothing really captures the logical look-and-feel of Lucene’s postings better than Mike McCandless’s SimpleTextPostingsFormat. SimpleText is a text-based representation of postings created for educational purposes. I’ve indexed a few documents in Lucene using SimpleText to demonstrate how postings are structured to allow for fast search:

A first step towards moving beyond being a search engine result consumer.

Postgres Demystified

Friday, May 24th, 2013

From the description:

Postgres has long been known as a stable database product that reliably stores your data. However, in recent years it has picked up many features, allowing it to become a much sexier database.

This video covers a whirlwind of Postgres features, which highlight why you should consider it for your next project. These include: Datatypes Using other languages within Postgres Extensions including NoSQL inside your SQL database Accessing your non-Postgres data (Redis, Oracle, MySQL) from within Postgres Window Functions.

Chris Kerstiens does a fast paced overview of Postgres.

Introducing Fact Tank

Friday, May 24th, 2013

Introducing Fact Tank by Alan Murray.

From the post:

Welcome to Fact Tank, a new, real-time platform from the Pew Research Center, dedicated to finding news in the numbers.

Fact Tank will build on the Pew Research Center’s unique brand of data journalism. For years, our teams of writers and social scientists have combined rigorous research with high-quality storytelling to provide important information on issues and trends shaping the nation and the world.

Fact Tank will allow us to provide that sort of information at a faster pace, in an attempt to provide you with the information you need when you need it. We’ll fill the gap between major surveys and reports with shorter pieces using our data to give context to the news of the day. And we’ll scour other data sources, bringing you important insights on politics, religion, technology, media, economics and social trends.

An interesting source of additional data on current news stories.

Installing Distributed Solr 4 with Fabric

Thursday, May 23rd, 2013

Installing Distributed Solr 4 with Fabric by Martijn Koster

From the post:

Solr 4 has a subset of features that allow it be run as a distributed fault-tolerant cluster, referred to as “SolrCloud”. Installing and configuring Solr on a multi-node cluster can seem daunting when you’re a developer who just wants to give the latest release a try. The wiki page is long and complex, and configuring nodes manually is laborious and error-prone. And while your OS has ZooKeeper/Solr packages, they are probably outdated. But it doesn’t have to be a lot of work: in this post I will show you how to deploy and test a Solr 4 cluster using just a few commands, using mechanisms you can easily adjust for your own deployments.

I am using a cluster consisting of a virtual machines running Ubuntu 12.04 64bit and I am controlling them from my MacBook Pro. The Solr configuration will mimic the Two shard cluster with shard replicas and zookeeper ensemble example from the wiki.

You can run this on AWS EC2, but some special considerations apply, see the footnote.

We’ll use Fabric, a light-weight deployment tool that is basically a Python library to easily execute commands on remote nodes over ssh. Compared to Chef/Puppet it is simpler to learn and use, and because it’s an imperative approach it makes sequential orchestration of dependencies more explicit. Most importantly, it does not require a separate server or separate node-side software installation.

DISCLAIMER: these instructions and associated scripts are released under the Apache License; use at your own risk.

I strongly recommend you use disposable virtual machines to experiment with.

Something to get you excited about the upcoming weekend!


MongoDB: The Definitive Guide 2nd Edition is Out!

Thursday, May 23rd, 2013

MongoDB: The Definitive Guide 2nd Edition is Out! by Kristina Chodorow.

From the webpage:

The second edition of MongoDB: The Definitive Guide is now available from O’Reilly! It covers both developing with and administering MongoDB. The book is language-agnostic: almost all of the examples are in JavaScript.

Looking forward to enjoying the second edition as much as the first!

Although, I am not really sure that always using JavaScript means you are “language-agnostic.” 😉

Probabilistic Programming and Bayesian Methods for Hackers

Thursday, May 23rd, 2013

Probabilistic Programming and Bayesian Methods for Hackers by Cam Davidson-Pilon and others.

From the website:

The Bayesian method is the natural approach to inference, yet it is hidden from readers behind chapters of slow, mathematical analysis. The typical text on Bayesian inference involves two to three chapters on probability theory, then enters what Bayesian inference is. Unfortunately, due to mathematical intractability of most Bayesian models, the reader is only shown simple, artificial examples. This can leave the user with a so-what feeling about Bayesian inference. In fact, this was the author’s own prior opinion.

After some recent success of Bayesian methods in machine-learning competitions, I decided to investigate the subject again. Even with my mathematical background, it took me three straight-days of reading examples and trying to put the pieces together to understand the methods. There was simply not enough literature bridging theory to practice. The problem with my misunderstanding was the disconnect between Bayesian mathematics and probabilistic programming. That being said, I suffered then so the reader would not have to now. This book attempts to bridge the gap.

If Bayesian inference is the destination, then mathematical analysis is a particular path to towards it. On the other hand, computing power is cheap enough that we can afford to take an alternate route via probabilistic programming. The latter path is much more useful, as it denies the necessity of mathematical intervention at each step, that is, we remove often-intractable mathematical analysis as a prerequisite to Bayesian inference. Simply put, this latter computational path proceeds via small intermediate jumps from beginning to end, where as the first path proceeds by enormous leaps, often landing far away from our target. Furthermore, without a strong mathematical background, the analysis required by the first path cannot even take place.

Bayesian Methods for Hackers is designed as a introduction to Bayesian inference from a computational/understanding-first, and mathematics-second, point of view. Of course as an introductory book, we can only leave it at that: an introductory book. For the mathematically trained, they may cure the curiosity this text generates with other texts designed with mathematical analysis in mind. For the enthusiast with less mathematical-background, or one who is not interested in the mathematics but simply the practice of Bayesian methods, this text should be sufficient and entertaining.

Not yet complete but what is there you will find very useful.

Data Science eBook by Analyticbridge – 2nd Edition

Thursday, May 23rd, 2013

Data Science eBook by Analyticbridge – 2nd Edition by Vincent Granville.

From the post:

This 2nd edition has more than 200 pages of pure data science, far more than the first edition. This new version of our very popular book will soon be available for download: we will make an announcement when it is officially published.

Sixty-two (62) new contributions split between data science recipes, data science discussions, data science resources.

If you can’t wait for the ebook, links to the contributions are given at Vincent’s post.

One post in particular caught my attention: How to reverse engineer Google?

The project sounds interesting but why not reverse engineer CNN or WSJ or NYT coverage?

Watch the stories that appear most often and the most visibly to determine what you need to do for coverage.

It may not have anything to do with your core competency, but then neither does gaming page rankings by Google.

Just that is your business model and then you are selling your service to people even less informed than you are.

Do be careful because some events covered by CNN, WSJ and the NTY are considered illegal in some jurisdictions.

Subway Maps and Visualising Social Equality

Wednesday, May 22nd, 2013

Subway Maps and Visualising Social Equality by James Chesire.

From the post:

Most government statistics are mapped according to official geographical units. Whilst such units are essential for data analysis and making decisions about, for example, government spending, they are hard for many people to relate to and they don’t particularly stand out on a map. This is why I tried a new method back in July 2012 to show life expectancy statistics in a fresh light by mapping them on to London Tube stations. The resulting ”Lives on the Line” map has been really popular with many people surprised at the extent of the variations in the data across London and also grateful for the way that it makes seemingly abstract statistics more easily accessible. To find out how I did it (and read some of the feedback) you can see here.

James gives a number of examples of the use of transportation lines making “abstract statistics more easily accessible.”

Worth a close look if you are interested in making dry municipal statistics part of the basis for social change.

US rendition map: what it means, and how to use it

Wednesday, May 22nd, 2013

US rendition map: what it means, and how to use it by James Ball.

From the post:

The Rendition Project, a collaboration between UK academics and the NGO Reprieve, has produced one of the most detailed and illuminating research projects shedding light on the CIA’s extraordinary rendition project to date. Here’s how to use it.

Truly remarkable project to date, but could be even more successful with your assistance.

Not likely that any of the principals will wind up in the dock at the Hague.

On the other hand, exposing their crimes may deter others from similar adventures.

Integrating the US’ Documents

Wednesday, May 22nd, 2013

Integrating the US’ Documents by Eric Mill.

From the post:

A few weeks ago, we integrated the full text of federal bills and regulations into our alert system, Scout. Now, if you visit CISPA or a fascinating cotton rule, you’ll see the original document – nicely formatted, but also well-integrated into Scout’s layout. There are a lot of good reasons to integrate the text this way: we want you to see why we alerted you to a document without having to jump off-site, and without clunky iframes.

As importantly, we wanted to do this in a way that would be easily reusable by other projects and people. So we built a tool called us-documents that makes it possible for anyone to do this with federal bills and regulations. It’s available as a Ruby gem, and comes with a command line tool so that you can use it with Python, Node, or any other language. It lives inside the unitedstates project at unitedstates/documents, and is entirely public domain.

This could prove to be real interesting. Both as a matter of content and a technique to replicate elsewhere.

I first saw this at: Mill: Integrating the US’s Documents.