Archive for August, 2013


Saturday, August 31st, 2013


From the webpage:

Python module which implements a template based state machine for parsing semi-formatted text. Originally developed to allow programmatic access to information returned from the command line interface (CLI) of networking devices.

TextFSM was developed internally at Google and released under the Apache 2.0 licence for the benefit of the wider community.

See: TextFSMHowto for details.

TextFSM looks like a useful Python module for extracting data from “semi-structured” text.

I first saw this in Nat Torkington’s Four short links: 29 August 2013.

The PieMaster

Saturday, August 31st, 2013

pie chart

Just too bizarre to pass on re-posting.

I found this and other material suitable for training students what not to do, at: WTF Visualizations.

Do You Mansplain Topic Maps?

Saturday, August 31st, 2013

Selling Data Science: Common Language by Sean Gonzalez.

From the post:

What do you think of when you say the word “data”? For data scientists this means SO MANY different things from unstructured data like natural language and web crawling to perfectly square excel spreadsheets. What do non-data scientists think of? Many times we might come up with a slick line for describing what we do with data, such as, “I help find meaning in data” but that doesn’t help sell data science. Language is everything, and if people don’t use a word on a regular basis it will not have any meaning for them. Many people aren’t sure whether they even have data let alone if there’s some deeper meaning, some insight, they would like to find. As with any language barrier the goal is to find common ground and build from there.

You can’t blame people, the word “data” is about as abstract as you can get, perhaps because it can refer to so many different things. When discussing data casually, rather than mansplain what you believe data is or what it could be, it’s much easier to find examples of data that they are familiar with and preferably are integral to their work. (emphasis added)

Well? Your answer here:______.

Let’s recast that last clause to read:

…it’s much easier to find examples of subjects they are familiar with and preferably are integral to their work.

So that the conversation is about their subjects and what they want to say about them.

As a potential customer, I would find that more compelling.


Working with PDFs…

Saturday, August 31st, 2013

Working with PDFs Using Command Line Tools in Linux by William J. Turkel.

From the post:

We have already seen that the default assumption in Linux and UNIX is that everything is a file, ideally one that consists of human- and machine-readable text. As a result, we have a very wide variety of powerful tools for manipulating and analyzing text files. So it makes sense to try to convert our sources into text files whenever possible. In the previous post we used optical character recognition (OCR) to convert pictures of text into text files. Here we will use command line tools to extract text, images, page images and full pages from Adobe Acrobat PDF files.

A great post if you are working with PDF files.

Google goes back to the future…

Saturday, August 31st, 2013

Google goes back to the future with SQL F1 database by Jack Clark.

From the post:

The tech world is turning back toward SQL, bringing to a close a possibly misspent half-decade in which startups courted developers with promises of infinite scalability and the finest imitation-Google tools available, and companies found themselves exposed to unstable data and poor guarantees.

The shift has been going on quietly for some time, and tech leader Google has been tussling with the drawbacks of non-relational and non ACID-compliant systems for years. That struggle has demanded the creation of a new system to handle data at scale, and on Tuesday at the Very Large Data Base (VLDB) conference, Google delivered a paper outlining its much-discussed “F1” system, which has replaced MySQL as the distributed heart of the company’s hugely lucrative AdWords platform.

The AdWords system includes “100s of applications and 1000s of users,” which all share a database over 100TB serving up “hundreds of thousands of requests per second, and runs SQL queries that scan tens of trillions of data rows per day,” Google said. And it’s got five nines of availability.


F1 uses some of Google’s most advanced technologies, such as BigTable and the planet-spanning “Spanner” database, which F1 servers are co-located with for optimum use. Google describes it as a “a hybrid, combining the best aspects of traditional relational databases and scalable NoSQL systems”.

I am wondering what the “…RDBMS doesn’t do X well parrots…” are going to say now?

The authors admit up front “trade-offs and sacrifices” were made. But when you meet your requirements while processing trillions of rows of data daily, you are entitled to “trade-offs and sacrifices.”

A very deep paper that will require background reading for most of us.

Looking forward to it.

OpenAGRIS 0.9 released:…

Friday, August 30th, 2013

OpenAGRIS 0.9 released: new functionalities, resources & look by Fabrizio Celli.

From the post:

The AGRIS team has released OpenAGRIS 0.9, a new version of the Web application that aggregates information from different Web sources to expand the AGRIS knowledge, providing as much data as possible about a topic or a bibliographical resource within the agricultural domain.

OpenAGRIS 0.9 contains new functionalities and resources, and received a new interface in English and Spanish, with French, Arabic, Chinese and Russian translations on their way.

Mission: To make information on agricultural research globally available, interlinked with other data resources (e.g. DBPedia, World Bank, Geopolitical Ontology, FAO fisheries dataset, AGRIS serials dataset etc.) following Linked Open Data principles, allowing users to access the full text of a publication and all the information the Web holds about a specific research area in the agricultural domain (1).

Curious what agricultural experts make of this resource?

As of today, the site claims 5,076,594 records. And with all the triple bulking up, some 134,276,804 triples based on those records.

What, roughly # of records * 26 for the number of triples?

Which is no mean feat but I wonder about the granularity of the information being offered?

That is how useful is it to find 10,000 resources when each will take an hour to read?

More granular retrieval, that is far below the level of a file or document, is going to be necessary to avoid repetition of human data mining.

Repetitive human data mining being one of the earmarks of today’s search technology.

An ignored issue in Big Data analysis

Friday, August 30th, 2013

An ignored issue in Big Data analysis by Kaiser Fung.

Kaiser debunks a couple of recent stories that were powered, so it was said, by “analysis” of “big data.”

Short, highly amusing and worth your time to read.

If you practice this type of statistical analysis (or lack thereof) you need to also be using Bible codes. Or a Ouija Board.

Statistical Thinking: [free book]

Friday, August 30th, 2013

Statistical Thinking: A Simulation Approach to Modeling Uncertainty

From the post:

Catalyst Press has just released the second edition of the book Statistical Thinking: A Simulation Approach to Modeling Uncertainty. The material in the book is based on work related to the NSF-funded CATALST Project (DUE-0814433). It makes exclusive use of simulation to carry out inferential analyses. The material also builds on best practices and materials developed in statistics education, research and theory from cognitive science, as well as materials and methods that are successfully achieving parallel goals in other disciplines (e.g., mathematics and engineering education).

The materials in the book help students:

  • Build a foundation for statistical thinking through immersion in real world problems and data
  • Develop an appreciation for the use of data as evidence
  • Use simulation to address questions involving statistical inference including randomization tests and bootstrap intervals
  • Model and simulate data using TinkerPlots™ software

Definitely a volume for the short reading list.

Applicable in a number of areas, from debunking statistical arguments in public debates to developing useful models for your clients.

Choosing a PostgreSQL text search method

Friday, August 30th, 2013

Choosing a PostgreSQL text search method by Craig Ringer.

From the post:

(This article is written with reference to PostgreSQL 9.3. If you’re using a newer version please check to make sure any limitations described remain in place.)

PostgreSQL offers several tools for searching and pattern matching text. The challenge is choosing which to use for a job. There’s:

There’s also SIMILAR TO, but we don’t speak of that in polite company, and PostgreSQL turns it into a regular expression anyway.

If you are thinking about running a PostgreSQL backend and need text searching, this will be a useful post for you.

I really appreciated Craig’s closing paragraph:

At no point did I try to determine whether LIKE or full-text search is faster for a given query. That’s because it usually doesn’t matter; they have different semantics. Which goes faster, a car or a boat? In most cases it doesn’t matter because speed isn’t your main selection criteria, it’s “goes on water” or “goes on land”.

Something to keep in mind with the “web scale” chorus comes along.

Most of the data of interest to me (not all) isn’t of web scale.

How about you?

Parsing arbitrary Text-based Guitar Tab…

Thursday, August 29th, 2013

RiffBank – Parsing arbitrary Text-based Guitar Tab into an Indexable and Queryable “RiffCode for ElasticSearch
by Ryan Robitalle.

Guitar tab is a form of tablature, a form of music notation that records finger positions.

Surfing just briefly, there appear to be a lot of music available in “tab” format.

Deeply interesting post that will take some time to work through.

It is one of those odd things that may suddenly turn out to be very relevant (or not) in another domain.

Looking forward to spending some time with tablature data.

Data Mining with Weka [Free MOOC]

Thursday, August 29th, 2013

Data Mining with Weka

From the webpage:

Welcome to the free online course Data Mining with Weka

This 5 week MOOC will introduce data mining concepts through practical experience with the free Weka tool.

The course features:

The course will start September 9, 2013, with enrolments now open.

An opportunity to both keep your mind in shape and learn something useful.

The need for data intuits who also know machine learning is increasing.

Are you going to be the pro from Dover or not?

Neo4j Cypher Refcard 2.0

Thursday, August 29th, 2013

Neo4j Cypher Refcard 2.0

This looks very useful.

If nobody else does, I will cast this into a traditional refcard format.

DSLs and Towers of Abstraction

Thursday, August 29th, 2013

DSLs and Towers of Abstraction by Gershom Bazerman.

From the description:

This talk will sketch some connections at the foundations of semantics (of programming languages, logics, formal systems in general). In various degrees of abbreviation, we will present Galois Connections, Lawvere Theories, adjoint functors and their relationship to syntax and semantics, and the core notion behind abstract interpretation. At each step we’ll draw connections, trying to show why these are good tools to think with even as we’re solving real world problems and building tools and libraries others will find simple and elegant to use.

Further reading:

If your mind has gotten flabby over the summer, this presentation will start to get it back in shape.

You may get swept along in the speaker’s enthusiasm.

Very high marks!

A Set of Hadoop-related Icons

Wednesday, August 28th, 2013

A Set of Hadoop-related Icons by Marc Holmes.

From the post:

The best architecture diagrams are those that impart the intended knowledge with maximum efficiency and minimum ambiguity. But sometimes there’s a need to add a little pizazz, and maybe even draw a picture or two for those Powerpoint moments.

Marc introduces a small set of Hadoop-related icons.

It will be interesting to see if these icons catch on as the defaults for Hadoop-related presentations.

Would be nice to have something similar for topic maps, if there are any artistic topic mappers in the audience.

BASE indexed 50 million OAI-records

Wednesday, August 28th, 2013

BASE indexed 50 million OAI-records by Sarah Dister.

From the post:

BASE, a search engine for academic open access web resources, has indexed more than 50,000,000 OAI-records. The records are provided by about 2,700 repositories among which many are related to agriculture.

BASE is a multi-disciplinary search engine for academically relevant OAI-Sources worldwide, which was created and developed by Bielefeld University Library.

Take a few minutes (or longer) to explore BASE.

It is a remarkable resource. For example, users can invoke the Eurovoc Thesaurus as part of their search query.

the BOMB in the GARDEN

Wednesday, August 28th, 2013

the BOMB in the GARDEN by Matthew Butterick.

From the post:

It’s now or nev­er for the web. The web is a medi­um for cre­ators, in­clud­ing de­sign­ers. But af­ter 20 years, the web still has no cul­ture of de­sign ex­cel­lence. Why is that? Because de­sign ex­cel­lence is in­hib­it­ed by two struc­tur­al flaws in the web. First flaw: the web is good at mak­ing in­for­ma­tion free, but ter­ri­ble at mak­ing it ex­pen­sive. So the web has had to rely large­ly on an ad­ver­tis­ing econ­o­my, which is weak­en­ing un­der the strain. Second flaw: the process of adopt­ing and en­forc­ing web stan­dards, as led by the W3C, is hope­less­ly bro­ken. Evidence of both these flaws can be seen in a) the low de­sign qual­i­ty across the web, and b) the speed with which pub­lish­ers, de­vel­op­ers, and read­ers are mi­grat­ing away from the web, and to­ward app plat­forms and me­dia plat­forms. This ev­i­dence strong­ly sug­gests that the web is on its way to be­com­ing a sec­ond-class plat­form. To ad­dress these flaws, I pro­pose that the W3C be dis­band­ed, and that the lead­er­ship of the web be re­or­ga­nized around open-source soft­ware prin­ci­ples. I also en­cour­age de­sign­ers to ad­vo­cate for a bet­ter web, lest they find them­selves confined to a shrink­ing ter­ri­to­ry of possibilities.

Apologies to Matthew for my mangling of the typography of his title.

This rocks!

This is one of those rare, read this at least once a month posts.

That is if you want to see a Web that supports high quality design and content.

If you like the current low quality, ad driven Web, just ignore it.

Computer Music Journal

Wednesday, August 28th, 2013

Computer Music Journal

After seeing Chris Ford’s presentation, I went looking for other computer music related material.

The Computer Music Journal is a pay-per-view journal out of MIT.

The Computer Music Journal link at the top of this post is a companion site that has a computer music biography and computer music links, organized by subjects.

If you are interested in computer music, this could be a very rich resource.

Functional Composition [Coding and Music]

Wednesday, August 28th, 2013

Functional Composition by Chris Ford.

From the summary:

Chris Ford shows how to make music starting with the basic building block of sound, the sine wave, and gradually accumulating abstractions culminating in a canon by Johann Sebastian Bach.

You can grab the source on Github.

Truly a performance presentation!


Chris not only plays music with an instrument, he also writes code to alter music as it is being played on a loop.

Steady hands if nothing else in front of a live audience!

Perhaps a great way to interest people in functional programming.

Certainly a great way to encode historical music that is hard to find performed.

NoSQL Listener

Wednesday, August 28th, 2013

NoSQL Listener

From the webpage:

Aggregating NoSQL news from Twitter, from your friends at Cloudant

What twitter streams do you want to capture and post online (or process into a topic map)?

You can fork this project at GitHub.

Here’s a research idea:

Capture tweets on a possible U.S. lead conflict and separate out those from a geographic plot around the Pentagon.

Do the tweet levels or tone track U.S. military action?

Casting SPELs In LISP

Wednesday, August 28th, 2013

Casting SPELs In LISP by Conrad Barski, M.D.

From the homepage:

Anyone who has ever learned to program in Lisp will tell you it is very different from any other programming language. It is different in lots of surprising ways- This comic book will let you find out how Lisp’s unique design makes it so powerful!

There are other language versions, Emacs Lisp, Clojure Lisp and Turkish.

Understand I am just taking Dr. Barski’s word for the Turkish version being the same as the original text. I don’t read Turkish.

If you prefer playful ways to learn a computer language, this should a winner for you!


Wednesday, August 28th, 2013


From the about:

CORE (COnnecting REpositories) aims to facilitate free access to scholarly publications distributed across many systems. As of today, CORE gives you access to millions of scholarly articles aggregated from many Open Access repositories.

We believe in free access to information. The mission of CORE is to:

  • Support the right of citizens and general public to access the results of research towards which they contributed by paying taxes.
  • Facilitate access to Open Access content for all by targeting general public, software developers, researchers, etc., by improving search and navigation using state-of-the-art technologies in the field of natural language processing and the Semantic Web.
  • Provide support to both content consumers and content providers by working with digital libraries and institutional repositories.
  • Contribute to a cultural change by promoting Open Access.

BTW, CORE also allows you to harvest their data.

As of today, August 28, 2013, 13,639,485 articles.

Excellent resource for scholarly publications!

Not to mention a useful yardstick for other publication indexing projects.

What does your indexing project offer that CORE does not?

That is rather than duplicating indexing we already possess, where it the value-add of your indexing?

Building a distributed search system

Wednesday, August 28th, 2013

Building a distributed search system with Apache Hadoop and Lucene by Mirko Calvaresi.

From the preface:

This work analyses the problem coming from the so called Big Data scenario, which can be defined as the technological challenge to manage and administer quantity of information with global dimension in the order of Terabyte (1012bytes) or Petabyte (1015bytes) and with an exponential growth rate. We’ll explore a technological and algorithmic approach to handle and calculate theses amounts of data that exceed the limit of computation of a traditional architecture based on real-time request processing:in particular we’ll analyze a singular open source technology, called Apache Hadoop, which implements the approach described as Map and Reduce.

We’ll describe also how to distribute a cluster of common server to create a Virtual File System and use this environment to populate a centralized search index (realized using another open source technology, called Apache Lucene). The practical implementation will be a web based application which offers to the user a unified searching interface against a collection of technical papers. The scope is to demonstrate that a performant search system can be obtained pre-processing the data using the Map and Reduce paradigm, in order to obtain a real time response, which is independent to the underlying amount of data. Finally we’ll compare this solutions to different approaches based on clusterization or No SQL solutions, with the scope to describe the characteristics of concrete scenarios, which suggest the adoption of those technologies.

Fairly complete (75 pages) report on a project indexing academic papers with Lucene and Hadoop.

I would like to see treatment of the voiced demand for “real-time processing” versus the need for “real-time processing.”

When I started using research tools, indexes, like the Readers Guide to Periodical Literature were at a minimum two (2) weeks behind popular journals.

Academic indexes ran that far behind if not a good bit longer.

The timeliness of indexing journal articles is now nearly simultaneous with publication.

Has the quality of our research improved due to faster access?

I can imagine use cases, drug interactions for example, the discovery of which should be streamed out as soon as practical.

But drug interactions are not the average case.

It would be very helpful to see research on what factors favor “real-time” solutions and which are quite sufficient with “non-real-time” solutions.

Selling Data Science [Topic Maps]

Tuesday, August 27th, 2013

Selling Data Science by Sean Gonzalez.

From the post:

Data Science is said to include statisticians, mathematicians, machine learning experts, algorithm experts, visualization ninjas, etc., and while these objective theories may be useful in recognizing necessary skills, selling our ideas is about execution. Ironically there are plenty of sales theories and guidelines, such as SPIN selling, the iconic ABC scene from boiler room, or my personal favorite from Glengarry Glenross, that tell us what we should be doing, what questions we should be asking, how a sale should progress, and of course how to close, but none of these address the thoughts we may be wrestling with as we navigate conversations. We don’t necessarily mean to complicate things, we just become accustomed to working with other data science types, but we still must reconcile how we communicate with our peers versus people in other walks of life who are often geniuses in their own right.

First in what Sean promises is a series of posts on how to sell data science.

I am sure the lessons will be equally applicable to selling topic maps.

I am not expecting magic bullets but it is a series of posts that I will follow.


The Blue Obelisk Data Repository’s 10 release

Tuesday, August 27th, 2013

The Blue Obelisk Data Repository’s 10 release by Egon Willighagen.

From the post:

The Blue Obelisk Data Repository (BODR) is not so high profile as other Blue Obelisk projects, but equally important. Well, maybe a tid bit more important: it’s a collection of core chemical and physical data, supporting computation chemistry and cheminformatics resources. For example, it is used by at least the CDK, Kalzium, and Bioclipse, but possibly more. Also, it’s packages for major Linux distributions, such as Debian (btw, congrats to their 20th birthday!) and Ubuntu.

It doesn’t change so often, but just has seen its 10th release. Actually, it was the first release in more than three years. But, fortunately, core chemical facts do not change often, nor much. So, this release has a number of data fixes, a few recent experimental isotope measurements, and also includes the new official names of the livermorium and flerovium elements. There is a full overview of changes.


If this is one of the lesser known Blue Obelisk projects, I have to take a look at the other ones!

Astropy: A Community Python Package for Astronomy

Tuesday, August 27th, 2013

Astropy: A Community Python Package for Astronomy by Bruce Berriman.

From the post:

The rapid adoption of Python by the astronomical community was starting to make it a victim of its own success, with fragmented development of Python packages across different groups. Thus began the Astropy project began in 2011, with an ambitious goal to coordinate Python development across various groups and simplify installation and usage for astronomers. These ambitious goals have been met and are summarized in the paper Astropy: A Community Python Package for Astronomy, prepared by the Astropy Collaboration. The Astropy webpage provides download and build instructions for the current release, version 0.2.4, and complete documentation. It is released under a “3-clause” BSD-type license – the package may be used for any purpose, as long as the copyright is acknowledged and warranty disclaimers are given.

Get the paper and the code. Both will repay your study well.

The only good Python story I know was from a programmer who lamented the ability of Python to scale.

He wrote a sample program in Python for a customer, anticipating they would return for the production version.

But the sample program handled their needs so well, they had no need for the production version.

I am sure Python was due some of the credit but the programmer is a James Clark level programmer so his skills contributed to the result as well.

Analytics and Machine Learning at Scale [Aug. 29-30]

Tuesday, August 27th, 2013

AMP Camp Three – Analytics and Machine Learning at Scale

From the webpage:

AMP Camp Three – Analytics and Machine Learning at Scale will be held in Berkeley California, August 29-30, 2013. AMP Camp 3 attendees and online viewers will learn to solve big data problems using components of the Berkeley Data Analytics Stack (BDAS) and cutting edge machine learning algorithms.

Live streaming!

Sessions will cover (among other things): Mesos, Spark, Shark, Spark Streaming, BlinkDB, MLbase, Tachyon and GraphX.

Talk about a jolt before the weekend!

Classification of handwritten digits

Tuesday, August 27th, 2013

Classification of handwritten digits

From the post:

In this blog post I show some experiments with algorithmic recognition of images of handwritten digits.

I followed the algorithm described in Chapter 10 of the book “Matrix Methods in Data Mining and Pattern Recognition” by Lars Elden.

The algorithm described uses the so called thin Singular Value Decomposition (SVD).

An interesting introduction to a traditional machine learning exercise.

Not to mention the use of Mathematica, a standard tool for mathematical analysis.

You do know they have a personal version for home use? List price as of today is $295 to purchase a copy.

MongoDB Training

Tuesday, August 27th, 2013

Free Online MongoDB Training

Classes include:

  • MongoDB for Java Developers
  • MongoDB for Node.js Developers
  • MongoDB for Developers
  • MongoDB for DBAs

The Fall semester is getting closer and you are thinking about classes, football, dates, parties, ….

MongoDB University can’t help you with the last three but it does have free classes.

You have to handle the other stuff on your own. 😉

PS: What books do you see next to the programmer in the picture? I see a C++ Nutshell book next to “The C Programming Language.” Anything else you recognize?

Why Computer Security Fails

Tuesday, August 27th, 2013

I was reading the source document in: DHS Bridging Siloed Databases [Comments?] when I encountered a possible reason for the Snowden security breach.

Records in this system are stored electronically in secure facilities in a locked drawer behind a locked door. The records may be stored on magnetic disc, tape, or digital media.

You might want to read that again:

Records in this system are stored electronically in secure facilities in a locked drawer behind a locked door. The records may be stored on magnetic disc, tape, or digital media.

Something about storing records electronically “…in a locked drawer behind a locked door” tips me off to the writer not having a clear idea about computer security.

Here is one document that has this language:

DEPARTMENT OF THE TREASURY Fiscal Service Privacy Act of 1974, as Amended; System of Records Notice AGENCY: Financial Management Service, Fiscal Service, Treasury. ACTION: Notice of systems of records.

Which covered:

CATEGORIES OF RECORDS IN THE SYSTEM: (1) Motor Vehicle Accident Reports. (2) Parking Permits. (3) Distribution lists of individuals requesting various Treasury publications. (4) Treasury Credentials.

And it reads:

Records in this system are stored electronically or on paper in secure facilities in a locked drawer behind a locked door. (emphasis added)

For paper records, ok. For electronic records, not so hot.

I’m not real sure what “a locked drawer behind a locked door” would mean for electronic records. Assuming anyone wanted to use or search the records. Maybe you could put them on a thumb-drive. ;-)`

Update: One of my regulars correspondents will accuse me of being obscure: Why Computer Security Fails? Ignorance. It’s just that simple.

Video lectures & presentations about Clojure

Monday, August 26th, 2013

Video lectures & presentations about Clojure by Alex Ott.

From the webpage:

On this page I tried to collect links to all existing video materials about Clojure — video-lectures & tutorials, presentations at conferences, etc. if you have more links to video materials, please leave them in comments to this page!

You will find the following types of content:

  • Lectures, Tutorials and Screencasts
  • Videos from Clojure user groups
  • ClojureScript
  • Clojure-related
  • Datomic

Bookmark this one!