Big data in minutes with the ELK Stack

November 21st, 2014

Big data in minutes with the ELK Stack by Philippe Creux.

From the post:

We’ve built a data analysis and dashboarding infrastructure for one of our clients over the past few weeks. They collect about 10 million data points a day. Yes, that’s big data.

My highest priority was to allow them to browse the data they collect so that they can ensure that the data points are consistent and contain all the attributes required to generate the reports and dashboards they need.

I chose to give the ELK stack a try: ElasticSearch, logstash and Kibana.

Is it just me or does processing “big data” seem to have gotten easier over the past several years?

But however easy or hard the processing, the value-add question is what do we know post data processing that we didn’t know before?

Building an AirPair Dashboard Using Apache Spark and Clojure

November 21st, 2014

Building an AirPair Dashboard Using Apache Spark and Clojure by Sébastien Arnaud.

From the post:

Have you ever wondered how much you should charge per hour for your skills as an expert on How many opportunities are available on this platform? Which skills are in high demand? My name is Sébastien Arnaud and I am going to introduce you to the basics of how to gather, refine and analyze publicly available data using Apache Spark with Clojure, while attempting to generate basic insights about the AirPair platform.

1.1 Objectives

Here are some of the basic questions we will be trying to answer through this tutorial:

  • What is the lowest and the highest hourly rate?
  • What is the average hourly rate?
  • What is the most sought out skill?
  • What is the skill that pays the most on average?

1.2 Chosen tools

In order to make this journey a lot more interesting, we are going to step out of the usual comfort zone of using a standard iPython notebook using pandas with matplotlib. Instead we are going to explore one of the hottest data technologies of the year: Apache Spark, along with one of the up and coming functional languages of the moment: Clojure.

As it turns out, these two technologies can work relatively well together thanks to Flambo, a library developed and maintained by YieldBot's engineers who have improved upon clj-spark (an earlier attempt from the Climate Corporation). Finally, in order to share the results, we will attempt to build a small dashboard using DucksBoard, which makes pushing data to a public dashboard easy.

In addition to illustrating the use of Apache Spark and Clojure, Sébastien also covers harvesting data from Twitter and processing it into a useful format.

Definitely worth some time over a weekend!

Deep Visual-Semantic Alignments for Generating Image Descriptions

November 21st, 2014

Deep Visual-Semantic Alignments for Generating Image Descriptions by Andrej Karpathy and Li Fei-Fei.

From the webpage:

We present a model that generates free-form natural language descriptions of image regions. Our model leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between text and visual data. Our approach is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate the effectiveness of our alignment model with ranking experiments on Flickr8K, Flickr30K and COCO datasets, where we substantially improve on the state of the art. We then show that the sentences created by our generative model outperform retrieval baselines on the three aforementioned datasets and a new dataset of region-level annotations.

Excellent examples with generated text. Code and other predictions “coming soon.”

For the moment you can also read the research paper: Deep Visual-Semantic Alignments for Generating Image Descriptions

Serious potential in any event but even more so if the semantics of the descriptions could be captured and mapped across natural languages.

Land Matrix

November 21st, 2014

Land Matrix: The Online Public Database on Land Deals

From the webpage:

The Land Matrix is a global and independent land monitoring initiative that promotes transparency and accountability in decisions over land and investment.

This website is our Global Observatory – an open tool for collecting and visualising information about large-scale land acquisitions.

The data represented here is constantly evolving; to make this resource more accurate and comprehensive, we encourage your participation.

The deals collected as data must meet the following criteria:

  • Entail a transfer of rights to use, control or ownership of land through sale, lease or concession;
  • Have been initiated since the year 2000;
  • Cover an area of 200 hectares or more;
  • Imply the potential conversion of land from smallholder production, local community use or important ecosystem service provision to commercial use.

FYI, 200 hectares = 2 square kilometers.

Land ownership and its transfer are matters of law and law means government.

The project describes its data this way:

The dataset is inherently unreliable, but over time it is expected to become more accurate. Land deals are notoriously un-transparent. In many countries, established procedures for decision-making on land deals do not exist, and negotiations and decisions do not take place in the public realm. Furthermore, a range of government agencies and levels of government are usually responsible for approving different kinds of land deals. Even official data sources in the same country can therefore vary, and none may actually reflect reality on the ground. Decisions are often changed, and this may or may not be communicated publically.

I would start earlier than the year 2000 but the same techniques could be applied along the route of the Keystone XL pipeline. I am assuming that you are aware that pipelines, roads and other public works are not located purely for physical or aesthetic reasons. Yes?

Please take the time to view and support the Land Matrix project and consider similar projects in your community.

If the owners can be run to ground, you may find the parties to the transactions are linked by other “associations.”

October 2014 Crawl Archive Available

November 21st, 2014

October 2014 Crawl Archive Available by Stephen Merity.

From the post:

The crawl archive for October 2014 is now available! This crawl archive is over 254TB in size and contains 3.72 billion webpages. The files are located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2014-42/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://aws-publicdatasets/ or to each line, you end up with the S3 and HTTP paths respectively.

Thanks again to blekko for their ongoing donation of URLs for our crawl!

Just in time for weekend exploration! ;-)


CERN frees LHC data

November 21st, 2014

CERN frees LHC data

From the post:

Today CERN launched its Open Data Portal, which makes data from real collision events produced by LHC experiments available to the public for the first time.

“Data from the LHC program are among the most precious assets of the LHC experiments, that today we start sharing openly with the world,” says CERN Director General Rolf Heuer. “We hope these open data will support and inspire the global research community, including students and citizen scientists.”

The LHC collaborations will continue to release collision data over the coming years.

The first high-level and analyzable collision data openly released come from the CMS experiment and were originally collected in 2010 during the first LHC run. Open source software to read and analyze the data is also available, together with the corresponding documentation. The CMS collaboration is committed to releasing its data three years after collection, after they have been thoroughly studied by the collaboration.

“This is all new and we are curious to see how the data will be re-used,” says CMS data preservation coordinator Kati Lassila-Perini. “We’ve prepared tools and examples of different levels of complexity from simplified analysis to ready-to-use online applications. We hope these examples will stimulate the creativity of external users.”

In parallel, the CERN Open Data Portal gives access to additional event data sets from the ALICE, ATLAS, CMS and LHCb collaborations that have been prepared for educational purposes. These resources are accompanied by visualization tools.

All data on are shared under a Creative Commons CC0 public domain dedication. Data and software are assigned unique DOI identifiers to make them citable in scientific articles. And software is released under open source licenses. The CERN Open Data Portal is built on the open-source Invenio Digital Library software, which powers other CERN Open Science tools and initiatives.

Awesome is the only term for this data release!

But, when you dig just a little bit further, you discover that embargoes still exist on three (3) out of (4) experiments. Both on data and software.

Disappointing but hopefully a dying practice when it comes to publicly funded data.

I first saw this in a tweet by Ben Evans.

Apache Hive on Apache Spark: The First Demo

November 21st, 2014

Apache Hive on Apache Spark: The First Demo by Brock Noland.

From the post:

Apache Spark is quickly becoming the programmatic successor to MapReduce for data processing on Apache Hadoop. Over the course of its short history, it has become one of the most popular projects in the Hadoop ecosystem, and is now supported by multiple industry vendors—ensuring its status as an emerging standard.

Two months ago Cloudera, Databricks, IBM, Intel, MapR, and others came together to port Apache Hive and the other batch processing engines to Spark. In October at Strata + Hadoop World New York, the Hive on Spark project lead Xuefu Zhang shared the project status and a provided a demo of our work. The same week at the Bay Area Hadoop User Group, Szehon Ho discussed the project and demo’ed the work completed. Additionally, Xuefu and Suhas Satish will be speaking about Hive on Spark at the Bay Area Hive User Group on Dec. 3.

The community has committed more than 140 changes to the Spark branch as part of HIVE-7292 – Hive on Spark. We are proud to say that queries are now functionally able to run, as you can see in the demo below of a multi-node Hive-on-Spark query (query 28 from TPC-DS with a scale factor of 20 on a TPC-DS derived dataset).

After seeing the demo, you will want to move Spark up on your technology to master list!

Beyond You’re vs. Your: A Grammar Cheat Sheet Even The Pros Can Use

November 21st, 2014

Beyond You’re vs. Your: A Grammar Cheat Sheet Even The Pros Can Use by Hayley Mullen.

From the post:

Grammar is one of those funny things that sparks a wide range of reactions from different people. While one person couldn’t care less about colons vs. semicolons, another person will have a visceral reaction to a misplaced apostrophe or a “there” where a “their” is needed (if you fall into the latter category, hello and welcome).

I think we can still all agree on one thing: poor grammar and spelling takes away from your message and credibility. In the worst case, a blog post rife with errors will cause you to think twice about how knowledgeable the person who wrote it really is. In lesser cases, a “then” where a “than” should be is just distracting and reflects poorly on your editing skills. Which is a bummer.

More than the ills mentioned by Hayley, poor writing is hard to understand. Using standards or creating topic maps is hard enough without having to decipher poor writing.

If you already write well, a refresher never hurts. If you don’t write so well, take Hayley’s post to heart and learn from it.

There are errors in standards that tend to occur over and over again. Perhaps I should write a cheat sheet for common standard writing errors. Possible entries: Avoiding Definite Article Abuse, Saying It Once Is Enough, etc.

Big bang of npm

November 21st, 2014

Big bang of npm

From the webpage:

npm is the largest package manager for javascript. This visualization gives you a small spaceship to explore the universe from inside. 106,317 stars (packages), 235,887 connections (dependencies).

Use WASD keys to move around. If you are browsing this with a modern smartphone – rotate your device around to control the camera (WebGL is required).

Navigation and other functions weren’t intuitive, at least not to me:

W – zooms in.

A – pans left.

S – zooms out.

D – pans right.

L – toggles links.

Choosing dependencies or dependents (lower left) filters the current view to show only dependencies or dependents of the chosen package.

Choosing a package name on lower left take you to the page for that package.

Search box at the top has a nice drop down of possible matches and displays dependencies or dependents by name, when selected below.

I would prefer more clues on the main display but given the density of the graph, that would quickly render it unusable.

Perhaps a way to toggle package names when displaying only a portion of the graph?

Users would have to practice with it but this technique could be very useful for displaying dense graphs. Say a map of the known contributions by lobbyists to members of Congress for example. ;-)

I first saw this in a tweet by Lincoln Mullen.

Clojure/conf 2014 Videos!

November 21st, 2014

As of November 21, 2014, 09:00 AM EST, the following Clojure/conf 2014 videos have been posted to ClojureTV:

Bozhidar Batsov – The evolution of the Emacs tooling for Clojure

Paul deGrandis – Unlocking data-driven systems

Colin Fleming – Cursive: A different type of IDE

Rich Hickey – Inside Transducers

Steve Miner – Generating Generators

Anna Pawlicka – Om nom nom nom

Ghadi Shayban – JVM Creature Comforts

Steven Yi – Developing Music Systems on the JVM with Pink and Score

As more videos are posted, I will update this listing.


FISA Judge To Yahoo: If US Citizens Don’t Know They’re Being Surveilled, There’s No Harm

November 20th, 2014

FISA Judge To Yahoo: If US Citizens Don’t Know They’re Being Surveilled, There’s No Harm

From the post:

A legal battle between Yahoo and the government over the Protect America Act took place in 2008, but details (forced from the government’s Top Secret file folders by FISA Judge Reggie Walton) are only emerging now. A total of 1,500 pages will eventually make their way into the public domain once redactions have been applied. The most recent release is a transcript [pdf link] of oral arguments presented by Yahoo’s counsel (Mark Zwillinger) and the US Solicitor General (Gregory Garre).

Cutting to the chase:

But the most surprising assertions made in these oral arguments don’t come from the Solicitor General. They come from Judge Morris S. Arnold, who shows something nearing disdain for the privacy of the American public and their Fourth Amendment rights.

In the first few pages of the oral arguments, while discussing whether or not secret surveillance actually harms US citizens (or the companies forced to comply with government orders), Arnold pulls a complete Mike Rogers:

If this order is enforced and it’s secret, how can you be hurt? The people don’t know that — that they’re being monitored in some way. How can you be harmed by it? I mean, what’s –what’s the — what’s your — what’s the damage to your consumer?

By the same logic, all sorts of secret surveillance would be OK — like watching your neighbor’s wife undress through the window, or placing a hidden camera in the restroom — as long as the surveilled party is never made aware of it. If you don’t know it’s happening, then there’s nothing wrong with it. Right? [h/t to Alex Stamos]

In the next astounding quote, Arnold makes the case that the Fourth Amendment doesn’t stipulate the use of warrants for searches because it’s not written right up on top in bold caps… or something.

The whole thrust of the development of Fourth Amendment law has sort of emphasized the watchdog function of the judiciary. If you just look at the Fourth Amendment, there’s nothing in there that really says that a warrant is usually required. It doesn’t say that at all, and the warrant clause is at the bottom end of the Fourth Amendment, and — but that’s the way — that’s the way it has been interpreted.

What’s standing between US citizens and unconstitutional acts by their government is a very thin wall indeed.

Bear in mind that you are not harmed if you don’t know you are being spied upon.

I guess the new slogan is: Don’t Ask, Don’t Look, Don’t Worry.


UK seeks to shutter Russian site streaming video from webcams

November 20th, 2014

UK seeks to shutter Russian site streaming video from webcams by Barb Darrow.

From the post:

If you feel like someone’s watching you, you might be right.

A mega peeping Tom site out of Russia is collecting video and images from poorly secured webcams, closed-circuit TV cameras and even baby monitors around the world and is streaming the results. And now Christopher Graham, the U.K.’s information commissioner, wants to shut it down, according to this Guardian report.

According to the Guardian, Graham wants the Russian government to put the kibosh on the site and if that doesn’t happen will work with other regulators, including the U.S. Federal Trade Commission, to step in.

Earlier this month a NetworkWorld blogger wrote about a site, presumably the same one mentioned by Graham, with a Russian IP address that accesses some 73,000 unsecured security cameras.

The site has a pretty impressive inventory of images it said were gleaned from Foscam, Linksys, Panasonic security cameras, other unnamed “IP cameras” and AvTech and Hikvision DVRs, according to that post. The site was purportedly set up to illustrate the importance of updating default security passwords.

Apologies but it looks like the site is offline at the moment. Perhaps overload from visitors given the publicity.

An important reminder that security begins at home and with the most basic steps, such as changing default passwords.

Only if you access the site and find out that you have been spied upon will you suffer any harm.

I am completely serious, only if you discover you have been spied upon can you suffer any harm.

Authority for that statement? FISA Judge To Yahoo: If US Citizens Don’t Know They’re Being Surveilled, There’s No Harm.

Over 1,000 research data repositories indexed in

November 20th, 2014

Over 1,000 research data repositories indexed in

From the post:

In August 2012 – the Registry of Research Data Repositories went online with 23 entries. Two years later the registry provides researchers, funding organisations, libraries and publishers with over 1,000 listed research data repositories from all over the world making it the largest and most comprehensive online catalog of research data repositories on the web. provides detailed information about the research data repositories, and its distinctive icons help researchers easily identify relevant repositories for accessing and depositing data sets.

To more than 5,000 unique visitors per month offers reliable orientation in the heterogeneous landscape of research data repositories. An average of 10 repositories are added to the registry every week. The latest indexed data infrastructure is the new CERN Open Data Portal.

Add to your short list of major data repositories!

Senate Republicans are getting ready to declare war on patent trolls

November 20th, 2014

Senate Republicans are getting ready to declare war on patent trolls by Timothy B. Lee

From the post:

Republicans are about to take control of the US Senate. And when they do, one of the big items on their agenda will be the fight against patent trolls.

In a Wednesday speech on the Senate floor, Sen. Orrin Hatch (R-UT) outlined a proposal to stop abusive patent lawsuits. “Patent trolls – which are often shell companies that do not make or sell anything – are crippling innovation and growth across all sectors of our economy,” Hatch said.

Hatch, the longest-serving Republican in the US Senate , is far from the only Republican in Congress who is enthusiastic about patent reform. The incoming Republican chairmen of both the House and Senate Judiciary committees have signaled their support for patent legislation. And they largely see eye to eye with President Obama, who has also called for reform.

“We must improve the quality of patents issued by the U.S. Patent and Trademark Office,” Hatch said. “Low-quality patents are essential to a patent troll’s business model.” His speech was short on specifics here, but one approach he endorsed was better funding for the patent office. That, he argued, would allow “more and better-trained patent examiners, more complete libraries of prior art, and greater access to modern information technologies to address the agency’s growing needs.”

I would hate to agree with Senator Hatch on anything but there is no doubt that low-quality patents are rife at the U.S. Patent and Trademark Office. Whether patent trolls simply took advantage of the quality of patents or are responsible for low quality patents it’s hard to say.

In any event, the call for “…more complete libraries of prior art, and greater access to modern information technologies…” sounds like a business opportunity for topic maps.

After all, we all know that faster, more comprehensive search engines of the patent literature only gives you more material to review. It doesn’t give you more relevant material to review. Or give you material you did not know to look for. Only additional semantics has the power to accomplish either of those tasks.

There are those who will keep beating bags of words in hopes that semantics will appear.

Don’t be one of those. Choose an area of patents of interest and use interactive text mining to annotate existing terms with semantics (subject identity) which will reduce misses and increase the usefulness of “hits.”

That isn’t a recipe for mining all existing patents but who wants to do that? If you gain a large enough semantic advantage in genomics, semiconductors, etc., the start-up cost to catch up will be a tough nut to crack. Particularly since you are already selling a better product for a lower price than a start-up can match.

I first saw this in a tweet by Tim O’Reilly.

PS: A better solution for software patent trolls would be a Supreme Court ruling that eliminates all software patents. Then Congress could pass a software copyright bill that grants copyright status on published code for three (3) years, non-renewable. If that sounds harsh, consider the credibility impact of nineteen year old bugs.

If code had to be recast every three years and all vendors were on the same footing, there would be a commercial incentive for better software. Yes? If I had the coding advantages of a major vendor, I would start lobbying for three (3) year software copyrights tomorrow. Besides, it would make software piracy a lot easier to track.

Conflict of Interest – Reversing the Definition

November 20th, 2014

Just a quick heads up that the semantics of “conflict of interest” has changed, at least in the context of the US House of Representatives.

Traditionally, the meaning of “conflict of interest” is captured by Wikipedia’s one-liner:

A conflict of interest (COI) is a situation occurring when an individual or organization is involved in multiple interests, one of which could possibly corrupt the motivation.

That seems fairly straight forward.

However, in H.R.1422 — 113th Congress (2013-2014), passed on 11/18/2014, the House authorized paid representatives of industry interest to be appointed to the EPA Advisory Board, saying:

SEC. 2. SCIENCE ADVISORY BOARD.(b)(2)(C) – persons with substantial and relevant expertise are not excluded from the Board due to affiliation with or representation of entities that may have a potential interest in the Board’s advisory activities, so long as that interest is fully disclosed to the Administrator and the public and appointment to the Board complies with section 208 of title 18, United States Code;

So, the House of Representatives has just reversed the standard definition of “conflict of interest” to say that hired guns of industry players have no “conflict of interest” sitting on the EPA Science Board, so long as they say they are hired guns.

I thought I was fairly hardened to hearing bizarre things out of government but reversing the definition of “conflict of interest” is a new one on me.

The science board is supposed to be composed of scientists, unsurprisingly. Scientists, by the very nature of their profession, do science. Experiments, reports, projects, etc. And no surprise the scientists on the EPA science panel work on … that’s right, environment science.

Care to guess who H.R.1422 prohibits from certain advisory activities?


Board members may not participate in advisory activities that directly or indirectly involve review or evaluation of their own work;

Scientists are excluded from advisory activities where they have expertise.

Being an expert in a field is a “conflict of interest” and being a hired gun is not (so long as being a hired gun is disclosed).

So the revision of “conflict of interest” is even worse than I thought.

I don’t have the heart to amend the Wikipedia article on conflict of interest. Would someone do that for me?

PS: I first saw this at House Republicans just passed a bill forbidding scientists from advising the EPA on their own research by Lindsay Abrams. Lindsay does a great job summarizing the issues and links to the legislation. I followed her link to the bill and reported just the legislative language. I think that is chilling enough.

PPS: Did your member of the House of Representative vote for this bill?

I first saw this in a tweet by Wilson da Silva.

Video: Experts Share Perspectives on Web Annotation

November 20th, 2014

Video: Experts Share Perspectives on Web Annotation by Gary Price.

From the post:

The topic of web annotation continues to grow in interest and importance.

Here’s how the World Wide Web Consortium (W3C) describes the topic:

Web annotations are an attempt to recreate and extend that functionality as a new layer of interactivity and linking on top of the Web. It will allow anyone to annotate anything anywhere, be it a web page, an ebook, a video, an image, an audio stream, or data in raw or visualized form. Web annotations can be linked, shared between services, tracked back to their origins, searched and discovered, and stored wherever the author wishes; the vision is for a decentralized and open annotation infrastructure.

A Few Examples

In recent weeks and months a WC3 Web Annotation working group got underway,, a company that has been working in this area for several years (and one we’ve mentioned several times on infoDOCKET) formally launched a web annotation extension for Chrome, the Mellon Foundation awarded $750,000 in research funding, and The Journal of Electronic Publishing began offering annotation for each article in the publication.

New Video

Today, posted a 15 minute video (embedded below) where several experts share some of their perspectives (Why the interest in the topic? Biggest Challenges, Future Plans, etc.) on the topic of web annotation.

The video was recorded at the recent W3C TPAC 2014 Conference in Santa Clara, CA.

I am puzzled by more than one speaker on the video referring to the lack of robust addressing as a reason why annotation has not succeeded in the past. Perhaps they are unaware of the XLink and XPointer work at the W3C? Or HyTime for that matter?

True, none of those efforts were widely supported but that doesn’t mean that robust addressing wasn’t available.

I for one will be interested in comparing the capabilities of prior efforts against what is marketed as “new web annotation” capabilities.

Annotation, particularly what was known as “extended XLinks” is very important for the annotation of materials to which you don’t have read/write access. Think about annotating texts distributed by a vendor on DVD. Or annotating text that are delivered over a web stream. A separate third-party value-add product. Like a topic map for instance.

See videos from I Annotate 2014

Python Multi-armed Bandits (and Beer!)

November 20th, 2014

Python Multi-armed Bandits (and Beer!) by Eric Chiang.

From the post:

There are many ways to evaluate different strategies for solving different prediction tasks. In our last post, for example, we discussed calibration and descrimination, two measurements which assess the strength of a probabilistic prediciton. Measurements like accuracy, error, and recall among others are useful when considering whether random forest “works better” than support vector machines on a problem set. Common sense tells us that knowing which analytical strategy “does the best” is important, as it will impact the quality of our decisions downstream. The trick, therefore, is in selecting the right measurement, a task which isn’t always obvious.

There are many prediction problems where choosing the right accuracy measurement is particularly difficult. For example, what’s the best way to know whether this version of your recommendation system is better than the prior version? You could – as was the case with the Netflix Prize – try to predict the number of stars a user gave to a movie and measure your accuracy. Another (simpler) way to vet your recommender strategy would be to roll I out to users and watch before and after behaviors.

So by the end of this blog post, you (the reader) will hopefully be helping me improve our beer recommender through your clicks and interactions.

The final application which this blog will explain can be found at The original post explaining beer recommenders can be found here.

I have friend who programs in Python (as well as other languages) and they are or are on their way to becoming a professional beer taster.

Given a choice, I think I would prefer to become a professional beer drinker but each to their own. ;-)

The discussion of measures of distances between beers in this post is quite good. When reading it, think about beers (or other beverages) you have had and try to pick between Euclidean distance, distance correlation, and cosine similarity in discussing how you evaluate those beverages to each other.

What? That isn’t how you evaluate your choices between beverages?

Yet, those “measures” have proven to be effective (effective != 100%) at providing distances between individual choices.

The “mapping” between the unknown internal scale of users and the metric measures used in recommendation systems is derived from a population of users. The resulting scale may or may not be an exact fit for any user in the tested group.

The usefulness of any such scale depends on the similarity of the population over which it was derived and the population where you want to use it. Not to mention how you validated the answers. (Users are reported to give the “expected” response as opposed to their actual choices in some scenarios.)

Geospatial Data in Python

November 20th, 2014

Geospatial Data in Python by Carson Farmer.

Materials for the tutorial: Geospatial Data in Python: Database, Desktop, and the Web by Carson Farmer (Associate Director of CARSI lab).

Important skills if you are concerned about projects such as the Keystone XL Pipeline:

keystone pipeline route

This is an instance where having the skills to combine geospatial, archaeological, and other data together will empower local communities to minimize the damage they will suffer from this project.

Having a background in the processing geophysical data is the first step in that process.

Indexed Database API Proposed Recommendation Published (Review Opportunity)

November 20th, 2014

Indexed Database API Proposed Recommendation Published

From the post:

The Web Applications Working Group has published a Proposed Recommendation of Indexed Database API. This document defines APIs for a database of records holding simple values and hierarchical objects. Each record consists of a key and some value. Moreover, the database maintains indexes over records it stores. An application developer directly uses an API to locate records either by their key or by using an index. A query language can be layered on this API. An indexed database can be implemented using a persistent B-tree data structure. Comments are welcome through 18 December. Learn more about the Rich Web Client Activity.

If you have the time between now and 18 December, this is a great opportunity to “get your feet wet” reviewing W3C recommendations.

The Indexed Database API document isn’t long (43 pages approximately) and you are no doubt already familiar with databases in general.

An example to get you started:

3.2 APIs

The API methods return without blocking the calling thread. All asynchronous operations immediately return an IDBRequest instance. This object does not initially contain any information about the result of the operation. Once information becomes available, an event is fired on the request and the information becomes available through the properties of the IDBRequest instance.

  1. When you read:

    The API methods return without blocking the calling thread.

    How do you answer the question: What is being returned by an API method?

  2. What is your answer after reading the next sentence?

    All asynchronous operations immediately return an IDBRequest instance.

  3. Does this work?

    API methods return an IDBRequest instance without blocking the calling thread.

    (Also correcting the unnecessary definite article “The.”)

  4. One more issue with the first sentence is:

    …without blocking the calling thread.

    If you search the document, there is no other mention of calling threads.

    I suspect this is unnecessary and would ask for its removal.

    So, revised the first sentence would read:

    API methods return an IDBRequest instance.

  5. Maybe, except that the second sentence says “All asynchronous operations….”

    When you see a statement of “All …. operations…,” you should look for a definition of those operations.

    I have looked and while “asynchronous” is used thirty-four (34) times, “asynchronous operations” is used only once.

    (more comments on “asynchronous” below)

  6. I am guessing you caught the “immediately” problem on your own. Undefined and what other response would there be?

    If we don’t need “asynchronous” and the first sentence is incomplete, is this a good suggestion for 3.2 APIs?

    (Proposed Edit)

    3.2 APIs

    API methods return an IDBRequest instance. This object does not initially contain any information about the result of the operation. Once information becomes available, an event is fired on the request and the information becomes available through the properties of the IDBRequest instance.

    There, your first edit to a W3C Recommendation removed twelve (12) words and made the text clearer.

    Plus there are the other thirty-three (33) instances of “asynchronous” to investigate.

    Not bad for your first effort!

    After looking around the rest of the proposed recommendation, I suspect that “asynchronous” is used to mean that results and requests can arrive and be processed in any order (except for some cases of overlapping scope of operations). It’s simpler just to say that once and not go about talking about “asynchronous requests,” “opening databases asynchronously,” etc.

Cytoscape.js != Cytoscape (desktop)

November 20th, 2014


From the webpage:

Cytoscape.js is an open-source Cytoscape.jsgraph theory (a.k.a. network) library written in JavaScript. You can use Cytoscape.js for graph analysis and visualisation.

Cytoscape.js allows you to easily display and manipulate rich, interactive graphs. Because Cytoscape.js allows the user to interact with the graph and the library allows the client to hook into user events, Cytoscape.js is easily integrated into your app, especially since Cytoscape.js supports both desktop browsers, like Chrome, and mobile browsers, like on the iPad. Cytoscape.js includes all the gestures you would expect out-of-the-box, including pinch-to-zoom, box selection, panning, et cetera.

Cytoscape.js also has graph analysis in mind: The library contains many useful functions in graph theory. You can use Cytoscape.js headlessly on Node.js to do graph analysis in the terminal or on a web server.

Cytoscape.js is an open-source project, and anyone is free to contribute. For more information, refer to the Cytoscape.jsGitHub README.

The library was developed at the Cytoscape.jsDonnelly Centre at the University of Toronto. It is the successor of Cytoscape.jsCytoscape Web.

Cytoscape.js & Cytoscape

Though Cytoscape.js shares its name with Cytoscape, Cytoscape.js is not exactly the same as Cytoscape desktop. Cytoscape.js is a JavaScript library for programmers. It is not an app for end-users, and developers need to write code around Cytoscape.js to build graphcentric apps.

Cytoscape.js is a JavaScript library: It gives you a reusable graph widget that you can integrate with the rest of your app with your own JavaScript code. The keen members of the audience will point out that this means that Cytoscape plugins/apps — written in Java — will obviously not work in Cytoscape.js — written in JavaScript. However, Cytoscape.js supports its own ecosystem of extensions.

We are trying to make the two projects intercompatible as possible, and we do share philosophies with Cytoscape: Graph style and data should be separate, the library should provide core functionality with extensions adding functionality on top of the library, and so on.

Great demo graphs!

High marks on the documentation and its TOC. Generous use of examples.

One minor niggle on the documentation:

Note that metacharacters need to be escaped:


I think the full set of metacharacters for JavaScript reads:

^ $ \ / ( ) | ? + * [ ] { } , .

Given that metacharacters vary between regex languages (unfortunately), it would be clearer to list the full set of JavaScript metacharacters and use only a few in the examples.


Note that metacharacters ( ^ $ \ / ( ) | ? + * [ ] { } , . )need to be escaped:


Overall a graph theory library that deserves your attention.

I first saw this in a tweet by Friedrich Lindenberg.

Update: I submitted a ticket on the metacharacters this morning and it was fixed shortly thereafter. Hard problems will likely take longer but definitely a responsive project!

Carnegie Mellon Machine Learning Dissertations

November 19th, 2014

Carnegie Mellon Machine Learning Dissertations

Forty-nine (49) dissertations on machine learning from Carnegie Mellon as of today.

Cold weather is setting in (in the Northern Hemisphere) so take it as additional reading material.

ClojureTV & Clojure/conf 2014

November 19th, 2014

I saw a tweet today from Clojure/conf:

We will be doing same day conference video publishing this week at #clojure_conj! Watch for updates

Now there’s a great idea!


Mining Idioms from Source Code

November 19th, 2014

Mining Idioms from Source Code by Miltiadis Allamanis and Charles Sutton.


We present the first method for automatically mining code idioms from a corpus of previously written, idiomatic software projects. We take the view that a code idiom is a syntactic fragment that recurs across projects and has a single semantic role. Idioms may have metavariables, such as the body of a for loop. Modern IDEs commonly provide facilities for manually defining idioms and inserting them on demand, but this does not help programmers to write idiomatic code in languages or using libraries with which they are unfamiliar. We present HAGGIS, a system for mining code idioms that builds on recent advanced techniques from statistical natural language processing, namely, nonparametric Bayesian probabilistic tree substitution grammars. We apply HAGGIS to several of the most popular open source projects from GitHub. We present a wide range of evidence that the resulting idioms are semantically meaningful, demonstrating that they do indeed recur across software projects and that they occur more frequently in illustrative code examples collected from a Q&A site. Manual examination of the most common idioms indicate that they describe important program concepts, including object creation, exception handling, and resource management.

A deeply interesting paper that identifies code idioms without the idioms being specified in advance.

Opens up a path to further investigation of programming idioms and annotation of such idioms.

I first saw this in: Mining Idioms from Source Code – Miltiadis Allamanis a review of a presentation by Felienne Hermans.

Science fiction fanzines to be digitized as part of major UI initiative

November 19th, 2014

Science fiction fanzines to be digitized as part of major UI initiative by Kristi Bontrager.

From the post:

The University of Iowa Libraries has announced a major digitization initiative, in partnership with the UI Office of the Vice President for Research and Economic Development. 10,000 science fiction fanzines will be digitized from the James L. “Rusty” Hevelin Collection, representing the entire history of science fiction as a popular genre and providing the content for a database that documents the development of science fiction fandom.

Hevelin was a fan and a collector for most of his life. He bought pulp magazines from newsstands as a boy in the 1930s, and by the early 1940s began attending some of the first organized science fiction conventions. He remained an active collector, fanzine creator, book dealer, and fan until his death in 2011. Hevelin’s collection came to the UI Libraries in 2012, contributing significantly to the UI Libraries’ reputation as a major international center for science fiction and fandom studies.

Interesting content for many of us but an even more interesting work flow model for the content:

Once digitized, the fanzines will be incorporated into the UI Libraries’ DIY History interface, where a select number of interested fans (up to 30) will be provided with secure access to transcribe, annotate, and index the contents of the fanzines. This group will be modeled on an Amateur Press Association (APA) structure, a fanzine distribution system developed in the early days of the medium that required contributions of content from members in order to qualify for, and maintain, membership in the organization. The transcription will enable the UI Libraries to construct a full-text searchable fanzine resource, with links to authors, editors, and topics, while protecting privacy and copyright by limiting access to the full set of page images.

The similarity between the Amateur Press Association (APA) structure and modern open source projects is interesting. I checked the APA’s homepage, they are have a more traditional membership fee now.

The Hevelin Collection homepage.

Less Than Universal & Uniform Indexing

November 19th, 2014

In Suffix Trees and their Applications in String Algorithms, I pointed out that a subset of the terms for “suffix tree” resulted in About 1,830,000 results (0.22 seconds).

Not a very useful result, even for the most dedicated of graduate students. ;-)

A better result would be an indexing entry for “suffix tree,” included results using its alternative names and enabled the user to quickly navigate to sub-entries under “suffix tree.”

To illustrate the benefit from actual indexing, consider that “Suffix Trees and their Applications in String Algorithms” lists only three keywords: “Pattern matching, String algorithms, Suffix tree.” Would you look at this paper for techniques on software maintenance?

Probably not, which would be a mistake. The section 4 covers the use of “parameterized pattern matching” for software maintenance of large programs in a fair amount of depth. Certainly more so than it covers “multidimensional pattern matching,” which is mentioned in the abstract and in the conclusion but not elsewhere in the paper. (“Higher dimensions” is mentioned on page 3 but only in two sentences with references.) Despite being mentioned in the abstract and conclusion as major theme of the paper.

A properly constructed index would break out both “parameterized pattern matching” and “software maintenance” as key subjects that occur in this paper. A bit easier to find than wading through 1,830,000 “results.”

Before anyone comments that such granular indexing would be too time consuming or expensive, recall the citation rates for computer science, 2000 – 2010:

Field 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 All years
Computer science 7.17 7.66 7.93 5.35 3.99 3.51 2.51 3.26 2.13 0.98 0.15 3.75

From: Citation averages, 2000-2010, by fields and years

The reason for the declining numbers is that citations to papers from the year 2000 decline over time.

But the highest percentage rate, 7.93 in 2002, is far less than the total number of papers published in 2000.

At one point in journal publication history, manual indexing was universal. But that was before full text searching became a reality and the scientific publication rate exploded.


The STM Report by Mark Ware and Michael Mabe.

Rather than an all human indexing model (not possible due to the rate of publication, costs) or an all computer-based searching model (leads to poor results as described above), why not consider a bifurcated indexing/search model?

The well over 90% of CS publications that aren’t cited should be subject to computer-based indexing and search models. On the other hand, the meager 8% that are cited, perhaps subject to some scale of citation, could be curated by human/machine assisted indexing.

Human/machine assisted indexing would increase access to material already selected by other readers. Perhaps even as a value-add product as opposed to take your chances with search access.

Suffix Trees and their Applications in String Algorithms

November 19th, 2014

Suffix Trees and their Applications in String Algorithms by Roberto Grossi and Giuseppe F. Italiano.


The suffix tree is a compacted trie that stores all suffixes of a given text string. This data structure has been intensively employed in pattern matching on strings and trees, with a wide range of applications, such as molecular biology, data processing, text editing, term rewriting, interpreter design, information retrieval, abstract data types and many others.

In this paper, we survey some applications of suffix trees and some algorithmic techniques for their construction. Special emphasis is given to the most recent developments in this area, such as parallel algorithms for suffix tree construction and generalizations of suffix trees to higher dimensions, which are important in multidimensional pattern matching.

The authors point out that “suffix tree” is only one of the names for this subject:

The importance of the suffix tree is underlined by the fact that it has been rediscovered many times in the scientific literature, disguised under different names, and that it has been studied under numerous variations. Just to mention a few appearances of the suffix tree, we cite the compacted bi-tree [101], the prefix tree [24], the PAT tree [50], the position tree [3, 65, 75], the repetition finder [82], and the subword tree [8, 24]….

Which is an advantage if you are researching another survey paper and tracing every thread on suffix trees by whatever name, not so much of an advantage if you miss this paper or an application under name other than “suffix tree.”

Of course, a search with:

“suffix tree” OR “impacted by-tree” OR “prefix tree” OR “PAT tree” OR “position tree” OR “repetition finder” OR “subword tree”

that returns About 1,830,000 results (0.22 seconds), isn’t very helpful.

In part because no one is going to examine 1,830,000 results and that is a subset of all the terms for suffix trees.

I think we can do better than that and without an unreasonable expenditure of resources. (See: Less Than Universal & Uniform Indexing)

#shirtgate, #shirtstorm, and the rhetoric of science

November 18th, 2014

Unless you have been in a coma or just arrived from off-world, you have probably heard about #shirtgate/#shirtstorm. If not, take a minute to search on those hash tags to come up to speed.

During the ensuing flood of posts, tweets, etc., I happened to stumble upon To the science guys who want to understand #shirtstorm by Janet D. Stemwedel.

It is impressive because despite the inability of men and women to fully appreciate the rhetoric of the other gender, Stemwedel finds a third rhetoric, that of science, in which to conduct her argument.

Not that the rhetoric of science is a perfect fit for either gender but it is a rhetoric in which both genders share some assumptions and methods of reasoning. Those partially shared assumptions and methods make Stemwedel’s argument effective.

Take her comments on data gathering (formatted on her blog as tweets):

So, first big point: women’s accounts of their own experiences are better data than your preexisting hunches about their experiences.

Another thing you science guys know: sometimes we observe unexpected outcomes. We don’t say, That SHOULDN’T happen! but, WHY did it happen?

Imagine, for sake of arg, that women’s rxn to @mggtTaylor’s porny shirt was a TOTAL surprise. Do you claim that rxn shouldn’t hv happened?

Or, do you think like a scientist & try to understand WHY it happened? Do you stay stuck in your hunches or get some relevant data?

Do you recognize that women’s experiences in & with science (plus larger society) may make effect of porny shirt on #Rosetta publicity…

…on those women different than effect of porny shirt was on y’all science guys? Or that women KNOW how they feel about it better than you?

Science guys telling women “You shouldn’t be mad about porny shirt on #Rosetta video because…” is modeling bad scientific method!

Finding a common rhetoric is at the core of creating sustainable mappings between differing semantics. Stemwedel illustrates the potential for such a rhetoric even in a highly charged situation.

PS: You need to read Stemwedel’s post in the original.

Positions in the philosophy of science

November 18th, 2014

positions in philosophy of science

If you want to start a debate among faculty this holiday season, print this graphic out and leave it laying around with one or two local names penciled in.

For example, I would not list naive realism as a “philosophy of science” as much as an error, taken for a “philosophy of science.” ;-)


I first saw this as Positions in the philosophy of science by Chris Blattman.

When Information Design is a Matter of Life or Death

November 18th, 2014

When Information Design is a Matter of Life or Death by Thomas Bohm.

From the post:

In 2008, Lloyds Pharmacy conducted 20 minute interviews1 with 1,961 UK adults. Almost one in five people admitted to having taken prescription medicines incorrectly; more than eight million adults have either misread medicine labels or misunderstood the instructions, resulting in them taking the wrong dose or taking medication at the wrong time of day. In addition, the overall problem seemed to be more acute among older patients.

Almost one in five people admitted to having taken prescription medicines incorrectly; more than eight million adults have either misread medicine labels or misunderstood the instructions.

Medicine or patient information leaflets refer to the document included inside medicine packaging and are typically printed on thin paper (see figures 1.1–1.4). They are essential for the safe use of medicines and help answer people’s questions when taking the medicine.

If the leaflet works well, it can lead to people taking the medicine correctly, hopefully improving their health and wellness. If it works poorly, it can lead to adverse side effects, harm, or even death. Subsequently, leaflets are heavily regulated in the way they need to be designed, written, and produced. European2 and individual national legislation sets out the information to be provided, in a specific order, within a medicine information leaflet.

A good reminder that failure to communicate in some information systems has more severe penalties than others.

I was reminded while reading the “thin paper” example:

Medicine information leaflets are often printed on thin paper and folded many times to fit into the medicine package. There is a lot of show-through from the information printed on the back of the leaflet, which decreases readability. When the leaflet is unfolded, the paper crease marks affect the readability of the text (see figures 1.3 and 1.4). A possible improvement would be to print the leaflet on a thicker paper.

of a information leaflet that unfolded to be 18 inches wide and 24 inches long. A real tribute to the folding art. The typeface was challenging even with glasses and a magnifying glass. Too tiring to read much of it.

I don’t think thicker paper would have helped, unless the information leaflet became an information booklet.

What are the consequences if someone misreads your interface?

MarkLogic® 8…

November 18th, 2014

MarkLogic® 8 Evolves Database Technology to Solve Heterogeneous Data Integration Problems with the Power of Search, Semantics and Bitemporal Features All in One System

From the post:

MarkLogic Corporation, the leading Enterprise NoSQL database platform provider, today announced the availability of MarkLogic® Version 8 Early Access Edition. MarkLogic 8 brings together advanced search, semantics, bitemporal and native JavaScript support into one powerful, agile and trusted database platform. Companies can now:

  • Get better answers faster through integrated search and query of all of their data, metadata, and relationships, regardless of the data type or source;
  • Lower costs and increase agility by easily integrating heterogeneous data, including relational, unstructured, and richly structured data, across silos and at massive scale;
  • Rapidly build production-ready applications in weeks versus months or years to address the needs of the business or organization.

For enterprise customers who value agility but can’t compromise on resiliency, MarkLogic software is the only database platform that integrates Google-like search with rich query and semantics into an intelligent and extensible data layer that works equally well in a data center or in the cloud. Unlike other NoSQL solutions, MarkLogic provides ACID transactions, HA, DR, and other hardened features that enterprises require, along with the scalability and agility they need to accelerate their business.

“As more complex data, much of it semi-structured, becomes increasingly important to businesses’ daily operations, enterprises are realizing that they must look beyond relational databases to help them understand, integrate, and manage all of their data, deriving maximum value in a simple, yet sophisticated manner,” said Carl Olofson, research vice president at IDC. “MarkLogic has a history of bringing advanced data management technology to market and many of their customers and partners are accustomed to managing complex data in an agile manner. As a result, they have a more mature and creative view of how to manage and use data than do mainstream database users. MarkLogic 8 offers some very advanced tools and capabilities, which could expand the market’s definition of enterprise database technology.”

I’m not in the early release program but if you are, heads up!

By “semantics,” MarkLogic means RDF triples and the ability to query those triples with text, values, etc.

Since we can all see triples, text and values with different semantics, your semantic mileage with MarkLogic may vary greatly.