Pig is Flying: Apache Pig on Apache Spark (aka “Spork”)

September 14th, 2014

Pig is Flying: Apache Pig on Apache Spark by Mayur Rustagi.

From the post:

Analysts can talk about data insights all day (and night), but the reality is that 70% of all data analyst time goes into data processing and not analysis. At Sigmoid Analytics, we want to streamline this data processing pipeline so that analysts can truly focus on value generation and not data preparation.

We focus our efforts on three simple initiatives:

  • Make data processing more powerful
  • Make data processing more simple
  • Make data processing 100x faster than before

As a data mashing platform, the first key initiative is to combine the power and simplicity of Apache Pig on Apache Spark, making existing ETL pipelines 100x faster than before. We do that via a unique mix of our operator toolkit, called DataDoctor, and Spark.

DataDoctor is a high-level operator DSL on top of Spark. It has frameworks for no-symmetrical joins, sorting, grouping, and embedding native Spark functions. It hides a lot of complexity and makes it simple to implement data operators used in applications like Pig and Apache Hive on Spark.

For the uninitiated, Spark is open source Big Data infrastructure that enables distributed fault-tolerant in-memory computation. As the kernel for the distributed computation, it empowers developers to write testable, readable, and powerful Big Data applications in a number of languages including Python, Java, and Scala.

Introduction to and how to get started using Spork (Pig-on-Spark).

I know, more proof that Phil Karton was correct in saying:

There are only two hard things in Computer Science: cache invalidation and naming things.


Astropy v0.4 Released

September 14th, 2014

Astropy v0.4 Released by Erik Tollerud.

From the post:

This July, we performed the third major public release (v0.4) of the astropy package, a core Python package for Astronomy. Astropy is a community-driven package intended to contain much of the core functionality and common tools needed for performing astronomy and astrophysics with Python.

New and improved major functionality in this release includes:

  • A new astropy.vo.samp sub-package adapted from the previously standalone SAMPy package
  • A re-designed astropy.coordinates sub-package for celestial coordinates
  • A new ‘fitsheader’ command-line tool that can be used to quickly inspect FITS headers
  • A new HTML table reader/writer
  • Improved performance for Quantity objects
  • A re-designed configuration framework

Erik goes on to say that Astropy 1.0 should arrive by the end of the year!


Forty-four More Greek Manuscripts Online

September 14th, 2014

Forty-four More Greek Manuscripts Online by James Freeman.

From the post:

We are delighted to announce another forty-four Greek manuscripts have been digitised. As always, we are most grateful to the Stavros Niarchos Foundation, the A. G. Leventis Foundation, Sam Fogg, the Sylvia Ioannou Foundation, the Thriplow Charitable Trust, the Friends of the British Library, and our other generous benefactors for contributing to the digitisation project. Happy exploring!

A random sampling:

Add MS 31921, Gospel Lectionary with ekphonetic notation (Gregory-Aland l 336), imperfect, 12th century, with some leaves supplied in the 14th century. Formerly in Blenheim Palace Library.

Add MS 34059, Gospel Lectionary (Gregory-Aland l 939), with ekphonetic neumes. 12th century.,

Add MS 36660, Old Testament lectionary with ekphonetic notation, and fragments from a New Testament lectionary (Gregory-Aland l 1490). 12th century.

Add MS 37320, Four Gospels (Gregory-Aland 2290). 10th century, with additions from the 16th-17th century.


Burney MS 106, Sophocles, Ajax, Electra, Oedipus Tyrannus, Antigone; [Aeschylus], Prometheus Vinctus; Pindar, Olympia. End of the 15th century.

Burney MS 108, Aelian, Tactica; Leo VI, Tactica; Heron of Alexandria, Pneumatica, De automatis, with numerous diagrams. 1st quarter of the 16th century, possibly written at Venice.

Burney MS 109, Works by Theocritus, Hesiod, Pindar, Pythagoras and Aratus. 2nd half of the 14th century, Italy.

And many more!

Given the complex histories of the texts witnessed by these Greek manuscripts, their interpretations and commentaries, to say nothing of the history of the manuscripts per se, they are rich subjects that merit treatment with a topic map.

Be sure to visit the other treasures of the British Library. It is an exemplar of how an academic institution should function.

Army can’t track spending on $4.3b system to track spending, IG finds

September 14th, 2014

Army can’t track spending on $4.3b system to track spending, IG finds. by Mark Flatten.

From the post:

More than $725 million was spent by the Army on a high-tech network for tracking supplies and expenses that failed to comply with federal financial reporting rules meant to allow auditors to track spending, according to an inspector general’s report issued Wednesday.

The Global Combat Support System-Army, a logistical support system meant to track supplies, spare parts and other equipment, was launched in 1997. In 2003, the program switched from custom software to a web-based commercial software system.

About $95 million was spent before the switch was made, according to the report from the Department of Defense IG.

As of this February, the Army had spent $725.7 million on the system, which is ultimately expected to cost about $4.3 billion.

The problem, according to the IG, is that the Army has failed to comply with a variety of federal laws that require agencies to standardize reporting and prepare auditable financial statements.

The report is full of statements like this one:

PMO personnel provided a system change request, which they indicated would correct four account attributes in July 2014. In addition, PMO personnel provided another system change request they indicated would correct the remaining account attribute (Prior Period Adjustment) in late FY 2015.

PMO = Project Management Office (in this case, of GCSS–Army).

The lack of identification of personnel speaking on behalf of the project or various offices pervades the report. Moreover, the same is true for twenty-seven (27) other reports on issues with this project.

If the sources of statements and information were identified in these reports, then it would be possible to track people across reports and to identify who has failed to follow up on representations made in the reports.

The first step towards accountability is identification of decision makers in audit reports.

Tracking decision makers from one position to another and linking them to specific decisions is a natural application of topic maps.

I first saw this in Links I Liked by Chris Blattman, September 7, 2014.

Cassandra Performance Testing with cstar_perf

September 14th, 2014

Cassandra Performance Testing with cstar_perf by Ryan Mcguire.

From the post:

It’s frequently been reiterated on this blog that performance testing of Cassandra is often done incorrectly. In my role as a Cassandra test engineer at DataStax, I’ve certainly done it incorrectly myself, numerous times. I’m convinced that the only way to do it right, consistently, is through automation – there’s simply too many variables to keep track of when doing things by hand.

cstar_perf is an easy to use tool to run performance tests on Cassandra clusters. A brief outline of what it does for you:

  • Downloads and builds Cassandra source code.
  • Configures your cassandra.yaml and environment settings.
  • Bootstraps nodes on a real cluster.
  • Runs a series of test operations on multiple versions or configs.
  • Collects and aggregates cluster performance metrics.
  • Creates easy to read performance charts comparing multiple test configurations in one view.
  • Runs a web frontend for convenient test scheduling, monitoring and reporting.

A great tool for Cassandra developers and a reminder of the first requirement for performance testing, automation. How’s your performance testing?

I first saw this in a tweet by Jason Brown.

Why Use Google Maps When You Can Get GPS Directions On The Death Star Instead?

September 13th, 2014

Why Use Google Maps When You Can Get GPS Directions On The Death Star Instead? by John Brownlee.

From the post:

Mapbox Studio is a toolkit that allows apps and websites to serve up their own custom-designed maps to users. Companies like Square, Pinterest, Foursquare, and Evernote con provide custom-skinned Mapboxes instead, changing map elements to better fit in with their brand.

But Mapbox can do far cooler stuff. It can blast you to Space Station Earth, a Mapbox that makes the entire planet look like the blinking, slate gray skin of the Star Wars Death Star.

Great if your target audience are Star Wars or similar science fiction fans or you can convince management that it will hold the attention of users longer.

Even routine tasks, like logging service calls answered, would be more enjoyable using an X-Wing fighter to destroy the location of the call after service has been completed. ;-)

Open AI Resources

September 13th, 2014

Open AI Resources

From the about page:

We all go further when we all work together. That’s the promise of Open AIR, an open source collaboration hub for AI researchers. With the decline of university- and government-sponsored research and the rise of large search and social media companies insistence on proprietary software, the field is quickly privatizing. Open AIR is the antidote: it’s important for leading scientists and researchers to keep our AI research out in the open, shareable, and extensible by the community. Join us in our goal to keep the field moving forward, together, openly.

An impressive collection of open source AI software and data.

The categories are:

A number of the major players in AI research are part of this project, which bodes well for it being maintained into the future.

If you create or encounter any open AI resources not listed at Open AI Resources, please Submit a Resource.

I first saw this in a tweet by Ana-Maria Popescu.

CQL Under the Hood

September 13th, 2014

CQL Under the Hood by Robbie Strickland.


As a reformed CQL critic, I’d like to help dispel the myths around CQL and extol its awesomeness. Most criticism comes from people like me who were early Cassandra adopters and are concerned about the SQL-like syntax, the apparent lack of control, and the reliance on a defined schema. I’ll pop open the hood, showing just how the various CQL constructs translate to the underlying storage layer–and in the process I hope to give novices and old-timers alike a reason to love CQL.

Slides from CassandraSummit 2014

Best viewed with a running instance of Cassandra.

Deep dive into understanding human language with Python

September 13th, 2014

Deep dive into understanding human language with Python by Alyona Medelyan.


Whenever your data is text and you need to analyze it, you are likely to need Natural Language Processing algorithms that help make sense of human language. They will help you answer questions like: Who is the author of this text? What is his or her attitude? What is it about? What facts does it mention? Do I have similar texts like this one already? Where does it belong to?

This tutorial will cover several open-source Natural Language Processing Python libraries such as NLTK, Gensim and TextBlob, show you how they work and how you can use them effectively.

Level: Intermediate (knowledge of basic Python language features is assumed)

Pre-requisites: a Python environment with NLTK, Gensim and TextBlob already installed. Please make sure to run nltk.download() and install movie_reviews and stopwords (under Corpora), as well as POS model (under Models).

Code examples, data and slides from Alyona’s NLP tutorial at KiwiPyCon 2014.

Introduction to NLTK, Gensim and TextBlob.

Not enough to make you dangerous but enough to get you interested in natural language processing.

Apache Kafka for Beginners

September 13th, 2014

Apache Kafka for Beginners by Gwen Shapira and Jeff Holoman.

From the post:

When used in the right way and for the right use case, Kafka has unique attributes that make it a highly attractive option for data integration.

Apache Kafka is creating a lot of buzz these days. While LinkedIn, where Kafka was founded, is the most well known user, there are many companies successfully using this technology.

So now that the word is out, it seems the world wants to know: What does it do? Why does everyone want to use it? How is it better than existing solutions? Do the benefits justify replacing existing systems and infrastructure?

In this post, we’ll try to answers those questions. We’ll begin by briefly introducing Kafka, and then demonstrate some of Kafka’s unique features by walking through an example scenario. We’ll also cover some additional use cases and also compare Kafka to existing solutions.

What is Kafka?

Kafka is one of those systems that is very simple to describe at a high level, but has an incredible depth of technical detail when you dig deeper. The Kafka documentation does an excellent job of explaining the many design and implementation subtleties in the system, so we will not attempt to explain them all here. In summary, Kafka is a distributed publish-subscribe messaging system that is designed to be fast, scalable, and durable. (emphasis in original)

A great reference to use for your case to technical management about Kafka. In particular the line:

even a small three-node cluster can process close to a million events per second with an average latency of 3ms.

Sure, there are applications with more stringent processing requirements, but there are far more applications with less than a million events per second.

Does your topic map system get updated more than a million times a second?

First map of Rosetta’s comet

September 13th, 2014

First map of Rosetta’s comet

From the webpage:

Scientists have found that the surface of comet 67P/Churyumov-Gerasimenko — the target of study for the European Space Agency’s Rosetta mission — can be divided into several regions, each characterized by different classes of features. High-resolution images of the comet reveal a unique, multifaceted world.

ESA’s Rosetta spacecraft arrived at its destination about a month ago and is currently accompanying the comet as it progresses on its route toward the inner solar system. Scientists have analyzed images of the comet’s surface taken by OSIRIS, Rosetta’s scientific imaging system, and defined several different regions, each of which has a distinctive physical appearance. This analysis provides the basis for a detailed scientific description of 67P’s surface. A map showing the comet’s various regions is available at: http://go.nasa.gov/1pU26L2

“Never before have we seen a cometary surface in such detail,” says OSIRIS Principal Investigator Holger Sierks from the Max Planck Institute for Solar System Science (MPS) in Germany. In some of the images, one pixel corresponds to a scale of 30 inches (75 centimeters) on the nucleus. “It is a historic moment — we have an unprecedented resolution to map a comet,” he says.

The comet has areas dominated by cliffs, depressions, craters, boulders and even parallel grooves. While some of these areas appear to be quiet, others seem to be shaped by the comet’s activity, in which grains emitted from below the surface fall back to the ground in the nearby area.


The Rosetta mission:

Rosetta launched in 2004 and will arrive at comet 67P/Churyumov-Gerasimenko on 6 August. It will be the first mission in history to rendezvous with a comet, escort it as it orbits the Sun, and deploy a lander to its surface. Rosetta is an ESA mission with contributions from its member states and NASA. Rosetta’s Philae lander is provided by a consortium led by DLR, MPS, CNES and ASI.

Not to mention being your opportunity to watch semantic diversity develop from a known starting point.

Already the comet has two names: (1 67P/Churyumov-Gerasimenko and 2) Rosetta’s comet. Can you guess which one will be used in the popular press?

Surface features will be described in different languages, which have different terms for features and the processes that formed them. Not to mention that even within natural languages there can be diversity as well.

Semantic diversity is our natural state. Normalization is an abnormal state, perhaps that is why it is so elusive on a large scale.

A Greater Voice for Individuals in W3C – Tell Us What You Would Value [Deadline: 30 Sept 2014]

September 12th, 2014

A Greater Voice for Individuals in W3C – Tell Us What You Would Value by Coralie Mercier.

From the post:

How is the W3C changing as the world evolves?

Broadening in recent years the W3C focus on industry is one way. Another was the launch in 2011 of W3C Community Groups to make W3C the place for new standards. W3C has heard the call for increased affiliation with W3C, and making W3C more inclusive of the web community.

W3C responded through the development of a program for increasing developer engagement with W3C. Jeff Jaffe is leading a public open task force to establish a program which seeks to provide individuals a greater voice within W3C, and means to get involved and help shape web technologies through open web standards.

Since Jeff announced the version 2 of the Webizen Task Force, we focused on precise goals, success criteria and a selection of benefits, and we built a public survey.

The W3C is a membership based organisation supported by way of membership fees, as to form a common set of technologies, written to the specifications defined through the W3C, which the web is built upon.

The proposal (initially called Webizen but that name may change and we invite your suggestions in the survey), seeks to extend participation beyond the traditional forum of incorporated entities with an interest in supporting open web standards, through new channels into the sphere of individual participation, already supported through the W3C community groups.

Today the Webizen Task Force is releasing a survey which will identify whether or not sufficient interest exists. The survey asks if you are willing to become a W3C Webizen. It offers several candidate benefits and sees which ones are of interest; which ones would make it worthwhile to become Webizens.

I took the survey today and suggest that you do the same before 30 September 2014.

In part I took the survey because on one comment on the original post that reads:

What a crock of shit! The W3C is designed to not be of service to individuals, but to the corporate sponsors. Any ideas or methods to improve web standards should not be taken from sources other then the controlling corporate powers.

I do think that as a PR stunt the Webizen concept could be a good ploy to allow individuals to think they have a voice, but the danger is that they may be made to feel as if they should have a voice.

This could prove detrimental in the future.

I believe the focus of the organization should remain the same, namely as a organization that protects corporate interests and regulates what aspects of technology can be, and should be, used by individuals.

The commenter apparently believes in a fantasy world where those with the gold don’t make the rules.

I am untroubled by those with the gold making the rules, so long as the rest of us have the opportunity for persuasion, that is to be heard by those making the rules.

My suggestion at #14 of the survey reads:

The anti-dilution of “value of membership” position creates a group of second class citizens, which can only lead to ill feelings and no benefit to the W3C. It is difficult to imagine that IBM, Oracle, HP or any of the other “members” of the W3C are all that concerned with voting on W3C specifications. They are likely more concerned with participating in the development of those standards. Which they could do without being members should they care to submit public comments, etc.

In fact, “non-members” can contribute to any work currently under development. If their suggestions have merit, I rather doubt their lack of membership is going to impact acceptance of their suggestions.

Rather than emphasizing the “member” versus “non-member” distinction, I would create a “voting member” and “working member” categories, with different membership requirements. “Voting members” would carry on as they are presently and vote on the administrative aspects of the W3C. “Working members” who consist of employees of “voting members,” “invited experts,” and “working members” who meet some criteria for interest in and expertise at a particular specification activity. Like an “invited expert” but without heavy weight machinery.

Emphasis on the different concerns of different classes of membership would go a long way to not creating a feeling of second class citizenship. Or at least it would minimize it more than the “in your face” type approach that appears to be the present position.

Being able to participate in teleconferences for example, should be sufficient for most working members. After all, if you have to win votes for a technical position, you haven’t been very persuasive in presenting your position.

Nothing against “voting members” at the W3C but I would rather be a “working member” any day.

How about you?

Take the Webizen survey.

Connected Histories: British History Sources, 1500-1900

September 12th, 2014

Connected Histories: British History Sources, 1500-1900

From the webpage:

Connected Histories brings together a range of digital resources related to early modern and nineteenth century Britain with a single federated search that allows sophisticated searching of names, places and dates, as well as the ability to save, connect and share resources within a personal workspace. We have produced this short video guide to introduce you to the key features.

Twenty-two remarkable resources can be searched by place, person, or keyword. Some of the sources require subscriptions but the vast majority do not. A summary of the resources would fail to do them justice so here is a list of the currently searchable resources:

As you probably assume, there is no binding point for any person, object, date or thing across all twenty-two resources with its associations to other persons, objects, dates or things.

As you explore Connected Histories, keep track of where you found information on a person, object, date or thing. Depending on the granularity of pointing, you might want to create a topic map to capture that information.

Want to see how #SchemaOrg #Dbpedia and #SKOS taxonomies can be seamlessly integrated?

September 12th, 2014

Want to see how #SchemaOrg #Dbpedia and #SKOS taxonomies can be seamlessly integrated? Register for our webinar: http://www.poolparty.biz/webinar-taxonomy-management-content-management-well-integrated/

is how the tweet read.

From the seminar registration page:

With the arrival of semantic web standards and linked data technologies, new options for smarter content management and semantic search have become available. Taxonomies and metadata management shall play a central role in your content management system: By combining text mining algorithms with taxonomies and knowledge graphs from the web a more accurate annotation and categorization of documents and more complex queries over text-oriented repositories like SharePoint, Drupal, or Confluence are now possible.

Nevertheless, the predominant opinion that taxonomy management is a tedious process currently impedes a widespread implementation of professional metadata strategies.

In this webinar, key people from the Semantic Web Company will describe how content management and collaboration systems like SharePoint, Drupal or Confluence can benefit from professional taxonomy management. We will also discuss why taxonomy management is not necessarily a tedious process when well integrated into content management workflows.

I’ve had mixed luck with webinars this year. Some were quite good and others were equally bad.

I have fairly firm opinions about #Schema.org, #Dbpedia and #SKOS taxonomies but tedium isn’t one of them. ;-)

You can register for free for: Webinar “Taxonomy management & content management – well integrated!”, October 8th, 2014.

Usual marketing harvesting of contact information. Linux users will have to use VMs for PCs or Mac.

If you attend, be sure to look for my post reviewing the webinar and post your comments there.

Bokeh 0.6 release

September 12th, 2014

Bokeh 0.6 release by Bryan Van de Ven.

From the post:

Bokeh is a Python library for visualizing large and realtime datasets on the web. Its goal is to provide to developers (and domain experts) with capabilities to easily create novel and powerful visualizations that extract insight from local or remote (possibly large) data sets, and to easily publish those visualization to the web for others to explore and interact with.

This release includes many bug fixes and improvements over our most recent 0.5.2 release:

  • Abstract Rendering recipes for large data sets: isocontour, heatmap
  • New charts in bokeh.charts: Time Series and Categorical Heatmap
  • Full Python 3 support for bokeh-server
  • Much expanded User and Dev Guides
  • Multiple axes and ranges capability
  • Plot object graph query interface
  • Hit-testing (hover tool support) for patch glyphs

See the CHANGELOG for full details.

I’d also like to announce a new Github Organization for Bokeh: https://github.com/bokeh. Currently it is home to Scala and and Julia language bindings for Bokeh, but the Bokeh project itself will be moved there before the next 0.7 release. Any implementors of new language bindings who are interested in hosting your project under this organization are encouraged to contact us.

In upcoming releases, you should expect to see more new layout capabilities (colorbar axes, better grid plots and improved annotations), additional tools, even more widgets and more charts, R language bindings, Blaze integration and cloud hosting for Bokeh apps.

Don’t forget to check out the full documentation, interactive gallery, and tutorial at


as well as the Bokeh IPython notebook nbviewer index (including all the tutorials) at:


One of the examples from the gallery:

plot graphic

reminds me of U.S. foreign policy. The unseen attractors are defense contractors and other special interests.

The Lesser Known Normal Forms of Database Design

September 12th, 2014

The Lesser Known Normal Forms of Database Design by John Myles White.

A refreshing retake on normal forms of database design!


MRAPs And Bayonets: What We Know About The Pentagon’s 1033 Program

September 11th, 2014

MRAPs And Bayonets: What We Know About The Pentagon’s 1033 Program by by Arezou Rezvani, Jessica Pupovac, David Eads, and Tyler Fisher. (NPR)

From the post:

Amid widespread criticism of the deployment of military-grade weapons and vehicles by police officers in Ferguson, Mo., President Obama recently ordered a review of federal efforts supplying equipment to local law enforcement agencies across the country.

So, we decided to take a look at what the president might find.

NPR obtained data from the Pentagon on every military item sent to local, state and federal agencies through the Pentagon’s Law Enforcement Support Office — known as the 1033 program — from 2006 through April 23, 2014. The Department of Defense does not publicly report which agencies receive each piece of equipment, but they have identified the counties that the items were shipped to, a description of each, and the amount the Pentagon initially paid for them.

We took the raw data, analyzed it and have organized it to make it more accessible. We are making that data set available to the public today.

This is a data set that raises more questions than it answers, as the post points out.

The top ten categories of items distributed (valued in the $millions): vehicles, aircraft, comm. & detection, clothing, construction, fire control, weapons, electric wire, medical equipment, and tractors.

Tractors? I can understand the military having tractors since it is entirely self-reliance during military operations. Why any local law enforcement office needs a tractor is less clear. Or bayonets (11,959 of them).

The NPR post does a good job of raising questions but since there are 3,143 counties or their equivalents in the United States, connecting the dots with particular local agencies, uses, etc. falls on your shoulders.

Could be quite interesting. Is your local sheriff “training” on an amphibious vehicle to reach his deer blind during hunting season? (Utter speculation on my part. I don’t know if your local sheriff likes to hunt deer.)

How is a binary executable organized? Let’s explore it!

September 10th, 2014

How is a binary executable organized? Let’s explore it! by Julia Evans.

From the post:

I used to think that executables were totally impenetrable. I’d compile a C program, and then that was it! I had a Magical Binary Executable that I could no longer read.

It is not so! Executable file formats are regular file formats that you can understand. I’ll explain some simple tools to start! We’ll be working on Linux, with ELF binaries. (binaries are kind of the definition of platform-specific, so this is all platform-specific.) We’ll be using C, but you could just as easily look at output from any compiled language.

I’ll be the first to admit that following Julia’s blog too closely carries the risk of changing you into a *nix kernel hacker.

I get a UTF-8 encoding error from her RSS feed so I have to follow her posts manually. Maybe the only thing that has saved me thus far. ;-)

Seriously, Julia’s posts help you expand your knowledge of what is on other side of the screen.


PS: Julia is demonstrating a world of subjects that are largely unknown to the casual user. Not looking for a subject does not protect you from a defect in that subject.

Where Does Scope Come From?

September 10th, 2014

Where Does Scope Come From? by Michael Robert Bernstein.

From the post:

After several false starts, I finally sat down and watched the first of Frank Pfenning’s 2012 “Proof theory foundations” talks from the University of Oregon Programming Languages Summer School (OPLSS). I am very glad that I did.

Pfenning starts the talk out by pointing out that he will be covering the “philosophy” branch of the “holy trinity” of Philosophy, Computer Science and Mathematics. If you want to “construct a logic,” or understand how various logics work, I can’t recommend this video enough. Pfenning demonstrates the mechanics of many notions that programmers are familiar with, including “connectives” (conjunction, disjunction, negation, etc.) and scope.

Scope is demonstrated during this process as well. It turns out that in logic, as in programming, the difference between a sensible concept of scope and a tricky one can often mean the difference between a proof that makes no sense, and one that you can rest other proofs on. I am very interested in this kind of fundamental kernel – how the smallest and simplest ideas are absolutely necessary for a sound foundation in any kind of logical system. Scope is one of the first intuitions that new programmers build – can we exploit this fact to make the connections between logic, math, and programming clearer to beginners? (emphasis in the original)

Michael promises more detail on the treatment of scope in future posts.

The lectures run four (4) hours so it is going to take a while to do all of them. My curiosity is whether “scope” in this context refers to variables in programming or does “scope” here extend in some way to scope as used in topic maps?

More to follow.

TinkerPop3 M2 Delay for MetaProperties

September 10th, 2014

TinkerPop3 M2 Delay for MetaProperties by Marko A. Rodreiguez.

From the post:

TinkerPop3 3.0.0.M2 was suppose to be released 1.5 weeks ago. We have delayed the release because we have now introduced MetaProperties into TinkerPop3. Matthias Bröcheler of Titan-fame has been pushing TinkerPop to provide this feature for over a year now. We had numerous discussions about it over the past year, and at one point, rejected the feature request. However, recently, a solid design proposal was presented by Matthias and Stephen and I went about implementing it over the last 1.5 weeks. With that said, TinkerPop3 now has MetaProperties.

What are meta-properties?

  1. Edges have Properties
  2. Vertices have MetaProperties
  3. MetaProperties have Properties

What are the consequences of meta-properties?

  1. A vertex can have multiple “name” properties (for example).
  2. A vertex’s properties (i.e. meta-properties) can have normal key/value properties (e.g. a “name” property can have an “acl:public” property).

What are the use cases?

  1. Provenance: different users have different declarations for Marko’s name: “marko”, “marko rodriguez,” “marko a. rodriguez.”
  2. Security: you can now do property-level security. Marko’s “age” has an acl:private property and his “name”(s) have acl:public properties.
  3. History: who mutated what and when did they do it? each vertex property can have a “creator:stephen” and a “createdAt:2014″ property.

If you have ever had to build a graph application that required provenance, security, history, and the like, you realized how difficult it is with the current key/value property graph model. You end up, in essence, creating vertices for properties so you can express such higher order semantics. However, maintaing that becomes a nightmare as tools like Gremlin and GraphWrappers don’t know the semantics and you basically are left to create your own GremlinDSL-extensions and tools to process such a custom representation. Well now, you get it for free and TinkerPop will be able to provide (in the future) wrappers (called strategies in TP3) for provenance, security, history, etc.

I don’t grok the reason for a distinction between properties of vertices and properties of edges so I have posted a note asking about it.

Take the quoted portion as a sample of the quality of work being done on TinkerPop3.

Taxonomies and Toolkits of Regular Language Algorithms

September 10th, 2014

Taxonomies and Toolkits of Regular Language Algorithms by Bruce William Watson.

From 1.1 Problem Statement:

A number of fundamental computing science problems have been extensively studied since the 1950s and the 1960s. As these problems were studied, numerous solutions (in the form of algorithms) were developed over the years. Although new algorithms still appear from time to time, each of these fields can be considered mature. In the solutions to many of the well-studied computing science problems, we can identify three deficiencies:

  1. Algorithms solving the same problem are difficult to compare to one another. This is usually due to the use of different programming languages, styles of presentation, or simply the addition of unnecessary details.
  2. Collections of implementations of algorithms solving a problem are difficult, if not impossible, to find. Some of the algorithms are presented in a relatively obsolete manner, either using old notations or programming languages for which no compilers exist, making it difficult to either implement the algorithm or find an existing implementation.
  3. Little is known about the comparative practical running time performance of the algorithms. The lack of existing implementations in one and the same framework, especially of the older algorithms, makes it difficult to determine the running time characteristics of the algorithms. A software engineer selecting one of the algorithms will usually do so on the basis of the algorithm’s theoretical running time, or simply by guessing.

In this dissertation, a solution to each of the three deficiencies is presented for each of the following three fundamental computing science problems:

  1. Keyword pattern matching in strings. Given a finite non-empty set of keywords (the patterns) and an input string, find the set of all occurrences of a keyword as a substring of the input string.
  2. Finite automata (FA) construction. Given a regular expression, construct a finite automaton which accepts the language denoted by the regular expression.
  3. Deterministic finite automata (DFA) minimization. Given a DFA, construct the unique minimal DFA accepting the same language.

We do not necessarily consider all the known algorithms solving the problems. For example, we restrict ourselves to batch-style algorithms1, as opposed to incremental algorithms2.

Requires updating given its age, 1995, but a work merits mention.

I first saw this in a tweet by silentbicycle.srec.

ETL: The Dirty Little Secret of Data Science

September 10th, 2014

ETL: The Dirty Little Secret of Data Science by Byron Ruth.

From the description:

“There is an adage that given enough data, a data scientist can answer the world’s questions. The untold truth is that the majority of work happens during the ETL and data preprocessing phase. In this talk I discuss Origins, an open source Python library for extracting and mapping structural metadata across heterogenous data stores.”

More than your usual ETL presentation, Byron makes several points of interest to the topic map community:

  • “domain knowledge” is necessary for effective ETL
  • “domain knowledge” changes and fades from dis-use
  • ETL isn’t transparent to consumers of data resulting from ETL, a “black box”
  • Data provenance is the answer to transparency, changing domain knowledge and persisting domain knowledge
  • “Provenance is a record that describes the people, institutions, entities, and activities, involved in producing, influencing, or delivering a piece of data or a thing.”
  • Project Origins, captures metadata and structures from backends and persists it to Neo4j

Great focus on provenance but given the lack of merging in Neo4j, the collation of information about a common subject, with different names, is going to be a manual process.

Follow @thedevel.

What’s in a Name?

September 10th, 2014

What’s in a Name?

From the webpage:

What will be covered? The meeting will focus on the role of chemical nomenclature and terminology in open innovation and communication. A discussion of areas of nomenclature and terminology where there are fundamental issues, how computer software helps and hinders, the need for clarity and unambiguous definitions for application to software systems. How can you contribute? As well as the talks from expert speakers there will be plenty of opportunity for discussion and networking. A record will be made of the meeting, including the discussion, and will be made available initially to those attending the meeting. The detailed programme and names of speakers will be available closer to the date of the meeting.

Date: 21 October 2014

Event Subject(s): Industry & Technology


The Royal Society of Chemistry
Burlington House
United Kingdom

Find this location using Google Map

Contact for Event Information

Name: Prof Jeremy Frey

University of Southampton
United Kingdom

Email: j.g.frey@soton.ac.uk

Now there’s an event worth the hassle of overseas travel during these paranoid times! Alas, I will have to wait for the conference record to be released to non-attendees. The event is a good example of the work going on at the Royal Society of Chemistry.

I first saw this in a tweet by Open PHACTS.

iCloud: Leak for Less

September 10th, 2014

Apple rolls out iCloud pricing cuts by Jonathan Vanian.

Jonathan details the new Apple pricing schedule for the iCloud.

Now you can leak your photographs for less!

Cheap storage = Cheap security.

Is there anything about that statement that is unclear?

QPDF – PDF Transformations

September 10th, 2014

QPDF – PDF Transformations

From the webpage:

QPDF is a command-line program that does structural, content-preserving transformations on PDF files. It could have been called something like pdf-to-pdf. It also provides many useful capabilities to developers of PDF-producing software or for people who just want to look at the innards of a PDF file to learn more about how they work.

QPDF is capable of creating linearized (also known as web-optimized) files and encrypted files. It is also capable of converting PDF files with object streams (also known as compressed objects) to files with no compressed objects or to generate object streams from files that don’t have them (or even those that already do). QPDF also supports a special mode designed to allow you to edit the content of PDF files in a text editor….

Government agencies often publish information in PDF. PDF which often has restrictions on copying and printing.

I have briefly tested QPDF and it does take care of copying and printing restrictions. Be aware that QPDF has many other capabilities as well.

Recursive Deep Learning For Natural Language Processing And Computer Vision

September 10th, 2014

Recursive Deep Learning For Natural Language Processing And Computer Vision by Richard Socher.

From the abstract:

As the amount of unstructured text data that humanity produces overall and on the Internet grows, so does the need to intelligently process it and extract diff erent types of knowledge from it. My research goal in this thesis is to develop learning models that can automatically induce representations of human language, in particular its structure and meaning in order to solve multiple higher level language tasks.

There has been great progress in delivering technologies in natural language processing such as extracting information, sentiment analysis or grammatical analysis. However, solutions are often based on diff erent machine learning models. My goal is the development of general and scalable algorithms that can jointly solve such tasks and learn the necessary intermediate representations of the linguistic units involved. Furthermore, most standard approaches make strong simplifying language assumptions and require well designed feature representations. The models in this thesis address these two shortcomings. They provide eff ective and general representations for sentences without assuming word order independence. Furthermore, they provide state of the art performance with no, or few manually designed features.

The new model family introduced in this thesis is summarized under the term Recursive Deep Learning. The models in this family are variations and extensions of unsupervised and supervised recursive neural networks (RNNs) which generalize deep and feature learning ideas to hierarchical structures. The RNN models of this thesis obtain state of the art performance on paraphrase detection, sentiment analysis, relation classifi cation, parsing, image-sentence mapping and knowledge base completion, among other tasks.

Socher’s models offer two significant advances:

  • No assumption of word order independence
  • No or few manually designed features

Of the two, I am more partial to elimination of the assumption of word order independence. I suppose in part because I see that leading to abandoning that assumption that words have some fixed meaning separate and apart from the other words used to define them.

Or in topic maps parlance, identifying a subject always involves the use of other subjects, which are themselves capable of being identified. Think about it. When was the last time you were called upon to identify a person, object or thing and you uttered an IRI? Never right?

That certainly works, at least in closed domains, in some cases, but other than simply repeating the string, you have no basis on which to conclude that is the correct IRI. Nor does anyone else have a basis to accept or reject your IRI.

I suppose that is another one of those “simplifying” assumptions. Useful in some cases but not all.

OceanColor Web

September 9th, 2014

OceanColor Web

A remarkable source for ocean color data and software for analysis of that data.

From the webpage:

This project creates a variety of established and new ocean color products for evaluation as candidates to become Earth Science Data Records.

Not directly relevant to anything I’m working on but I don’t know what environmental or oceanography projects you are pursuing.

I first saw this in a tweet by Rob Simmon.

PLOS Resources on Ebola

September 9th, 2014

PLOS Resources on Ebola by Virginia Barbour and PLOS Collections.

From the post:

The current Ebola outbreak in West Africa probably began in Guinea in 2013, but it was only recognized properly in early 2014 and shows, at the time of writing, no sign of subsiding. The continuous human-to-human transmission of this new outbreak virus has become increasingly worrisome.

Analyses thus far of this outbreak mark it as the most serious in recent years and the effects are already being felt far beyond those who are infected and dying; whole communities in West Africa are suffering because of its negative effects on health care and other infrastructures. Globally, countries far removed from the outbreak are considering their local responses, were Ebola to be imported; and the ripple effects on the normal movement of trade and people are just becoming apparent.

A great collection of PLOS resources on Ebola.

Even usual closed sources are making Ebola information available for free:

Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak (Science DOI: 10.1126/science.1259657) This is the gene sequencing report that establishes that one (1) person ate infected bush meat and is the source of all the following Ebola infections.

So much for needing highly specialized labs to “weaponize” biological agents. One infection is likely to result in > 20,000 deaths. You do the math.

I first saw this in a tweet by Alex Vespignani.