Archive for September, 2014

New Wandora Release 2014-09-25

Tuesday, September 30th, 2014

New Wandora Release 2014-09-25

This release features:

Sounds good to me!

Download the latest release today!

Neo4j: Generic/Vague relationship names

Tuesday, September 30th, 2014

Neo4j: Generic/Vague relationship names by Mark Needham.

From the post:

An approach to modelling that I often see while working with Neo4j users is creating very generic relationships (e.g. HAS, CONTAINS, IS) and filtering on a relationship property or on a property/label at the end node.

Intuitively this doesn’t seem to make best use of the graph model as it means that you have to evaluate many relationships and nodes that you’re not interested in whereas if you use a more specific relationship type that isn’t the case.

However, I’ve never actually tested the performance differences between the approaches so I thought I’d try it out.

I created 4 different databases which had one node with 60,000 outgoing relationships – 10,000 which we wanted to retrieve and 50,000 that were irrelevant.

I modelled the ‘relationship’ in 4 different ways…

  • Filter by relationship type
  • Filter by end node label
  • Filter by relationship property
    (node)-[:HAS {type: “address”}]->(address)
  • Filter by end node
    (node)-[:HAS]->(address {type: “address”})

…and then measured how long it took to retrieve the ‘has address’ relationships.

See Mark’s post for the test results but the punch line is the less filtering required, the faster the result.

Designing data structures for eventual queries seems sub-optimal to me.


Open Sourcing ml-ease

Tuesday, September 30th, 2014

Open Sourcing ml-ease by Deepak Agarwal.

From the post:

LinkedIn data science and engineering is happy to release the first version of ml-ease, an open-source large scale machine learning library. ml-ease supports model fitting/training on a single machine, a Hadoop cluster and a Spark cluster with emphasis on scalability, speed, and ease-of-use. ml-ease is a useful tool for developers working on big data machine learning applications, and we’re looking forward to feedback from the open-source community. ml-ease currently supports ADMM logistic regression for binary response prediction with L1 and L2 regularization on Hadoop clusters.

See Deepak’s post for more details and news of future machine learning algorithms to be released!

Sudoku, Linear Optimization, and the Ten Cent Diet

Tuesday, September 30th, 2014

Sudoku, Linear Optimization, and the Ten Cent Diet by Joh Orwant.

From the post:

In 1945, future Nobel laureate George Stigler wrote an essay in the Journal of Farm Economics titled The Cost of Subsistence about a seemingly simple problem: how could a soldier be fed for as little money as possible?

The “Stigler Diet” became a classic problem in the then-new field of linear optimization, which is used today in many areas of science and engineering. Any time you have a set of linear constraints such as “at least 50 square meters of solar panels” or “the amount of paint should equal the amount of primer” along with a linear goal (e.g., “minimize cost” or “maximize customers served”), that’s a linear optimization problem.

At Google, our engineers work on plenty of optimization problems. One example is our YouTube video stabilization system, which uses linear optimization to eliminate the shakiness of handheld cameras. A more lighthearted example is in the Google Docs Sudoku add-on, which instantaneously generates and solves Sudoku puzzles inside a Google Sheet, using the SCIP mixed integer programming solver to compute the solution.

(image omitted)

Today we’re proud to announce two new ways for everyone to solve linear optimization problems. First, you can now solve linear optimization problems in Google Sheets with the Linear Optimization add-on written by Google Software Engineer Mihai Amarandei-Stavila. The add-on uses Google Apps Script to send optimization problems to Google servers. The solutions are displayed inside the spreadsheet. For developers who want to create their own applications on top of Google Apps, we also provide an API to let you call our linear solver directly.

(image omitted)

Second, we’re open-sourcing the linear solver underlying the add-on: Glop (the Google Linear Optimization Package), created by Bruno de Backer with other members of the Google Optimization team. It’s available as part of the or-tools suite and we provide a few examples to get you started. On that page, you’ll find the Glop solution to the Stigler diet problem. (A Google Sheets file that uses Glop and the Linear Optimization add-on to solve the Stigler diet problem is available here. You’ll need to install the add-on first.)

For a fuller introduction to linear programming: Practical Optimization: A Gentle Introduction by John W. Chinneck. Online, draft chapters.

I would say more about the utility of linear optimization in the subject identity space but it might violate an NDA I signed many years ago. Sorry.

LaTeX (Wikibooks)

Tuesday, September 30th, 2014

LaTeX (Wikibooks)

I mention LaTeX because it is a featured book on Wikibooks, no mean feat, and I am going to be consulting it on several writing projects. I have several other LaTeX references, but it never hurts to have one more.

BTW, if your interests run in that direction, consider contributing to LaTeX on some of the more advanced topics.


Neo4j 2.1.5

Tuesday, September 30th, 2014

Neo4j 2.1.5

From the post:

Neo4j 2.1.5 is a maintenance release, with critical improvements.

Notably, this release addresses the following:

  • Corrects a Cypher compiler error introduced only in Neo4j 2.1.4, which caused Cypher queries containing nested maps to fail type checking.
  • Resolves a critical error, where discrete remove+add operations on properties could result in a new property being added, without the old property being correctly removed.
  • Corrects an issue causing significantly degraded write performance in larger transactions.
  • Improves memory use in Cypher queries containing OPTIONAL MATCH.
  • Resolves an issue causing failed index lookups for some newly created integer properties.
  • Fixes an issue which could cause excessive store growth in some clustered environments (Neo4j Enterprise).
  • Adds additional metadata (label and ID) to node and relationship representations in JSON responses from the REST API.
  • Resolves an issue with extraneous remove commands being added to the legacy auto-index transaction log.
  • Resolves an issue preventing the lowest ID cluster member from successfully leaving and rejoining the cluster, in cases where it was not the master (Neo4j Enterprise).

All Neo4j 2.x users are recommended to upgrade to this release. Upgrading to Neo4j 2.1 requires a migration to the on-disk store and can not be reversed. Please ensure you have a valid backup before proceeding, then use on a test or staging server to understand any changed behaviors before going into production.

Neo4j 1.9 users may upgrade directly to this release, and are recommended to do so carefully. We strongly encourage verifying the syntax and validating all responses from your Cypher scripts, REST calls, and Java code before upgrading any production system. For information about upgrading from Neo4j 1.9, please see our Upgrading to Neo4j 2 FAQ.

Do you remember which software company had the “We are holding the gun but you decide whether to pull the trigger” type upgrade warning? There are so many legendary upgrade stories that it is hard to remember them all. Is there a collection of upgrade warnings and/or stories on the Net? Thanks!

BTW, if you are running Neo4j 2.x upgrade. No comment on Neo4j 1.9.

IPython Cookbook released

Tuesday, September 30th, 2014

IPython Cookbook released by Cyrille Rossant.

From the post:

My new book, IPython Interactive Computing and Visualization Cookbook, has just been released! A sequel to my previous beginner-level book on Python for data analysis, this new 500-page book is a complete advanced-level guide to Python for data science. The 100+ recipes cover not only interactive and high-performance computing topics, but also data science methods in statistics, data mining, machine learning, signal processing, image processing, network analysis, and mathematical modeling.

Here is a glimpse of the topics addressed in this book:

  • IPython notebook, interactive widgets in IPython 2+
  • Best practices in interactive computing: version control, workflows with IPython, testing, debugging, continuous integration…
  • Data analysis with pandas, NumPy/SciPy, and matplotlib
  • Advanced data visualization with seaborn, Bokeh, mpld3, d3.js, Vispy
  • Code profiling and optimization
  • High-performance computing with Numba, Cython, GPGPU with CUDA/OpenCL, MPI, HDF5, Julia
  • Statistical data analysis with SciPy, PyMC, R
  • Machine learning with scikit-learn
  • Signal processing with SciPy, image processing with scikit-image and OpenCV
  • Analysis of graphs and social networks with NetworkX
  • Geographic Information Systems in Python
  • Mathematical modeling: dynamical systems, symbolic mathematics with SymPy

All of the code is freely available as IPython notebooks on the book’s GitHub repository. This repository is also the place where you can signal errata or propose improvements to any part of the book.

It’s never too early to work on your “wish list” for the holidays! 😉

Or to be person who tweaks the code (or data).

Lingo of Lambda Land

Tuesday, September 30th, 2014

Lingo of Lambda Land by Katie Miller.

From the post:

Comonads, currying, compose, and closures
This is the language of functional coders
Equational reasoning, tail recursion
Lambdas and lenses and effect aversion
Referential transparency and pure functions
Pattern matching for ADT deconstructions
Functors, folds, functions that are first class
Monoids and monads, it’s all in the type class
Infinite lists, so long as they’re lazy
Return an Option or just call it Maybe
Polymorphism and those higher kinds
Monad transformers, return and bind
Catamorphisms, like from Category Theory
You could use an Either type for your query
Arrows, applicatives, continuations
IO actions and partial applications
Higher-order functions and dependent types
Bijection and bottom, in a way that’s polite
Programming of a much higher order
Can be found just around the jargon corner

I posted about Kate Miller’s presentation, Coder Decoder: Functional Programmer Lingo Explained, with Pictures but wanted to draw your attention to the poem she wrote to start the presentation.

In part because it is an amusing poem but also for you to attempt an experiment that Stanley Fish reports on interpretation of poems.

Stanley’s experiment is recounted in “How to Recognize a Poem When You See One,” which appears as chapter 14 in Is There A Text In This Class? The Authority of Interpretative Communities by Stanley Fish.

As functional programmers or wannabe functional programmers, you are probably not the “right” audience for this experiment. (But, feel free to try it.)

Stanley’s experiment came about from a list of authors given to one class, centered on a blackboard (yes, many years ago) to which, for the second class, Stanley drew a box around the list of names and inserted “p. 43” on the board. Those were the only changes between the classes.

The second class was one on interpretation of religious poetry and they were instructed this list was a religious poem and they should being applied the techniques learned in the class to its interpretation.

Stanley’s account of this experiment is masterful and I urge you to read his account in full.

At the same time, you will learn a lot about semantics if you ask a poetry professor to have one of their classes produce an interpretation of this poem. You will discover that “not knowing the meaning of the terms” is no barrier to the production of an interpretation. Sit in the back of the classroom and don’t betray the experiment by offering explanations of the terms.

The question to ask yourself at the end of the experiment is: Where did the semantics of the poem originate? Did Katie Miller imbue it with semantics that would be known to all readers? Or do the terms themselves carry semantics and Katie just selected them? If either answer is yes, how did the poetry class arrive at its rather divergent and colorful explanation of the poem?

Hmmm, if you were scanning this text with a parser, whose semantics would your parser attribute to the text? Katie’s? Any programmers? The class’?

Worthwhile to remember that data processing chooses “a” semantic, not “the” semantic in any given situation.

Coder Decoder: Functional Programmer Lingo Explained, with Pictures

Tuesday, September 30th, 2014

by Katie Miller.

From the description:

For the uninitiated, a conversation with functional programmers can feel like ground zero of a jargon explosion. This talk will help you to defend against the blah-blah blast by demystifying several terms commonly used by FP fans with bite-sized Haskell examples and friendly pictures. The presentation will also offer a glimpse of how some of these concepts can be applied in a simple Haskell web application. Expect appearances by Curry, Lens, and the infamous M-word, among others.


Haskell demo source code:

Informative and entertaining presentation on functional programming lingo.

Not all functional programming lingo, but enough to make you wish your presentations were this clear.

Clojure Cup 2014 – Voting Ends October 06 23:59 UTC

Tuesday, September 30th, 2014

Clojure Cup 2014 – Voting Ends October 06 23:59 UTC

From the webpage:

Here are the apps competing in Clojure Cup 2014. You may vote for as many of them as you like, using your Twitter account. At the end of the voting period the App with the most votes will receive the Public Favorite award.

Voting ends October 06 23:59 UTC

Don’t bother trying to discern your local time from 23:59 UTC. Vote as soon as you read this post!

If you want to encourage functional programming in general and Clojure in particular, vote for your favorite.

Looking at the prizes, you will want to start working on your Clojure chops for Clojure Cup 2015!

The Apache Software Foundation Announces Apache™ Storm™ as a Top-Level Project

Monday, September 29th, 2014

The Apache Software Foundation Announces Apache™ Storm™ as a Top-Level Project

From the post:

The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 200 Open Source projects and initiatives, announced today that Apache™ Storm™ has graduated from the Apache Incubator to become a Top-Level Project (TLP), signifying that the project’s community and products have been well-governed under the ASF’s meritocratic process and principles.

“Apache Storm’s graduation is not only an indication of its maturity as a technology, but also of the robust, active community that develops and supports it,” said P. Taylor Goetz, Vice President of Apache Storm. “Storm’s vibrant community ensures that Storm will continue to evolve to meet the demands of real-time stream processing and computation use cases.”

Apache Storm is a high-performance, easy-to-implement distributed real-time computation framework for processing fast, large streams of data, adding reliable data processing capabilities to Apache Hadoop. Using Storm, a Hadoop cluster can efficiently process a full range of workloads, from real-time to interactive to batch.

As with all Apache products, Apache Storm software is released under the Apache License v2.0, and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project’s day-to-day operations, including community development and product releases. For documentation and ways to become involved with Apache Storm, visit and @Apache_Storm on Twitter.

You will see many notices of Apache™ Storm™’s graduation to a Top-Level Project. Odds are you have already seen one. But, like the weather channel reporting rain at your location, someone may have missed the news. 😉

How does SQLite work? Part 1: pages!

Monday, September 29th, 2014

How does SQLite work? Part 1: pages! by Julia Evans.

From the post:

This evening the fantastic Kamal and I sat down to learn a little more about databases than we did before.

I wanted to hack on SQLite, because I’ve used it before, it requires no configuration or separate server process, I’d been told that its source code is well-written and approachable, and all the data is stored in one file. Perfect!

Following Julia down a rabbit hole to program internals encourages you to venture out on your own!

I can’t say why her posts have that quality, but they do.


I first saw this in a tweet by FoundationDB.

Peaxy Hyperfiler Redefines Data Management to Deliver on the Promise of Advanced Analytics

Monday, September 29th, 2014

Peaxy Hyperfiler Redefines Data Management to Deliver on the Promise of Advanced Analytics

From the post:

Peaxy, Inc. ( today announced general availability of the Peaxy Hyperfiler, its hyperscale data management system that enables enterprises to access and manage massive amounts of unstructured data without disrupting business operations. For engineers and researchers who must search for datasets across multiple geographies, platforms and drives, accessing all the data necessary to inform the product lifecycle, from design to predictive maintenance, presents a major challenge. By making all data, regardless of quantity or location, immediately accessible via a consistent data path, companies will be able to dramatically accelerate their highly technical, data-intensive initiatives. These organizations will be able to manage data in a way that allows them to employ advanced analytics that have been promised for years but never truly realized.

…Key product features include:

  • Scalability to tens of thousands of nodes enabling the creation of an exabyte-scale data infrastructure in which performance scales in parallel with capacity
  • Fully distributed namespace and data space that eliminate data silos to make all data easily accessible and manageable
  • Simple, intuitive user interface built for engineers and researchers as well as for IT
  • Data tiered in storage classes based on performance, capacity and replication factor
  • Automated, policy-based data migration
  • Flexible, customizable data management
  • Remote, asynchronous replication to facilitate disaster recovery
  • Call home remote monitoring
  • Software-based, hardware-agnostic architecture that eliminates proprietary lock-in
  • Addition or replacement of hardware resources with no down time
  • A version of the Hyperfiler that has been successfully beta tested on Amazon Web Services (AWS)

I would not say that the “how it works” page is opaque but it does remind me of the Grinch telling Cindy Lou that he was taking their Christmas tree to be repaired. Possible but lacking in detail.

What do you think?


Do you see:

  1. Any mention of mapping multiple sources of data into a consolidated view?
  2. Any mention of managing changing terminology over a product history?
  3. Any mention of indexing heterogeneous data?
  4. Any mention of natural language processing unstructured data?
  5. Any mention of machine learning over unstructured data?
  6. Anything beyond am implied “a miracle” occurs between data and Hyperfiler?

The documentation promises “data filters” but is also short on specifics.

A safe bet that mapping of terminology and semantics, for an enterprise and/or long product history, remains fertile ground for topic maps.

I first saw this in a tweet by Gregory Piatetsky

PS: Answers to the questions I raise may exist somewhere but I warrant they weren’t posted on September 29, 2014 at the locations listed in this post.

Big Data – A curated list of big data frameworks, resources and tools

Sunday, September 28th, 2014

Big Data – A curated list of big data frameworks, resources and tools by Andrea Mostosi.

From the post:

“Big-data” is one of the most inflated buzzword of the last years. Technologies born to handle huge datasets and overcome limits of previous products are gaining popularity outside the research environment. The following list would be a reference of this world. It’s still incomplete and always will be.

Four hundred and eighty-four (484) resources by my count.

An impressive collection but HyperGraphDB is missing from this list.

Others that you can name off hand?

I don’t think the solution to the many partial “Big Data” lists of software, techniques and other resources is to create yet another list of the same. That would be a duplicated (and doomed) effort.



Scientists confess to sneaking Bob Dylan lyrics into their work for the past 17 years

Sunday, September 28th, 2014

Scientists confess to sneaking Bob Dylan lyrics into their work for the past 17 years by Rachel Feltman.

From the post:

While writing an article about intestinal gasses 17 years ago, Karolinska Institute researchers John Lundberg and Eddie Weitzberg couldn’t resist a punny title: “Nitric Oxide and inflammation: The answer is blowing in the wind”.

Thus began their descent down the slippery slope of Bob Dylan call-outs. While the two men never put lyrics into their peer-reviewed studies, The Local Sweden reports, they started a personal tradition of getting as many Dylan quotes as possible into everything else they wrote — articles about other peoples’ work, editorials, book introductions, and so on.

An amusing illustration of one difficulty in natural language processing, allusion.

The Wikipedia article on allusion summarizes one typology of allusion (R. F. Thomas, “Virgil’s Georgics and the art of reference” Harvard Studies in Classical Philology 90 (1986) pp 171–98) as:

  1. Casual Reference, “the use of language which recalls a specific antecedent, but only in a general sense” that is relatively unimportant to the new context;
  2. Single Reference, in which the hearer or reader is intended to “recall the context of the model and apply that context to the new situation”; such a specific single reference in Virgil, according to Thomas, is a means of “making connections or conveying ideas on a level of intense subtlety”;
  3. Self-Reference, where the locus is in the poet’s own work;
  4. Corrective Allusion, where the imitation is clearly in opposition to the original source’s intentions;
  5. Apparent Reference, “which seems clearly to recall a specific model but which on closer inspection frustrates that intention”; and
  6. Multiple Reference or Conflation, which refers in various ways simultaneously to several sources, fusing and transforming the cultural traditions.

(emphasis in original)

Allusion is a sub-part of the larger subject of intertextuality.

Thinking of the difficulties that allusions introduce into NLP. With “Dylan lyrics meaning” as a quoted search string, I get over 60,000 “hits” consisting of widely varying interpretations. Add to that the interpretation of a Dylan allusion in a different context and you have a truly worthy NLP problem.

Two questions:

The Dylan post is one example of allusion. Is there any literature or sense of how much allusion occurs in specific types of writing?

Any literature on NLP techniques for dealing with allusion in general?

I first saw this in a tweet by Carl Anderson.

Machine Learning Is Way Easier Than It Looks

Sunday, September 28th, 2014

Machine Learning Is Way Easier Than It Looks by Ben McRedmond.

From the post:

It’s easy to believe that machine learning is hard. An arcane craft known only to a select few academics.

After all, you’re teaching machines that work in ones and zeros to reach their own conclusions about the world. You’re teaching them how to think! However, it’s not nearly as hard as the complex and formula-laden literature would have you believe.

Like all of the best frameworks we have for understanding our world, e.g. Newton’s Laws of Motion, Jobs to be Done, Supply & Demand — the best ideas and concepts in machine learning are simple. The majority of literature on machine learning, however, is riddled with complex notation, formulae and superfluous language. It puts walls up around fundamentally simple ideas.

Let’s take a practical example. Say we wanted to include a “you might also like” section at the bottom of this post. How would we go about that? (emphasis in the original)

Yes, Ben uses a simple example. Yes, Ruby isn’t an appropriate language for machine learning. Yes, there are far more complex techniques in common use for machine learning. Just to cover a few of the comments made in response to Ben’s post.

However, Ben does illustrate that it is possible to clearly communicate the essential principles in a machine learning example. And to provide simple code that implements those principles.

That does not take anything away from more complex techniques or more complex code to implement any machine learning approach.

If you are writing about machine learning in professional literature, don’t use this approach as “clarity” there has a different meaning than when writing for non-specialists.

On the other hand, when writing for non-specialists, do use Ben’s approach as “clarity” there isn’t the same as in professional literature.

Neither one is more right or correct than the other, but are addressed to different audiences.

Ben’s style of explanation is one that is worthy of emulation, at least in non-professional literature.

I first saw this in a tweet by Carl Anderson.

Native Actors – A Scalable Software Platform for Distributed, Heterogeneous Environments

Saturday, September 27th, 2014

Native Actors – A Scalable Software Platform for Distributed, Heterogeneous Environments by Dominik Charousset, Thomas C. Schmidt, Raphael Hiesgen, and Matthias Wählisch.


Writing concurrent software is challenging, especially with low-level synchronization primitives such as threads or locks in shared memory environments. The actor model replaces implicit communication by an explicit message passing in a ‘shared-nothing’ paradigm. It applies to concurrency as well as distribution, but has not yet entered the native programming domain. This paper contributes the design of a native actor extension for C++, and the report on a software platform that implements our design for (a)concurrent, (b) distributed, and (c) heterogeneous hardware environments. GPGPU and embedded hardware components are integrated in a transparent way. Our software platform supports the development of scalable and efficient parallel software. It includes a lock-free mailbox algorithm with pattern matching facility for message processing. Thorough performance evaluations reveal an extraordinary small memory footprint in realistic application scenarios, while runtime performance not only outperforms existing mature actor implementations, but exceeds the scaling behavior of low-level message passing libraries such as OpenMPI.

When I read Stroustrup: Why the 35-year-old C++ still dominates ‘real’ dev I started to post a comment asking why there were no questions about functional programming languages? But, the interview is a “puff” piece and not a serious commentary on programming.

Then I ran across this work on implementing actors in C++. Maybe Stroustrup was correct without being aware of it.

Bundled with the C++ library libcppa, available at:

Apologies for Missing Yesterday

Saturday, September 27th, 2014

I have posted everyday to this blog except for one day when a network failure prevented my posting and yesterday.

Carol, my wife, fell Thursday at work and broke her left arm just about the wrist. Rather badly. Yesterday we spent all morning trying to coordinate between insurers, the surgeon and hospital.

We drove to the hospital just after lunch yesterday and the beginning of “hospital” time. Clock time reports the trip took over nine (9) hours but “hospital” time has a quality all its own. I don’t have the words to describe it but if you have ever waited for a loved one in a hospital, you will know what I mean. To others who haven’t had that experience, I hope you never do.

The full arm cast is going to be a nuisance for at least six (6) weeks but so far as we know, the surgery was successful (pins and all that stuff).

I thought about posting late last night but knew that whatever I posted would either be inconsequential and/or incoherent. Not the level of content that I strive for on this blog.

In addition to being Carol’s left arm today, I should find some time for a couple of posts I have been meaning to make. One of which is on actors in C++ that out perform Erlang. Yes! And a couple of other treats that I spotted this week.

Clojure in Unity 3D: Functional Video Game Development

Thursday, September 25th, 2014

Clojure in Unity 3D: Functional Video Game Development by Ramsey Nasser and Tims Gardner.

I had never considered computer games from this perspective:

You have to solve every hard problem in computer science, 60 times a second. Brandon Bloom.

Great presentation, in part because of its focus on demonstrating results. Interested viewers left to consult the code for the details.

Combines Clojure with Unity (a game engine) that can export to PS4.

Is being enjoyable the primary difference between video games and most program interfaces?

A project to watch!

Useful links:

@timsgardner , Tims Gardner

@ra , Ramsey Nasser,

@jplur_ , Joseph Parker,

Unity (Game Engine) (Windows/Mac OS)

PS: I need to get a PS4 in order to track game development with Clojure. If you want to donate one to that cause, contact me for a shipping address.

I won’t spend countless hours playing games that are not Clojure related. I am juggling enough roles without adding any fantasy (computer-based anyway) ones. 😉

ML Pipelines

Wednesday, September 24th, 2014

ML Pipelines

From the post:

Recently at the AMP Lab, we’ve been focused on building application frameworks on top of the BDAS stack. Projects like GraphX, MLlib/MLI, Shark, and BlinkDB have leveraged the lower layers of the stack to provide interactive analytics at unprecedented scale across a variety of application domains. One of the projects that we have focused on over the last several months we have been calling “ML Pipelines”, an extension of our earlier work on MLlib and is a component of MLbase.

In real-world applications – both academic and industrial – use of a machine learning algorithm is only one component of a predictive analytic workflow. Pre-processing steps and considerations about production deployment must also be taken into account. For example, in text classification, preprocessing steps like n-gram extraction, and TF-IDF feature weighting are often necessary before training of a classification model like an SVM. When it comes time to deploy the model, your system must not only know the SVM weights to apply to input features, but also how to get your raw data into the same format that the model is trained on.

The simple example above is typical of a task like text categorization, but let’s take a look at a typical pipeline for image classification:


This more complicated pipeline, inspired by this paper, is representative of what is done commonly done in practice. More examples can be found in this paper. The pipeline consists of several components. First, relevant features are identified after whitening via K-means. Next, featurization of the input images happens via convolution, rectification, and summarization via pooling. Then, the data is in a format ready to be used by a machine learning algorithm – in this case a simple (but extremely fast) linear solver. Finally, we can apply the model to held-out data to evaluate its effectiveness.

Inspirational isn’t it?

Certainly a project to watch for machine learning in particular but also for data processing pipelines in general.

I first saw this in a tweet by Peter Bailis.

Twitter open sourced a recommendation algorithm for massive datasets

Wednesday, September 24th, 2014

Twitter open sourced a recommendation algorithm for massive datasets by Derrick Harris.

From the post:

Late last month, Twitter open sourced an algorithm that’s designed to ease the computational burden on systems trying to recommend content — contacts, articles, products, whatever — across seemingly endless sets of possibilities. Called DIMSUM, short for Dimension Independent Matrix Square using MapReduce (rolls off the tongue, no?), the algorithm trims the list of potential combinations to a reasonable number, so other recommendation algorithms can run in a reasonable amount of time.

Reza Zadeh, the former Twitter data scientist and current Stanford consulting professor who helped create the algorithm, describes it in terms of the famous handshake problem. Two people in a room? One handshake; no problem. Ten people in a room? Forty-five handshakes; still doable. However, he explained, “The number of handshakes goes up quadratically … That makes the problem very difficult when x is a million.”

Twitter claims 271 million active users.

DIMSUM works primarily in two different areas: (1) matching promoted ads with the right users, and (2) suggesting similar people to follow after users follow someone. Running through all the possible combinations would take days even on a large cluster of machines, Zadeh said, but sampling the user base using DIMSUM takes significantly less time and significantly fewer machines.

The “similarity” of two or more people or bits of content is a variation on the merging rules of the TMDM.

In recommendation language, two or more topics are “similar” if:

  • at least one equal string in their [subject identifiers] properties,
  • at least one equal string in their [item identifiers] properties,
  • at least one equal string in their [subject locators] properties,
  • an equal string in the [subject identifiers] property of the one topic item and the [item identifiers] property of the other, or
  • the same information item in their [reified] properties.

TMDM 5.3.5 Properties

The TMDM says “equal” and not “similar” but the point being that you can arbitrarily decide on how “similar” two or more topics must be in order to trigger merging.

That realization opens up the entire realm of “similarity” and “recommendation” algorithms and techniques for application to topic maps.

Which brings us back to the algorithm just open sourced by Twitter.

With DIMSUM, you don’t have to do a brute force topic by topic comparison for merging purposes. Some topics will not meet a merging “threshold” and not be considered by merging routines.

Of course, with the TMDM, merging being either true or false, you may be stuck with brute force. Suggestions?

But if you have other similarity measures, you may be able to profit from DIMSUM.

BTW, I would not follow #dimsum on Twitter because it is apparently a type of dumpling. 😉

Update: All-pairs similarity via DIMSUM DIMSUM has been implemented in Spark!

Hewlett Foundation extends CC BY policy to all grantees

Wednesday, September 24th, 2014

Hewlett Foundation extends CC BY policy to all grantees by Timothy Vollmer.

From the post:

Last week the William and Flora Hewlett Foundation announced that it is extending its open licensing policy to require that all content (such as reports, videos, white papers) resulting from project grant funds be licensed under the most recent Creative Commons Attribution (CC BY) license. From the Foundation’s blog post: “We’re making this change because we believe that this kind of broad, open, and free sharing of ideas benefits not just the Hewlett Foundation, but also our grantees, and most important, the people their work is intended to help.” The change is explained in more detail on the foundation’s website.

The foundation had a long-standing policy requiring that recipients of its Open Educational Resources grants license the outputs of those grants; this was instrumental in the creation and growth of the OER field, which continues to flourish and spread. Earlier this year, the license requirement was extended to all Education Program grants, and as restated, the policy will now be rolled out to all project-based grants under any foundation program. The policy is straightforward: it requires that content produced pursuant to a grant be made easily available to the public, on the grantee’s website or otherwise, under the CC BY 4.0 license — unless there is some good reason to use a different license.

For a long time Creative Commons has been interested in promoting open licensing policies within philanthropic grantmaking. We received a grant from the Hewlett Foundation to survey the licensing policies of private foundations, and to work toward increasing the free availability of foundation-supported works. We wrote about the progress of the project in March, and we’ve been maintaining a spreadsheet of foundation IP policies, and a model IP policy.

We urge other foundations and funding bodies to emulate the outstanding leadership demonstrated by the William and Flora Hewlett Foundation and commit to making open licensing an essential component of their grantmaking strategy.

Not only is a wave of big data approaching but it will be more available than data has been at any time in history.

As funders require open access to funded content, arguments for restricted access will simply disappear from even the humanities.

If you want to change behavior, principled arguments won’t get you as far as changing the reward system.

Intelligence Community On/Off The Record

Wednesday, September 24th, 2014

While looking up a particular NSA leak today I discovered:

IC On The Record

Direct access to factual information related to the lawful foreign surveillance activities of the U.S. Intelligence Community.

Created at the direction of the President of the United States and maintained by the Office of the Director of National Intelligence.


IC Off The Record

Direct access to leaked information related to the surveillance activities of the U.S. Intelligence Community and their partners.

IC Off The Record points to IC On The Record but the reverse isn’t true.

When you visit IC On The Record, tweet about IC Off The Record. Help everyone come closer to a full understanding of the intelligence community.

King James Programming

Wednesday, September 24th, 2014

King James Programming

From the webpage:

Posts generated by a Markov chain trained on the King James Bible, Structure and Interpretation of Computer Programs, and Why’s Poignant Guide to Ruby.

A sampling from the main page:

(For six months did Joab remain there with all Israel, until he had utterly destroyed them and their contents with an empty x.)

3:10 And the LORD said unto Moses, Depart, and go up to build the initial instruction list

You may also appreciate the KJP Rejects page:

32:31 Pharaoh shall see them, and shall enjoy her sabbaths, as long as all references to Project Gutenberg are removed.

7:6 Then Jesus went thence, and departed into a mountain to pray, and not to me only, but unto all them that sold and bought in the temple

What texts would you use?

Markov chains have a serious side as well.

In-depth introduction to machine learning in 15 hours of expert videos

Wednesday, September 24th, 2014

In-depth introduction to machine learning in 15 hours of expert videos by Kevin Markham.

From the post:

In January 2014, Stanford University professors Trevor Hastie and Rob Tibshirani (authors of the legendary Elements of Statistical Learning textbook) taught an online course based on their newest textbook, An Introduction to Statistical Learning with Applications in R (ISLR). I found it to be an excellent course in statistical learning (also known as “machine learning”), largely due to the high quality of both the textbook and the video lectures. And as an R user, it was extremely helpful that they included R code to demonstrate most of the techniques described in the book.

If you are new to machine learning (and even if you are not an R user), I highly recommend reading ISLR from cover-to-cover to gain both a theoretical and practical understanding of many important methods for regression and classification. It is available as a free PDF download from the authors’ website.

Kevin provides links to the slides for each chapter and the videos with timings, so you can fit them in as time allows.


I first saw this in a tweet by Christophe Lalanne.

CERMINE: Content ExtRactor and MINEr

Wednesday, September 24th, 2014

CERMINE: Content ExtRactor and MINEr

From the webpage:

CERMINE is a Java library and a web service for extracting metadata and content from scientific articles in born-digital form. The system analyses the content of a PDF file and attempts to extract information such as:

  • Title of the article
  • Journal information (title, etc.)
  • Bibliographic information (volume, issue, page numbers, etc.)
  • Authors and affiliations
  • Keywords
  • Abstract
  • Bibliographic references

CERMINE at Github

I used the following three files for a very subjective test of the online interface:

I am mostly interested in extraction of bibliographic entries and can report that while CERMINE made some mistakes, it is quite useful.

I first saw this in a tweet by Docear.

Tor users could be FBI’s main target if legal power grab succeeds

Tuesday, September 23rd, 2014

Tor users could be FBI’s main target if legal power grab succeeds by Lisa Vaas.

From the post:

The US Department of Justice (DOJ) is proposing a power grab that would make it easier for domestic law enforcement to break into computers of people trying to protect their anonymity via Tor or other anonymizing technologies.

That’s according to a law professor and litigator who deals with constitutional issues that arise in espionage, cybersecurity and counterterrorism prosecutions.

Ahmed Ghappour, a visiting professor at UC Hastings College of the Law, San Francisco, explained the potential ramifications of the legal maneuver in a post published last week.

I dislike government surveillance as much as anyone but let’s get the facts about surveillance straight before debating it.

For example, Lisa says:

…make it easier for domestic law enforcement to break into computers of people trying to protect their anonymity via Tor… (emphasis added)

Certainly gets your attention but I’m with Bill Clinton, it depends on what you mean by “easier.”

If you mean “easier,” as in breaking Tor or other technologies, in a word: NO.

If you mean “easier,” as in issuance of search warrants, YES.

The “…power grab….” concerns re-wording of Rule 41 Search and Seizure of the Federal Rules of Criminal Procedure (Herein, Rule 41.).

Section (b) of Rule 41 sets out who can issue a search and seizure warrant and just as importantly, where the person or evidence can be located. The present rules of section (b) can be summarized as:

  1. Person or property located within a district
  2. Person or property outside a district, if located within the district when issued but might move before execution of the warrant
  3. Person or property within or outside a district (terrorism)
  4. Person or property to be tracked within, without a district or both
  5. Person or property located outside a district or state but within (A) US territory, possession, or commonwealth; (diplomatic/consular locations)

(There are other nuances I have omitted in order to focus on location of the person and property to be seized.)

Rule 41 (b) defines where the person or property to be seized may be located.

With that background, consider the proposed amendment to Rule 41:

(6) a magistrate judge with authority in any district where activities related to a crime may have occurred has authority to issue a warrant to use remote access to search electronic storage media and to seize or copy electronically stored information located within or outside the district if:

(A) the district where the media or information is located has been concealed through technological means; or

(B) in an investigation of a violation of 18 U.S.C. Sec. 1030(a)(5), the media are protected computers that have been damaged without authorization and are located in five or more districts.

The issue is whether the same terms of present Rule 41 (b) (3) in terrorism cases should be expanded to other cases where the location of “media or information…has been concealed through technological means.”

Professor Ahmed Ghappour, in Justice Department Proposal Would Massively Expand FBI Extraterritorial Surveillance is concerned that searches for electronic media at unknown locations will of necessity result in searches of computers located in foreign jurisdictions. No doubt that is the case because to “not know the location of media or information” means just that, you don’t know. Could be on a domestic computer or a foreign one. Unless and until you find the “media or information,” its location will remain unknown.

In the interest of cooperation with foreign law enforcement and some lingering notion of “jurisdiction” of a court being tied to physical boundaries (true historically speaking), Professor Ghappour would resist expanding the same jurisdiction in Rule 41 (b)(3) to non-terrorism crimes under proposed Rule 41 (b)(6)(A).

The essence of the “unknown server location” argument is that United States courts can issue search warrants, if the government can identify the location of a target server, subject to the other provisions of Rule 41. But since Tor prevents discovery of a server location, ipso facto, no search warrant.

To be fair to the government, a physical notion of jurisdiction for search and seizure warrants, as embodied in Rule 41, is a historical artifact and not essential to the Fourth Amendment for U.S. citizens:

The rights of the people to be secure in their persons, houses, papers, and effects, against unreasonable searchers and seizures, shall not be violated; and no Warrants shall issue but upon probable cause, supported by Oath or affirmation, and particularly describing the place to be searched, and the persons or things to be seized.

The government’s often flat-footed response to technology is a common topic of conversation. Here an attempt by government to adapt to modern computer network reality is said to be too far and too fast.

Despite my sympathies being with the hare and not the hounds, I don’t think the law should foster an evidentiary shell game based upon antiquated notions of physical jurisdiction. (Leaving it to the government to procure the information it seeks without assistance from innocent bystanders. See Note 1)

Note 1: I don’t see this as contrary to my position in Resisting Tyranny – Customer-Centric-Cloud (CCCl). The issue there was a subpoena to Microsoft for data held in a foreign server. I think Cloud operators have a fiduciary duty to their customers that is prior and superior to the claims of any particular court. If the FBI can obtain the information on such servers with a warrant, on its own, then it should do so. But courts should not be able to press gang others to assist in local law enforcement activities.

Note 2: You may want to review the Advisory Committee on Criminal Rules, New Orleans, April 7-8, 2014 for background materials on the proposed change to Rule 41. Review the Annotated Constitution chapter on Search and Seizure for Fourth Amendment issues.

Note 3: If you are looking for an amusing example for parsing, try 18 U.S.C. Sec. 1030. Far clearer than any part of the Internal Revenue Code or its regulations but still complicated enough to be amusing.

BANNED BOOKS WEEK 2014: September 21-27

Tuesday, September 23rd, 2014

BANNED BOOKS WEEK 2014: September 21-27

From the webpage:

The ALA [American Library Association] promotes the freedom to choose or the freedom to express one’s opinions even if that opinion might be considered unorthodox or unpopular and stresses the importance of ensuring the availability of those viewpoints to all who wish to read them.

A challenge is an attempt to remove or restrict materials, based upon the objections of a person or group. A banning is the removal of those materials. Challenges do not simply involve a person expressing a point of view; rather, they are an attempt to remove material from the curriculum or library, thereby restricting the access of others. As such, they are a threat to freedom of speech and choice.

The ALA has numerous resources that focus on U.S.-centric issues on banning books.

For a more international perspective, see: List of books banned by governments at Wikipedia. The list of one hundred and seventeen (117) entries there is illustrative and not exhaustive in terms of books banned in any particular country. Check entries for specific countries and/or with government representatives for a specific country if in doubt.

I was surprised to find that Australia banned The Anarchist Cookbook (1971), which on initial publication needed serious editing and now needs revision and updating. (Caveat: I haven’t seen the 2002 revision.) On the other hand, it is one of the few non-sexual titles you will find at: Banned Books in Australia: A Selection.

If you want an erotica reading list, starting with the two hundred and fifty (250) titles banned by Australia is a good starting point. The Aussies omit Catullus for some unknown reason so you will have to pencil him into the list.

Censorship is proof positive of a closed mind.

Do you have a closed mind?

Project Paradox

Monday, September 22nd, 2014

project decisions

Care to name projects and standards that suffered from the project paradox?

I first saw this in a tweet by Tobias Fors

Algorithms and Data – Example

Monday, September 22nd, 2014

People's Climate

AJ+ was all over the #OurClimate march in New York City.

Let’s be generous and say the march attracted 400,000 people.

At approximately 10:16 AM Eastern time this morning, the world population clock reported a population of 7,262,447,500.

0.000550 % of the world’s population expressed an opinion on climate change in New York yesterday.

I mention that calculation, disclosing both data and the algorithm, to point out the distortion between the number of people driving policy versus the number of people impacted.

Other minority opinions promoted by AJ+ include that of the United States (population: 318,776,000) on what role Iran (population: 77,176,930) should play in the Middle East (population: 395,133,109) and the world (population: 7,262,447,500), on issues such as the Islamic State. BBC News: Islamic State crisis: Kerry says Iran can help defeat IS.

Isn’t that “the tail wagging the dog?”

Is there any wonder why international decision making departs from the common interests of the world’s population?

Hopefully AJ+ will stop beating the drum quite so loudly for minority opinions and seek out more representative ones, even if not conveniently located in New York City.