Data Workflows for Machine Learning: by Paco Nathan.
Excellent presentation on data workflows, at least if you think of them as being primarily from one machine or process to another. Hence the closing emphasis on PMML – Predictive Model Markup Language.
Although Paco alludes to the organizational/social side of data flow, that gets lost in the thicket of technical options.
For example, at slide 25, Paco talks about using Cascading to combing the workflow from multiple departments into an integrated app.
Which I am certain is withing the capabilities of Cascading, but that does not address the social or organizational difficulties of getting that to happen.
One of the main problems in the recent U.S. health care exchange debacle was the interchange of data between two of the vendors.
I suppose in recent management lingo, no one took “ownership” of that problem. 😉
Data interchange isn’t new technical territory but failure to cooperate is as deadly to a data processing project as a melting CPU.
The technical side of data workflows is necessary for success, but so is avoiding any beaver dams across the data stream.
Dealt with any beavers lately?