Introducing Drake, a kind of ‘make for data’ by Aaron Crow.
From the post:
Here at Factual we’ve felt the pain of managing data workflows for a very long time. Here are just a few of the issues:
- a multitude of steps, with complicated dependencies
- code and input can change frequently – it’s tiring and error-prone to figure out what needs to be re-built
- inputs scattered all over (home directories, NFS, HDFS, etc.), tough to maintain, tough to sustain repeatability
Paul Butler, a self-described Data Hacker, recently published an article called “Make for Data Scientists“, which explored the challenges of managing data processing work. Paul went on to explain why GNU Make could be a viable tool for easing this pain. He also pointed out some limitations with Make, for example the assumption that all data is local.
We were gladdened to read Paul’s article, because we’d been hard at work building an internal tool to help manage our data workflows. A defining goal was to end up with a kind of “Make for data”, but targeted squarely at the problems of managing data workflow.
A really nice introduction to Drake, with a simple example and pointers to more complete resources.
Not hard to see how Drake could fit into a topic map authoring work flow.