Archive for the ‘Workflow’ Category

Introducing Drake, a kind of ‘make for data’

Sunday, April 28th, 2013

Introducing Drake, a kind of ‘make for data’ by Aaron Crow.

From the post:

Here at Factual we’ve felt the pain of managing data workflows for a very long time. Here are just a few of the issues:

  • a multitude of steps, with complicated dependencies
  • code and input can change frequently – it’s tiring and error-prone to figure out what needs to be re-built
  • inputs scattered all over (home directories, NFS, HDFS, etc.), tough to maintain, tough to sustain repeatability

Paul Butler, a self-described Data Hacker, recently published an article called “Make for Data Scientists“, which explored the challenges of managing data processing work. Paul went on to explain why GNU Make could be a viable tool for easing this pain. He also pointed out some limitations with Make, for example the assumption that all data is local.

We were gladdened to read Paul’s article, because we’d been hard at work building an internal tool to help manage our data workflows. A defining goal was to end up with a kind of “Make for data”, but targeted squarely at the problems of managing data workflow.

A really nice introduction to Drake, with a simple example and pointers to more complete resources.

Not hard to see how Drake could fit into a topic map authoring work flow.

Project Falcon…

Wednesday, April 3rd, 2013

Project Falcon: Tackling Hadoop Data Lifecycle Management via Community Driven Open Source by Venkatesh Seetharam.

From the post:

Today we are excited to see another example of the power of community at work as we highlight the newly approved Apache Software Foundation incubator project named Falcon. This incubation project was initiated by the team at InMobi together with engineers from Hortonworks. Falcon is useful to anyone building apps on Hadoop as it simplifies data management through the introduction of a data lifecycle management framework.

All About Falcon and Data Lifecycle Management

Falcon is a data lifecycle management framework for Apache Hadoop that enables users to configure, manage and orchestrate data motion, disaster recovery, and data retention workflows in support of business continuity and data governance use cases.

Falcon workflow

I am certain a topic map based workflow solution could be created.

However, using a solution being promoted by others removes one thing from the topic map “to do” list.

Not to mention giving topic maps an introduction to other communities.

Drake [Data Processing Workflow]

Wednesday, March 27th, 2013

Drake

From the webpage:

Drake is a simple-to-use, extensible, text-based data workflow tool that organizes command execution around data and its dependencies. Data processing steps are defined along with their inputs and outputs and Drake automatically resolves their dependencies and calculates:

  • which commands to execute (based on file timestamps)
  • in what order to execute the commands (based on dependencies)

Drake is similar to GNU Make, but designed especially for data workflow management. It has HDFS support, allows multiple inputs and outputs, and includes a host of features designed to help you bring sanity to your otherwise chaotic data processing workflows.

The video demonstrating Drake is quite good.

Granting my opinion may be influenced by the use of awk in the early examples. ;-)

Definitely a tool for scripted production of topic maps.

I first saw this in a tweet by Chris Diehl.

The Kepler Project

Wednesday, October 19th, 2011

The Kepler Project

From the website:

The Kepler Project is dedicated to furthering and supporting the capabilities, use, and awareness of the free and open source, scientific workflow application, Kepler. Kepler is designed to help scien­tists, analysts, and computer programmers create, execute, and share models and analyses across a broad range of scientific and engineering disciplines. Kepler can operate on data stored in a variety of formats, locally and over the internet, and is an effective environment for integrating disparate software components, such as merging “R” scripts with compiled “C” code, or facilitating remote, distributed execution of models. Using Kepler’s graphical user interface, users simply select and then connect pertinent analytical components and data sources to create a “scientific workflow”—an executable representation of the steps required to generate results. The Kepler software helps users share and reuse data, workflows, and compo­nents developed by the scientific community to address common needs.

The Kepler software is developed and maintained by the cross-project Kepler collaboration, which is led by a team consisting of several of the key institutions that originated the project: UC Davis, UC Santa Barbara, and UC San Diego. Primary responsibility for achieving the goals of the Kepler Project reside with the Leadership Team, which works to assure the long-term technical and financial viability of Kepler by making strategic decisions on behalf of the Kepler user community, as well as providing an official and durable point-of-contact to articulate and represent the interests of the Kepler Project and the Kepler software application. Details about how to get more involved with the Kepler Project can be found in the developer section of this website.

Kepler is a java-based application that is maintained for the Windows, OSX, and Linux operating systems. The Kepler Project supports the official code-base for Kepler development, as well as provides materials and mechanisms for learning how to use Kepler, sharing experiences with other workflow developers, reporting bugs, suggesting enhancements, etc.

I found this from an announcement of an NSF grant for a bioKepler project.

Questions:

  1. Review the Kepler project and prepare a short summary of it. (3 – 5 pages)
  2. Workflow by its very nature involves subjects moving from one process or user to another. How is that handled by Kepler in general?
  3. Can you use intersect the workflow of Kepler with other workflow management software? If not, why not? (research project)