Archive for the ‘Workflow’ Category

Computational Data Analysis Workflow Systems

Friday, October 6th, 2017

Computational Data Analysis Workflow Systems

An incomplete list of existing workflow systems. As of today, approximately 17:00 EST, 173 systems in no particular order.

I first saw this mentioned in a tweet by Michael R. Crusoe.

One of the many resources found at: Common Workflow Language.

From the webpage:

The Common Workflow Language (CWL) is a specification for describing analysis workflows and tools in a way that makes them portable and scalable across a variety of software and hardware environments, from workstations to cluster, cloud, and high performance computing (HPC) environments. CWL is designed to meet the needs of data-intensive science, such as Bioinformatics, Medical Imaging, Astronomy, Physics, and Chemistry.

You should take a quick look at: Common Workflow Language User Guide to get a feel for CWL.

Try to avoid thinking of CWL as “documenting” your workflow if that is an impediment to using it. That’s a side effect but its main purpose is to make your more effective.

A Reproducible Workflow

Friday, August 26th, 2016

The video is 104 seconds and highly entertaining!

From the description:

Reproducible science not only reduce errors, but speeds up the process of re-runing your analysis and auto-generate updated documents with the results. More info at:

How are you making your data analysis reproducible?


Why Use Make

Thursday, November 12th, 2015

Why Use Make by Mike Bostock.

From the post:

I love Make. You may think of Make as merely a tool for building large binaries or libraries (and it is, almost to a fault), but it’s much more than that. Makefiles are machine-readable documentation that make your workflow reproducible.

To illustrate with a recent example: yesterday Kevin and I needed to update a six-month old graphic on drought to accompany a new article on thin snowpack in the West. The article was already on the homepage, so the clock was ticking to republish with new data as soon as possible.

Shamefully, I hadn’t documented the data-transformation process, and it’s painfully easy to forget details over six months: I had a mess of CSV and GeoJSON data files, but not the exact source URL from the NCDC; I was temporarily confused as to the right Palmer drought metric (Drought Severity Index or Z Index?) and the corresponding categorical thresholds; finally, I had to resurrect the code to calculate drought coverage area.

Despite these challenges, we republished the updated graphic without too much delay. But I was left thinking how much easier it could have been had I simply recorded the process the first time as a makefile. I could have simply typed make in the terminal and be done!

Remember how science has been losing the ability to replicate experiments due to computers? How Computers Broke Science… [Soon To Break Businesses …]

So you are trying to remember and explain to an opponent’s attorney the process you went through in processing data, after about 3 hours of sharp questioning, how clear do you think you will be? Will you really remember every step? The source of every file?

Had you documented your workflow you can read from your Make file and say exactly what happened, in what order and with what sources. You do need to do that every time if you want anyone to believe the make file represents what actually happened.

You will be on more solid ground than trying to remember which files, the dates on those files, their content, etc.

Mike concludes his post with:

So do your future self and coworkers a favor, and use Make!

Let’s modify that to read:

So do your future self, coworkers, and lawyer a favor, and use Make!

I first saw this in a tweet by Christophe Lalanne.


Saturday, January 3rd, 2015


From the webpage:


UI testing with decision trees. An experimental approach to UI testing, based on Decision Trees. A NodeJS wrapper for PhantomJS, CasperJS and PhantomCSS, PhantomFlow enables a fluent way of describing user flows in code whilst generating structured tree data for visualisation.

PhantomFlow Report: Test suite overview with radial Dendrogram and pie visualisation

The above visualisation is a real-world example, showing the complexity of visual testing at Huddle.


  • Enable a more expressive way of describing user interaction paths within tests
  • Fluently communicate UI complexity to stakeholders and team members through generated visualisations
  • Support TDD and BDD for web applications and responsive web sites
  • Provide a fast feedback loop for UI testing
  • Raise profile of visual regression testing
  • Support misual regression workflows, quick inspection & rebasing via UI.

If you are planning on being more user focused (translation: successful in gaining users) this year, PhantomFlow may be the tool for you!

It strikes me as a tool that can present the workflow differently than you are accustomed to seeing it. I find that helpful because I will overlook potential difficulties because I already know how some function works.

The red button labeled STOP! may mean to a user that the application stops. Not that the decryption key on the hard drive is trashed to prevent decryption even if I give up the key under torture. That may not occur to them. If that happens on their hard drive, they may be rather miffed.

COSMOS: Python library for massively parallel workflows

Friday, August 1st, 2014

COSMOS: Python library for massively parallel workflows by Erik Gafni, et al. (Bioinformatics (2014) doi: 10.1093/bioinformatics/btu385 )


Summary: Efficient workflows to shepherd clinically generated genomic data through the multiple stages of a next-generation sequencing pipeline are of critical importance in translational biomedical science. Here we present COSMOS, a Python library for workflow management that allows formal description of pipelines and partitioning of jobs. In addition, it includes a user interface for tracking the progress of jobs, abstraction of the queuing system and fine-grained control over the workflow. Workflows can be created on traditional computing clusters as well as cloud-based services.

Availability and implementation: Source code is available for academic non-commercial research purposes. Links to code and documentation are provided at and

Contact: or

Supplementary information: Supplementary data are available at Bioinformatics online.

A very good abstract but for pitching purposes, I would have chosen the first paragraph of the introduction:

The growing deluge of data from next-generation sequencers leads to analyses lasting hundreds or thousands of compute hours per specimen, requiring massive computing clusters or cloud infrastructure. Existing computational tools like Pegasus (Deelman et al., 2005) and more recent efforts like Galaxy (Goecks et al., 2010) and Bpipe (Sadedin et al., 2012) allow the creation and execution of complex workflows. However, few projects have succeeded in describing complicated workflows in a simple, but powerful, language that generalizes to thousands of input files; fewer still are able to deploy workflows onto distributed resource management systems (DRMs) such as Platform Load Sharing Facility (LSF) or Sun Grid Engine that stitch together clusters of thousands of compute cores. Here we describe COSMOS, a Python library developed to address these and other needs.

That paragraph highlights the bioinformatics aspects of COSMOS but also hints at a language that might be adapted to other “massively parallel workflows.” Workflows may differ details but the need to efficiently and effectively define them is a common problem.

Human Sense Making

Saturday, May 3rd, 2014

Scientists’ sense making when hypothesizing about disease mechanisms from expression data and their needs for visualization support by Barbara Mirel and Carsten Görg.


A common class of biomedical analysis is to explore expression data from high throughput experiments for the purpose of uncovering functional relationships that can lead to a hypothesis about mechanisms of a disease. We call this analysis expression driven, -omics hypothesizing. In it, scientists use interactive data visualizations and read deeply in the research literature. Little is known, however, about the actual flow of reasoning and behaviors (sense making) that scientists enact in this analysis, end-to-end. Understanding this flow is important because if bioinformatics tools are to be truly useful they must support it. Sense making models of visual analytics in other domains have been developed and used to inform the design of useful and usable tools. We believe they would be helpful in bioinformatics. To characterize the sense making involved in expression-driven, -omics hypothesizing, we conducted an in-depth observational study of one scientist as she engaged in this analysis over six months. From findings, we abstracted a preliminary sense making model. Here we describe its stages and suggest guidelines for developing visualization tools that we derived from this case. A single case cannot be generalized. But we offer our findings, sense making model and case-based tool guidelines as a first step toward increasing interest and further research in the bioinformatics field on scientists’ analytical workflows and their implications for tool design.

From the introduction:

In other domains, improvements in data visualization designs have relied on models of analysts’ actual sense making for a complex analysis [2]. A sense making model captures analysts’ cumulative, looped (not linear) “process [es] of searching for a representation and encoding data in that representation to answer task-specific questions” relevant to an open-ended problem [3]: 269. As an end-to-end flow of application-level tasks, a sense making model may portray and categorize analytical intentions, associated tasks, corresponding moves and strategies, informational inputs and outputs, and progression and iteration over time. The importance of sense making models is twofold: (1) If an analytical problem is poorly understood developers are likely to design for the wrong questions, and tool utility suffers; and (2) if developers do not have a holistic understanding of the entire analytical process, developed tools may be useful for one specific part of the process but will not integrate effectively in the overall workflow [4,5].

As the authors admit, one case isn’t enough to be generalized but their methodology, with its focus on the work flow of a scientist, is a refreshing break from imagined and/or “ideal” work flows for scientists.

Until now semantic software has followed someone’s projection of an “ideal” work flow.

The next generation of semantic software should follow the actual work flows of people working with their data.

I first saw this in a tweet by Neil Saunders

Data Workflows for Machine Learning:

Sunday, February 2nd, 2014

Data Workflows for Machine Learning: by Paco Nathan.

Excellent presentation on data workflows, at least if you think of them as being primarily from one machine or process to another. Hence the closing emphasis on PMML – Predictive Model Markup Language.

Although Paco alludes to the organizational/social side of data flow, that gets lost in the thicket of technical options.

For example, at slide 25, Paco talks about using Cascading to combing the workflow from multiple departments into an integrated app.

Which I am certain is withing the capabilities of Cascading, but that does not address the social or organizational difficulties of getting that to happen.

One of the main problems in the recent U.S. health care exchange debacle was the interchange of data between two of the vendors.

I suppose in recent management lingo, no one took “ownership” of that problem. 😉

Data interchange isn’t new technical territory but failure to cooperate is as deadly to a data processing project as a melting CPU.

The technical side of data workflows is necessary for success, but so is avoiding any beaver dams across the data stream.

Dealt with any beavers lately?

Introducing Drake, a kind of ‘make for data’

Sunday, April 28th, 2013

Introducing Drake, a kind of ‘make for data’ by Aaron Crow.

From the post:

Here at Factual we’ve felt the pain of managing data workflows for a very long time. Here are just a few of the issues:

  • a multitude of steps, with complicated dependencies
  • code and input can change frequently – it’s tiring and error-prone to figure out what needs to be re-built
  • inputs scattered all over (home directories, NFS, HDFS, etc.), tough to maintain, tough to sustain repeatability

Paul Butler, a self-described Data Hacker, recently published an article called “Make for Data Scientists“, which explored the challenges of managing data processing work. Paul went on to explain why GNU Make could be a viable tool for easing this pain. He also pointed out some limitations with Make, for example the assumption that all data is local.

We were gladdened to read Paul’s article, because we’d been hard at work building an internal tool to help manage our data workflows. A defining goal was to end up with a kind of “Make for data”, but targeted squarely at the problems of managing data workflow.

A really nice introduction to Drake, with a simple example and pointers to more complete resources.

Not hard to see how Drake could fit into a topic map authoring work flow.

Project Falcon…

Wednesday, April 3rd, 2013

Project Falcon: Tackling Hadoop Data Lifecycle Management via Community Driven Open Source by Venkatesh Seetharam.

From the post:

Today we are excited to see another example of the power of community at work as we highlight the newly approved Apache Software Foundation incubator project named Falcon. This incubation project was initiated by the team at InMobi together with engineers from Hortonworks. Falcon is useful to anyone building apps on Hadoop as it simplifies data management through the introduction of a data lifecycle management framework.

All About Falcon and Data Lifecycle Management

Falcon is a data lifecycle management framework for Apache Hadoop that enables users to configure, manage and orchestrate data motion, disaster recovery, and data retention workflows in support of business continuity and data governance use cases.

Falcon workflow

I am certain a topic map based workflow solution could be created.

However, using a solution being promoted by others removes one thing from the topic map “to do” list.

Not to mention giving topic maps an introduction to other communities.

Drake [Data Processing Workflow]

Wednesday, March 27th, 2013


From the webpage:

Drake is a simple-to-use, extensible, text-based data workflow tool that organizes command execution around data and its dependencies. Data processing steps are defined along with their inputs and outputs and Drake automatically resolves their dependencies and calculates:

  • which commands to execute (based on file timestamps)
  • in what order to execute the commands (based on dependencies)

Drake is similar to GNU Make, but designed especially for data workflow management. It has HDFS support, allows multiple inputs and outputs, and includes a host of features designed to help you bring sanity to your otherwise chaotic data processing workflows.

The video demonstrating Drake is quite good.

Granting my opinion may be influenced by the use of awk in the early examples. 😉

Definitely a tool for scripted production of topic maps.

I first saw this in a tweet by Chris Diehl.

The Kepler Project

Wednesday, October 19th, 2011

The Kepler Project

From the website:

The Kepler Project is dedicated to furthering and supporting the capabilities, use, and awareness of the free and open source, scientific workflow application, Kepler. Kepler is designed to help scien­tists, analysts, and computer programmers create, execute, and share models and analyses across a broad range of scientific and engineering disciplines. Kepler can operate on data stored in a variety of formats, locally and over the internet, and is an effective environment for integrating disparate software components, such as merging “R” scripts with compiled “C” code, or facilitating remote, distributed execution of models. Using Kepler’s graphical user interface, users simply select and then connect pertinent analytical components and data sources to create a “scientific workflow”—an executable representation of the steps required to generate results. The Kepler software helps users share and reuse data, workflows, and compo­nents developed by the scientific community to address common needs.

The Kepler software is developed and maintained by the cross-project Kepler collaboration, which is led by a team consisting of several of the key institutions that originated the project: UC Davis, UC Santa Barbara, and UC San Diego. Primary responsibility for achieving the goals of the Kepler Project reside with the Leadership Team, which works to assure the long-term technical and financial viability of Kepler by making strategic decisions on behalf of the Kepler user community, as well as providing an official and durable point-of-contact to articulate and represent the interests of the Kepler Project and the Kepler software application. Details about how to get more involved with the Kepler Project can be found in the developer section of this website.

Kepler is a java-based application that is maintained for the Windows, OSX, and Linux operating systems. The Kepler Project supports the official code-base for Kepler development, as well as provides materials and mechanisms for learning how to use Kepler, sharing experiences with other workflow developers, reporting bugs, suggesting enhancements, etc.

I found this from an announcement of an NSF grant for a bioKepler project.


  1. Review the Kepler project and prepare a short summary of it. (3 – 5 pages)
  2. Workflow by its very nature involves subjects moving from one process or user to another. How is that handled by Kepler in general?
  3. Can you use intersect the workflow of Kepler with other workflow management software? If not, why not? (research project)