COSMOS: Python library for massively parallel workflows by Erik Gafni, et al. (Bioinformatics (2014) doi: 10.1093/bioinformatics/btu385 )
Abstract:
Summary: Efficient workflows to shepherd clinically generated genomic data through the multiple stages of a next-generation sequencing pipeline are of critical importance in translational biomedical science. Here we present COSMOS, a Python library for workflow management that allows formal description of pipelines and partitioning of jobs. In addition, it includes a user interface for tracking the progress of jobs, abstraction of the queuing system and fine-grained control over the workflow. Workflows can be created on traditional computing clusters as well as cloud-based services.
Availability and implementation: Source code is available for academic non-commercial research purposes. Links to code and documentation are provided at http://lpm.hms.harvard.edu and http://wall-lab.stanford.edu.
Contact: dpwall@stanford.edu or peter_tonellato@hms.harvard.edu.
Supplementary information: Supplementary data are available at Bioinformatics online.
A very good abstract but for pitching purposes, I would have chosen the first paragraph of the introduction:
The growing deluge of data from next-generation sequencers leads to analyses lasting hundreds or thousands of compute hours per specimen, requiring massive computing clusters or cloud infrastructure. Existing computational tools like Pegasus (Deelman et al., 2005) and more recent efforts like Galaxy (Goecks et al., 2010) and Bpipe (Sadedin et al., 2012) allow the creation and execution of complex workflows. However, few projects have succeeded in describing complicated workflows in a simple, but powerful, language that generalizes to thousands of input files; fewer still are able to deploy workflows onto distributed resource management systems (DRMs) such as Platform Load Sharing Facility (LSF) or Sun Grid Engine that stitch together clusters of thousands of compute cores. Here we describe COSMOS, a Python library developed to address these and other needs.
That paragraph highlights the bioinformatics aspects of COSMOS but also hints at a language that might be adapted to other “massively parallel workflows.” Workflows may differ details but the need to efficiently and effectively define them is a common problem.