Archive for the ‘Data Pipelines’ Category

The Data Engineering Ecosystem: An Interactive Map

Saturday, March 14th, 2015

The Data Engineering Ecosystem: An Interactive Map by David Drummond and John Joo.

From the post:

Companies, non-profit organizations, and governments are all starting to realize the huge value that data can provide to customers, decision makers, and concerned citizens. What is often neglected is the amount of engineering required to make that data accessible. Simply using SQL is no longer an option for large, unstructured, or real-time data. Building a system that makes data usable becomes a monumental challenge for data engineers.

There is no plug and play solution that solves every use case. A data pipeline meant for serving ads will look very different from a data pipeline meant for retail analytics. Since there are unlimited permutations of open-source technologies that can be cobbled together, it can be overwhelming when you first encounter them. What do all these tools do and how do they fit into the ecosystem?

Insight Data Engineering Fellows face these same questions when they begin working on their data pipelines. Fortunately, after several iterations of the Insight Data Engineering Program, we have developed this framework for visualizing a typical pipeline and the various data engineering tools. Along with the framework, we have included a set of tools for each category in the interactive map.

This looks quite handy if you are studying for a certification test and need to know the components and a brief bit about each one.

For engineering purposes, it would be even better if you could connect your pieces together and then map the data flows through the pipelines. That is where did the data previously held in table X go during each step and what operations were performed on it? Not to mention being able to track an individual datum through the process.

Is there a tool that I haven’t seen or overlooked that allows that type of insight into a data pipeline? With subject identities of course for the various subjects along the way.

MemSQL releases a tool to easily ship big data into its database

Thursday, January 1st, 2015

MemSQL releases a tool to easily ship big data into its database by Jordan Novet.

From the post:

Like other companies pushing databases, San Francisco startup MemSQL wants to solve low-level problems, such as easily importing data from critical sources. Today MemSQL is acting on that impulse by releasing a tool to send data from the S3 storage service on the Amazon Web Services cloud and from the Hadoop open-source file system into its proprietary in-memory SQL database — or the open-source MySQL database.

Engineers can try out the new tool, named MemSQL Loader, today, now that it’s been released under an open-source MIT license.

The existing “LOAD DATA” command in MemSQL and MySQL can bring data in, although it has its shortcomings, as Wayne Song, a software engineer at the startup, wrote in a blog post today. Song and his colleagues ran into those snags and started coding.

How very cool!

Not every database project seeks to “easily import… data from critical sources.” but I am very glad to see MemSQL take up the challenge.

Reducing the friction between data stores and tools will make data pipelines more robust, reducing the amount of time spent trouble shooting routine data traffic issues and increasing the time spend on analysis that fuels your ROI from data science.

True enough, if you want to make ASCII importing a $custom assistance from your staff task, that is one business model. On the whole I would not say it is a very viable one. Particularly with more production minded folks like MemSQL around.

What database are you going to extend MemSQL Loader to support?

BigDataScript: a scripting language for data pipelines

Friday, December 19th, 2014

BigDataScript: a scripting language for data pipelines by Pablo Cingolani, Rob Sladek, and Mathieu Blanchette.

Abstract:

Motivation: The analysis of large biological datasets often requires complex processing pipelines that run for a long time on large computational infrastructures. We designed and implemented a simple script-like programming language with a clean and minimalist syntax to develop and manage pipeline execution and provide robustness to various types of software and hardware failures as well as portability.

Results: We introduce the BigDataScript (BDS) programming language for data processing pipelines, which improves abstraction from hardware resources and assists with robustness. Hardware abstraction allows BDS pipelines to run without modification on a wide range of computer architectures, from a small laptop to multi-core servers, server farms, clusters and clouds. BDS achieves robustness by incorporating the concepts of absolute serialization and lazy processing, thus allowing pipelines to recover from errors. By abstracting pipeline concepts at programming language level, BDS simplifies implementation, execution and management of complex bioinformatics pipelines, resulting in reduced development and debugging cycles as well as cleaner code.

Availability and implementation: BigDataScript is available under open-source license at http://pcingola.github.io/BigDataScript.

How would you compare this pipeline proposal to: XProc 2.0: An XML Pipeline Language?

I prefer XML solutions because I can reliably point to an element or attribute to endow it with explicit semantics.

While explicit semantics is my hobby horse, it may not be yours. Curious how you view this specialized language for bioinformatics pipelines?

I first saw this in a tweet by Pierre Lindenbaum.