Archive for the ‘Flow-Based Programming (FBP)’ Category

When All The Program’s A Graph…

Thursday, February 14th, 2013

When All The Program’s A Graph – Prismatic’s Plumbing Library

From the post:

At some point as a programmer you might have the insight/fear that all programming is just doing stuff to other stuff.

Then you may observe after coding the same stuff over again that stuff in a program often takes the form of interacting patterns of flows.

Then you may think hey, a program isn’t only useful for coding datastructures, but a program is a kind of datastructure and that with a meta level jump you could program a program in terms of flows over data and flow over other flows.

That’s the kind of stuff Prismatic is making available in the Graph extension to their plumbing package (code examples), which is described in an excellent post: Graph: Abstractions for Structured Computation.

Formalizing the structure of FP code. Who could argue with that?

Read the first post as a quick introduction to the second.

Crunch for Dummies

Friday, December 9th, 2011

Crunch for Dummies by Brock Noland

From the post:

This guide is intended to be an introduction to Crunch.


Crunch is used for processing data. Crunch builds on top of Apache Hadoop to provide a simpler interface for Java programmers to process data. In Crunch you create pipelines, not unlike Unix pipelines, such as the command below:

Interesting coverage of Crunch.

I don’t know that I agree with the characterization:

… using Hadoop …. require[s] learning a complex process called MapReduce or a higher level language such as Apache Hive or Apache Pig.

True, to use Hadoop means learning MapReduce or Hive or PIg but I don’t think of them as being all that complex. Besides, once you have learned them, the benefits are considerable.

But, to each his own.

You might also be interested in: Introducing Crunch: Easy MapReduce Pipelines for Hadoop.

Introducing Crunch: Easy MapReduce Pipelines for Hadoop

Tuesday, October 11th, 2011

Introducing Crunch: Easy MapReduce Pipelines for Hadoop

Josh Wills writes:

As a data scientist at Cloudera, I work with customers across a wide range of industries that use Hadoop to solve their business problems. Many of the solutions we create involve multi-stage pipelines of MapReduce jobs that join, clean, aggregate, and analyze enormous amounts of data. When working with log files or relational database tables, we use high-level tools like Pig and Hive for their convenient and powerful support for creating pipelines over structured and semi-structured records.

As Hadoop has spread from web companies to other industries, the variety of data that is stored in HDFS has expanded dramatically. Hadoop clusters are being used to process satellite images, time series data, audio files, and seismograms. These formats are not a natural fit for the data schemas imposed by Pig and Hive, in the same way that structured binary data in a relational database can be a bit awkward to work with. For these use cases, we either end up writing large, custom libraries of user-defined functions in Pig or Hive, or simply give up on our high-level tools and go back to writing MapReduces in Java. Either of these options is a serious drain on developer productivity.

Today, we’re pleased to introduce Crunch, a Java library that aims to make writing, testing, and running MapReduce pipelines easy, efficient, and even fun. Crunch’s design is modeled after Google’s FlumeJava, focusing on a small set of simple primitive operations and lightweight user-defined functions that can be combined to create complex, multi-stage pipelines. At runtime, Crunch compiles the pipeline into a sequence of MapReduce jobs and manages their execution.

Sounds like DataFlow Programming… or Flow-Based Programming (FBP) to me. In which case the claim that:

It’s just Java. Crunch shares a core philosophical belief with Google’s FlumeJava: novelty is the enemy of adoption.

must be true, as FBP is over forty years old now. I doubt programmers involved in Crunch would be aware of it. Programming history started with their first programming language, at least for them.

From a vendor perspective, I would turn the phrase a bit to read: novelty is the enemy of market/mind share.

Unless you are a startup, in which case, novelty is good until you reach critical mass and then novelty loses its luster.

Unnecessary novelty, like new web programming languages for their own sake, can also be a bid for market/mind share.

Interesting to see both within days of each other.

Dataflow Programming:…

Monday, October 10th, 2011

Dataflow Programming: Handling Huge Data Loads Without Adding Complexity by Jim Falgout.

From the post:

Because the dataflow operators in a graph work in parallel, the model allows overlapping I/O operations with computation. This is a “whole application” approach to parallelization as opposed to many thread-oriented performance frameworks that focus on hot sections of code such as for loops. This addresses a key problem in processing “big data” for today’s many-core processors: feeding data fast enough to the processors.

While dataflow does this, it scales down easily as well. This distinguishes it from technologies, such as Hadoop and to a lesser extent, Map Reduce, which don’t scale downward well due to their innate complexity.

We’ve discussed how a dataflow architecture exploits multicore. These same principles can be applied to multi-node clusters by extending dataflow queues over networks with a dataflow graph executed on multiple systems in parallel. The compositional model of building dataflow graphs allows for replication of pieces of the graph across multiple nodes. Scaling out extends the reach of dataflow to solve large data problems.

Read the first comment. By J.P. Morrison, author of Flow Based Programming, the inspiration for Pipes. (Shamelessly repeated from a post by Marko Rodriguez on the gremlin-users list, Achim first noticed the article.)

Be aware that Amazon lists the Kindle edition for \$29.00 and a hardback edition for \$69.00. Sadly one reader reports the book has no index?