Archive for the ‘Cascading’ Category

Availability Cascades [Activists Take Note, Big Data Project?]

Saturday, February 25th, 2017

Availability Cascades and Risk Regulation by Timur Kuran and Cass R. Sunstein, Stanford Law Review, Vol. 51, No. 4, 1999, U of Chicago, Public Law Working Paper No. 181, U of Chicago Law & Economics, Olin Working Paper No. 384.


An availability cascade is a self-reinforcing process of collective belief formation by which an expressed perception triggers a chain reaction that gives the perception of increasing plausibility through its rising availability in public discourse. The driving mechanism involves a combination of informational and reputational motives: Individuals endorse the perception partly by learning from the apparent beliefs of others and partly by distorting their public responses in the interest of maintaining social acceptance. Availability entrepreneurs – activists who manipulate the content of public discourse – strive to trigger availability cascades likely to advance their agendas. Their availability campaigns may yield social benefits, but sometimes they bring harm, which suggests a need for safeguards. Focusing on the role of mass pressures in the regulation of risks associated with production, consumption, and the environment, Professor Timur Kuran and Cass R. Sunstein analyze availability cascades and suggest reforms to alleviate their potential hazards. Their proposals include new governmental structures designed to give civil servants better insulation against mass demands for regulatory change and an easily accessible scientific database to reduce people’s dependence on popular (mis)perceptions.

Not recent, 1999, but a useful starting point for the study of availability cascades.

The authors want to insulate civil servants where I want to exploit availability cascades to drive their responses but that’a question of perspective and not practice.

Google Scholar reports 928 citations of Availability Cascades and Risk Regulation, so it has had an impact on the literature.

However, availability cascades are not a recipe science but Networks, Crowds, and Markets: Reasoning About a Highly Connected World by David Easley and Jon Kleinberg, especially chapters 16 and 17, provide a background for developing such insights.

I started to suggest this would make a great big data project but big data projects are limited to where you have, well, big data. Certainly have that with Facebook, Twitter, etc., but that leaves a lot of the world’s population and social activity on the table.

That is to avoid junk results, you would need survey instruments to track any chain reactions outside of the bots that dominate social media.

Very high end advertising, which still misses with alarming regularity, would be a good place to look for tips on availability cascades. They have a profit motive to keep them interested.

PredictionIO Guide

Sunday, October 20th, 2013

PredictionIO Guide

From the webpage:

PredictionIO is an open source Machine Learning Server. It empowers programmers and data engineers to build smart applications. With PredictionIO, you can add the following features to your apps instantly:

  • predict user behaviors
  • offer personalized video, news, deals, ads and job openings
  • help users to discover interesting events, documents, apps and restaurants
  • provide impressive match-making services
  • and more….

PredictionIO is built on top of solid open source technology. We support Hadoop, Mahout, Cascading and Scalding natively.

PredictionIO looks interesting in general but especially its Item Similarity Engine.

From the Item Similarity: Overview:

People who like this may also like….

This engine tries to suggest N items that are similar to a targeted item. By being ‘similar’, it does not necessarily mean that the two items look alike, nor they share similar attributes. The definition of similarity is independently defined by each algorithm and is usually calculated by a distance function. The built-in algorithms assume that similarity between two items means the likelihood any user would like (or buy, view etc) both of them.

The example that comes to mind is merging all “shoes” from any store and using the resulting price “occurrences” to create a price range and average for each store.

Cascading and Scalding

Tuesday, May 28th, 2013

Cascading and Scalding by Danny Bickson.

Danny has posted some links for Cascading and Scalding, alternatives to Pig.

I continue to be curious about documentation of semantics for Pig scripts or any of its alternatives.

Or for that matter, in any medium to large-sized mapreduce shop, how do you index those semantics?

A Newspaper Clipping Service with Cascading

Friday, April 5th, 2013

A Newspaper Clipping Service with Cascading by Sujit Pal.

From the post:

This post describes a possible implementation for an automated Newspaper Clipping Service. The end-user is a researcher (or team of researchers) in a particular discipline who registers an interest in a set of topics (or web-pages). An assistant (or team of assistants) then scour information sources to find more documents of interest to the researcher based on these topics identified. In this particular case, the information sources were limited to a set of “approved” newspapers, hence the name “Newspaper Clipping Service”. The goal is to replace the assistants with an automated system.

The solution I came up with was to analyze the original web pages and treat keywords extracted out of these pages as topics, then for each keyword, query a popular search engine and gather the top 10 results from each query. The search engine can be customized so the sites it looks at is restricted by the list of approved newspapers. Finally the URLs of the results are aggregated together, and only URLs which were returned by more than 1 keyword topic are given back to the user.

The entire flow can be thought of as a series of Hadoop Map-Reduce jobs, to first download, extract and count keywords from (web pages corresponding to) URLs, and then to extract and count search result URLs from the keywords. I’ve been wanting to play with Cascading for a while, and this seemed like a good candidate, so the solution is implemented with Cascading.

Hmmm, but an “automated system” leaves the user to sort, create associations, etc., for themselves.

Assistants with such a “clipping service” could curate the clippings by creating associations with other materials and adding non-obvious but useful connections.

Think of the front page of the New York Times as an interface to curated content behind the stories that appear on it.

Where “home” is the article on the front page.

Not only more prose but a web of connections to material you might not even know existed.

For example, in Beijing Flaunts Cross-Border Clout in Search for Drug Lord by Jane Perlez and Bree Feng (NYT) we learn that:

Under Lao norms, law enforcement activity is not done after dark, (Liu Yuejin, leader of the antinarcotics bureau of the Ministry of Public Security)

Could be important information, depending upon your reasons for being in Laos.

“Functional Programming for…Big Data”

Wednesday, March 20th, 2013

“Functional Programming for optimization problems in Big Data” by Paco Nathan.

Interesting slide deck, even if it doesn’t start with high drama. 😉


  1. Data Science
  2. Functional Programming
  3. Workflow Abstraction
  4. Typical Use Cases
  5. Open Data Example

The reading list mentioned in these slides makes a nice self-review course in data science.

The Open Data Example is for Palo Alto but you can substitute a city with open data closer to home.

Sinking Data to Neo4j from Hadoop with Cascading

Tuesday, March 12th, 2013

Sinking Data to Neo4j from Hadoop with Cascading by Paul Ingles.

From the post:

Recently, I worked with a colleague (Paul Lam, aka @Quantisan on building a connector library to let Cascading interoperate with Neo4j: cascading.neo4j. Paul had been experimenting with Neo4j and Cypher to explore our data through graphs and we wanted an easy way to flow our existing data on Hadoop into Neo4j.

The data processing pipeline we’ve been growing at is built around Cascalog, Hive, Hadoop and Kafka.

Once the data has been aggregated and stored a lot of our ETL is performed upon Cascalog and, by extension, Cascading. Querying/analysis is a mix of Cascalog and Hive. This layer is built upon our long-term data storage system: Hadoop; this, all combined, lets us store high-resolution data immutably at a much lower cost than uSwitch’s previous platform.

As Paul notes later in his post, this isn’t a fast solution, about 20,000 nodes a second.

But if that fits your requirements, could be a good place to start.

Cascading into Hadoop with SQL

Wednesday, February 20th, 2013

Cascading into Hadoop with SQL by Nicole Hemsoth.

From the post:

Today Concurrent, the company behind the Cascading Hadoop abstraction framework, announced a new trick to help developers tame the elephant.

The company, which is focused on simplifying Hadoop, has introduced a SQL parser that sits on top of Cascading with a JDBC Interface. Concurrent says that they’ll be pushing out over the next couple of weeks with hopes that developers will take it under their wing and support the project.

According to the company’s CTO and founder, Chris Wensel, the goal is to get the commuity to rally around a new way to let non-programmers make use of data that’s locked in Hadoop clusters and let them more easily move applications onto Hadoop clusters.

The newly-announced approach to extending the abstraction is called Lingual, which is aimed at putting Hadoop within closer sights for those familiar with SQL, JDBC and traditional BI tools. It provides what the company calls “true SQL for Cascading and Hadoop” to enable easier creation and running of applications on Hadoop and again, to tap into that growing pool of Hadoop-seekers who lack the expertise to back mission-critical apps on the platform.

Wensel says that Lingual’s goal is to provide an ANSI-standard SQL interface that is designed to play well with all of the big name distros running on site or in cloud environments. This will allow a “cut and paste” capability for existing ANSI SQL code from traditional data warehouses so users can access data that’s locked away on a Hadoop cluster. It’s also possible to query and export data from Hadoop right into a wide range of BI tools.

Another example of meeting a large community of uses where they are, not where you would like for them to be.

Targeting a market that already exists is easier than building a new one from the ground up.

Apache Crunch

Saturday, January 5th, 2013

Apache Crunch: A Java Library for Easier MapReduce Programming by Josh Wills.

From the post:

Apache Crunch (incubating) is a Java library for creating MapReduce pipelines that is based on Google’s FlumeJava library. Like other high-level tools for creating MapReduce jobs, such as Apache Hive, Apache Pig, and Cascading, Crunch provides a library of patterns to implement common tasks like joining data, performing aggregations, and sorting records. Unlike those other tools, Crunch does not impose a single data type that all of its inputs must conform to. Instead, Crunch uses a customizable type system that is flexible enough to work directly with complex data such as time series, HDF5 files, Apache HBase tables, and serialized objects like protocol buffers or Avro records.

Crunch does not try to discourage developers from thinking in MapReduce, but it does try to make thinking in MapReduce easier to do. MapReduce, for all of its virtues, is the wrong level of abstraction for many problems: most interesting computations are made up of multiple MapReduce jobs, and it is often the case that we need to compose logically independent operations (e.g., data filtering, data projection, data transformation) into a single physical MapReduce job for performance reasons.

Essentially, Crunch is designed to be a thin veneer on top of MapReduce — with the intention being not to diminish MapReduce’s power (or the developer’s access to the MapReduce APIs) but rather to make it easy to work at the right level of abstraction for the problem at hand.

Although Crunch is reminiscent of the venerable Cascading API, their respective data models are very different: one simple common-sense summary would be that folks who think about problems as data flows prefer Crunch and Pig, and people who think in terms of SQL-style joins prefer Cascading and Hive.

Brief overview of Crunch and an example (word count) application.

Definitely a candidate for your “big data” tool belt.

Cascading 2.1

Thursday, November 8th, 2012

Cascading 2.1

Cascading 2.1 was released October 30, 2012. (Apologies for missing the release.)

If you don’t know Cascading, it self describes as:

Big Data Application Development

Cascading is a Java application framework that enables typical developers to quickly and easily develop rich Data Analytics and Data Management applications that can be deployed and managed across a variety of computing environments. Cascading works seamlessly with Apache Hadoop 1.0 and API compatible distributions.

Data Processing API

At it’s core, Cascading is a rich Java API for defining complex data flows and creating sophisticated data oriented frameworks. These frameworks can be Maven compatible libraries, or Domain Specific Languages (DSLs) for scripting.

Data Integration API

Cascading allows developers to create and test rich functionality before tackling complex integration problems. Thus integration points can be developed and tested before plugging them into a production data flow.

Process Scheduler API

The Process Scheduler coupled with the Riffle lifecycle annotations allows Cascading to schedule unit of work from any third-party application.

Enterprise Development

Cascading was designed to fit into any Enterprise Java development environment. With its clear distinction between “data processing” and “data integration”, its clean Java API, and JUnit testing framework, Cascading can be easily tested at any scale. Even the core Cascading development team runs 1,500 tests daily on an Continuous Integration server and deploys all the tested Java libraries into our own public Maven repository,

Data Science

Because Cascading is Java based, it naturally fits into all of the JVM based languages available. Notably Scala, Clojure, Jruby, Jython, and Groovy. Within a many of these languages, scripting and query languages have been created by the Cascading community to simplify ad-hoc and production ready analytics and machine learning applications. See the extensions page for more information.

Homepage link?

Scalding for the Impatient

Sunday, August 12th, 2012

Scalding for the Impatient by Sujit Pal.

From the post:

Few weeks ago, I wrote about Pig, a DSL that allows you to specify a data processing flow in terms of PigLatin operations, and results in a sequence of Map-Reduce jobs on the backend. Cascading is similar to Pig, except that it provides a (functional) Java API to specify a data processing flow. One obvious advantage is that everything can now be in a single language (no more having to worry about UDF integration issues). But there are others as well, as detailed here and here.

Cascading is well documented, and there is also a very entertaining series of articles titled Cascading for the Impatient that builds up a Cascading application to calculate TF-IDF of terms in a (small) corpus. The objective is to showcase the features one would need to get up and running quickly with Cascading.

Scalding is a Scala DSL built on top of Cascading. As you would expect, Cascading code is an order of magnitude shorter than equivalent Map-Reduce code. But because Java is not a functional language, implementing functional constructs leads to some verbosity in Cascading that is eliminated in Scalding, leading to even shorter and more readable code.

I was looking for something to try my newly acquired Scala skills on, so I hit upon the idea of building up a similar application to calculate TF-IDF for terms in a corpus. The table below summarizes the progression of the Cascading for the Impatient series. I’ve provided links to the original articles for the theory (which is very nicely explained there) and links to the source codes for both the Cascading and Scalding versions.

A very nice side by side comparison and likely to make you interested in Scalding.

Cascading 2.0

Thursday, June 7th, 2012

Cascading 2.0

From the post:

We are happy to announce that Cascading 2.0 is now publicly available for download.

This release includes a number of new features. Specifically:

  • Apache 2.0 Licensing
  • Support for Hadoop 1.0.2
  • Local and Hadoop planner modes, where local runs in memory without Hadoop dependencies
  • HashJoin pipe for “map side joins”
  • Merge pipe for “map side merges”
  • Simple Checkpointing for capturing intermediate data as a file
  • Improved Tap and Scheme APIs

We have also created a new top-level project on GitHub for all community sponsored Cascading projects:

From the documentation:

What is Cascading?

Cascading is a data processing API and processing query planner used for defining, sharing, and executing data-processing workflows on a single computing node or distributed computing cluster. On a single node, Cascading’s “local mode” can be used to efficiently test code and process local files before being deployed on a cluster. On a distributed computing cluster using Apache Hadoop platform, Cascading adds an abstraction layer over the Hadoop API, greatly simplifying Hadoop application development, job creation, and job scheduling.

Cascading homepage.

Don’t miss the extensions to Cascading: Cascading Extensions. Any summary would be unfair. Take a look for yourself. Coverage of any of these you would like to point out?

I first spotted Cascading 2.0 at Alex Popescu’s myNoSQL.


Friday, February 17th, 2012

Scalding by Patrick Oscar Boykin.

From the blog:

Today, we’re excited to open source Scalding, a Scala API for Cascading. Cascading is a thin Java library and API that sits on top of Apache Hadoop’s MapReduce layer. Scalding is comprised of two main components:

  • a DSL to make MapReduce computations look very similar to Scala’s collection API
  • A wrapper for Cascading to make it simpler to define the typical use cases of jobs, tests and describing data sources on a Hadoop Distributed File System (HDFS) or local disk

Interesting find since I just mentioned Cascading yesterday.


Thursday, February 16th, 2012


Since Cascading got called out today in the graph partitioning posts, thought it would not hurt to point it out.

From the webpage:

Cascading is a Data Processing API, Process Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster. All without having to ‘think’ in MapReduce.

Cascading is a thin Java library and API that sits on top of Hadoop’s MapReduce layer and is executed from the command line like any other Hadoop application.

As a library and API that can be driven from any JVM based language (Jython, JRuby, Groovy, Clojure, etc.), developers can create applications and frameworks that are “operationalized”. That is, a single deployable Jar can be used to encapsulate a series of complex and dynamic processes all driven from the command line or a shell. Instead of using external schedulers to glue many individual applications together with XML against each individual command line interface.

The Cascading API approach dramatically simplifies development, regression and integration testing, and deployment of business critical applications on both Amazon Web Services (like Elastic MapReduce) or on dedicated hardware.

Graph partitioning in MapReduce with Cascading

Thursday, February 16th, 2012

Graph partitioning in MapReduce with Cascading in two parts by Friso van Vollenhoven.

Graph partitioning in MapReduce with Cascading (Part 1).

From the post:

I have recently had the joy of doing MapReduce based graph partitioning. Here’s a post about how I did that. I decided to use Cascading for writing my MR jobs, as it is a lot less verbose than raw Java based MR. The graph algorithm consists of one step to prepare the input data and then a iterative part, that runs until convergence. The program uses a Hadoop counter to check for convergence and will stop iterating once there. All code is available. Also, the explanation has colorful images of graphs. (And everything is written very informally and there is no math.)

Graph partitioning part 2: connected graphs.

From the post:

In a previous post, we talked about finding the partitions in a disconnected graph using Cascading. In reality, most graphs are actually fully connected, so only being able to partition already disconnected graphs is not very helpful. In this post, we’ll take a look at partitioning a connected graph based on some criterium for creating a partition boundary.

Very accessible explanations complete with source code (github).

What puzzles me about the efforts to develop an algorithm to automatically partition a graph database is that there is no corresponding effort to develop an algorithm to automatically partition relational databases. Yet we know that relational databases can be represented as graphs. So what’s the difference?

Conceding that graphs such as Facebook, the WWW, etc., have grown without planning and so aren’t subject to the same partitioning considerations as relational databases. But isn’t there a class of graphs that are closer to relational databases than Facebook?

Consider that diverse research facilities for a drug company could use graph databases for research purposes but that doesn’t mean that any user can create edges between nodes at random. Any more than a user of a sharded database can create arbitrary joins.

I deeply enjoy graph posts such as these by Friso van Vollenhoven but the “cool” aspects of using MapReduce should not keep us from seeing heuristics we can use to enhance the performance of graph databases.