Archive for the ‘GraphBuilder’ Category

Pigs can build graphs too for graph analytics

Thursday, December 19th, 2013

Pigs can build graphs too for graph analytics by Ted Willke.

From the post:

Today, my team is announcing a major update to Intel® Graph Builder for Apache Hadoop* software, our open source library that structures big data for graph-based machine learning and data mining. This update will help data scientists accelerate their time-to-insight by making graph analytics easier to work with on big data systems. We believe that graph analytics will be a key tool for realizing value from big data once a few key hurdles are cleared, and, in this blog, my engineers and I would like to share our perspective on why we decided to tackle graph construction first and what we’re doing to make it easier.


Additional resources:

Intel® Graph Builder for Apache Hadoop* Software v2

GraphBuilder Community

Oddly enough, version 2.0 doesn’t show up on github.

Check back early next week.

I first saw this in a tweet by aurelius.

Tittel [Merry Christmas Marko!]

Thursday, December 19th, 2013

Intel Goes Graph with Hadoop Distro by Alex Woodie.

From the post:

Intel will be targeting big retail operations with a new graph database that it unveiled today as part of its Intel Distribution for Apache Hadoop version 3 announcement. The graph engine will enable customers to make product or customer recommendations in real time, a la Netflix or Amazon, based on existing data. The chip giant also fleshed out its Hadoop distro with a 20x speedup in encryption functions, a data tokenization option, and a handful of new machine learning algorithms aimed at solving common problems.

Intel got its feet wet with graph analytics a year ago when it released into the open source arena Graph Builder, a set of libraries designed to help developers create graphs based on real world models. Since that first alpha release, Intel developers have streamlined the software and made it easier for users to import, clean, and transform large amounts of data sitting in the graph database. These enhancements will ship in early 2014 as Intel Graph Builder for Apache Hadoop software version 2.

Intel Graph Builder is based on the open source Titan distributed graph database, and uses Pig scripts to trigger queries on top of the graph, says Ritu Kama, director of product management in Intel’s Big Data group. The graph engine adds another analytical option for Intel Hadoop customers, in addition to MapReduce, HBase, Hive, and Mahout, which are all bundled with the distribution.

Yes, Titan, whose development has been lead by Marko A. Rodriguez.

I can’t think of a better Christmas present!

Will Tittel be the successor to Wintel?

When you tire of the shallow end of the graph pool, you can answer that question for yourself with Titan and/or the Intel® Distribution.

PS: The download page says:

Download the Intel® Distribution to experience the power of hardware assisted security & enterprise grade performance for Apache Hadoop* big data processing. This 100% Apache Hadoop* open source download delivers core project capabilities with value added Intel® Manager: auto-tuning for hadoop clusters, role based access control for HBase, multi-site scalability and adaptive replication in HBase, and many other features to ease deployment of Hadoop in the enterprise. After registration you will be presented to download TAR or Virtual Machine versions, gain access to online help documentation, and receive a link to Community Forums.

It’s 90 day unrestricted evaluation software.

I’m going to wait until after the holidays to grab a copy.

Data Quality, Feature Engineering, GraphBuilder

Wednesday, November 27th, 2013

Avoiding Cluster-Scale Headaches with Better Tools for Data Quality and Feature Engineering by Ted Willke.

Ted’s second slide reads:

Machine Learning may nourish the soul…

…but Data Preparation will consume it.

Ted starts off talking about the problems of data preparation but fairly quickly focuses in on property graphs and using Pig ETL.

He also outlines outstanding problems with Pig ETL (slides 29-32).

Nothing surprising but good news that Graph Builder 2 Alpha is due out in Dec’ 13.

BTW, GraphBuilder 1.0 can be found at:

Graph Landscape Survey

Monday, May 20th, 2013

Improving options for unlocking your graph data by Ben Lorica.

From the post:

The popular open source project GraphLab received a major boost early this week when a new company comprised of its founding developers, raised funding to develop analytic tools for graph data sets. GraphLab Inc. will continue to use the open source GraphLab to “push the limits of graph computation and develop new ideas”, but having a commercial company will accelerate development, and allow the hiring of resources dedicated to improving usability and documentation.

While social media placed graph data on the radar of many companies, similar data sets can be found in many domains including the life and health sciences, security, and financial services. Graph data is different enough that it necessitates special tools and techniques. Because tools were a bit too complex for casual users, in the past this meant graph data analytics was the province of specialists. Fortunately graph data is an area that has attracted many enthusiastic entrepreneurs and developers. The tools have improved and I expect things to get much easier for users in the future. A great place to learn more about tools for graph data, is at the upcoming GraphLab Workshop (on July 1st in SF).

Ben summarizes graph resources for:

  • Data wrangling: creating graphs
  • Data management and search
  • Graph-parallel frameworks
  • Machine-learning and analytics
  • Visualization

It would be hard to find a better starting place for investigating the buzz about graphs.

I first saw this in An Overview of Graph Processing Frameworks by Danny Bickson.

GraphBuilder – A Scalable Graph Construction Library for Apache™ Hadoop™

Monday, March 4th, 2013

GraphBuilder – A Scalable Graph Construction Library for Apache™ Hadoop™ by Theodore L. Willke, Nilesh Jain and Haijie Gu. (whitepaper)


The exponential growth in the pursuit of knowledge gleaned from data relationships that are expressed naturally as large and complex graphs is fueling new parallel machine learning algorithms. The nature of these computations is iterative and data-dependent. Recently, frameworks have emerged to perform these computations in a distributed manner at commercial scale. But feeding data to these frameworks is a huge challenge in itself. Since graph construction is a data-parallel problem, Hadoop is well-suited for this task but lacks some elements that would make things easier for data scientists that do not have domain expertise in distributed systems engineering. We developed GraphBuilder, a scalable graph construction software library for Apache Hadoop, to address this gap. GraphBuilder offloads many of the complexities of graph construction, including graph formation, tabulation, compression, transformation, partitioning, output formatting, and serialization. It is written in Java for ease of programming and scales using the MapReduce parallel programming model. We describe the motivation for GraphBuilder, its architecture, and present two case studies that provide a preliminary evaluation.

The “whitepaper” introduction to GraphBuilder.

Building graphs with Hadoop

Friday, December 7th, 2012

Building graphs with Hadoop

From the post:

Faced with a mass of unstructured data, the first step of analysing it should be to organise it, and the first step of that process should be working out in what way it should be organised. But then that mass of data has to be fed into the graph which can take a long time and may be inefficient. That’s why Intel has announced the release of the open source GraphBuilder library, a tool that is meant to help scientists and developers working with large amounts of data build applications that make sense of this data.

The library plugs into Apache Hadoop and is designed to create graphs from big data sets which can then be used in applications. GraphBuilder is written in Java using the MapReduce parallel programming model and takes care of many of the complexities of graph construction. According to the developers, this makes it easier for scientists and developers who do not necessarily have skills in distributed systems engineering to make use of large data sets in their Hadoop applications. They can focus on writing the code that breaks the data up into meaningful nodes and useful edge information which can be run across the distributed architecture where the library also performs a wide range of other useful processes to optimise the data for later analysis.

A nice way to re-use those Hadoop skills you have been busy acquiring!

Definitely on the weekend schedule!