GraphLab « Another Word For It

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 12, 2013

Bond Percolation in GraphLab

Filed under: GraphLab,Graphs,Networks — Patrick Durusau @ 4:11 pm

Bond Percolation in GraphLab by Danny Bickson.

From the post:

I was asked by Prof. Scott Kirkpatrick to help and implement bond percolation in GraphLab. It is an oldie but goldie problem which is closely related to the connected components problem.

Here is an explanation about bond percolation from Wikipedia:

A representative question (and the source of the name) is as follows. Assume that some liquid is poured on top of some porous material. Will the liquid be able to make its way from hole to hole and reach the bottom? This physical question is modelled mathematically as a three-dimensional network of n × n × n vertices, usually called “sites”, in which the edge or “bonds” between each two neighbors may be open (allowing the liquid through) with probability p, or closed with probability 1 – p, and they are assumed to be independent. Therefore, for a given p, what is the probability that an open path exists from the top to the bottom? The behavior for large n is of primary interest. This problem, called now bond percolation, was introduced in the mathematics literature by Broadbent & Hammersley (1957), and has been studied intensively by mathematicians and physicists since.

In social networks, Danny notes this algorithm is used to find groups of friends.

Similar mazes appear in puzzle books.

My curiosity is about finding groups of subject identity properties.

A couple of other percolation resources of interest:

Percolation Exercises by Eric Mueller.

PercoVis (Mac), visualization of percolation by Daniel B. Larremore.

Comments Off

April 30, 2013

GraphLab Workshop 2013 (Update)

Filed under: Conferences,GraphLab,Machine Learning — Patrick Durusau @ 2:46 pm

GraphLab Workshop 2013 Confirmed Agenda

You probably already have your plane tickets and hotel reservation but have you registered for GraphLab Workshop 2013?

Not just a select few graph databases for comparison but:

We have secured talks and demos about the hottest graph processing systems out there: GraphLab (CMU/UW), Pregel (Google), Giraph (Facebook) , Cassovary (Twitter), Grappa (UW), Combinatorial BLAS (LBNL/UCSB), Allegro Graph (Franz) ,Neo4j, Titan (Aurelius), DEX (Sparsity Technologies), YarcData and others!

Registration.

2013 Graphlab Workshop on Large Scale Machine Learning
Sessions Events LLC
Monday, July 1, 2013 from 8:00 AM to 7:00 PM (PDT)
San Francisco, CA

I know, I know, 8 AM is an unholy time to be anywhere (other than on your way home) on the West Coast.

Just pull an all-dayer for a change.

Expecting to see lots of posts and tweets from the conference!

Comments Off

April 18, 2013

Distributed Dual Decomposition (DDD) in GraphLab

Filed under: GraphLab,Operations Research,Relaxation — Patrick Durusau @ 1:38 pm

Distributed Dual Decomposition (DDD) in GraphLab by Danny Bickson.

From the post:

Our collaborator Dhruv Batra, from Virginia Tech has kindly contributed DDD code for GraphLab. Here are some explanation about the method and how to deploy it.

The full documentation is found here.

Distributed Dual Decomposition

Dual Decomposition (DD), also called Lagrangian Relaxation, is a powerful technique with a rich history in Operations Research. DD solves a relaxation of difficult optimization problems by decomposing them into simpler subproblems, solving these simpler subproblems independently and then combining these solutions into an approximate global solution.

More details about DD for solving Maximum A Posteriori (MAP) inference problems in Markov Random Fields (MRFs) can be found in the following:

D. Sontag, A. Globerson, T. Jaakkola. Introduction to Dual Decomposition for Inference. Optimization for Machine Learning, editors S. Sra, S. Nowozin, and S. J. Wright: MIT Press, 2011.

Always bear in mind that string matching is only one form of subject identification.

Comments Off

April 3, 2013

2nd GraphLab workshop [Early Bird Registration]

Filed under: Conferences,GraphLab,Graphs,Networks — Patrick Durusau @ 10:14 am

The 2nd GraphLab workshop is coming up! by Danny Bickson.

Danny also says there is a 30% discount if you email him: danny.bickson@gmail.com. Don’t know when that runs out but worth a try.

From the post:

Following the great success of the first GraphLab workshop, we have started to organize this year event, in July at the bay area. To remind you, last year we wanted to organize a 15-20 people event, which eventually got a participation of 300+ researchers from 100+ companies.

The main aim of this year workshop is to bring together top researchers from academia, as well as top data scientists from industry with the special focus of large scale machine learning on sparse graphs.

The event will take place Monday July 1st, 2013 in San Francisco. Early bird registration is now open!

Preliminary agenda.

Definitely one to have on your calendar!

Comments Off

March 11, 2013

GraphLab: A Distributed Abstraction…

Filed under: Graph Partitioning,GraphLab,Graphs,Networks — Patrick Durusau @ 7:32 pm

GraphLab: A Distributed Abstraction for Machine Learning in the Cloud by Carlos Guestrin. (video)

Take away line: “How does a vertex think?”

Deeply impressive presentation and performance numbers!

Resources:

GraphLab 2.1: http://graphlab.org

GraphChi 0.1: http://graphchi.org

Slides from the talk.

This needs to be very high on your graph reading/following list.

I first saw this at: SFBayACM talk: GraphLab framework for Machine Learning in the Cloud.

Comments (1)

December 8, 2012

GraphLab vs. Piccolo vs. Spark

Filed under: GraphLab,Graphs,Networks,Piccolo,Spark — Patrick Durusau @ 7:26 pm

GraphLab vs. Piccolo vs. Spark by Danny Bickson.

From the post:

I got an interesting case study from Cui Henggang, a first year graduate student at CMU Parallel Data Lab. Cui implemented GMM on GraphLab, for comparing its performance to Piccolo and Spark. His collaborators on this projects where Jinliang Wei and Wei Dai. The algorithm is described on Chris Bishop, Pattern Recognition and Machine Learning, Chapter 9.2, page 438.

Danny reports Chu will be releasing his report and posting his GMM code to the graphic models toolkit (GraphLab).

I will post a pointer when the report appears, here and probably in a new post as well.

Comments Off

October 23, 2012

Deploying a GraphLab/Spark/Mesos cluster on EC2

Filed under: Amazon Web Services AWS,Clustering (servers),GraphLab,Spark — Patrick Durusau @ 10:10 am

Deploying a GraphLab/Spark/Mesos cluster on EC2 by Danny Bickson.

From the post:

I got the following instructions from my collaborator Jay (Haijie Gu) who spent some time learning Spark cluster deployment and adapted those useful scripts to be used in GraphLab.

This tutorial will help you spawn a GraphLab distributed cluster, run alternating least squares task, collect the results and shutdown the cluster.

This tutorial is very new beta release. Please contact me if you are brave enough to try it out..

I haven’t seen any responses to Danny’s post. Is yours going to be the first one?

Comments Off

October 3, 2012

Big Learning with Graphs by Joey Gonzalez (Video Lecture)

Filed under: GraphLab,Graphs,Machine Learning — Patrick Durusau @ 5:00 am

Big Learning with Graphs by Joey Gonzalez by Marti Hearst.

From the post:

For those of you who follow the latest developments in the Big Data technology stack, you’ll know that GraphLab is the hottest technology for processing huge graphs in fast time. We got to hear the algorithms behind GraphLab 2 even before the OSDI crowd! Check it out:

Slides.

GraphLab homepage.

For when you want to move up to parallel graph processing.

Comments Off

September 11, 2012

Item based similarity with GraphChi

Filed under: GraphChi,GraphLab,Similarity — Patrick Durusau @ 10:15 am

Item based similarity with GraphChi by Danny Bickson.

From the post:

Item based collaborative filtering is one of the most popular collaborative filtering methods used in more than 70% of the companies I am talking to. Following my mega collaborator Justin Yan‘s advice, I have started to implement some item based similarity methods in GraphChi.

Item based methods compare all pairs of items together, for computing similarity metric between each pair of items. This task becomes very quickly computation intensive. For example, Netflix data has around 17K movies. If we want to compute all pairs of movies to find the most similar ones, we need to compare around 290M pairs!

If we use a symmetric similarity measure, the distance between movie A and B, is similar to the distance between movie B and A. Thus for the Netflix example we have around 145M pairs to compute. To reduce the work furthermore, we only compare movies which where watched together by at least X users, for example X=5. Otherwise, those movies are not considered similar.

When the dataset is big, it is not possible to load it fully into memory at a single machine. That is where GraphChi comes in. My preliminary implementation of the item similarity computes similarity between all pairs without fully reading the dataset into memory. The idea is to load a chunk of the items (called pivots) into memory, and then stream over the rest of the items by comparing the pivots to the rest of the items.
…

I need to check on Danny’s blog and the GraphChi/GraphLab webpages more often!

Comments Off

GraphChi parsers toolkit

Filed under: GraphChi,GraphLab,Graphs,Latent Dirichlet Allocation (LDA),Parsers,Tweets — Patrick Durusau @ 9:53 am

GraphChi parsers toolkit by Danny Bickson.

From the post:

To the request of Baoqiang Cao I have started a parsers toolkits in GraphChi to be used for preparing data to be used in GraphLab/ Graphchi. The parsers should be used as template which can be easy customized to user specific needs.

Danny starts us off with an LDA parser (with worked example of its use) and then adds a Twitter parser that creates a graph of retweets.

Enjoy!

Comments Off

August 22, 2012

Distributed GraphLab: …

Filed under: Amazon Web Services AWS,GraphLab,Graphs — Patrick Durusau @ 6:44 pm

Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud by Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, Joseph M. Hellerstein.

Abstract:

While high-level data parallel frameworks, like MapReduce, simplify the design and implementation of large-scale data processing systems, they do not naturally or efficiently support many important data mining and machine learning algorithms and can lead to inefficient learning systems. To help fill this critical void, we introduced the GraphLab abstraction which naturally expresses asynchronous, dynamic, graph-parallel computation while ensuring data consistency and achieving a high degree of parallel performance in the shared-memory setting. In this paper, we extend the GraphLab framework to the substantially more challenging distributed setting while preserving strong data consistency guarantees.

We develop graph based extensions to pipelined locking and data versioning to reduce network congestion and mitigate the effect of network latency. We also introduce fault tolerance to the GraphLab abstraction using the classic Chandy-Lamport snapshot algorithm and demonstrate how it can be easily implemented by exploiting the GraphLab abstraction itself. Finally, we evaluate our distributed implementation of the GraphLab abstraction on a large Amazon EC2 deployment and show 1-2 orders of magnitude performance gains over Hadoop-based implementations.

A gem from the first day as a member of the GraphLab and GraphChi group on LinkedIn!

This rocks!

Comments Off

August 21, 2012

GraphLab and GraphChi group on LinkedIn

Filed under: GraphChi,GraphLab — Patrick Durusau @ 1:09 pm

GraphLab and GraphChi group on LinkedIn by Igor Carron.

From the post:

Danny just started the GraphLab and GraphChi group on LinkedIn. If you want to be part of that disruptive discussion, we may want to join.

OK, I just hit “join.” What about you?

Comments (1)

August 5, 2012

Algorithms for Modern Massive Data Sets [slides]

Filed under: Algorithms,BigData,GraphLab,Sparse Data — Patrick Durusau @ 3:40 pm

Algorithms for Modern Massive Data Sets [slides]

Igor Carron writes:

In case you have to take your mind off tomorrow’s suspense-filled and technologically challenging landing of Curiosity on Mars (see 7 minutes of Terror, a blockbuster taking place on Mars this Summer ) Michael Mahoney, Alex Shkolnik, Gunnar Carlsson, Petros Drineas, the organizers of Workshop on Algorithms for Modern Massive Data Sets (MMDS 2012), just made available the slides of the meeting. Other relevant meeting slides include that of the Coding, Complexity, and Sparsity Workshop and that of the GraphLab workshop. Don’t grind your teeth too much tomorrow and Go Curiosity !

Igor has helpfully listed four conference days of links for the algorithms workshop. When you finish there, take a look at the other two slide listings.

Comments Off

July 19, 2012

GraphLab 2.1 [New Release]

Filed under: GraphLab,Graphs,Machine Learning,Networks — Patrick Durusau @ 10:43 am

GraphLab 2.1

A new release (July 10, 2012) of GraphLab!

From the webpage:

Overview

Designing and implementing efficient and provably correct parallel machine learning (ML) algorithms can be very challenging. Existing high-level parallel abstractions like MapReduce are often insufficiently expressive while low-level tools like MPI and Pthreads leave ML experts repeatedly solving the same design challenges. By targeting common patterns in ML, we developed GraphLab, which improves upon abstractions like MapReduce by compactly expressing asynchronous iterative algorithms with sparse computational dependencies while ensuring data consistency and achieving a high degree of parallel performance.

The new GraphLab 2.1 features:

a new GraphLab 2 abstraction

Fully Distributed with HDFS integration

New toolkits

Collaborative Filtering

Clustering

Text Modeling

Computer Vision

Improved Build system and documentation

Go to http://graphlab.org for details.

If you want to get started with GraphLab today download the source or clone us Google code. We recommend cloning to get the latest features and bug fixes.
hg clone https://code.google.com/p/graphlabapi/
If you don't have mercurial (hg) you can get it from http://mercurial.selenic.com/.

I almost didn’t find the download link. Just larger than anything else on the page, white letters, black backgroud, at: http:www.graphlab.org. Kept looking for a drop down menu item, etc.

Shows even the clearest presentation can be missed by a user.

Now to get this puppy running on my local box.

Comments Off

March 18, 2012

GraphLab workshop, Why should you care ?

Filed under: GraphLab,Graphs — Patrick Durusau @ 8:51 pm

GraphLab workshop, Why should you care ?

Danny Bickson has announced the first GraphLab workshop.

The “…Why should you care ?” post reads in part as follows:

Designing and implementing efficient and provably correct parallel machine learning (ML) algorithms can be very challenging. Existing high-level parallel abstractions like MapReduce are often insufficiently expressive while low-level tools like MPI and Pthreads leave ML experts repeatedly solving the same design challenges. By targeting common patterns in ML, we developed GraphLab, which improves upon abstractions like MapReduce by compactly expressing asynchronous iterative algorithms with sparse computational dependencies while ensuring data consistency and achieving a high degree of parallel performance.

In short it is a way to perform iterative algorithms on sparse graphs (parallel processing is also included). With the advent of cheap cloud computing, and the underlying need for post-processing in sparse recovery or advanced matrix factorization like dictionary learning, robust PCA and the like, it might be interesting to investigate the matter and even present something at this workshop….

Read the rest of “…Why should you care ?” for links to resources and examples. You will care. Promise.

And, if that doesn’t completely convince you, try:

A small Q&A with Danny Bickson on GraphLab.

Me? I am just hopeful for a small video cam somewhere in the audience with the slides/resources being posted.

Comments Off

October 9, 2011

Large Scale Machine Learning and Other Animals

Filed under: GraphLab — Patrick Durusau @ 6:41 pm

Large Scale Machine Learning and Other Animals

From About Me:

Danny Bickson

I am a project scientist at Carnegie Mellon University, Machine Learning Department. I am interested in distributed/parallel large scale machine learning algorithms and applications on various computing platforms such as clouds and clusters. Checkout our GraphLab project page. I am quite excited about multicore matrix factorization code that I recently wrote, that has been downloaded about 1500 times, and helped us win the 5h place in ACM KDD CUP 2011 contest (out of more than 1000 participants). We are looking for industrial collaboration around GraphLab. Contact me if you are interested in hearing more.

I encountered this blog due to its current post on installing GraphLab on Ubuntu

I am sure I will be pointing out other specific posts in the future but thought I should direct your attention to his blog and the GraphLab project while it was on my mind.

Comments Off

September 6, 2011

Hadoop Fatigue — Alternatives to Hadoop

Filed under: GraphLab,Hadoop,HPCC,MapReduce,Spark,Storm — Patrick Durusau @ 7:15 pm

Hadoop Fatigue — Alternatives to Hadoop

Can you name six (6) alternatives to Hadoop? Or formulate why you choose Hadoop over those alternatives?

From the post:

After working extensively with (Vanilla) Hadoop professional for the past 6 months, and at home for research, I have found several nagging issues with Hadoop that have convinced me to look elsewhere for everyday use and certain applications. For these applications, the though of writing a Hadoop job makes me take a deep breath. Before I continue, I will say that I still love Hadoop and the community.

Writing Hadoop jobs in Java is very time consuming because everything must be a class, and many times these classes extend several other classes or extend multiple interfaces; the Java API is very bloated. Adding a simple counter to a Hadoop job becomes a chore of its own.

Documentation for the bloated Java API is sufficient, but not the most helpful.

HDFS is complicated and has plenty of issues of its own. I recently heard a story about data loss in HDFS just because the IP address block used by the cluster changed.

Debugging a failure is a nightmare; is it the code itself? Is it a configuration parameter? Is it the cluster or one/several machines on the cluster? Is it the filesystem or disk itself? Who knows?!

Logging is verbose to the point that finding errors is like finding a needle in a haystack. That is, if you are even lucky to have an error recorded! I’ve had plenty of instances where jobs fail and there is absolutely nothing in the stdout or stderr logs.

Large clusters require a dedicated team to keep it running properly, but that is not surprising.

Writing a Hadoop job becomes a software engineering task rather than a data analysis task.

Hadoop will be around for a long time, and for good reason. MapReduce cannot solve every problem (fact), and Hadoop can solve even fewer problems (opinion?). After dealing with some of the innards of Hadoop, I’ve often said to myself “there must be a better way.” For large corporations that routinely crunch large amounts of data using MapReduce, Hadoop is still a great choice. For research, experimentation, and everyday data munging, one of these other frameworks may be better if the advantages of HDFS are not necessarily imperative:

Out of the six alternatives, I haven’t seen BashReduce or Disco, so I need to look those up.

Ah, the other alternatives: GraphLab, HPCC, Spark, and Preview of Storm: The Hadoop of Realtime Processing.

It is a pet peeve of mine that some authors force me to search for links they could have just as well entered. The New York Times of all places, refers to websites and does not include the URLs. And that is for paid subscribers.

Comments Off

June 16, 2011

GraphLab Abstraction

Filed under: GraphLab,Graphs — Patrick Durusau @ 3:41 pm

GraphLab Abstraction

Abstract:

The GraphLab abstraction is the product of several years of research in designing and implementing systems for statistical inference in probabilistic graphical models. Early in our work [12], we discovered that the high-level parallel abstractions popular in the ML community such as MapReduce [2, 13] and parallel BLAS [14] libraries are unable to express statistical inference algorithms efficiently. Our work revealed that an efficient algorithm for graphical model inference should explicitly address the sparse dependencies between random variables and adapt to the input data and model parameters.

Guided by this intuition we spent over a year designing and implementing various machine learning algorithms on top of low-level threading primitives and distributed communication frameworks such as OpenMP [15], CILK++ [16] and MPI [1]. Through this process, we discovered the following set of core algorithmic patterns that are common to a wide range of machine learning techniques. Following, we detail our findings and motivate why a new framework is needed (see Table 1).

March 5, 2011

GraphLab

Filed under: GraphLab,Graphs,Machine Learning,Visualization — Patrick Durusau @ 2:51 pm

GraphLab

Progress on graph processing continues.

From the website:

A New Parallel Framework for Machine Learning

Designing and implementing efficient and provably correct parallel machine learning (ML) algorithms can be very challenging. Existing high-level parallel abstractions like MapReduce are often insufficiently expressive while low-level tools like MPI and Pthreads leave ML experts repeatedly solving the same design challenges. By targeting common patterns in ML, we developed GraphLab, which improves upon abstractions like MapReduce by compactly expressing asynchronous iterative algorithms with sparse computational dependencies while ensuring data consistency and achieving a high degree of parallel performance.

The popular MapReduce abstraction, is defined in two parts, a Map stage which performs computation on indepedent problems which can be solved in isolation, and a Reduce stage which combines the results.

GraphLab provides a similar analog to the Map in the form of an Update Function. The Update Function however, is able to read and modify overlapping sets of data (program state) in a controlled fashion as defined by the user provided data graph. The user provided data graph represents the program state with arbitrary blocks of memory associated with each vertex and edges. In addition the update functions can be recursively triggered with one update function spawning the application of update functions to other vertices in the graph enabling dynamic iterative computation. GraphLab uses powerful scheduling primitives to control the order update functions are executed.

The GraphLab analog to Reduce is the Sync Operation. The Sync Operation also provides the ability to perform reductions in the background while other computation is running. Like the update function sync operations can look at multiple records simultaneously providing the ability to operate on larger dependent contexts.

See also: GraphLab: A New Framework For Parallel Machine Learning (The original paper.)

This is a project that bears close watching.

Comments (1)

« Newer Posts