Archive for the ‘Sparse Data’ Category

Deep learning made easy

Friday, May 3rd, 2013

Deep learning made easy by Zygmunt Zając.

From the post:

As usual, there’s an interesting competition at Kaggle: The Black Box. It’s connected to ICML 2013 Workshop on Challenges in Representation Learning, held by the deep learning guys from Montreal.

There are a couple benchmarks for this competition and the best one is unusually hard to beat – only less than a fourth of those taking part managed to do so. We’re among them. Here’s how.

The key ingredient in our success is a recently developed secret Stanford technology for deep unsupervised learning, called sparse filtering. Actually, it’s not secret. It’s available at Github, and has one or two very appealling properties. Let us explain.

The main idea of deep unsupervised learning, as we understand it, is feature extraction. One of the most common applications are in multimedia. The reason for that is that multimedia tasks, for example object recognition, are easy for humans, but difficult for the computers*.

Geoff Hinton from Toronto talks about two ends of spectrum in machine learning: one is statistics and getting rid of noise, the other one – AI, or the things that humans are good at but computers are not. Deep learning proponents say that deep, that is, layered, architectures, are the way to solve AI kind of problems.

The idea might have something to do with an inspiration from how the brain works. Each layer is supposed to extract higher-level features, and these features are supposed to be more useful for the task at hand.

Rather say layered architectures are observed to mimic human results.

Just as a shovel mimics and exceeds a human hand for digging.

But you would not say operation of a shovel gives us insight into the operation of a human hand.

Or would you?

Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices

Wednesday, April 24th, 2013

Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices by Shivaram Venkataraman, Erik Bodzsar, Indrajit Roy, Alvin AuYoung, and Robert S. Schreiber.

Abstract:

It is cumbersome to write machine learning and graph algorithms in data-parallel models such as MapReduce and Dryad. We observe that these algorithms are based on matrix computations and, hence, are inefficient to implement with the restrictive programming and communication interface of such frameworks.

In this paper we show that array-based languages such as R [3] are suitable for implementing complex algorithms and can outperform current data parallel solutions. Since R is single-threaded and does not scale to large datasets, we have built Presto, a distributed system that extends R and addresses many of its limitations. Presto efficiently shares sparse structured data, can leverage multi-cores, and dynamically partitions data to mitigate load imbalance. Our results show the promise of this approach: many important machine learning and graph algorithms can be expressed in a single framework and are substantially faster than those in Hadoop and Spark.

Your mileage may vary but the paper reports that for PageRank, Presto is 40X faster than Hadoop and 15X Spark.

Unfortunately I can’t point you to any binary or source code for Presto.

Still, the description is an interesting one at a time of rapid development of computing power.

A Fast Parallel Maximum Clique Algorithm…

Sunday, March 3rd, 2013

A Fast Parallel Maximum Clique Algorithm for Large Sparse Graphs and Temporal Strong Components by Ryan A. Rossi, David F. Gleich, Assefaw H. Gebremedhin, Md. Mostofa Ali Patwary.

Abstract:

We propose a fast, parallel, maximum clique algorithm for large, sparse graphs that is designed to exploit characteristics of social and information networks. We observe roughly linear runtime scaling over graphs between 1000 vertices and 100M vertices. In a test with a 1.8 billion-edge social network, the algorithm finds the largest clique in about 20 minutes. For social networks, in particular, we found that using the core number of a vertex in combination with a good heuristic clique finder efficiently removes the vast majority of the search space. In addition, we parallelize the exploration of the search tree. In the algorithm, processes immediately communicate changes to upper and lower bounds on the size of maximum clique, which occasionally results in a super-linear speedup because vertices with especially large search spaces can be pruned by other processes. We use this clique finder to investigate the size of the largest temporal strong components in dynamic networks, which requires finding the largest clique in a particular temporal reachability graph.

Thirty-two networks are reported in this paper and a promised online appendix as around eighty (80).

The online appendix is live but as of today (March 2, 2013), it has no content.

No matter, the paper should keep you busy for more than a little while. ;-)

I am interested in parallel graph processing in general but the concept of communicating “…changes to upper and lower bounds on the size of maximum clique…” seems applicable to “merging” in topic maps.

That is if some set of topics share some common characteristic that exclude them from consideration for merging, why apply the merging test at all?

Will have to think about that.

Algorithms for Modern Massive Data Sets [slides]

Sunday, August 5th, 2012

Algorithms for Modern Massive Data Sets [slides]

Igor Carron writes:

In case you have to take your mind off tomorrow’s suspense-filled and technologically challenging landing of Curiosity on Mars (see 7 minutes of Terror, a blockbuster taking place on Mars this SummerMichael Mahoney, Alex Shkolnik, Gunnar Carlsson, Petros Drineas, the organizers of Workshop on Algorithms for Modern Massive Data Sets (MMDS 2012), just made available the slides of the meeting. Other relevant meeting slides include that of the Coding, Complexity, and Sparsity Workshop and that of the GraphLab workshop. Don’t grind your teeth too much tomorrow and Go Curiosity !

Igor has helpfully listed four conference days of links for the algorithms workshop. When you finish there, take a look at the other two slide listings.

Meta-Analysis of ‘Sparse’ Data: Perspectives from the Avandia Cases

Sunday, July 8th, 2012

Meta-Analysis of ‘Sparse’ Data: Perspectives from the Avandia Cases by Michael Finkelstein and Bruce Levin.

Abstract:

Combining the results of multiple small trials to increase accuracy and statistical power, a technique called meta-analysis has become well established and increasingly important in medical studies, particularly in connection with new drugs. When the data are sparse, as they are in many such cases, certain accepted practices, applied reflexively by researchers, may be misleading because they are biased and for other reasons. We illustrate some of the problems by examining a meta-analysis of the connection between the diabetes drug Avandia (rosiglitazone) and myocardial infarction that was strongly criticized as misleading, but led to thousands of lawsuits being filed against the manufacturer and the FDA acting to restrict access to the drug. Our scrutiny of the Avandia meta-analysis is particularly appropriate because it plays an important role in ongoing litigation, has been sharply criticized, and has been subject to a more searching review in court than meta-analyses of other drugs.

A good introduction to the issues of meta-analysis, where the stakes for drug companies, can be quite high.

All clinical trials vary in some respects, the question with meta-analysis being is the variance enough (enough heterogeneity) to make meta-analysis invalid?

How would you measure heterogeneity or perhaps an experts claim of heterogeneity?