Archive for the ‘Sparse Data’ Category

PySparNN [nearest neighbors in sparse, high dimensional spaces (like text documents).]

Thursday, April 7th, 2016


From the post:

Approximate Nearest Neighbor Search for Sparse Data in Python! This library is well suited to finding nearest neighbors in sparse, high dimensional spaces (like text documents).

Out of the box, PySparNN supports Cosine Distance (i.e. 1 – cosine_similarity).

PySparNN benefits:

  • Designed to be efficent on sparse data (memory & cpu).
  • Implemented leveraging existing python libraries (scipy & numpy).
  • Easily extended with other metrics: Manhattan, Euclidian, Jaccard, etc.
  • Work in progress – Min, Max distance thresholds can be set at query time (not index time). Example: return the k closest items on the interval [0.8, 0.9] from a query point.

If your data is NOT SPARSE – please consider annoy. Annoy uses a similar-ish method and I am a big fan of it. As of this writing, annoy performs ~8x faster on their introductory example.
General rule of thumb – annoy performs better if you can get your data to fit into memory (as a dense vector).

The most comparable library to PySparNN is scikit-learn’s LSHForrest module. As of this writing, PySparNN is ~1.5x faster on the 20newsgroups dataset. A more robust benchmarking on sparse data is desired. Here is the comparison.

I included the text snippet in the title because PySparNN isn’t clueful, at least not at first glance.

I looked for a good explanation on nearest neighbors and encountered this lecture by Pat Wilson’s (MIT OpenCourseWare):

The lecture has a number of gems, including the observation that:

Town and Country readers tend to be social parasites.

Observations on text and nearest neightbors, time marks 17:30 – 24:17.

You should make an effort to watch the entire video. You will have a broader appreciate for the sheer power of nearest neighbor analysis and as a bonus, some valuable insights on why going without sleep is a very bad idea.

I first saw this in a tweet by Lynn Cherny.

Classifying Shakespearean Drama with Sparse Feature Sets

Tuesday, October 14th, 2014

Classifying Shakespearean Drama with Sparse Feature Sets by Douglas Duhaime.

From the post:

In her fantastic series of lectures on early modern England, Emma Smith identifies an interesting feature that differentiates the tragedies and comedies of Elizabethan drama: “Tragedies tend to have more streamlined plots, or less plot—you know, fewer things happening. Comedies tend to enjoy a multiplication of characters, disguises, and trickeries. I mean, you could partly think about the way [tragedies tend to move] towards the isolation of a single figure on the stage, getting rid of other people, moving towards a kind of solitude, whereas comedies tend to end with a big scene at the end where everybody’s on stage” (6:02-6:37). 

The distinction Smith draws between tragedies and comedies is fairly intuitive: tragedies isolate the poor player that struts and frets his hour upon the stage and then is heard no more. Comedies, on the other hand, aggregate characters in order to facilitate comedic trickery and tidy marriage plots. While this discrepancy seemed promising, I couldn’t help but wonder whether computational analysis would bear out the hypothesis. Inspired by the recent proliferation of computer-assisted genre classifications of Shakespeare’s plays—many of which are founded upon high dimensional data sets like those generated by DocuScope—I was curious to know if paying attention to the number of characters on stage in Shakespearean drama could help provide additional feature sets with which to carry out this task.

A quick reminder that not all text analysis is concerned with 140 character strings. 😉

Do you prefer:

high dimensional

where every letter in “high dimensional” is a hyperlink with an unknown target or a fuller listing:

Allison, Sarah, and Ryan Heuser, Matthew Jockers, Franco Moretti, Michael Witmore. Quantitative Formalism: An Experiment

Jockers, Matthew. Machine-Classifying Novels and Plays by Genre

Hope, Jonathan and Michael Witmore. “The Hundredth Psalm to the Tune of ‘Green Sleeves’”: Digital Approaches Shakespeare’s Language of Genre

Hope, Jonathan. Shakespeare by the numbers: on the linguistic texture of the Late Plays

Hope, Jonathan and Michael Witmore. The Very Large Textual Object: A Prosthetic Reading of Shakespeare

Lenthe, Victor. Finding the Sherlock in Shakespeare: some ideas about prose genre and linguistic uniqueness

Stumpf, Mike. How Quickly Nature Falls Into Revolt: On Revisiting Shakespeare’s Genres

Stumpf, Mike. This Thing of Darkness (Part III)

Tootalian, Jacob A. Shakespeare, Without Measure: The Rhetorical Tendencies of Renaissance Dramatic Prose

Ullyot, Michael. Encoding Shakespeare

Witmore, Michael. A Genre Map of Shakespeare’s Plays from the First Folio (1623)

Witmore, Michael. Shakespeare Out of Place?

Witmore, Michael. Shakespeare Quarterly 61.3 Figures

Witmore, Michael. Visualizing English Print, 1530-1800, Genre Contents of the Corpus

Decompiling Shakespeare (Site is down. Was also down when the WayBack machine tried to archive the site in July of 2014)

I prefer the longer listing.

If you are interested in Shakespeare, Folger Digital Texts has free XML and PDF versions of his work.

I first saw this in a tweet by Gregory Piatetsky

Deep learning made easy

Friday, May 3rd, 2013

Deep learning made easy by Zygmunt Zając.

From the post:

As usual, there’s an interesting competition at Kaggle: The Black Box. It’s connected to ICML 2013 Workshop on Challenges in Representation Learning, held by the deep learning guys from Montreal.

There are a couple benchmarks for this competition and the best one is unusually hard to beat – only less than a fourth of those taking part managed to do so. We’re among them. Here’s how.

The key ingredient in our success is a recently developed secret Stanford technology for deep unsupervised learning, called sparse filtering. Actually, it’s not secret. It’s available at Github, and has one or two very appealling properties. Let us explain.

The main idea of deep unsupervised learning, as we understand it, is feature extraction. One of the most common applications are in multimedia. The reason for that is that multimedia tasks, for example object recognition, are easy for humans, but difficult for the computers*.

Geoff Hinton from Toronto talks about two ends of spectrum in machine learning: one is statistics and getting rid of noise, the other one – AI, or the things that humans are good at but computers are not. Deep learning proponents say that deep, that is, layered, architectures, are the way to solve AI kind of problems.

The idea might have something to do with an inspiration from how the brain works. Each layer is supposed to extract higher-level features, and these features are supposed to be more useful for the task at hand.

Rather say layered architectures are observed to mimic human results.

Just as a shovel mimics and exceeds a human hand for digging.

But you would not say operation of a shovel gives us insight into the operation of a human hand.

Or would you?

Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices

Wednesday, April 24th, 2013

Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices by Shivaram Venkataraman, Erik Bodzsar, Indrajit Roy, Alvin AuYoung, and Robert S. Schreiber.


It is cumbersome to write machine learning and graph algorithms in data-parallel models such as MapReduce and Dryad. We observe that these algorithms are based on matrix computations and, hence, are inefficient to implement with the restrictive programming and communication interface of such frameworks.

In this paper we show that array-based languages such as R [3] are suitable for implementing complex algorithms and can outperform current data parallel solutions. Since R is single-threaded and does not scale to large datasets, we have built Presto, a distributed system that extends R and addresses many of its limitations. Presto efficiently shares sparse structured data, can leverage multi-cores, and dynamically partitions data to mitigate load imbalance. Our results show the promise of this approach: many important machine learning and graph algorithms can be expressed in a single framework and are substantially faster than those in Hadoop and Spark.

Your mileage may vary but the paper reports that for PageRank, Presto is 40X faster than Hadoop and 15X Spark.

Unfortunately I can’t point you to any binary or source code for Presto.

Still, the description is an interesting one at a time of rapid development of computing power.

A Fast Parallel Maximum Clique Algorithm…

Sunday, March 3rd, 2013

A Fast Parallel Maximum Clique Algorithm for Large Sparse Graphs and Temporal Strong Components by Ryan A. Rossi, David F. Gleich, Assefaw H. Gebremedhin, Md. Mostofa Ali Patwary.


We propose a fast, parallel, maximum clique algorithm for large, sparse graphs that is designed to exploit characteristics of social and information networks. We observe roughly linear runtime scaling over graphs between 1000 vertices and 100M vertices. In a test with a 1.8 billion-edge social network, the algorithm finds the largest clique in about 20 minutes. For social networks, in particular, we found that using the core number of a vertex in combination with a good heuristic clique finder efficiently removes the vast majority of the search space. In addition, we parallelize the exploration of the search tree. In the algorithm, processes immediately communicate changes to upper and lower bounds on the size of maximum clique, which occasionally results in a super-linear speedup because vertices with especially large search spaces can be pruned by other processes. We use this clique finder to investigate the size of the largest temporal strong components in dynamic networks, which requires finding the largest clique in a particular temporal reachability graph.

Thirty-two networks are reported in this paper and a promised online appendix as around eighty (80).

The online appendix is live but as of today (March 2, 2013), it has no content.

No matter, the paper should keep you busy for more than a little while. 😉

I am interested in parallel graph processing in general but the concept of communicating “…changes to upper and lower bounds on the size of maximum clique…” seems applicable to “merging” in topic maps.

That is if some set of topics share some common characteristic that exclude them from consideration for merging, why apply the merging test at all?

Will have to think about that.

Algorithms for Modern Massive Data Sets [slides]

Sunday, August 5th, 2012

Algorithms for Modern Massive Data Sets [slides]

Igor Carron writes:

In case you have to take your mind off tomorrow’s suspense-filled and technologically challenging landing of Curiosity on Mars (see 7 minutes of Terror, a blockbuster taking place on Mars this SummerMichael Mahoney, Alex Shkolnik, Gunnar Carlsson, Petros Drineas, the organizers of Workshop on Algorithms for Modern Massive Data Sets (MMDS 2012), just made available the slides of the meeting. Other relevant meeting slides include that of the Coding, Complexity, and Sparsity Workshop and that of the GraphLab workshop. Don’t grind your teeth too much tomorrow and Go Curiosity !

Igor has helpfully listed four conference days of links for the algorithms workshop. When you finish there, take a look at the other two slide listings.

Meta-Analysis of ‘Sparse’ Data: Perspectives from the Avandia Cases

Sunday, July 8th, 2012

Meta-Analysis of ‘Sparse’ Data: Perspectives from the Avandia Cases by Michael Finkelstein and Bruce Levin.


Combining the results of multiple small trials to increase accuracy and statistical power, a technique called meta-analysis has become well established and increasingly important in medical studies, particularly in connection with new drugs. When the data are sparse, as they are in many such cases, certain accepted practices, applied reflexively by researchers, may be misleading because they are biased and for other reasons. We illustrate some of the problems by examining a meta-analysis of the connection between the diabetes drug Avandia (rosiglitazone) and myocardial infarction that was strongly criticized as misleading, but led to thousands of lawsuits being filed against the manufacturer and the FDA acting to restrict access to the drug. Our scrutiny of the Avandia meta-analysis is particularly appropriate because it plays an important role in ongoing litigation, has been sharply criticized, and has been subject to a more searching review in court than meta-analyses of other drugs.

A good introduction to the issues of meta-analysis, where the stakes for drug companies, can be quite high.

All clinical trials vary in some respects, the question with meta-analysis being is the variance enough (enough heterogeneity) to make meta-analysis invalid?

How would you measure heterogeneity or perhaps an experts claim of heterogeneity?