Archive for the ‘High Dimensionality’ Category

Visualizing High-Dimensional Data…

Monday, July 28th, 2014

Visualizing High-Dimensional Data in the Browser with SVD, t-SNE and Three.js by Nicolas Kruchten.

From the post:

Data visualization, by definition, involves making a two- or three-dimensional picture of data, so when the data being visualized inherently has many more dimensions than two or three, a big component of data visualization is dimensionality reduction. Dimensionality reduction is also often the first step in a big-data machine-learning pipeline, because most machine-learning algorithms suffer from the Curse of Dimensionality: more dimensions in the input means you need exponentially more training data to create a good model. Datacratic’s products operate on billions of data points (big data) in tens of thousands of dimensions (big problem), and in this post, we show off a proof of concept for interactively visualizing this kind of data in a browser, in 3D (of course, the images on the screen are two-dimensional but we use interactivity, motion and perspective to evoke a third dimension).

Both the post and the demo are very impressive!

For a compelling review, see Dimension Reduction: A Guided Tour by Christopher J.C. Burges.

Christopher captures my concern with dimensional reduction in the first sentence of the introduction:

Dimension reduction1 is the mapping of data to a lower dimensional space such that uninformative variance in the data is discarded, or such that a subspace in which the data lives is detected.

I understand the need for dimensional reduction and that it can produce useful results. But what is being missed in the “…uniformative variance in the data…” is unknown.

Not an argument against dimensional reduction but a caution to avoid quickly dismissing variation in data as “uninformative.”

Foundations of Data Science

Sunday, September 29th, 2013

Foundations of Data Science by John Hopcroft and Ravindran Kannan.

From the introduction:

Computer science as an academic discipline began in the 60’s. Emphasis was on programming languages, compilers, operating systems, and the mathematical theory that supported these areas. Courses in theoretical computer science covered nite automata, regular expressions, context free languages, and computability. In the 70’s, algorithms was added as an important component of theory. The emphasis was on making computers useful. Today, a fundamental change is taking place and the focus is more on applications. There are many reasons for this change. The merging of computing and communications has played an important role. The enhanced ability to observe, collect and store data in the natural sciences, in commerce, and in other elds calls for a change in our understanding of data and how to handle it in the modern setting. The emergence of the web and social networks, which are by far the largest such structures, presents both opportunities and challenges for theory.

While traditional areas of computer science are still important and highly skilled individuals are needed in these areas, the majority of researchers will be involved with using computers to understand and make usable massive data arising in applications, not just
how to make computers useful on specifi c well-defi ned problems. With this in mind we have written this book to cover the theory likely to be useful in the next 40 years, just as automata theory, algorithms and related topics gave students an advantage in the last 40 years. One of the major changes is the switch from discrete mathematics to more of an emphasis on probability, statistics, and numerical methods.

In draft form but impressive!

Current chapters:

  1. Introduction
  2. High-Dimensional Space
  3. Random Graphs
  4. Singular Value Decomposition (SVD)
  5. Random Walks and Markov Chains
  6. Learning and the VC-dimension
  7. Algorithms for Massive Data Problems
  8. Clustering
  9. Topic Models, Hidden Markov Process, Graphical Models, and Belief Propagation
  10. Other Topics [Rankings, Hare System for Voting, Compressed Sensing and Sparse Vectors]
  11. Appendix

I am certain the authors would appreciate comments and suggestions concerning the text.

I first saw this in a tweet by CompSciFact.

Class-imbalanced classifiers for high-dimensional data

Tuesday, January 22nd, 2013

Class-imbalanced classifiers for high-dimensional data by Wei-Jiun Lin and James J. Chen. (Brief Bioinform (2013) 14 (1): 13-26. doi: 10.1093/bib/bbs006)


A class-imbalanced classifier is a decision rule to predict the class membership of new samples from an available data set where the class sizes differ considerably. When the class sizes are very different, most standard classification algorithms may favor the larger (majority) class resulting in poor accuracy in the minority class prediction. A class-imbalanced classifier typically modifies a standard classifier by a correction strategy or by incorporating a new strategy in the training phase to account for differential class sizes. This article reviews and evaluates some most important methods for class prediction of high-dimensional imbalanced data. The evaluation addresses the fundamental issues of the class-imbalanced classification problem: imbalance ratio, small disjuncts and overlap complexity, lack of data and feature selection. Four class-imbalanced classifiers are considered. The four classifiers include three standard classification algorithms each coupled with an ensemble correction strategy and one support vector machines (SVM)-based correction classifier. The three algorithms are (i) diagonal linear discriminant analysis (DLDA), (ii) random forests (RFs) and (ii) SVMs. The SVM-based correction classifier is SVM threshold adjustment (SVM-THR). A Monte–Carlo simulation and five genomic data sets were used to illustrate the analysis and address the issues. The SVM-ensemble classifier appears to perform the best when the class imbalance is not too severe. The SVM-THR performs well if the imbalance is severe and predictors are highly correlated. The DLDA with a feature selection can perform well without using the ensemble correction.

At least the “big data” folks are right on one score: We are going to need help sorting out all the present and future information.

Not that we will ever attempt to sort it all out, as was reported in: The Untapped Big Data Gap (2012) [Merry Christmas Topic Maps!], only 23% of “big data” is going to be valuable if we do analyze it.

And your enterprise’s part of that 23% is even smaller.

Enough that your users will need help dealing with it, but not nearly the deluge that is being predicted.

Animating Random Projections of High Dimensional Data [“just looking around a bit”]

Tuesday, October 9th, 2012

Animating Random Projections of High Dimensional Data by Andreas Mueller.

From the post:

Recently Jake showed some pretty cool videos in his blog.

This inspired me to go back to an idea I had some time ago, about visualizing high-dimensional data via random projections.

I love to do exploratory data analysis with scikit-learn, using the manifold, decomposition and clustering module. But in the end, I can only look at two (or three) dimensions. And I really like to see what I am doing.

So I go and look at the first two PCA directions, than at the first and third, than at the second and third… and so on. That is a bit tedious and looking at more would be great. For example using time.

There is a software out there, called ggobi, which does a pretty good job at visualizing  high dimensional data sets. It is possible to take interactive tours of your high dimensions, set projection angles and whatnot. It has a UI and tons of settings.

I used it a couple of times and I really like it. But it doesn’t really fit into my usual work flow. It  has good R integration, but not Python integration that I know of. And it also seems a bit overkill for “just looking around a bit”.

It’s hard to over estimate the value of “just looking around a bit.”

As opposed to defending a fixed opinion about data, data structures, or processing.

Who knows?

Practice at “just looking around a bit,” may make your opinions less fixed.

Chance you will have to take.

High Dimensional Undirected Graphical Models

Monday, September 24th, 2012

High Dimensional Undirected Graphical Models by Larry Wasserman.

Larry discusses uncertainty in high dimensional graphs. No answers but does illustrate the problem.

Clustering high dimensional data

Thursday, June 28th, 2012

Clustering high dimensional data by Ira Assent. (Assent, I. (2012), Clustering high dimensional data. WIREs Data Mining Knowl Discov, 2: 340–350. doi: 10.1002/widm.1062)


High-dimensional data, i.e., data described by a large number of attributes, pose specific challenges to clustering. The so-called ‘curse of dimensionality’, coined originally to describe the general increase in complexity of various computational problems as dimensionality increases, is known to render traditional clustering algorithms ineffective. The curse of dimensionality, among other effects, means that with increasing number of dimensions, a loss of meaningful differentiation between similar and dissimilar objects is observed. As high-dimensional objects appear almost alike, new approaches for clustering are required. Consequently, recent research has focused on developing techniques and clustering algorithms specifically for high-dimensional data. Still, open research issues remain. Clustering is a data mining task devoted to the automatic grouping of data based on mutual similarity. Each cluster groups objects that are similar to one another, whereas dissimilar objects are assigned to different clusters, possibly separating out noise. In this manner, clusters describe the data structure in an unsupervised manner, i.e., without the need for class labels. A number of clustering paradigms exist that provide different cluster models and different algorithmic approaches for cluster detection. Common to all approaches is the fact that they require some underlying assessment of similarity between data objects. In this article, we provide an overview of the effects of high-dimensional spaces, and their implications for different clustering paradigms. We review models and algorithms that address clustering in high dimensions, with pointers to the literature, and sketch open research issues. We conclude with a summary of the state of the art.

The author has a clever example (figure 4) of why adding dimensions can decrease the discernment of distinct groups in data. A problem that worsens as the number of dimensions increases.

Or does it? Or is it the case that by weighting all dimensions equally we get the result we deserve?

My counter-example would be introducing you to twin sisters. As the number of dimensions increased, so would the similarity that would befoul any clustering algorithm.

But the important dimension, their names, is sufficient to cluster attributes around the appropriate data points.

Is the “curse of dimensionality” rather a “failure to choose dimensions wisely?”

Visual and semantic interpretability of projections of high dimensional data for classification tasks

Thursday, May 24th, 2012

Visual and semantic interpretability of projections of high dimensional data for classification tasks by Ilknur Icke and Andrew Rosenberg.

A number of visual quality measures have been introduced in visual analytics literature in order to automatically select the best views of high dimensional data from a large number of candidate data projections. These methods generally concentrate on the interpretability of the visualization and pay little attention to the interpretability of the projection axes. In this paper, we argue that interpretability of the visualizations and the feature transformation functions are both crucial for visual exploration of high dimensional labeled data. We present a two-part user study to examine these two related but orthogonal aspects of interpretability. We first study how humans judge the quality of 2D scatterplots of various datasets with varying number of classes and provide comparisons with ten automated measures, including a number of visual quality measures and related measures from various machine learning fields. We then investigate how the user perception on interpretability of mathematical expressions relate to various automated measures of complexity that can be used to characterize data projection functions. We conclude with a discussion of how automated measures of visual and semantic interpretability of data projections can be used together for exploratory analysis in classification tasks.

Rather small group of test subjects (20) so I don’t think you can say much other than more work is needed.

Then it occurred to me that I often speak of studies applying to “users” without stopping to remember that for many tasks, I fall into that self-same category. Subject to the same influences, fatigues and even mistakes.

Anyone know of research by researchers being applied to the same researchers?

Representing word meaning and order information in a composite holographic lexicon

Saturday, November 19th, 2011

Representing word meaning and order information in a composite holographic lexicon by Michael N. Jones , Douglas J. K. Mewhort.


The authors present a computational model that builds a holographic lexicon representing both word meaning and word order from unsupervised experience with natural language. The model uses simple convolution and superposition mechanisms (cf. B. B. Murdock, 1982) to learn distributed holographic representations for words. The structure of the resulting lexicon can account for empirical data from classic experiments studying semantic typicality, categorization, priming, and semantic constraint in sentence completions. Furthermore, order information can be retrieved from the holographic representations, allowing the model to account for limited word transitions without the need for built-in transition rules. The model demonstrates that a broad range of psychological data can be accounted for directly from the structure of lexical representations learned in this way, without the need for complexity to be built into either the processing mechanisms or the representations. The holographic representations are an appropriate knowledge representation to be used by higher order models of language comprehension, relieving the complexity required at the higher level.

More reading along the lines of higher-dimensional representation techniques. Almost six (6) pages of references to run backwards and forwards so this is going to take a while.

Hyperdimensional Computing: An introduction…

Saturday, November 19th, 2011

Hyperdimensional Computing: An introduction to computing in distributed representation with high-dimensional random vectors by Pentti Kanerva (Cognitive Computation 1(2): 139-159)

You know it is going to be a Jack Park sort of day when the morning email has a notice about a presentation entitled: “Hyperdimensional Computing for Modeling How Brains Compute”. With the link you see above in the email. 😉

What’s a Jack Park sort of day like? Well, it certainly throws you off the track you were on before you sat down at the monitor. And pretty soon you are adding things like: Holographic Reduced Representation: Distributed Representation for Cognitive Structures by Tony Plate to your Amazon wish list. And adding to an already desk busting pile of papers that need your attention.

In short: A day that suits me just fine!

Suggest you read the paper, whether you add Tony’s book to your wish list or not. (Just in case you are interested, Patrick’s wish list. Or related items. If I get/have duplicates I can donate them to the library.)

I think the hyperdimensional computing approach may be one way to talk about the present need to make representations unnaturally precise so that our computing systems can deal with them. I say “unnaturally precise” because the descriptions or definitions needed by our computers aren’t necessary outside their context. If I call Jack up on the phone, I don’t say: “” identifer for user who wishes to speak to (identifier for Jack), etc.” No, I say: “This is Patrick.” Trusting that if Jack is awake and I don’t have a cold, Jack will recognize my voice.

There are any number of reasons why Jack will recognize it is me and not some other Patrick he knows. I will be calling for Georgia, which as a particular area code, the time of day will be right, I have a Southern accent, any number of clues, even before we get to my self-identification.

To increase the usefulness of our information systems, they need to become a lot more like us and not have us flattening our worlds to become a lot more like them.

Jack’s “Hyperdimensional Computing” may be one path in that direction.

COMTOP: Applied and Computational Algebraic Topology

Tuesday, September 13th, 2011

COMTOP: Applied and Computational Algebraic Topology

From the website:

The overall goal of this project is to develop flexible topological methods which will allow the analysis of data which is difficult to analyze using classical linear methods. Data obtained by sampling from highly curved manifolds or singular algebraic varieties in Euclidean space are typical examples where our methods will be useful. We intend to develop and refine two pieces of software which have been written by members of our research group, ISOMAP (Tenenbaum) and PLEX (de Silva-Carlsson). ISOMAP is a tool for dimension reduction and parameterization of high dimensional data sets, and PLEX is a homology computing tool which we will use in locating and analyzing singular points in data sets, as well as estimating dimension in situations where standard methods do not work well. We plan to extend the range of applicability of both tools, in the case of ISOMAP by studying embeddings into spaces with non-Euclidean metrics, and in the case of PLEX by building in the Mayer-Vietoris spectral sequence as a tool. Both ISOMAP and PLEX will be adapted for parallel computing. We will also begin the theoretical study of statistical questions relating to topology. For instance, we will initiate the study of higher dimensional homology of subsets sampled from Euclidean space under various sampling hypotheses. The key object of study will be the family of Cech complexes constructed using the distance function in Euclidean space together with a randomly chosen finite set of points in Euclidean space.

The goal of this project is to develop tools for understanding data sets which are not easy to understand using standard methods. This kind of data might include singular points, or might be strongly curved. The data is also high dimensional, in the sense that each data point has many coordinates. For instance, we might have a data set whose points each of which is an image, which has one coordinate for each pixel. Many standard tools rely on linear approximations, which do not work well in strongly curved or singular problems. The kind of tools we have in mind are in part topological, in the sense that they measure more qualitative properties of the spaces involved, such as connectedness, or the number of holes in a space, and so on. This group of methods has the capability of recognizing the number of parameters required to describe a space, without actually parameterizing it. These methods also have the capability of recognizing singular points (like points where two non-parallel planes or non-parallel lines intersect), without actually having to construct coordinates on the space. We will also be further developing and refining methods we have already constructed which can actually find good parameterizations for many high dimensional data sets. Both projects will involve the adaptation for the computer of many methods which have heretofore been used in by-hand calculations for solving theoretical problems. We will also initiate the theoretical development of topological tools in a setting which includes errors and sampling.

Are you still using “classical linear methods” to analyze your data sets?

Could be the right choice but it may be your only choice if you never consider the alternatives.

LSH Algorithm and Implementation (E2LSH)

Tuesday, January 25th, 2011

LSH Algorithm and Implementation (E2LSH) Authors: Alexandr Andoni and Piotr Indyk

Andoni and Indyk aren’t any better with titles than they were in Whose Your Nearest Neighbor but here you will find additional information on their high dimension nearest neighbor algorithm as well as an implementation of the algorithm.

NAQ Tree in Your Forest?

Tuesday, January 25th, 2011

Effectiveness of NAQ-tree as index structure for similarity search in high-dimensional metric space Authors: Ming Zhang and Reda Alhajj Keywords: Knn search, High dimensionality, Dimensionality reduction, Indexing, Similarity search


Similarity search (e.g., k-nearest neighbor search) in high-dimensional metric space is the key operation in many applications, such as multimedia databases, image retrieval and object recognition, among others. The high dimensionality and the huge size of the data set require an index structure to facilitate the search. State-of-the-art index structures are built by partitioning the data set based on distances to certain reference point(s). Using the index, search is confined to a small number of partitions. However, these methods either ignore the property of the data distribution (e.g., VP-tree and its variants) or produce non-disjoint partitions (e.g., M-tree and its variants, DBM-tree); these greatly affect the search efficiency. In this paper, we study the effectiveness of a new index structure, called Nested-Approximate-eQuivalence-class tree (NAQ-tree), which overcomes the above disadvantages. NAQ-tree is constructed by recursively dividing the data set into nested approximate equivalence classes. The conducted analysis and the reported comparative test results demonstrate the effectiveness of NAQ-tree in significantly improving the search efficiency.

Although I think the following paragraph from the paper is more interesting:

Consider a set of objects O = {o1 , o2 , . . . , on } and a set of attributes A = {a1 , a2 , . . . , ad }, we first divide the objects into groups based on the first attribute a1 , i.e., objects with same value of a1 are put in the same group; each group is an equivalence class [23] with respect to a1 . In other words, all objects in a group are indistinguishable by attribute a1 . We can refine the equivalence classes further by dividing each existing equivalence class into groups based on the second attribute a2 ; all objects in a refined equivalence class are indistinguishable by attributes a1 and a2 . This process may be repeated by adding one more attribute at a time until all the attributes are considered. Finally, we get a hierarchical set of equivalence classes, i.e., a hierarchical partitioning of the objects. This is roughly the basic idea of NAQ-tree, i.e., to partition the data space in our similarity search method. In other words, given a query object o, we can gradually reduce the search space by gradually considering the most relevant attributes.

With the caveat that this technique is focused on metric spaces.

But I rather like the idea of reducing the search space by the attributes under consideration. Replace search space with similarity/sameness space and you will see what I mean. Still relevant for searching as well.

Whose Your Nearest Neighbor?

Tuesday, January 25th, 2011

Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions Authors: Alexandr Andoni and Piotr Indyk

OK, I lied about the title.

You would think there would be short courses on title writing. Maybe I should start offering a one-day seminar in effective title writing.

Anyway, whatever the title issues, this is a deeply fascinating work on detection of nearest neighbors.

The short version is that really close neighbors bump into each other when hashing. So collisions become a way to detect neighbors. Rather clever.

I think of collisions as a basis for identify the same subjects.

Works in metric spaces but topic maps apply to metric spaces as well. After all, subjects are what define and occupy metric spaces.

For the longer explanation, read the paper.

Fuzzy Table

Thursday, November 25th, 2010

Tackling Large Scale Data In Government.

OK, but I cite the post because of its coverage of Fuzzy Table:

FuzzyTable is a large-scale, low-latency, parallel fuzzy-matching database built over Hadoop. It can use any matching algorithm that can compare two often high-dimensional items and return a similarity score. This makes it suitable not only for comparing fingerprints but other biometric modalities, images, audio, and anything that can be represented as a vector of features.

Hmmm, “anything that can be represented as a vector of features?”

Did someone mention subject identity? 😉

Worth a very close read. Software release coming.

Analyzing the Role of Dimension Arrangement for Data Visualization in Radviz

Monday, October 11th, 2010

Analyzing the Role of Dimension Arrangement for Data Visualization in Radviz Authors: Luigi Caro, Vanessa Frias-Martinez, Enrique Frias-Martinez Keywords: Radial Coordinate Visulalization, Radviz, Dimension arrangement


The Radial Coordinate Visualization (Radviz) technique has been widely used to effectively evaluate the existence of patterns in highly dimensional data sets. A crucial aspect of this technique lies in the arrangement of the dimensions, which determines the quality of the posterior visualization. Dimension arrangement (DA) has been shown to be an NP-problem and different heuristics have been proposed to solve it using optimization techniques. However, very little work has focused on understanding the relation between the arrangement of the dimensions and the quality of the visualization. In this paper we first present two variations of the DA problem: (1) a Radviz independent approach and (2) a Radviz dependent approach. We then describe the use of the Davies-Bouldin index to automatically evaluate the quality of a visualization i.e., its visual usefulness. Our empirical evaluation is extensive and uses both real and synthetic data sets in order to evaluate our proposed methods and to fully understand the impact that parameters such as number of samples, dimensions, or cluster separability have in the relation between the optimization algorithm and the visualization tool.

Interesting both for exploration of data sets for constructing topic maps and quite possibly for finding the “best” visualizations for a topic map deliverable.

Distributed Knowledge Discovery with Non Linear Dimensionality Reduction

Sunday, October 10th, 2010

Distributed Knowledge Discovery with Non Linear Dimensionality Reduction Authors: Panagis Magdalinos, Michalis Vazirgiannis, Dialecti Valsamou Keywords: distributed non linear dimensionality reduction, NLDR, distributed dimensionality reduction, DDR, distributed data mining, DDM, dimensionality reduction, DR, Distributed Isomap, D-Isomap, C-Isomap, L-Isomap


Data mining tasks results are usually improved by reducing the dimensionality of data. This improvement however is achieved harder in the case that data lay on a non linear manifold and are distributed across network nodes. Although numerous algorithms for distributed dimensionality reduction have been proposed, all assume that data reside in a linear space. In order to address the non-linear case, we introduce D-Isomap, a novel distributed non linear dimensionality reduction algorithm, particularly applicable in large scale, structured peer-to-peer networks. Apart from unfolding a non linear manifold, our algorithm is capable of approximate reconstruction of the global dataset at peer level a very attractive feature for distributed data mining problems. We extensively evaluate its performance through experiments on both artificial and real world datasets. The obtained results show the suitability and viability of our approach for knowledge discovery in distributed environments.

Data mining in peer-to-peer networks will face topic map authors sooner or later.

Not only a useful discussion of the issues, but, the authors have posted source code and data sets used in the article as well:

Similarity Indexing: Algorithms and Performance (1996)

Monday, September 20th, 2010

Similarity Indexing: Algorithms and Performance (1996) Authors: David A. White , Ramesh Jain KeyWords Similarity Indexing, High Dimensional Feature Vectors, Approximate k-Nearest Neighbor Searching, Closest Points, Content-Based Retrieval, Image and Video Database Retrieval.

The authors of this paper coined the phrase “similarity indexing.”

A bit dated now but interesting as a background to techniques currently in use.

A topic map tracing the development of one of the “similarity” techniques would make an excellent thesis project.

The TV-tree — an index structure for high-dimensional data (1994)

Saturday, September 18th, 2010

The TV-tree — an index structure for high-dimensional data (1994) Authors: King-ip Lin , H. V. Jagadish , Christos Faloutsos Keywords:Spatial Index, Similarity Retrieval, Query by Context, R*-Tree, High-Dimensionality Feature Spaces.


We propose a file structure to index high-dimensionality data, typically, points in some feature space. The idea is to use only a few of the features, utilizing additional features whenever the additional discriminatory power is absolutely necessary. We present in detail the design of our tree structure and the associated algorithms that handle such `varying length’ feature vectors. Finally we report simulation results, comparing the proposed structure with the R -tree, which is one of the most successful methods for low-dimensionality spaces. The results illustrate the superiority of our method, with up to 80% savings in disk accesses.

The notion of “…utilizing additional features whenever the additional discriminatory power is absolutely necessary…” is an important one.

Compare to fixed simplistic discrimination and/or fixed complex, high-overhead, discrimination between subject representatives.

Either one represents a failure of imagination.