Archive for the ‘Topological Data Analysis’ Category

Extracting insights from the shape of complex data using topology

Thursday, November 6th, 2014

Extracting insights from the shape of complex data using topology by P. Y. Lum, et al. (Scientific Reports 3, Article number: 1236 doi:10.1038/srep01236)

Abstract:

This paper applies topological methods to study complex high dimensional data sets by extracting shapes (patterns) and obtaining insights about them. Our method combines the best features of existing standard methodologies such as principal component and cluster analyses to provide a geometric representation of complex data sets. Through this hybrid method, we often find subgroups in data sets that traditional methodologies fail to find. Our method also permits the analysis of individual data sets as well as the analysis of relationships between related data sets. We illustrate the use of our method by applying it to three very different kinds of data, namely gene expression from breast tumors, voting data from the United States House of Representatives and player performance data from the NBA, in each case finding stratifications of the data which are more refined than those produced by standard methods.

In order to identify subjects you must first discover them.

Does the available financial contribution data on members of the United States House of Representatives correspond with the clustering analysis here? (Asking because I don’t know but would be interested in finding out.)

I first saw this in a tweet by Stian Danenbarger.

Computational Topology and Data Analysis

Wednesday, November 13th, 2013

Computational Topology and Data Analysis by Tamal K Dey.

Course syllabus:

Computational topology has played a synergistic role in bringing together research work from computational geometry, algebraic topology, data analysis, and many other related scientific areas. In recent years, the field has undergone particular growth in the area of data analysis. The application of topological techniques to traditional data analysis, which before has mostly developed on a statistical setting, has opened up new opportunities. This course is intended to cover this aspect of computational topology along with the developments of generic techniques for various topology-centered problems.

A course outline on computational topology with a short reading list, papers and notes on various topics.

I found this while looking up references on Tackling some really tough problems….

Tackling some really tough problems…

Wednesday, November 13th, 2013

Tackling some really tough problems with machine learning by Derrick Harris.

From the post:

Machine learning startup Ayasdi is partnering with two prominent institutions — Lawrence Livermore National Laboratory and the Texas Medical Center — to help advance some of their complicated data challenges. At LLNL, the company will collaborate on research in energy, climate change, medical technology, and national security, while its work with the Texas Medical Center will focus on translational medicine, electronic medical records and finding new uses for existing drugs.

Ayasdi formally launched in January after years researching its core technology, called topological data analysis. Essentially, the company’s software, called Iris, uses hundreds of machine learning algorithms to analyze up to tens of billions of data points and identify the relationships among them. The topological part comes from the way the results of this analysis are visually mapped into a network that places similar or tightly connected points near one another so users can easily spot collections of variables that appear to affect each other.

Tough problems:

At LLNL, the company will collaborate on research in energy, climate change, medical technology, and national security, while its work with the Texas Medical Center will focus on translational medicine, electronic medical records and finding new uses for existing drugs.

I would say so but that wasn’t the “tough” problem I was expecting.

The “tough” problem I had in mind was taking data with no particular topology and mapping it to a topology.

I ask because “similar or tightly connected points” depend upon a notion of “similarity” that is not inherent in most data points.

For example, how “similar” are you from a leaker by working in the same office? How does that “similarity” compare to the “similarity” of other relationships?


Original text (which I have corrected above):

I ask because “similar or tightly connected points” depend upon a notion of “distance” that is not inherent in most data points.

For example, how “near” or “far” are you from a leaker by working in the same office? How does that “near” or “far” compare to the nearness or farness of other relationships?

I corrected the original post to remove the implication of a metric distance.

The Filtering vs. Clustering Dilemma

Tuesday, June 11th, 2013

Hierarchical information clustering by means of topologically embedded graphs by Won-Min Song, T. Di Matteo, and Tomaso Aste.

Abstract:

We introduce a graph-theoretic approach to extract clusters and hierarchies in complex data-sets in an unsupervised and deterministic manner, without the use of any prior information. This is achieved by building topologically embedded networks containing the subset of most significant links and analyzing the network structure. For a planar embedding, this method provides both the intra-cluster hierarchy, which describes the way clusters are composed, and the inter-cluster hierarchy which describes how clusters gather together. We discuss performance, robustness and reliability of this method by first investigating several artificial data-sets, finding that it can outperform significantly other established approaches. Then we show that our method can successfully differentiate meaningful clusters and hierarchies in a variety of real data-sets. In particular, we find that the application to gene expression patterns of lymphoma samples uncovers biologically significant groups of genes which play key-roles in diagnosis, prognosis and treatment of some of the most relevant human lymphoid malignancies.

I like the framing of the central issue a bit further in the paper:

Filtering information out of complex datasets is becoming a central issue and a crucial bottleneck in any scientifi c endeavor. Indeed, the continuous increase in the capability of automatic data acquisition and storage is providing an unprecedented potential for science. However, the ready accessibility of these technologies is posing new challenges concerning the necessity to reduce data-dimensionality by fi ltering out the most relevant and meaningful information with the aid of automated systems. In complex datasets information is often hidden by a large degree of redundancy and grouping the data into clusters of elements with similar features is essential in order to reduce complexity [1]. However, many clustering methods require some a priori information and must be performed under expert supervision. The requirement of any prior information is a potential problem because often the fi ltering is one of the preliminary processing on the data and therefore it is performed at a stage where very little information about the system is available. Another difficulty may arise from the fact that, in some cases, the reduction of the system into a set of separated local communities may hide properties associated with the global organization. For instance, in complex systems, relevant features are typically both local and global and di fferent levels of organization emerge at diff erent scales in a way that is intrinsically not reducible. We are therefore facing the problem of catching simultaneously two complementary aspects: on one side there is the need to reduce the complexity and the dimensionality of the data by identifying clusters which are associated with local features; but, on the other side, there is a need of keeping the information about the emerging global organization that is responsible for cross-scale activity. It is therefore essential to detect clusters together with the diff erent hierarchical gatherings above and below the cluster levels. (emphasis added)

Simplification of data is always lossy. The proposed technique does not avoid all loss but hopes to mitigate its consequences.

Briefly the technique relies upon building a network of the “most significant” links and analyzing the network structure. The synthetic and real data sets show that the technique works quite well. At least for data sets where we can judge the outcome.

What of larger data sets? Where the algorithmic approaches are the only feasible means of analysis? How do we judge accuracy in those cases?

A revised version of this paper appears as: Hierarchical Information Clustering by Means of Topologically Embedded Graphs by Won-Min Song, T. Di Matteo, and Tomaso Aste.

The original development of the technique used here can be found in: A tool for filtering information in complex systems by M. Tumminello, T. Aste, T. Di Matteo, and R. N. Mantegna.

“Almost there….” (Computing Homology)

Friday, April 12th, 2013

We all remember the pilot in Star Wars that kept saying, “Almost there….” Jeremy Kun has us “almost there…” in his latest installment: Computing Homology.

To give you some encouragement, Jeremy concludes the post saying:

The reader may be curious as to why we didn’t come up with a more full-bodied representation of a simplicial complex and write an algorithm which accepts a simplicial complex and computes all of its homology groups. We’ll leave this direct approach as a (potentially long) exercise to the reader, because coming up in this series we are going to do one better. Instead of computing the homology groups of just one simplicial complex using by repeating one algorithm many times, we’re going to compute all the homology groups of a whole family of simplicial complexes in a single bound. This family of simplicial complexes will be constructed from a data set, and so, in grandiose words, we will compute the topological features of data.

If it sounds exciting, that’s because it is! We’ll be exploring a cutting-edge research field known as persistent homology, and we’ll see some of the applications of this theory to data analysis. (bold emphasis added)

Data analysts are needed at all levels.

Do you want to be a spreadsheet data analyst or something a bit harder to find?

…topological data analysis

Sunday, January 20th, 2013

New big data firm to pioneer topological data analysis by John Burn-Murdoch.

From the post:

A US big data firm is set to establish algebraic topology as the gold standard of data science with the launch of the world’s leading topological data analysis (TDA) platform.

Ayasdi, whose co-founders include renowned mathematics professor Gunnar Carlsson, launched today in Palo Alto, California, having secured $10.25m from investors including Khosla Ventures in the first round of funding.

The funds will be used to build on its Insight Discovery platform, the culmination of 12 years of research and development into mathematics, computer science and data visualisation at Stanford.

Ayasdi’s work prior to launching as a company has already yielded breakthroughs in the pharmaceuticals industry. In one case it revealed new insights in eight hours – compared to the previous norm of over 100 hours – cutting the turnaround from analysis to clinical trials in the process.

The project? CompTop, which I covered here.

Does topological data analysis sound more interesting now than before?

Algebraic Topology and Machine Learning

Saturday, August 25th, 2012

Algebraic Topology and Machine Learning – In conjunction with Neural Information Processing Systems (NIPS 2012)

September 16, 2012 – Submissions Due
October 7, 2012 – Acceptance Notices
December 7 or 8 (TBD), 2011, Lake Tahoe, Nevada, USA.

From the call for papers:

Topological methods and machine learning have long enjoyed fruitful interactions as evidenced by popular algorithms like ISOMAP, LLE and Laplacian Eigenmaps which have been borne out of studying point cloud data through the lens of topology/geometry. More recently several researchers have been attempting to understand the algebraic topological properties of data. Algebraic topology is a branch of mathematics which uses tools from abstract algebra to study and classify topological spaces. The machine learning community thus far has focused almost exclusively on clustering as the main tool for unsupervised data analysis. Clustering however only scratches the surface, and algebraic topological methods aim at extracting much richer topological information from data.

The goals of this workshop are:

  1. To draw the attention of machine learning researchers to a rich and emerging source of interesting and challenging problems.
  2. To identify problems of interest to both topologists and machine learning researchers and areas of potential collaboration.
  3. To discuss practical methods for implementing topological data analysis methods.
  4. To discuss applications of topological data analysis to scientific problems.

We also invite submissions in a variety of areas, at the intersection of algebraic topology and learning, that have witnessed recent activity. Areas of focus for submissions include but are not limited to:

  1. Statistical approaches to robust topological inference.
  2. Novel applications of topological data analysis to problems in machine learning.
  3. Scalable methods for topological data analysis.

NIPS2012 site. You will appreciate the “dramatization.” 😉

Put on your calendar and/or watch for papers!

Topological Data Analysis

Monday, July 2nd, 2012

Topological Data Analysis by Larry Wasserman.

From the post:

Topological data analysis (TDA) is a relatively new area of research that spans many disciplines including topology (in particular, homology), statistics, machine learning and computation geometry.

The basic idea of TDA is to describe the “shape of the data” by finding clusters, holes, tunnels, etc. Cluster analysis is special case of TDA. I’m not an expert on TDA but I do find it fascinating. I’ll try to give a flavor of what this subject is about.

Just in case you want to get in on the ground floor of a new area of research.

Larry has citations to the literature in case you need to pick up beach reading.

Applied topology and Dante: an interview with Robert Ghrist [Sept., 2010]

Monday, May 28th, 2012

Applied topology and Dante: an interview with Robert Ghrist by John D. Cook. (September 13, 2010)

From the post:

Robert Ghrist A few weeks ago I discovered Robert Ghrist via his web site. Robert is a professor of mathematics and electrical engineering. He describes his research as applied topology, something I’d never heard of. (Topology has countless applications to other areas of mathematics, but I’d not heard of much work directly applying topology to practical physical problems.) In addition to his work in applied topology, I was intrigued by Robert’s interest in old books.

The following is a lightly-edited transcript of a phone conversation Robert and I had September 9, 2010.

If the interview sounds interesting, you may want to read/skim:

[2008] R. Ghrist, “Three examples of applied and computational homology,” Nieuw Archief voor Wiskunde 5/9(2).

or,

[2010] R. Ghrist, “Applied Algebraic Topology & Sensor Networks,” a manu-script text. (caveat! file>50megs!)

Applied Topology & Sensor Networks are the notes for an AMS short course. Ghrist recommends continuing with Algebraic Toplogy by Allen Hatcher. (Let me know if you need my shipping address.)

Q: Are sensors always mechanical sensors? We speak of them as though that were the case.

What if I can’t afford unmanned drones (to say nothing of their pilots) and have $N$ people with cellphones?

How does a more “discriminating” “sensor” impact the range of capabilities/solutions?

Inquiry: Algebraic Geometry and Topology

Monday, October 31st, 2011

Inquiry: Algebraic Geometry and Topology

Speaking of money and such matters, a call for assistance from Quantivity:

Algebraic geometry and topology traditionally focused on fairly pure math considerations. With the rise of high-dimensional machine learning, these fields are increasing being pulled into interesting computational applications such as manifold learning. Algebraic statistics and information geometry offer potential to help bridge these fields with modern statistics, especially time-series and random matrices.

Early evidence suggests potential for significant intellectual cross-fertilization with finance, both mathematical and computational. Geometrically, richer modeling and analysis of latent geometric structure than available from classic linear algebraic decomposition (e.g. PCA, one of the main workhorses of modern $mathbb{P}$ finance); for example, cumulant component analysis. Topologically, more effective qualitative analysis of data sampled from manifolds or singular algebraic varieties; for example, persistent homology (see CompTop).

As evidence by Twitter followers, numerous Quantivity readers are experts in these fields. Thus, perhaps the best way to explore is to seek insight from readers.

Readers: please use comments to suggest applied literature from these fields; ideally, although not required, that of potential relevance to finance modeling. All types of literature are requested, from intro texts to survey articles to preprint working papers on specific applications.

These suggestions will be synthesized into one or more subsequent posts, along with appropriate additions to People of Quant Research.

If you or a member of your family knows of any relevant resources, please go to: Inquiry: Algebraic Geometry and Topology and volunteer those resources as comments. You might even make the People of Quant Research list!

I wonder if there would be any interest in tracking bundled instruments using topic maps? That could be an interesting question.