Archive for the ‘Matrix’ Category

The non-negative matrix factorization toolbox for biological data mining

Tuesday, April 16th, 2013

The non-negative matrix factorization toolbox for biological data mining by Yifeng Li and Alioune Ngom. (Source Code for Biology and Medicine 2013, 8:10 doi:10.1186/1751-0473-8-10)

From the post:

Background: Non-negative matrix factorization (NMF) has been introduced as an important method for mining biological data. Though there currently exists packages implemented in R and other programming languages, they either provide only a few optimization algorithms or focus on a specific application field. There does not exist a complete NMF package for the bioinformatics community, and in order to perform various data mining tasks on biological data.

Results: We provide a convenient MATLAB toolbox containing both the implementations of various NMF techniques and a variety of NMF-based data mining approaches for analyzing biological data. Data mining approaches implemented within the toolbox include data clustering and bi-clustering, feature extraction and selection, sample classification, missing values imputation, data visualization, and statistical comparison.

Conclusions: A series of analysis such as molecular pattern discovery, biological process identification, dimension reduction, disease prediction, visualization, and statistical comparison can be performed using this toolbox.

Written in a bioinformatics context but also used in text data mining (Enron emails), spectral analysis and other data mining fields. (See Non-negative matrix factorization)

Twitter’s Scalding and Algebird: Matrix and Lighweight Algebra Library

Tuesday, September 25th, 2012

Twitter’s Scalding and Algebird: Matrix and Lighweight Algebra Library by Alex Popescu.

Alex points out:

  1. Scalding now includes a type-safe Matrix API
  2. In the familiar Fields API, we’ve added the ability to add type information to fields which allows scalding to pick up Ordering instances so that grouping on almost any scala collection becomes easy.
  3. Algebird is our lightweight abstract algebra library for Scala and is targeted for building aggregation systems (such as Storm).

Of the three, I am going to take a look at Algebird first.

3 surprising facts about the computation of scalar products

Monday, November 28th, 2011

3 surprising facts about the computation of scalar products by Daniel Lemire.

From the post:

The speed of many algorithms depends on how quickly you can multiply matrices or compute distances. In turn, these computations depend on the scalar product. Given two arrays such as (1,2) and (5,3), the scalar product is the sum of products 1 × 5 + 2 × 3. We have strong incentives to compute the scalar product as quickly as possible.

Sorry, can’t tell you the three things because that would ruin the surprise. ;-) See Daniel’s blog for the details.

Approaching optimality for solving SDD systems

Sunday, September 18th, 2011

In October 2010, this paper was presented by the authors.

Approaching optimality for solving SDD systems by Ioannis Koutis, Gary L. Miller, and Richard Peng.

Public reports on that paper can be found at: A Breakthrough in Algorithm Design in the September 2011 issue of CACM and PC Pro in: Algorithm sees massive jump in complex number crunching.

The claim is that the new approach will be a billion times faster than traditional techniques.

In February of 2011, the authors have posted a new and improved version of their algorithm in:

A nearly-mlogn time solver for SDD linear systems.

Koutis has written a MATLAB implementation at: CMG: Combinatorial Multigrid

For further background, see: Combinatorial Preconditioning, sparsification, local clustering, low-stretch trees, etc. by Spielman, one of the principal researchers in this area.

The most obvious application in topic maps would be recommender systems that bring possible merges to a topic map author’s attention or even perform merging on specified conditions. (If the application doesn’t seem obvious, read the post I refer to in: Text Feature Extraction (tf-idf) – Part 1 , again. Will also give you some ideas about scaleable merging tests as well.)

Years ago Lars Marius told me that topic maps needed to scale on laptops to be successful. It looks like algorithms are catching up to meet his requirement.

Welcome to The Matrix Factorization Jungle

Friday, September 2nd, 2011

Welcome to The Matrix Factorization Jungle [ A living documention on the state of the art algorithms dedicated to matrix factorization ]

From the webpage:

Matrix Decompositions has a long history and generally centers around a set of known factorizations such as LU, QR, SVD and eigendecompositions. With the advent of new methods based on random projections and convex optimization that started in part in the compressive sensing literature, we are seeing a surge of very diverse algorithms dedicated to many different kinds of matrix factorizations with constraints based on rank, positivity, sparsity,… As a result of this large increase in interest, I have decided to keep a list of them here following the success of the big picture in compressive sensing.

If you are unfamiliar with the use of matrices in data mining, consider Non-negative matrix factorization and the examples cited under Text mining.

University of Florida Sparse Matrix Collection

Monday, April 11th, 2011

University of Florida Sparse Matrix Collection

It’s not what you think. Well, it is but it is so much more. You really have to see the images at this site.

Abstract (from the paper by the same title):

We describe the University of Florida Sparse Matrix Collection, a large and actively growing set of sparse matrices that arise in real applications. The Collection is widely used by the numerical linear algebra community for the development and performance evaluation of sparse matrix algorithms. It allows for robust and repeatable experiments: robust because performance results with artificially-generated matrices can be misleading, and repeatable because matrices are curated and made publicly available in many formats. Its matrices cover a wide spectrum of domains, include those arising from problems with underlying 2D or 3D geometry (as structural engineering, computational fluid dynamics, model reduction, electromagnetics, semiconductor devices, thermodynamics, materials, acoustics, computer graphics/vision, robotics/kinematics, and other discretizations) and those that typically do not have such geometry (optimization, circuit simulation, economic and financial modeling, theoretical and quantum chemistry, chemical process simulation, mathematics and statistics, power networks, and other networks and graphs). We provide software for accessing and managing the Collection, from MATLAB, Mathematica, Fortran, and C, as well as an online search capability. Graph visualization of the matrices is provided, and a new multilevel coarsening scheme is proposed to facilitate this task.

A Java viewer for matrices is also found here.

The Matrix Cookbook – Post

Tuesday, February 8th, 2011

The Matrix Cookbook by Bob Carpenter (LingPipe Blog) points to a “matrix cheatsheet” by K. B. Petersen and M. S. Pedersen.

Good for those of you starting to use matrix methods for IR and hence building topic maps.

Decomposer

Saturday, December 11th, 2010

Decomposer

From the website:

Matrix algebra underpins the way many Big Data algorithms and data structures are composed: full-text search can be viewed as doing matrix multiplication of the term-document matrix by the query vector (giving a vector over documents where the components are the relevance score), computing co-occurrences in a collaborative filtering context (people who viewed X also viewed Y, or ratings-based CF like the Netflix Prize contest) is taking the squaring the user-item interation matrix, calculating users who are k-degrees separated from each other in a social network or web-graph can be found by looking at the k-fold product of the graph adjacency matrix, and the list goes on (and these are all cases where the linear structure of the matrix is preserved!)
….
Currently implemented: Singular Value Decomposition using the Asymmetric Generalized Hebbian Algorithm outlined in Genevieve Gorrell & Brandyn Webb’s paper and there is a Lanczos implementation, both single-threaded, and in the contrib/hadoop subdirectory, as a hadoop map-reduce (series of) job(s). Coming soon: stochastic decomposition.

This code is in the process of being absorbed into the Apache Mahout Machine Learning Project.

Useful in learning to use search technology but also for recognizing at a very fundamental level, the limitations of that technology.

Document and query vectors are constructed without regard to the semantics of their components.

Using co-occurrence, for example, doesn’t give a search engine greater access to the semantics of the terms in question.

It simply makes the vector longer and so matches are less frequent and hopefully, less frequent = more precise.

That may or may not be the case. It also doesn’t account for case where the vectors are different but the subject in question is the same.