Archive for the ‘Linear Regression’ Category

Predictive Analytics: Generalized Linear Regression [part 3]

Sunday, June 3rd, 2012

Predictive Analytics: Generalized Linear Regression by Ricky Ho.

From the post:

In the previous 2 posts, we have covered how to visualize input data to explore strong signals as well as how to prepare input data to a form that is situation for learning. In this and subsequent posts, I’ll go through various machine learning techniques to build our predictive model.

  1. Linear regression
  2. Logistic regression
  3. Linear and Logistic regression with regularization
  4. Neural network
  5. Support Vector Machine
  6. Naive Bayes
  7. Nearest Neighbor
  8. Decision Tree
  9. Random Forest
  10. Gradient Boosted Trees

There are two general types of problems that we are interested in this discussion; Classification is about predicting a category (value that is discrete, finite with no ordering implied) while Regression is about predicting a numeric quantity (value is continuous, infinite with ordering).

For classification problem, we use the “iris” data set and predict its “species” from its “width” and “length” measures of sepals and petals. Here is how we setup our training and testing data.

Ricky walks you through linear regression, logistic regression and linear and logistic regression with regularization.

Skytree: Big Data Analytics

Saturday, March 3rd, 2012

Skytree: Big Data Analytics

Released this last week, Skytree offers both local as well as cloud-based data analytics.

From the website:

Skytree Server can accurately perform machine learning on massive datasets at high speed.

In the same way a relational database system (or database accelerator) is designed to perform SQL queries efficiently, Skytree Server is designed to efficiently perform machine learning on massive datasets.

Skytree Server’s scalable architecture performs state-of-the-art machine learning methods on data sets that were previously too big for machine learning algorithms to process. Leveraging advanced algorithms implemented on specialized systems and dedicated data representations tuned to machine learning, Skytree Server delivers up to 10,000 times performance improvement over existing approaches.

Currently supported machine learning methods:

  • Neighbors (Nearest, Farthest, Range, k, Classification)
  • Kernel Density Estimation and Non-parametric Bayes Classifier
  • K-Means
  • Linear Regression
  • Support Vector Machines (SVM)
  • Fast Singular Value Decomposition (SVD)
  • The Two-point Correlation

There is a “free” local version with a data limit (100,000 records) and of course the commercial local and cloud versions.


A Performance Study of Data Mining Techniques: Multiple Linear Regression vs. Factor Analysis

Wednesday, September 7th, 2011

A Performance Study of Data Mining Techniques: Multiple Linear Regression vs. Factor Analysis by Abhishek Taneja and R.K.Chauhan.


The growing volume of data usually creates an interesting challenge for the need of data analysis tools that discover regularities in these data. Data mining has emerged as disciplines that contribute tools for data analysis, discovery of hidden knowledge, and autonomous decision making in many application domains. The purpose of this study is to compare the performance of two data mining techniques viz., factor analysis and multiple linear regression for different sample sizes on three unique sets of data. The performance of the two data mining techniques is compared on following parameters like mean square error (MSE), R-square, R-Square adjusted, condition number, root mean square error(RMSE), number of variables included in the prediction model, modified coefficient of efficiency, F-value, and test of normality. These parameters have been computed using various data mining tools like SPSS, XLstat, Stata, and MS-Excel. It is seen that for all the given dataset, factor analysis outperform multiple linear regression. But the absolute value of prediction accuracy varied between the three datasets indicating that the data distribution and data characteristics play a major role in choosing the correct prediction technique.

I had to do a double-take when I saw “factor analysis” in the title of this article. I remember factor analysis from Schubert’s The judicial mind revisited : psychometric analysis of Supreme Court ideology, where Schubert used factor analysis to model the relative positions of the Supreme Court Justices. Schubert taught himself factor analysis on a Frieden rotary calculator. (I had one of those too but that’s a different story.)

The real lesson of this article comes at the end of the abstract: the data distribution and data characteristics play a major role in choosing the correct prediction technique.

Combining Pattern Classifiers: Methods and Algorithms

Saturday, March 12th, 2011

Combining Pattern Classifiers: Methods and Algorithms, Ludmila I. Kuncheva (2004)

WorldCat entry: Combining Pattern Classifiers: Methods and Algorithms

From the preface:

Everyday life throws at us an endless number of pattern recognition problems: smells, images, voices, faces, situations, and so on. Most of these problems we solve at a sensory level or intuitively, without an explicit method or algorithm. As soon as we are able to provide an algorithm the problem becomes trivial and we happily delegate it to the computer. Indeed, machines have confidently replaced humans in many formerly difficult or impossible, now just tedious pattern recognition tasks such as mail sorting, medical test reading, military target recognition, signature verification, meteorological forecasting, DNA matching, fingerprint recognition, and so on.

In the past, pattern recognition focused on designing single classifiers. This book is about combining the “opinions” of an ensemble of pattern classifiers in the hope that the new opinion will be better than the individual ones. “Vox populi, vox Dei.”

The field of combining classifiers is like a teenager: full of energy, enthusiasm, spontaneity, and confusion; undergoing quick changes and obstructing the attempts to bring some order to its cluttered box of accessories. When I started writing this book, the field was small and tidy, but it has grown so rapidly that I am faced with the Herculean task of cutting out a (hopefully) useful piece of this rich, dynamic, and loosely structured discipline. This will explain why some methods and algorithms are only sketched, mentioned, or even left out and why there is a chapter called “Miscellanea” containing a collection of important topics that I could not fit anywhere else.

Appreciate the author’s suggesting of older material to see how the pattern recognition developed.

Suggestions/comments on this or later literature on pattern recognition?

Machine Learning Ex2 – Linear Regression – Post

Friday, February 25th, 2011

Machine Learning Ex2 – Linear Regression

A useful exercise on linear regression using R.