Class-imbalanced classifiers for high-dimensional data by Wei-Jiun Lin and James J. Chen. (Brief Bioinform (2013) 14 (1): 13-26. doi: 10.1093/bib/bbs006)
A class-imbalanced classifier is a decision rule to predict the class membership of new samples from an available data set where the class sizes differ considerably. When the class sizes are very different, most standard classification algorithms may favor the larger (majority) class resulting in poor accuracy in the minority class prediction. A class-imbalanced classifier typically modifies a standard classifier by a correction strategy or by incorporating a new strategy in the training phase to account for differential class sizes. This article reviews and evaluates some most important methods for class prediction of high-dimensional imbalanced data. The evaluation addresses the fundamental issues of the class-imbalanced classification problem: imbalance ratio, small disjuncts and overlap complexity, lack of data and feature selection. Four class-imbalanced classifiers are considered. The four classifiers include three standard classification algorithms each coupled with an ensemble correction strategy and one support vector machines (SVM)-based correction classifier. The three algorithms are (i) diagonal linear discriminant analysis (DLDA), (ii) random forests (RFs) and (ii) SVMs. The SVM-based correction classifier is SVM threshold adjustment (SVM-THR). A Monte–Carlo simulation and five genomic data sets were used to illustrate the analysis and address the issues. The SVM-ensemble classifier appears to perform the best when the class imbalance is not too severe. The SVM-THR performs well if the imbalance is severe and predictors are highly correlated. The DLDA with a feature selection can perform well without using the ensemble correction.
At least the “big data” folks are right on one score: We are going to need help sorting out all the present and future information.
Not that we will ever attempt to sort it all out, as was reported in: The Untapped Big Data Gap (2012) [Merry Christmas Topic Maps!], only 23% of “big data” is going to be valuable if we do analyze it.
And your enterprise’s part of that 23% is even smaller.
Enough that your users will need help dealing with it, but not nearly the deluge that is being predicted.