Combining Heterogeneous Classifiers for Relational Databases by Geetha Manjunatha, M Narasimha Murty and Dinkar Sitaram.
Most enterprise data is distributed in multiple relational databases with expert-designed schema. Using traditional single-table machine learning techniques over such data not only incur a computational penalty for converting to a ‘flat’ form (mega-join), even the human-specified semantic information present in the relations is lost. In this paper, we present a practical, two-phase hierarchical meta-classification algorithm for relational databases with a semantic divide and conquer approach. We propose a recursive, prediction aggregation technique over heterogeneous classifiers applied on individual database tables. The proposed algorithm was evaluated on three diverse datasets, namely TPCH, PKDD and UCI benchmarks and showed considerable reduction in classification time without any loss of prediction accuracy.
When I read:
So, a typical enterprise dataset resides in such expert-designed multiple relational database tables. On the other hand, as known, most traditional classication algorithms still assume that the input dataset is available in a single table – a flat representation of data attributes. So, for applying these state-of-art single-table data mining techniques to enterprise data, one needs to convert the distributed relational data into a flat form.
a couple of things dropped into place.
First, the problem being described, the production of a flat form for analysis reminds me of the problem of record linkage in the late 1950′s (predating relational databases). There records were regularized to enable very similar analysis.
Second, as the authors state in a paragraph or so, conversion to such a format is not possible in most cases. Interesting that the choice of relational database table design has the impact of limiting the type of analysis that can be performed on the data.
Therefore, knowledge mining over real enterprise data using machine learning techniques is very valuable for what is called an intelligent enterprise. However, application of state-of-art pattern recognition techniques in the mainstream BI has not yet taken o [Gartner report] due to lack of in-memory analytics among others. The key hurdle to make this possible is the incompatibility between the input data formats used by most machine learning techniques and the formats used by real enterprises.
If freeing data from its relational prison is a key aspect to empowering business intelligence (BI), what would you suggest as a solution?