Graph Based Classification Methods Using Inaccurate External Classifier Information by Sundararajan Sellamanickam and Sathiya Keerthi Selvaraj.
Abstract:
In this paper we consider the problem of collectively classifying entities where relational information is available across the entities. In practice inaccurate class distribution for each entity is often available from another (external) classifier. For example this distribution could come from a classifier built using content features or a simple dictionary. Given the relational and inaccurate external classifier information, we consider two graph based settings in which the problem of collective classification can be solved. In the first setting the class distribution is used to fix labels to a subset of nodes and the labels for the remaining nodes are obtained like in a transductive setting. In the other setting the class distributions of all nodes are used to define the fitting function part of a graph regularized objective function. We define a generalized objective function that handles both the settings. Methods like harmonic Gaussian field and local-global consistency (LGC) reported in the literature can be seen as special cases. We extend the LGC and weighted vote relational neighbor classification (WvRN) methods to support usage of external classifier information. We also propose an efficient least squares regularization (LSR) based method and relate it to information regularization methods. All the methods are evaluated on several benchmark and real world datasets. Considering together speed, robustness and accuracy, experimental results indicate that the LSR and WvRN-extension methods perform better than other methods.
Doesn’t read like a page-turner does it? 😉
An example from the paper will help illustrate why this is an important paper:
In this paper we consider a related relational learning problem where, instead of a subset of labeled nodes, we have inaccurate external label/class distribution information for each node. This problem arises in many web applications. Consider, for example, the problem of identifying pages about Public works, Court, Health, Community development, Library etc. within the web site of a particular city. The link and directory relations contain useful signals for solving such a classification problem. Note that this relational structure will be different for different city web sites. If we are only interested in a small number of cities then we can afford to label a number of pages in each site and then apply transductive learning using the labeled nodes. But, if we want to do the classification on hundreds of thousands of city sites, labeling on all sites is expensive and we need to take a different approach. One possibility is to use a selected set of content dictionary features together with the labeling of a small random sample of pages from a number of sites to learn an inaccurate probabilistic classifier, e.g., logistic regression. Now, for any one city web site, the output of this initial classifier can be used to generate class distributions for the pages in the site, which can then be used together with the relational information in the site to get accurate classification.
In topic map parlance, we would say identity was being established by the associations in which a topic participates but that is a matter of terminology and not substantive difference.