Outlier detection in two review articles (Part 2) by Sandro Saitta.
From the post:
Here we go with the second review article about outlier detection (this post is the continuation of Part I).
A Survey of Outlier Detection Methodologies
This paper, from Hodge and Austin, is also an excellent review of the field. Authors give a list of keywords in the field: outlier detection, novelty detection, anomaly detection, noise detection, deviation detection and exception mining. For the authors, “An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs (Grubbs, 1969)”. Before listing several application in the field, authors mention that an outlier can be “surprising veridical data“. It may only be situated in the wrong class.
An interesting list of possible reasons for outliers is given: human error, instrument error, natural deviations in population, fraudulent behavior, changes in behavior of system and faults in system. Like in the first article, Hodge and Austin define three types of approaches to outlier detection (unsupervised, supervised and semi-supervised). In the last one, they mention that some algorithms can allow a confidence in the fact that the observation is an outlier. Main drawback of the supervised approach is its inability to discover new types of outliers.
While you are examining the techniques, do note the alternative ways to identify the problem.
Can you say topic map? 😉
Simple query expansion, assuming that any single term return hundreds of papers, isn’t all that helpful. Instead of several hundred papers you get several thousand. Gee, thanks.
But that isn’t an indictment of alternative identifications of subjects, that is a problem of granularity.
Returning documents forces users to wade through large amounts of potentially irrelevant content.
The question is how to retain alternative identifications of subjects while returning a manageable (or configurable) amount of content?
Suggestions?