Archive for the ‘Outlier Detection’ Category

NASA Support – Dr. Kirk Borne of George Mason University

Saturday, January 19th, 2013

The Arts and Entertainment Magazine (an unlikely source for me) published TAEM Interview with Dr. Kirk Borne of George Mason University, which is a delightful interview to generate support for NASA.

Of particular interest, Dr. Kirk Borne says:

My current research is focused on outlier detection, which I prefer to call Surprise Discovery – finding the unknown unknowns and the unexpected patterns in the data. These discoveries may reveal data quality problems (i.e., problems with the experiment or data processing pipeline), but they may also reveal totally new astrophysical phenomena: new types of galaxies or stars or whatever. That discovery potential is huge within the huge data collections that are being generated from the large astronomical sky surveys that are taking place now and will take place in the coming decades. I haven’t yet found that one special class of objects or new type of astrophysical process that will win me a Nobel Prize, but you never know what platinum-plated needles may be hiding in those data haystacks.

Topic maps are known for encoding knowns and known patterns in data.

How would you explore a topic map to find “…unknown unknowns and the unexpected patterns in the data?”

BTW, Dr. Borne invented the term “astroinformatics.”

Outlier Analysis

Sunday, January 13th, 2013

Outlier Analysis by Charu Aggarwal (Springer, January 2013). Post by Gregory Piatetsky.

From the post:

This is an authored text book on outlier analysis. The book can be considered a first comprehensive text book in this area from a data mining and computer science perspective. Most of the earlier books in outlier detection were written from a statistical perspective, and precede the emergence of the data mining field over the last 15-20 years.

Each chapter contains carefully organized content on the topic, case studies, extensive bibliographic notes and the future direction of research in this field. Thus, the book can also be used as a reference aid. Emphasis was placed on simplifying the content, so that the material is relatively easy to assimilate. The book assumes relatively little prior background, other than a very basic understanding of probability and statistical concepts. Therefore, in spite of its deep coverage, it can also provide a good introduction to the beginner. The book includes exercises as well, so that it can be used as a teaching aid.

Table of Contents and Introduction. Includes exercises and a 500+ reference bibliography.

Definitely a volume for the short reading list.

Caveat: As an outlier by any measure, my opinions here may be biased. ;-)

Outlier detection in two review articles (Part 2) (TM use case on Identifiers)

Saturday, May 26th, 2012

Outlier detection in two review articles (Part 2) by Sandro Saitta.

From the post:

Here we go with the second review article about outlier detection (this post is the continuation of Part I).

A Survey of Outlier Detection Methodologies

This paper, from Hodge and Austin, is also an excellent review of the field. Authors give a list of keywords in the field: outlier detection, novelty detection, anomaly detection, noise detection, deviation detection and exception mining. For the authors, “An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs (Grubbs, 1969)”. Before listing several application in the field, authors mention that an outlier can be “surprising veridical data“. It may only be situated in the wrong class.

An interesting list of possible reasons for outliers is given: human error, instrument error, natural deviations in population, fraudulent behavior, changes in behavior of system and faults in system. Like in the first article, Hodge and Austin define three types of approaches to outlier detection (unsupervised, supervised and semi-supervised). In the last one, they mention that some algorithms can allow a confidence in the fact that the observation is an outlier. Main drawback of the supervised approach is its inability to discover new types of outliers.

While you are examining the techniques, do note the alternative ways to identify the problem.

Can you say topic map? ;-)

Simple query expansion, assuming that any single term return hundreds of papers, isn’t all that helpful. Instead of several hundred papers you get several thousand. Gee, thanks.

But that isn’t an indictment of alternative identifications of subjects, that is a problem of granularity.

Returning documents forces users to wade through large amounts of potentially irrelevant content.

The question is how to retain alternative identifications of subjects while returning a manageable (or configurable) amount of content?

Suggestions?

Outlier detection in two review articles (Part 1)

Saturday, May 12th, 2012

Outlier detection in two review articles (Part 1) by Sandro Saitta.

Sandro writes:

The first one, Outlier Detection: A Survey, is written by Chandola, Banerjee and Kumar. They define outlier detection as the problem of “[...] finding patterns in data that do not conform to expected normal behavior“. After an introduction to what outliers are, authors present current challenges in this field. In my experience, non-availability of labeled data is a major one.

One of their main conclusions is that “[...] outlier detection is not a well-formulated problem“. It is your job, as a data miner, to formulate it correctly.

The final quote seems particularly well suited to subject identity issues. While any one subject identity may be well defined, the question is how to find and manage other subject identifications that may not be well defined.

As Sandro points out, it has nineteen (19) pages of references. However, only nine of those are as recent at 2007. All the rest are older references. I am sure it remains an excellent reference source but suspect more recent review articles on outlier detection exist.

Suggestions?