Archive for the ‘Outlier Detection’ Category

Why Big Data Fails to Detect Terrorists

Thursday, December 17th, 2015

Kirk Borne tweeted a link to his presentation, Big Data Science for Astronomy & Space and more specifically to slides 24 and 25 on novelty detection, surprise discovery.

Casting about for more resources to point out, I found Novelty Detection in Learning Systems by Stephen Marsland.

The abstract for Stephen’s paper:

Novelty detection is concerned with recognising inputs that differ in some way from those that are usually seen. It is a useful technique in cases where an important class of data is under-represented in the training set. This means that the performance of the network will be poor for those classes. In some circumstances, such as medical data and fault detection, it is often precisely the class that is under-represented in the data, the disease or potential fault, that the network should detect. In novelty detection systems the network is trained only on the negative examples where that class is not present, and then detects inputs that do not fits into the model that it has acquired, that it, members of the novel class.

This paper reviews the literature on novelty detection in neural networks and other machine learning techniques, as well as providing brief overviews of the related topics of statistical outlier detection and novelty detection in biological organisms.

The rest of the paper is very good and worth your time to read but we need not venture beyond the abstract to demonstrate why big data cannot, by definition, detect terrorists.

The root of the terrorist detection problem summarized in the first sentence:

Novelty detection is concerned with recognising inputs that differ in some way from those that are usually seen.

So, what are the inputs of a terrorist that differ from the inputs usually seen?

That’s a simple enough question.

Previously committing a terrorist suicide attack is a definite tell but it isn’t a useful one.

Obviously the TSA doesn’t know because it has never caught a terrorist, despite its profile and wannabe psychics watching travelers.

You can churn big data 24×7 but if you don’t have a baseline of expected inputs, no input is going to stand out from the others.

The San Bernardino were not detected, because the inputs didn’t vary enough for the couple to stand out.

Even if they had been selected for close and unconstitutional monitoring of their etraffic, bank accounts, social media, phone calls, etc., there is no evidence that current data techniques would have detected them.

Before you invest or continue paying for big data to detect terrorists, ask the simple questions:

What is your baseline from which variance will signal a terrorist?

How often has it worked?

Once you have a dead terrorist, you can start from the dead terrorist and search your big data, but that’s an entirely different starting point.

Given the weeks, months and years of finger pointing following a terrorist attack, speed really isn’t an issue.

RAD – Outlier Detection on Big Data

Monday, March 2nd, 2015

RAD – Outlier Detection on Big Data by Jeffrey Wong, Chris Colburn, Elijah Meeks, and Shankar Vedaraman.

From the post:

Outlier detection can be a pain point for all data driven companies, especially as data volumes grow. At Netflix we have multiple datasets growing by 10B+ record/day and so there’s a need for automated anomaly detection tools ensuring data quality and identifying suspicious anomalies. Today we are open-sourcing our outlier detection function, called Robust Anomaly Detection (RAD), as part of our Surus project.

As we built RAD we identified four generic challenges that are ubiquitous in outlier detection on “big data.”

  • High cardinality dimensions: High cardinality data sets – especially those with large combinatorial permutations of column groupings – makes human inspection impractical.
  • Minimizing False Positives: A successful anomaly detection tool must minimize false positives. In our experience there are many alerting platforms that “sound an alarm” that goes ultimately unresolved. The goal is to create alerting mechanisms that can be tuned to appropriately balance noise and information.
  • Seasonality: Hourly/Weekly/Bi-weekly/Monthly seasonal effects are common and can be mis-identified as outliers deserving attention if not handled properly. Seasonal variability needs to be ignored.
  • Data is not always normally distributed: This has been a particular challenge since Netflix has been growing over the last 24 months. Generally though, an outlier tool must be robust so that it works on data that is not normally distributed.

In addition to addressing the challenges above, we wanted a solution with a generic interface (supporting application development). We met these objectives with a novel algorithm encased in a wrapper for easy deployment in our ETL environment.

Looking for “suspicious anomalies” is always popular, in part because it implies someone has deliberately departed from “normal” behavior.

Certainly important but as the FBI staging terror plots we discussed earlier today, show that the normal FBI “mo” is to stage terror plots and an anomaly would be a real terror plot, one not staged by the FBI.

The lesson being don’t assume outliers are departures from a desired norm. Can be, but not always are.

NASA Support – Dr. Kirk Borne of George Mason University

Saturday, January 19th, 2013

The Arts and Entertainment Magazine (an unlikely source for me) published TAEM Interview with Dr. Kirk Borne of George Mason University, which is a delightful interview to generate support for NASA.

Of particular interest, Dr. Kirk Borne says:

My current research is focused on outlier detection, which I prefer to call Surprise Discovery – finding the unknown unknowns and the unexpected patterns in the data. These discoveries may reveal data quality problems (i.e., problems with the experiment or data processing pipeline), but they may also reveal totally new astrophysical phenomena: new types of galaxies or stars or whatever. That discovery potential is huge within the huge data collections that are being generated from the large astronomical sky surveys that are taking place now and will take place in the coming decades. I haven’t yet found that one special class of objects or new type of astrophysical process that will win me a Nobel Prize, but you never know what platinum-plated needles may be hiding in those data haystacks.

Topic maps are known for encoding knowns and known patterns in data.

How would you explore a topic map to find “…unknown unknowns and the unexpected patterns in the data?”

BTW, Dr. Borne invented the term “astroinformatics.”

Outlier Analysis

Sunday, January 13th, 2013

Outlier Analysis by Charu Aggarwal (Springer, January 2013). Post by Gregory Piatetsky.

From the post:

This is an authored text book on outlier analysis. The book can be considered a first comprehensive text book in this area from a data mining and computer science perspective. Most of the earlier books in outlier detection were written from a statistical perspective, and precede the emergence of the data mining field over the last 15-20 years.

Each chapter contains carefully organized content on the topic, case studies, extensive bibliographic notes and the future direction of research in this field. Thus, the book can also be used as a reference aid. Emphasis was placed on simplifying the content, so that the material is relatively easy to assimilate. The book assumes relatively little prior background, other than a very basic understanding of probability and statistical concepts. Therefore, in spite of its deep coverage, it can also provide a good introduction to the beginner. The book includes exercises as well, so that it can be used as a teaching aid.

Table of Contents and Introduction. Includes exercises and a 500+ reference bibliography.

Definitely a volume for the short reading list.

Caveat: As an outlier by any measure, my opinions here may be biased. 😉

Outlier detection in two review articles (Part 2) (TM use case on Identifiers)

Saturday, May 26th, 2012

Outlier detection in two review articles (Part 2) by Sandro Saitta.

From the post:

Here we go with the second review article about outlier detection (this post is the continuation of Part I).

A Survey of Outlier Detection Methodologies

This paper, from Hodge and Austin, is also an excellent review of the field. Authors give a list of keywords in the field: outlier detection, novelty detection, anomaly detection, noise detection, deviation detection and exception mining. For the authors, “An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs (Grubbs, 1969)”. Before listing several application in the field, authors mention that an outlier can be “surprising veridical data“. It may only be situated in the wrong class.

An interesting list of possible reasons for outliers is given: human error, instrument error, natural deviations in population, fraudulent behavior, changes in behavior of system and faults in system. Like in the first article, Hodge and Austin define three types of approaches to outlier detection (unsupervised, supervised and semi-supervised). In the last one, they mention that some algorithms can allow a confidence in the fact that the observation is an outlier. Main drawback of the supervised approach is its inability to discover new types of outliers.

While you are examining the techniques, do note the alternative ways to identify the problem.

Can you say topic map? 😉

Simple query expansion, assuming that any single term return hundreds of papers, isn’t all that helpful. Instead of several hundred papers you get several thousand. Gee, thanks.

But that isn’t an indictment of alternative identifications of subjects, that is a problem of granularity.

Returning documents forces users to wade through large amounts of potentially irrelevant content.

The question is how to retain alternative identifications of subjects while returning a manageable (or configurable) amount of content?

Suggestions?

Outlier detection in two review articles (Part 1)

Saturday, May 12th, 2012

Outlier detection in two review articles (Part 1) by Sandro Saitta.

Sandro writes:

The first one, Outlier Detection: A Survey, is written by Chandola, Banerjee and Kumar. They define outlier detection as the problem of “[…] finding patterns in data that do not conform to expected normal behavior“. After an introduction to what outliers are, authors present current challenges in this field. In my experience, non-availability of labeled data is a major one.

One of their main conclusions is that “[…] outlier detection is not a well-formulated problem“. It is your job, as a data miner, to formulate it correctly.

The final quote seems particularly well suited to subject identity issues. While any one subject identity may be well defined, the question is how to find and manage other subject identifications that may not be well defined.

As Sandro points out, it has nineteen (19) pages of references. However, only nine of those are as recent at 2007. All the rest are older references. I am sure it remains an excellent reference source but suspect more recent review articles on outlier detection exist.

Suggestions?