RAD – Outlier Detection on Big Data by Jeffrey Wong, Chris Colburn, Elijah Meeks, and Shankar Vedaraman.
From the post:
Outlier detection can be a pain point for all data driven companies, especially as data volumes grow. At Netflix we have multiple datasets growing by 10B+ record/day and so there’s a need for automated anomaly detection tools ensuring data quality and identifying suspicious anomalies. Today we are open-sourcing our outlier detection function, called Robust Anomaly Detection (RAD), as part of our Surus project.
As we built RAD we identified four generic challenges that are ubiquitous in outlier detection on “big data.”
- High cardinality dimensions: High cardinality data sets – especially those with large combinatorial permutations of column groupings – makes human inspection impractical.
- Minimizing False Positives: A successful anomaly detection tool must minimize false positives. In our experience there are many alerting platforms that “sound an alarm” that goes ultimately unresolved. The goal is to create alerting mechanisms that can be tuned to appropriately balance noise and information.
- Seasonality: Hourly/Weekly/Bi-weekly/Monthly seasonal effects are common and can be mis-identified as outliers deserving attention if not handled properly. Seasonal variability needs to be ignored.
- Data is not always normally distributed: This has been a particular challenge since Netflix has been growing over the last 24 months. Generally though, an outlier tool must be robust so that it works on data that is not normally distributed.
In addition to addressing the challenges above, we wanted a solution with a generic interface (supporting application development). We met these objectives with a novel algorithm encased in a wrapper for easy deployment in our ETL environment.
…
Looking for “suspicious anomalies” is always popular, in part because it implies someone has deliberately departed from “normal” behavior.
Certainly important but as the FBI staging terror plots we discussed earlier today, show that the normal FBI “mo” is to stage terror plots and an anomaly would be a real terror plot, one not staged by the FBI.
The lesson being don’t assume outliers are departures from a desired norm. Can be, but not always are.
[…] RAD – Outlier Detection on Big Data by Jeffrey Wong, Chris Colburn, Elijah Meeks, and Shankar Vedaraman.From the post:Outlier detection can be a pain point for all data driven companies, especially as data volumes grow. […]
Pingback by RAD – Outlier Detection on Big Data | Bio... — March 3, 2015 @ 9:22 am