Kirk Borne tweeted a link to his presentation, Big Data Science for Astronomy & Space and more specifically to slides 24 and 25 on novelty detection, surprise discovery.
Casting about for more resources to point out, I found Novelty Detection in Learning Systems by Stephen Marsland.
The abstract for Stephen’s paper:
Novelty detection is concerned with recognising inputs that differ in some way from those that are usually seen. It is a useful technique in cases where an important class of data is under-represented in the training set. This means that the performance of the network will be poor for those classes. In some circumstances, such as medical data and fault detection, it is often precisely the class that is under-represented in the data, the disease or potential fault, that the network should detect. In novelty detection systems the network is trained only on the negative examples where that class is not present, and then detects inputs that do not fits into the model that it has acquired, that it, members of the novel class.
This paper reviews the literature on novelty detection in neural networks and other machine learning techniques, as well as providing brief overviews of the related topics of statistical outlier detection and novelty detection in biological organisms.
The rest of the paper is very good and worth your time to read but we need not venture beyond the abstract to demonstrate why big data cannot, by definition, detect terrorists.
The root of the terrorist detection problem summarized in the first sentence:
Novelty detection is concerned with recognising inputs that differ in some way from those that are usually seen.
So, what are the inputs of a terrorist that differ from the inputs usually seen?
That’s a simple enough question.
Previously committing a terrorist suicide attack is a definite tell but it isn’t a useful one.
Obviously the TSA doesn’t know because it has never caught a terrorist, despite its profile and wannabe psychics watching travelers.
You can churn big data 24×7 but if you don’t have a baseline of expected inputs, no input is going to stand out from the others.
The San Bernardino were not detected, because the inputs didn’t vary enough for the couple to stand out.
Even if they had been selected for close and unconstitutional monitoring of their etraffic, bank accounts, social media, phone calls, etc., there is no evidence that current data techniques would have detected them.
Before you invest or continue paying for big data to detect terrorists, ask the simple questions:
What is your baseline from which variance will signal a terrorist?
How often has it worked?
Once you have a dead terrorist, you can start from the dead terrorist and search your big data, but that’s an entirely different starting point.
Given the weeks, months and years of finger pointing following a terrorist attack, speed really isn’t an issue.