Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 11, 2015

Darpa Is Developing a Search Engine for the Dark Web

Filed under: Dark Data,DARPA,Search Engines — Patrick Durusau @ 7:19 pm

Darpa Is Developing a Search Engine for the Dark Web by Kim Zetter.

From the post:

A new search engine being developed by Darpa aims to shine a light on the dark web and uncover patterns and relationships in online data to help law enforcement and others track illegal activity.

The project, dubbed Memex, has been in the works for a year and is being developed by 17 different contractor teams who are working with the military’s Defense Advanced Research Projects Agency. Google and Bing, with search results influenced by popularity and ranking, are only able to capture approximately five percent of the internet. The goal of Memex is to build a better map of more internet content.

Reading how Darpa is build yet another bigger dirt pile, I was reminded of Rick Searle saying:


Rather than think of Big Data as somehow providing us with a picture of reality, “naturally emerging” as Mayer-Schönberger quoted above suggested we should start to view it as a way to easily and cheaply give us a metric for the potential validity of a hypothesis. And it’s not only the first step that continues to be guided by old fashioned science rather than computer driven numerology but the remaining steps as well, a positive signal followed up by actual scientist and other researchers doing such now rusting skills as actual experiments and building theories to explain their results. Big Data, if done right, won’t end up making science a form of information promising, but will instead be used as the primary tool for keeping scientist from going down a cul-de-sac.

The same principle applied to mass surveillance means a return to old school human intelligence even if it now needs to be empowered by new digital tools. Rather than Big Data being used to hoover up and analyze all potential leads, espionage and counterterrorism should become more targeted and based on efforts to understand and penetrate threat groups themselves. The move back to human intelligence and towards more targeted surveillance rather than the mass data grab symbolized by Bluffdale may be a reality forced on the NSA et al by events. In part due to the Snowden revelations terrorist and criminal networks have already abandoned the non-secure public networks which the rest of us use. Mass surveillance has lost its raison d’etre.

In particular because the project is designed to automatically discover “relationships:”


But the creators of Memex don’t want just to index content on previously undiscovered sites. They also want to use automated methods to analyze that content in order to uncover hidden relationships that would be useful to law enforcement, the military, and even the private sector. The Memex project currently has eight partners involved in testing and deploying prototypes. White won’t say who the partners are but they plan to test the system around various subject areas or domains. The first domain they targeted were sites that appear to be involved in human trafficking. But the same technique could be applied to tracking Ebola outbreaks or “any domain where there is a flood of online content, where you’re not going to get it if you do queries one at a time and one link at a time,” he says.

I for one am very sure the new system (I refuse to sully the historical name by associating it with this doomed DARPA project), will find relationships, many relationships in fact. Too many relationships for any organization, now matter how large, to sort the relevant for the irrelevant.

If you want to segment the task, you could say that data mining is charged with finding relationships.

However, the next step is data analysis is to evaluate the evidence for the relationships found in the preceding step.

The step after evaluating the evidence for relationships discovered is to determine what, if anything, those relationships mean to some question at hand.

In all but the simplest of cases, there will be even more steps than the ones I listed. All of which must occur before you have extracted reliable intelligence from the data mining exercise.

Having data to search is a first step. Searching for and finding relationships in data is another step. But if that is where the search trail ends, you have just wasted another $10 to $20 million that could have gone for worthwhile intelligence gathering.

2 Comments

  1. It is a waste of money for the publicly stated purpose. But, if the military wants to be able to show evidence of wrongdoing for any selected target, the bigger the data the easier it is. As an example of our near-term future, look at the North Korea/Sony political theater. We don’t like North Korea and we can find a North Korea IP address in the server log for one of the many servers involved in the Sony hack so they must be guilty. Nevermind that the logs from those servers probably contained IP addresses from most of the countries on Earth. They were able to find the evidence that fit their storyline and, the bigger the data, the more likely that will be.

    Comment by clemp — February 12, 2015 @ 8:12 pm

  2. Important point! Thanks, I had missed the “ability to deceive” aspect of large data collections.

    I wonder if there is a model for the size of data versus the range of deceptions possible? A Data to Lies or D/L ratio?

    Here’s a thought experiment that may turn into a project: Create a data set from the FEC data on campaign contributions, reasoning that anyone who ever shared a zip code knows everyone else from within that zip code donating to the same candidate. Then use criminal conviction databases (sex offenders?) to create contact graphs with the individuals in the first data set. Suggested title: Social Networks of Sex Offenders and Elected Officials. Should be able to generate lots of spurious connections. Not to rival the NSA since they have bigger computers but respectable.

    Comment by Patrick Durusau — February 14, 2015 @ 10:43 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress