Archive for the ‘Dark Data’ Category

Dark Matter: Driven by Data

Monday, May 9th, 2016

A delightful keynote by Dan Geer, presented at the 2015 LangSec Workshop at the IEEE Symposium on Security & Privacy Workshops, May 21, 2015, San Jose, CA.

Prepared text for the presentation.

A quote to interest you in watching the video:

Workshop organizer Meredith Patterson gave me a quotation from Taylor Hornby that I hadn’t seen. In it, Hornby succinctly states the kind of confusion we are in and which LANGSEC is all about:

The illusion that your program is manipulating its data is powerful. But it is an illusion: The data is controlling your program.

It almost appears that we are building weird machines on purpose, almost the weirder the better. Take big data and deep learning. Where data science spreads, a massive increase in tailorability to conditions follows. But even if Moore’s Law remains forever valid, there will never be enough computing hence data driven algorithms must favor efficiency above all else, yet the more efficient the algorithm, the less interrogatable it is,[MO] that is to say that the more optimized the algorithm is, the harder it is to know what the algorithm is really doing.[SFI]

And there is a feedback loop here: The more desirable some particular automation is judged to be, the more data it is given. The more data it is given, the more its data utilization efficiency matters. The more its data utilization efficiency matters, the more its algorithms will evolve to opaque operation. Above some threshold of dependence on such an algorithm in practice, there can be no going back. As such, if science wishes to be useful, preserving algorithm interrogatability despite efficiency-seeking, self-driven evolution is the research grade problem now on the table. If science does not pick this up, then Lessig’s characterization of code as law[LL] is fulfilled. But if code is law, what is a weird machine?

If you can’t interrogate an algorithm, could you interrogate a topic map that is an “inefficient” implementation of the algorithm?

Or put differently, could there be two representations of the same algorithm, one that is “efficient,” and one that can be “interrogated?”

Read the paper version but be aware the video has a very rich Q&A session that follows the presentation.

4,000 Deep Web Links [but no AshleyMadison]

Wednesday, August 19th, 2015

4,000 Deep Web Links by Nikoloz Kokhreidze.

This listing was generated in early June, 2015, so it doesn’t include the most recent AshleyMadison data dump.

Apparently an authentic (according to some commentators) data dump from AshleyMadison was posted yesterday but I haven’t been able to find a report with an address for the dump.

One commentator, with a major technical site, sniffed that the dump was news but:

I’m not really interested in actively outing anyone’s private information

What a twit!

Why should I take even the New York Times’ word for the contents of the dump when the data should be searchable by anyone?

It maybe that N number of email addresses end in .mil but how many end in nyimes.com? Unlikely to see that reported in the New York Times. Yes?

Summarizing data and attempting to persuade me your summary is both useful and accurate is great. Saves me the time and trouble of wrangling the data.

However, the raw data that you have summarized should be available for verification by others. That applies to data held by Wikileaks, the New York Times or any government.

The history of government and corporate leaks has one overriding lesson: Deception is universal.

Suggested motto for data geeks:

In God We Trust, All Others Must Provide Raw Data.

PS: A good place to start touring the Deep/Dark Web: http://7g5bqm7htspqauum.onion/ – The Hidden Wiki (requires Tor for access).

What’s all the fuss about Dark Data?…

Wednesday, March 11th, 2015

What’s all the fuss about Dark Data? Big Data’s New Best Friend by Martyn Jones.

From the post:

Dark data, what is it and why all the fuss?

First, I’ll give you the short answer. The right dark data, just like its brother right Big Data, can be monetised – honest, guv! There’s loadsa money to be made from dark data by ‘them that want to’, and as value propositions go, seriously, what could be more attractive?

Let’s take a look at the market.

Gartner defines dark data as "the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes" (IT Glossary – Gartner)

Techopedia describes dark data as being data that is "found in log files and data archives stored within large enterprise class data storage locations. It includes all data objects and types that have yet to be analyzed for any business or competitive intelligence or aid in business decision making." (Techopedia – Cory Jannsen)

Cory also wrote that "IDC, a research firm, stated that up to 90 percent of big data is dark data."

In an interesting whitepaper from C2C Systems it was noted that "PST files and ZIP files account for nearly 90% of dark data by IDC Estimates." and that dark data is "Very simply, all those bits and pieces of data floating around in your environment that aren’t fully accounted for:" (Dark Data, Dark Email – C2C Systems)

Elsewhere, Charles Fiori defined dark data as "data whose existence is either unknown to a firm, known but inaccessible, too costly to access or inaccessible because of compliance concerns." (Shedding Light on Dark Data – Michael Shashoua)

Not quite the last insight, but in a piece published by Datameer, John Nicholson wrote that "Research firm IDC estimates that 90 percent of digital data is dark." And went on to state that "This dark data may come in the form of machine or sensor logs" (Shine Light on Dark Data – Joe Nicholson via Datameer)

Finally, Lug Bergman of NGDATA wrote this in a sponsored piece in Wired: "It" – dark data – "is different for each organization, but it is essentially data that is not being used to get a 360 degree view of a customer.

Well, I would say that 90% of 2.7 Zetabytes (as of last October) of data being dark is a reason to be concerned.

But like the Wizard of Oz, Martyn knows what you are lacking, a data inventory:


You don’t need a Chief Data Officer in order to be able to catalogue all your data assets. However, it is still good idea to have a reliable inventory of all your business data, including the euphemistically termed Big Data and dark data.

If you have such an inventory, you will know:

What you have, where it is, where it came from, what it is used in, what qualitative or quantitative value it may have, and how it relates to other data (including metadata) and the business.

Really? A data inventory? Relief to know the MDM (master data management) folks have been struggling for the past two decades for no reason. All they needed was a data inventory!

You might want to recall AnHai Doan’s observation for a single enterprise mapping project:

…the manual creation of semantic mappings has long been known to be extremely laborious and error-prone. For example, a recent project at the GTE telecommunications company sought to integrate 40 databases that have a total of 27,000 elements (i.e., attributes of relational tables) [LC00]. The project planners estimated that, without the database creators, just finding and documenting the semantic mappings among the elements would take more than 12 person years.

That’s right. One enterprise, 40 databases, 12 person years.

How that works out: PersonYears x 2.7 Zetabytes = ???, no one knows.

Oh, why did I lose the 90% as “dark data?” Simple enough, the data AnHai was mapping wasn’t entirely “dark.” At least it had headers that were meaningful to someone. Unstructured data has no headers at all.

What Martyn is missing?

What is known about data is the measure of its darkness, not usage.

But supplying opaque terms (all terms are opaque to someone) for data, only puts you into the AnHai situation. Either you enlist people who know the meanings of the terms and/or you create new meanings for them from scratch. Hopefully in the latter case you approximate the original meanings assigned to the terms.

If you want to improve on opaque terms, you need to provide alternative opaque terms that may be recognized by some future user instead of the primary opaque term you would use otherwise.

Make no mistake, it isn’t possible to escape opacity but you can increase your odds that your data can be useful at some future point in time. How many alternatives = some degree of future usefulness isn’t known.

So far as I know, the question hasn’t been researched. Every new set of opaque terms (read ontology, classification, controlled vocabulary) presents itself as possessing semantics for the ages. Given the number of such efforts, I find their confidence misplaced.

In Praise of CSV

Wednesday, March 11th, 2015

In Praise of CSV by Waldo Jaquith

From the post:

Comma Separated Values is the file format that open data advocates love to hate. Compared to JSON, CSV is clumsy; compared to XML, CSV is simplistic. Its reputation is as a tired, limited, it’s-better-than-nothing format. Not only is that reputation is undeserved, but CSV should often be your first choice when publishing data.

It’s true—CSV is tired and limited, though superior to not having data, but there’s another side to those coins. One man‘s tired is another man’s established. One man’s limited is another man’s focused. And “better than nothing” is in, fact, better than nothing, which is frequently the alternative to producing CSV.

A bit further on:


The lack of typing makes schemas generally impractical, and as a result validation of field contents is also generally impractical.

There is ongoing work to improve that situation at the CSV on the Web Working Group (W3C). As of today, see: Metadata Vocabulary for Tabular Data, W3C Editor’s Draft 11 March 2015.

The W3C work is definitely a step in the right direction but even if you “know” a field heading or its data type, do you really “know” the semantics of that field? Assume you have a floating point number, is that “pound-seconds” or “newton-seconds?” Mars orbiters really need to know.

Perhaps CSV files are nearly the darkest dark data with a structure. Even with field names and data types, the semantics of any field and/or its relationship to other fields, remains a mystery.

It may be the case that within a week, month or year, someone may remember the field semantics but what of ten (10) years or even one hundred (100) years from now?

Darpa Is Developing a Search Engine for the Dark Web

Wednesday, February 11th, 2015

Darpa Is Developing a Search Engine for the Dark Web by Kim Zetter.

From the post:

A new search engine being developed by Darpa aims to shine a light on the dark web and uncover patterns and relationships in online data to help law enforcement and others track illegal activity.

The project, dubbed Memex, has been in the works for a year and is being developed by 17 different contractor teams who are working with the military’s Defense Advanced Research Projects Agency. Google and Bing, with search results influenced by popularity and ranking, are only able to capture approximately five percent of the internet. The goal of Memex is to build a better map of more internet content.

Reading how Darpa is build yet another bigger dirt pile, I was reminded of Rick Searle saying:


Rather than think of Big Data as somehow providing us with a picture of reality, “naturally emerging” as Mayer-Schönberger quoted above suggested we should start to view it as a way to easily and cheaply give us a metric for the potential validity of a hypothesis. And it’s not only the first step that continues to be guided by old fashioned science rather than computer driven numerology but the remaining steps as well, a positive signal followed up by actual scientist and other researchers doing such now rusting skills as actual experiments and building theories to explain their results. Big Data, if done right, won’t end up making science a form of information promising, but will instead be used as the primary tool for keeping scientist from going down a cul-de-sac.

The same principle applied to mass surveillance means a return to old school human intelligence even if it now needs to be empowered by new digital tools. Rather than Big Data being used to hoover up and analyze all potential leads, espionage and counterterrorism should become more targeted and based on efforts to understand and penetrate threat groups themselves. The move back to human intelligence and towards more targeted surveillance rather than the mass data grab symbolized by Bluffdale may be a reality forced on the NSA et al by events. In part due to the Snowden revelations terrorist and criminal networks have already abandoned the non-secure public networks which the rest of us use. Mass surveillance has lost its raison d’etre.

In particular because the project is designed to automatically discover “relationships:”


But the creators of Memex don’t want just to index content on previously undiscovered sites. They also want to use automated methods to analyze that content in order to uncover hidden relationships that would be useful to law enforcement, the military, and even the private sector. The Memex project currently has eight partners involved in testing and deploying prototypes. White won’t say who the partners are but they plan to test the system around various subject areas or domains. The first domain they targeted were sites that appear to be involved in human trafficking. But the same technique could be applied to tracking Ebola outbreaks or “any domain where there is a flood of online content, where you’re not going to get it if you do queries one at a time and one link at a time,” he says.

I for one am very sure the new system (I refuse to sully the historical name by associating it with this doomed DARPA project), will find relationships, many relationships in fact. Too many relationships for any organization, now matter how large, to sort the relevant for the irrelevant.

If you want to segment the task, you could say that data mining is charged with finding relationships.

However, the next step is data analysis is to evaluate the evidence for the relationships found in the preceding step.

The step after evaluating the evidence for relationships discovered is to determine what, if anything, those relationships mean to some question at hand.

In all but the simplest of cases, there will be even more steps than the ones I listed. All of which must occur before you have extracted reliable intelligence from the data mining exercise.

Having data to search is a first step. Searching for and finding relationships in data is another step. But if that is where the search trail ends, you have just wasted another $10 to $20 million that could have gone for worthwhile intelligence gathering.

Searching for Dark Data

Tuesday, February 19th, 2013

Searching for Dark Data by Paul Doscher.

From the post:

We live in a highly connected world where every digital interaction spawns chain reactions of unfathomable data creation. The rapid explosion of text messaging, emails, video, digital recordings, smartphones, RFID tags and those ever-growing piles of paper – in what was supposed to be the paperless office – has created a veritable ocean of information.

Welcome to the world of Dark Data

Welcome to the world of Dark Data, the humongous mass of constantly accumulating information generated in the Information Age. Whereas Big Data refers to the vast collection of the bits and bytes that are being generated each nanosecond of each day, Dark Data is the enormous subset of unstructured, untagged information residing within it.

Research firm IDC estimates that the total amount of digital data, aka Big Data, will reach 2.7 zettabytes by the end of this year, a 48 percent increase from 2011. (One zettabyte is equal to one billion terabytes.) Approximately 90 percent of this data will be unstructured – or Dark.

Dark Data has thrown traditional business intelligence and reporting technologies for a loop. The software that countless executives have relied on to access information in the past simply cannot locate or make sense of the unstructured data that comprises the bulk of content today and tomorrow. These tools are struggling to tap the full potential of this new breed of data.

The good news is that there’s an emerging class of technologies that is ready to pick up where traditional tools left off and carry out the crucial task of extracting business value from this data.

Effective exploration of Dark Data will require something different from search tools that depend upon:

  • Pre-specified semantics (RDF) because Dark Data has no pre-specified semantics.
  • Structure because Dark Data has no structure.

Effective exploration of Dark Data will require:

Machine assisted-Interactive searching with gifted and grounded semantic comparators (people) creating pathways, tunnels and signposts into the wilderness of Dark Data.

I first saw this at: Delving into Dark Data.