Top Three Technologies to Tame the Big Data Beast by Steve Hamby.
I would re-order some of Steve’s remarks. For example, on the Semantic Web, why not put those paragraphs first:
The first technology needed to tame Big Data — derived from the “memex” concept — is semantic technology, which loosely implements the concept of associative indexing. Dr. Bush is generally considered the godfather of hypertext based on the associative indexing concept, per his 1945 article. The Semantic Web, paraphrased from a definition by the World Wide Web Consortium (W3C), extends hyperlinked Web pages by adding machine-readable metadata about the Web page, including relationships across Web pages, thus allowing machine agents to process the hyperlinks automatically. The W3C provides a series of standards to implement the Semantic Web, such as Web Ontology Language (OWL), Resource Description Framework (RDF), Rule Interchange Format (RIF), and several others.
The May 2001 Scientific American article “The Semantic Web” by Tim Berners-Lee, Jim Hendler, and Ora Lassila described the Semantic Web as agents that query ontologies representing human knowledge to find information requested by a human. OWL ontology is based on Description Logics, which are both expressive and decidable, and provide a foundation for developing precise models about various domains of knowledge. These ontologies provide the “memory index” that enables searches across vast amounts of data to return relevant, actionable information, while addressing key data trust challenges as well. The ability to deliver semantics to a mobile device, such as what the recent release of the iPhone 4S does with Siri, is an excellent step in taming the Big Data beast, since users can get the data they need when and where they need it. Big Data continues to grow, but semantic technologies provide the needed check points to properly index vital information in methods that imitate the way humans think, as Dr. Bush aptly noted.
Follow that with the amount of data recitation and the comments about Vannevar Bush:
In the July 1945 issue of The Atlantic Monthly, Dr. Vannevar Bush’s famous essay, “As We May Think,” was published as one of the first articles addressing Big Data, information overload, or the “growing mountain of research” as stated in the article. The 2010 IOUG Database Growth Survey, conducted in July-August 2010, estimates that more than a zettabyte (or a trillion gigabytes) of data exists in databases, and that 16 percent of organizations surveyed reported a data growth rate in excess of 50 percent annually. A Gartner survey, also conducted in July-August 2010, reported that 47 percent of IT staffers surveyed ranked data growth as one of the top three challenges faced by their IT organization. Based on two recent IBM articles derived from their CIO Survey, one in three CIOs make decisions based on untrusted data; one in two feel they do not have the data they need to make an informed decision; and 83 percent cite better analytics as a top concern. A recent survey conducted for MarkLogic asserts that 35 percent of respondents believe their unstructured data sources will surpass their structured data sources in size in the next 36 months, while 86 percent of respondents claim that unstructured data is important to their organization. The survey further asserts that only 11 percent of those that consider unstructured data important have an infrastructure that addresses unstructured data.
Dr. Bush conceptualized a “private library,” coined “memex” (mem[ory ind]ex) in his essay, which could ingest the “mountain of research,” and use associative indexing — how we think — to correlate trusted data to support human decision making. Although Dr. Bush conceptualized “memex” as a desk-based device complete with levers, buttons, and a microfilm-based storage device, he recognized that future mechanisms and gadgetry would enhance the basic concepts. The core capabilities of “memex” were needed to allow man to “encompass the great record and to grow in the wisdom of race experience.”
That would allow exploration of questions and comments like:
1) With a zettabyte of data and more coming in every day, precisely how are we going to create/impose OWL ontologies to develop “…precise models about various domains of knowledge?”
2) Curious on what grounds hyperlinking is considered the equivalent of associative indexing? Hyperlinks can be used by indexes but hyperlinking isnt indexing. Wasn’t then, isn’t now.
3) The act of indexing is collecting references to a list of subjects. Imposing RDF/OWL may be preparatory steps towards indexing but are not indexing in and of themselves.
4) Description Logics are decidable but why does Steve think human knowledge can be expressed in decidable fashion? There is a vast amount of human knowledge in religion, philosophy, politics, ethics, economics, etc., that cannot be expressed in decidable fashion. Parking regulations can be expressed in decidable fashion, I think, but I don’t know if they are worth the trouble of RDF/OWL.
5) For that matter, where does Steve get the idea that human knowledge is precise? I suppose you could have made that argument in the 1890’s, except for some odd cases, classical physics was sufficient. At least until 1905. (Hint: Think of Albert Einstein.) Human knowledge is always provisional, uncertain and subject to revision. The CERN has apparently observed neutrinos going faster than the speed of light, for example. More revisions of physics are on the way.
Part of what we need to tame the big data “beast” is acceptance that we need information systems that are like ourselves.
That is to say information systems that are tolerant of imprecision, perhaps even inconsistency, that don’t offer a false sense of decidability and omniscience. Then at least we can talk about and recognize the parts of big data that remain to be tackled.