Big Data is More Than Hadoop by David Menninger.
From the post:
We recently published the results of our benchmark research on Big Data to complement the previously published benchmark research on Hadoop and Information Management. Ventana Research undertook this research to acquire real-world information about levels of maturity, trends and best practices in organizations’ use of large-scale data management systems now commonly called Big Data. The results are illuminating.
Volume, velocity and variety of data (the so-called three V’s) are often cited as characteristics of big data. Our research offers insight into each of these three categories. Regarding volume, over half the participating organizations process more than 10 terabytes of data, and 10% process more than 1 petabyte of data. In terms of velocity, 30% are producing more than 100 gigabytes of data per day. In terms of the variety of data, the most common types of big data are structured, containing information about customers and transactions.
However, one-third (31%) of participants are working with large amounts of unstructured data. Of the three V’s, nine out of 10 participants rate scalability and performance as the most important evaluation criteria, suggesting that volume and velocity of big data are more important concerns than variety.
This research shows that big data is not a single thing with one uniform set of requirements. Hadoop, a well-publicized technology for dealing with big data, gets a lot of attention (including from me), but there are other technologies being used to store and analyze big data.
Interesting work but especially for what the enterprises surveyed are missing about Big Data.
When I read “Volume, velocity and variety of data (the so-called three V’s) are often cited as characteristics of big data.” I was thinking that “variety” meant the varying semantics of the data. As is natural when collecting data from a variety of sources.
Nope. Completely off-base. “Variety” in the three V’s, at least for Ventura Research means:
The data being analyzed consists of a variety of data types. Rapidly increasing unstructured data and social media receive much of the attention in the big-data market, and the research shows these types of data are common among Hadoop users.
While the Ventura work is useful, at least for the variety leg of the Big Data stool, you will be better off with Ed Dumbill’s What is Big Data? where he points out for variety:
A common use of big data processing is to take unstructured data and extract ordered meaning, for consumption either by humans or as a structured input to an application. One such example is entity resolution, the process of determining exactly what a name refers to. Is this city London, England, or London, Texas? By the time your business logic gets to it, you don’t want to be guessing.
While data type variety is an issue, it isn’t one that is difficult to solve. Semantic variety on the other hand, is an issue that keeps on giving.
I think the promotional question for topic maps with regard to Big Data is: Do you still like the answer you got yesterday?
Topic maps can not only keep the question you asked yesterday and its answer, but the new question you want to ask today (and its answer). (Try that with fixed schemas.)