Big Data and Text by Bill Inmon.
From the post:
Let’s take a look at big data. Corporations have discovered that there is a lot more data out there then they had ever imagined. There are log tapes, emails and tweets. There are registration records, phone records and TV log records. There are images and medical images. In short, there is an amazing amount of data.
Back in the good old days, there was just plain old transaction data. Bank teller machines. Airline reservation data. Point of sale records. We didn’t know how good we had it in those days. Why back in the good old days, a designer could create a data model and expect the data to fit reasonably well into the data model. Or the designer could define a record type to the database management system. The system would capture and store huge numbers of records that had the same structure. The only thing that was different was the content of the records.
Ah, the good old days – where there was at least a semblance of order when it came to managing and understanding data.
Take a look at the world now. There just is no structure to some of the big data types. Or if there is an order, it is well hidden. Really messing things up is the fact that much of big data is in the form of text. And text defies structure. Trying to put text into a standard database management system is like trying to put a really square peg into a really round hole.
While reading this post (only part of which appears here) it occurred to me that “unstructured data” is being used to mean data that lacks the appearance of outward semantics. That is for any database table, you can show it to a variety of users and all of them will claim to understand the meanings both explicit and implicit in the tables. At least until they are asked to merge databases together as part of a reorganization of a business operation. Then out come old notebooks, emails, guesses and questions for older staff.
True, having outward structure can help, but the divide really isn’t between structured and unstructured data. Mostly because both of them normally lack any explicit semantics.