Modern Data Integration Paradigms by Matthew D. Sarrel, The Bloor Group.
Introduction:
Businesses of all sizes and industries are rapidly transforming to make smarter, data-driven decisions. To accomplish this transformation to digital business , organizations are capturing, storing, and analyzing massive amounts of structured, semi-structured, and unstructured data from a large variety of sources. The rapid explosion in data types and data volume has left many IT and data science/business analyst leaders reeling.
Digital transformation requires a radical shift in how a business marries technology and processes. This isn’t merely improving existing processes, but
rather redesigning them from the ground up and tightly integrating technology. The end result can be a powerful combination of greater efficiency, insight and scale that may even lead to disrupting existing markets. The shift towards reliance on data-driven decisions requires coupling digital information with powerful analytics and business intelligence tools in order to yield well-informed reasoning and business decisions. The greatest value of this data can be realized when it is analyzed rapidly to provide timely business insights. Any process can only be as timely as the underlying technology allows it to be.
Even data produced on a daily basis can exceed the capacity and capabilities of many pre-existing database management systems. This data can be structured or unstructured, static or streaming, and can undergo rapid, often unanticipated, change. It may require real-time or near-real-time transformation to be read into business intelligence (BI) systems. For these reasons, data integration platforms must be flexible and extensible to accommodate business’s types and usage patterns of the data.
There’s the usual homage to the benefits of data integration:
…
IT leaders should therefore try to integrate data across systems in a way that exposes them using standard and commonly implemented technologies such as SQL and REST. Integrating data, exposing it to applications, analytics and reporting improves productivity, simplifies maintenance, and decreases the amount of time and effort required to make data-driven decisions.
…
The paper covers, lightly, Operational Data Store (ODS) / Enterprise Data Hub (EDH), Enterprise Data Warehouse (EDW), Logical Data Warehouse (LDW), and Data Lake as data integration options.
Having found existing systems deficient in one or more ways, the report goes on to recommend replacement with Voracity.
To be fair, as described, all four systems plus Voracity are all deficient in the same way. The hard part of data integration, the rub that lies at the heart of the task, is passed over as ETL.
Efficient and correct ETL performance requires knowledge of what column headers, for instance, identify. For instance, from the Enron spreadsheets, can you specify the transformation of the data in the following columns? “A, B, C, D, E, F…” from andrea_ring_15_IFERCnov.xlsx, or “A, B, C, D, E,…” from andy_zipper__129__Success-TradeLog.xlsx?
With enough effort, no doubt you could go through speadsheets of interest and create a mapping sufficient to transform data of interest, but where are you going to write down the facts you established for each column that underlie your transformation?
In topic maps, we may the mistake of mystifying the facts for each column by claiming to talk about subject identity, which has heavy ontological overtones.
What we should have said was we wanted to talk about where do we write down subject identifications?
Thus:
- What do you want to talk about?
- Data in column F in andrea_ring_15_IFERCnov.xlsx
- Do you want to talk about each entry separately?
- What subject is each entry? (date written month/day (no year))
- What calendar system was used for the date?
- Who created that date entry? (If want to talk about them as well, create a separate topic and an association to the spreadsheet.)
- The date is the date of … ?
- Conversion rules for dates in column F, such as supplying year.
- Merging rules for #2? (date comparison)
- Do you want relationship between #2 and the other data in each row? (more associations)
With simple questions, we have documented column F of a particular spreadsheet for any present or future ETL operation. No magic, no logical conundrums, no special query language, just asking what an author or ETL specialist knew but didn’t write down.
There are subtlties such as distinguishing between subject identifiers (identifies a subject, like a wiki page) and subject locators (points to the subject we want to talk about, like a particular spreadsheet) but identifying what you want to talk about (subject identifications and where to write them down) is more familiar than our prior obscurities.
Once those identifications are written down, you can search those identifications to discover the same subjects identified differently or with properties in one identification and not another. Think of it as capturing the human knowledge that resides in the brains of your staff and ETL experts.
The ETL assumed by Bloor Group should be written: ETLD – Extract, Transform, Load, Dump (knowledge). That seems remarkably inefficient and costly to me. You?