MIT Group Cites “Data Prep” as a Data Science Bottleneck
The bottleneck is varying data semantics. No stranger to anyone interested in topic maps. The traditional means of solving that problem is to clean the data for one purpose, which unless the basis for cleaning is recorded, leaves the data dirty for the next round of integration.
What do you think is being described in this text?:
Much of Veeramachaneni’s recent research has focused on how to automate this lengthy data prep process. “Data scientists go to all these boot camps in Silicon Valley to learn open source big data software like Hadoop, and they come back, and say ‘Great, but we’re still stuck with the problem of getting the raw data to a place where we can use all these tools,’” Veeramachaneni says.
The proliferation of data sources and the time it takes to prepare these massive reserves of data are the core problems Tamr is attacking. The knee-jerk reaction to this next-gen integration and preparation problem tends to be “Machine Learning” — a cure for all ills. But as Veeramachaneni points out, machine learning can’t resolve all data inconsistencies:
Veeramachaneni and his team are also exploring how to efficiently integrate the expertise of domain experts, “so it won’t take up too much of their time,” he says. “Our biggest challenge is how to use human input efficiently, and how to make the interactions seamless and efficient. What sort of collaborative frameworks and mechanisms can we build to increase the pool of people who participate?”
Tamr has built the very sort of collaborative framework Veeramachaneni mentions, drawing from the best of machine and human learning to connect hundreds or thousands of data sources.
Top-down, deterministic data unification approaches (such as ETL, ELT and MDM) were not designed to scale to the variety of hundreds or thousands or even tens of thousands of data silos (perpetual and proliferating). Traditional deterministic systems depend on a highly trained architect developing a “master” schema — “the one schema to rule them all” — which we believe is a red herring. Embracing the fundamental diversity and ever-changing nature of enterprise data and semantics leads you towards a bottom up, probabalistic approach to connecting data sources from various enterprise silos.
You also have to engage the source owners collaboratively to curate the variety of data at scale, which is Tamr’s core design pattern. Advanced algorithms automatically connect the vast majority of the sources while resolving duplications, errors and inconsistencies among source data of sources, attributes and records — a bottom-up, probabilistic solution that is reminiscent of Google’s full-scale approach to web search and connection. When the Tamr system can’t resolve connections automatically, it calls for human expert guidance, using people in the organization familiar with the data to weigh in on the mapping and improve its quality and integrity.
Off hand I would say it is a topic map authoring solution that features algorithms to assist the authors where authoring has been crowd-sourced.
What I don’t know is whether the insight of experts is captured as dark data (A matches B) or if their identifications are preserved so they can be re-used in the future (The properties of A that result in a match with the properties of B).
I didn’t register to I can’t see the “white paper.” Let me know how close I came if you decide to get the “white paper.” Scientists are donating research data in the name of open science but startups are still farming registration data.