ETL: The Dirty Little Secret of Data Science by Byron Ruth.
From the description:
“There is an adage that given enough data, a data scientist can answer the world’s questions. The untold truth is that the majority of work happens during the ETL and data preprocessing phase. In this talk I discuss Origins, an open source Python library for extracting and mapping structural metadata across heterogenous data stores.”
More than your usual ETL presentation, Byron makes several points of interest to the topic map community:
- “domain knowledge” is necessary for effective ETL
- “domain knowledge” changes and fades from dis-use
- ETL isn’t transparent to consumers of data resulting from ETL, a “black box”
- Data provenance is the answer to transparency, changing domain knowledge and persisting domain knowledge
- “Provenance is a record that describes the people, institutions, entities, and activities, involved in producing, influencing, or delivering a piece of data or a thing.”
- Project Origins, captures metadata and structures from backends and persists it to Neo4j
Great focus on provenance but given the lack of merging in Neo4j, the collation of information about a common subject, with different names, is going to be a manual process.
Follow @thedevel.