Much of today’s statistical modeling and predictive analytics is beautiful but unique. It’s impossible to repeat, it’s snowflake data science. (Matt Wood, principal data scientist for Amazon Web Services)
Think about that for a moment.
Snowflakes are unique. Can the same be said about your data science projects?
Would that explain the 80% figure of data science time being spent on cleaning, ETL, and similar tasks with data?
Is it that data never gets clean or are you cleaning the same data over and over again?
Barb Darrow reported in: From Amazon’s top data geek: data has got to be big — and reproducible:
The next frontier is making that data reproducible, said Matt Wood, principal data scientist for Amazon Web Services, at GigaOM’s Structure:Data 2013 event Wednesday.
In short, it’s great to get a result from your number crunching, but if the result is different next time out, there’s a problem. No self-respecting scientist would think of submitting the findings for a trial or experiment unless she is able to show that the it will be the same after multiple runs.
“Much of today’s statistical modeling and predictive analytics is beautiful but unique. It’s impossible to repeat, it’s snowflake data science.” Wood told attendees in New York. “Reproducibility becomes a key arrow in the quiver of the data scientist.”
The next frontier is making sure that people can reproduce, reuse and remix their data which provides a “tremendous amount of value,” Wood noted. (emphasis added)
I like that: Reproduce, Reuse, Remix data.
That’s going to require robust and granular handling of subject identity.
The three R’s of topic maps.
Yes?