Data scientists: Question the integrity of your data by Rebecca Merrett.
From the post:
If there’s one lesson website traffic data can teach you, it’s that information is not always genuine. Yet, companies still base major decisions on this type of data without questioning its integrity.
At ADMA’s Advancing Analytics in Sydney this week, Claudia Perlich, chief scientist of Dstillery, a marketing technology company, spoke about the importance of filtering out noisy or artificial data that can skew an analysis.
“Big data is killing your metrics,” she said, pointing to the large portion of bot traffic on websites.
“If the metrics are not really well aligned with what you are truly interested in, they can find you a lot of clicking and a lot of homepage visits, but these are not the people who will buy the product afterwards because they saw the ad.”
Predictive models that look at which users go to some brands’ home pages, for example, are open to being completely flawed if data integrity is not called into question, she said.
“It turns out it is much easier to predict bots than real people. People write apps that skim advertising, so a model can very quickly pick up what that traffic pattern of bots was; it can predict very, very well who would go to these brands’ homepages as long as there was bot traffic there.”
The predictive model in this case will deliver accurate results when testing its predictions. However, that doesn’t bring marketers or the business closer to reaching its objective of real human ad conversions, Perlich said.
…
The on-line Merriam-Webster’s defined “integrity” as:
- firm adherence to a code of especially moral or artistic values : incorruptibility
- an unimpaired condition : soundness
- the quality or state of being complete or undivided : completeness
None of those definitions of “integrity” apply to the data Perlich describes.
What Perlich criticizes is measuring data with no relationship to the goal of the analysis, “…human ad conversions.”
That’s not “integrity” of data. Perhaps appropriate/fitness for use or relevance but not “integrity.”
Avoid vague and moralizing terminology when discussing data and data science.
Discussions of ethics are difficult enough without introducing confusion with unrelated issues.
I first saw this in a tweet by Data Science Renee.