Archive for the ‘Data Quality’ Category

A different take on data skepticism

Thursday, April 25th, 2013

A different take on data skepticism by Beau Cronin.

From the post:

Recently, the Mathbabe (aka Cathy O’Neil) vented some frustration about the pitfalls in applying even simple machine learning (ML) methods like k-nearest neighbors. As data science is democratized, she worries that naive practitioners will shoot themselves in the foot because these tools can offer very misleading results. Maybe data science is best left to the pros? Mike Loukides picked up this thread, calling for healthy skepticism in our approach to data and implicitly cautioning against a “cargo cult” approach in which data collection and analysis methods are blindly copied from previous efforts without sufficient attempts to understand their potential biases and shortcomings.

…Well, I would argue that all ML methods are not created equal with regard to their safety. In fact, it is exactly some of the simplest (and most widely used) methods that are the most dangerous.

Why? Because these methods have lots of hidden assumptions. Well, maybe the assumptions aren’t so much hidden as nodded-at-but-rarely-questioned. A good analogy might be jumping to the sentencing phase of a criminal trial without first assessing guilt: asking “What is the punishment that best fits this crime?” before asking “Did the defendant actually commit a crime? And if so, which one?” As another example of a simple-yet-dangerous method, k-means clustering assumes a value for k, the number of clusters, even though there may not be a “good” way to divide the data into this many buckets. Maybe seven buckets provides a much more natural explanation than four. Or maybe the data, as observed, is truly undifferentiated and any effort to split it up will result in arbitrary and misleading distinctions. Shouldn’t our methods ask these more fundamental questions as well?

Beau make several good points on questioning data methods.

I would extend those “…more fundamental questions…” to data as well.

Data, at least as far as I know, doesn’t drop from the sky. It is collected, generated, sometimes both, by design.

That design had some reason for collecting that data, in some particular way and in a given format.

Like methods, data stands mute with regard to those designs, what choices were made, by who and for what reason?

Giving voice what can be known about methods and data falls to human users.

The Costs and Profits of Poor Data Quality

Tuesday, April 16th, 2013

The Costs and Profits of Poor Data Quality by Jim Harris.

From the post:

Continuing the theme of my two previous posts, which discussed when it’s okay to call data quality as good as it needs to get and when perfect data quality is necessary, in this post I want to briefly discuss the costs — and profits — of poor data quality.

Loraine Lawson interviewed Ted Friedman of Gartner Research about How to Measure the Cost of Data Quality Problems, such as the costs associated with reduced productivity, redundancies, business processes breaking down because of data quality issues, regulatory compliance risks, and lost business opportunities. David Loshin blogged about the challenge of estimating the cost of poor data quality, noting that many estimates, upon close examination, seem to rely exclusively on anecdotal evidence.

As usual, Jim does a very good job of illustrating costs and profits from poor data quality.

I have a slightly different question:

What could you know about data to spot that it is of poor quality?

It is one thing to find out after a space ship crashes that poor data quality was responsible, but it would be better to spot the error before hand. As in before the launch.

Probably data specific but are there any general types of information that would help you spot poor quality data?

Before you are 1,000 meters off the lunar surface. ;-)

Why Data Lineage is Your Secret … Weapon [Auditing Topic Maps]

Sunday, March 10th, 2013

Why Data Lineage is Your Secret Data Quality Weapon by Dylan Jones.

From the post:

Data lineage means many things to many people but it essentially refers to provenance – how do you prove where your data comes from?

It’s really a simple exercise. Just pull an imaginary string of data from where the information presents itself, back through the labyrinth of data stores and processing chains, until you can go no further.

I’m constantly amazed by why so few organisations practice sound data lineage management despite having fairly mature data quality or even data governance programs. On a side note, if ever there was a justification for the importance of data lineage management then just take a look at the brand damage caused by the recent European horse meat scandal.

But I digress. Why is data lineage your secret data quality weapon?

The simple answer is that data lineage forces your organisation to address two big issues that become all too apparent:

  • Lack of ownership
  • Lack of formal information chain design

Or to put it into a topic map context, can you trace what topics merged to create the topic you are now viewing?

And if you can’t trace, how can you audit the merging of topics?

And if you can’t audit, how do you determine the reliability of your topic map?

That is reliability in terms of date (freshness), source (reliable or not), evaluation (by screeners), comparison (to other sources), etc.

Same questions apply to all data aggregation systems.

Or as Mrs. Weasley tells Ginny:

“Never trust anything that can think for itself if you can’t see where it keeps its brain.”


Correction: Wesley -> Weasley. We had a minister friend over Sunday and were discussing the former, not the latter. ;-)

Applying “Lateral Thinking” to Data Quality

Saturday, December 8th, 2012

Applying “Lateral Thinking” to Data Quality by Ken O’Connor.

From the post:

I am a fan of Edward De Bono, the originator of the concept of Lateral Thinking. One of my favourite examples of De Bono’s brilliance, relates to dealing with the worldwide problem of river pollution.

River Discharge Pipe

De Bono suggested “each factory must be downstream of itself” – i.e. Require factories’ water inflow pipes to be just downstream of their outflow pipes.

Suddenly, the water quality in the outflow pipe becomes a lot more important to the factory. Apparently several countries have implemented this idea as law.

What has this got to do with data quality?

By applying the same principle to data entry, all downstream data users will benefit, and information quality will improve.

How could this be done?

So how do you move the data input pipe just downstream of the data outflow pipe?

Before you take a look at Ken’s solution, take a few minutes to brain storm about how you would do it.

Important for semantic technologies because there aren’t enough experts to go around. Meaning non-expert users will do a large portion of the work.

Comments/suggestions?

The Seventh Law of Data Quality

Saturday, November 24th, 2012

The Seventh Law of Data Quality by Jim Harris.

Jim’s series on the “laws” of data quality can be recommended without reservation. There are links to each one in his coverage of the seventh law.

The seventh of data quality law reads:

Determine the business impact of data quality issues BEFORE taking any corrective action in order to properly prioritize data quality improvement efforts.

I would modify that slightly to make it applicable to data issues more broadly as:

Determine the business impact of a data issue BEFORE addressing it at all.

Your data may be completely isolated in silos, but without a business purpose to be served by freeing them, why bother?

And that purpose should have a measurable ROI.

In the absence of a business purpose and a measurable ROI, keep both hands on your wallet.

Acknowledging Errors in Data Quality

Sunday, October 28th, 2012

Acknowledging Errors in Data Quality by Jim Harris.

From the post:

The availability heuristic is a mental shortcut that occurs when people make judgments based on the ease with which examples come to mind. Although this heuristic can be beneficial, such as when it helps us recall examples of a dangerous activity to avoid, sometimes it leads to availability bias, where we’re affected more strongly by the ease of retrieval than by the content retrieved.

In his thought-provoking book “Thinking, Fast and Slow,” Daniel Kahneman explained how availability bias works by recounting an experiment where different groups of college students were asked to rate a course they had taken the previous semester by listing ways to improve the course — while varying the number of improvements that different groups were required to list.

Jim applies the result of Kahneman’s experiment to data quality issues and concludes:

  • Isolated errors – Management chooses one-time data cleaning projects.
  • Ten errors – Management concludes overall data quality must not be too bad (availability heuristic).

I need to re-read Kahneman but have you seen suggestions for overcoming the availability heuristic?

Data Preparation: Know Your Records!

Thursday, October 25th, 2012

Data Preparation: Know Your Records! by Dean Abbott.

From the post:

Data preparation in data mining and predictive analytics (dare I also say Data Science?) rightfully focuses on how the fields in ones data should be represented so that modeling algorithms either will work properly or at least won’t be misled by the data. These data preprocessing steps may involve filling missing values, reigning in the effects of outliers, transforming fields so they better comply with algorithm assumptions, binning, and much more. In recent weeks I’ve been reminded how important it is to know your records. I’ve heard this described in many ways, four of which are:
the unit of analysis
the level of aggregation
what a record represents
unique description of a record

A bit further on Dean reminds us:

What isn’t always obvious is when our assumptions about the data result in unexpected results. What if we expect the unit of analysis to be customerID/Session but there are duplicates in the data? Or what if we had assumed customerID/Session data but it was in actuality customerID/Day data (where ones customers typically have one session per day, but could have a dozen)? (emphasis added)

Obvious once Dean says it, but how often do you question assumptions about data?

Do you know what impact incorrect assumptions about data will have on your operations?

If you investigate your assumptions about data, where do you record your observations?

Or will you repeat the investigation with every data dump from a particular source?

Describing data “in situ” could benefit you six months from now or your successor. (The data and or its fields would be treated as subjects in a topic map.)

Working More Effectively With Statisticians

Sunday, September 23rd, 2012

Working More Effectively With Statisticians by Deborah M. Anderson. (Fall 2012 Newsletter of Society for Clinical Data Management, pages 5-8)

Abstract:

The role of the clinical trial biostatistician is to lend scientific expertise to the goal of demonstrating safety and efficacy of investigative treatments. Their success, and the outcome of the clinical trial, is predicated on adequate data quality, among other factors. Consequently, the clinical data manager plays a critical role in the statistical analysis of clinical trial data. In order to better fulfill this role, data managers must work together with the biostatisticians and be aligned in their understanding of data quality. This article proposes ten specific recommendations for data managers in order to facilitate more effective collaboration with biostatisticians.

See the article for the details but the recommendations are generally applicable to all data collection projects:

Recommendation #1: Communicate early and often with the biostatistician and provide frequent data extracts for review.

Recommendation #2: Employ caution when advising sites or interactive voice/web recognition (IVR/IVW) vendors on handling of randomization errors.

Recommendation #3: Collect the actual investigational treatment and dose group for each subject.

Recommendation #4: Think carefully and consult the biostatistician about the best way to structure investigational treatment exposure and accountability data.

Recommendation #5: Clarify in electronic data capture (EDC) specifications whether a question is only a “prompt” screen or whether the answer to the question will be collected explicitly in the database.

Recommendation #6: Recognize the most critical data items from a statistical analysis perspective and apply the highest quality standards to them.

Recommendation #7: Be alert to protocol deviations/violations (PDVs).

Recommendation #8: Plan for a database freeze and final review before database lock.

Recommendation #9: Archive a snapshot of the clinical database at key analysis milestones and at the end of the study.

Recommendation #10: Educate yourself about fundamental statistical principles whenever the opportunity arises.

I first saw this at John Johnson’s Data cleaning is harder than statistical analysis.

Living with Imperfect Data

Wednesday, July 4th, 2012

Living with Imperfect Data by Jim Ericson.

From the post:

In a keynote at our MDM & Data Governance conference in Toronto a few days ago, an executive from a large analytical software company said something interesting that stuck with me. I am paraphrasing from memory, but it was very much to the effect of, “Sometimes it’s better to have everyone agreeing on numbers that aren’t entirely accurate than having everyone off doing their own numbers.”

Let that sink in for a moment.

After I did, the very idea of this comment struck me at a few levels. It might have the same effect on you.

In one sense, admitting there is an acceptable level of shared inaccuracy is anathema to the way we like to describe data governance. It was especially so at a MDM-centric conference where people are pretty single-minded about what constitutes “truth.”

As a decision support philosophy, it wouldn’t fly at a health care conference.

I rather like that: “Sometimes it’s better to have everyone agreeing on numbers that aren’t entirely accurate than having everyone off doing their own numbers.”

I suspect because it is the opposite of how I really like to see data. I don’t want rough results, say in a citation network but rather all the relevant citations. Even if it isn’t possible to review all the relevant citations. Still need to be complete.

But completeness is the enemy of results or at least published results. Sure, eventually, assuming a small enough data set, it is possible to map it in its entirety. But that means that whatever good would have come from it being available sooner, has been lost.

I don’t want to lose the sense of rough agreement posed here, because that is important as well. There are many cases where, despite Fed and economists protests to the contrary, the numbers are almost fictional anyway. Pick some, they will be different soon enough. What counts is that we have agreed on numbers for planning purposes. Can always pick new ones.

The same is true for topic maps and perhaps even more so for topic maps. They are a view into an infoverse, fixed at a moment in time by authoring decisions.

Don’t like the view? Create another one.

Are You a Bystander to Bad Data?

Tuesday, June 5th, 2012

Are You a Bystander to Bad Data? by Jim Harris.

From the post:

In his recent Harvard Business Review blog post “Break the Bad Data Habit,” Tom Redman cautioned against correcting data quality issues without providing feedback to where the data originated.

“At a minimum,” Redman explained, “others using the erred data may not spot the error. There is no telling where it might turn up or who might be victimized.” And correcting bad data without providing feedback to its source also denies the organization an opportunity to get to the bottom of the problem.

“And failure to provide feedback,” Redman continued, “is but the proximate cause. The deeper root issue is misplaced accountability — or failure to recognize that accountability for data is needed at all. People and departments must continue to seek out and correct errors. They must also provide feedback and communicate requirements to their data sources.”

In his blog post, “The Secret to an Effective Data Quality Feedback Loop,” Dylan Jones responded to Redman’s blog post with some excellent insights regarding data quality feedback loops and how they can help improve your data quality initiatives.

[I removed two incorrect links in the quoted portion of Jim's article. Were pointers to the rapper "Redman" and not Tom Redman. And I posted a comment on Jim's blog about the error.]

Take the time to think about providing feedback on bad data.

Would bad data get corrected more often if correction was easier?

What if a data stream could be intercepted and corrected? Would that make correction easier?

Crowdsourcing – A Solution to your “Bad Data” Problems

Friday, May 11th, 2012

Crowdsourcing – A Solution to your “Bad Data” Problems by Hollis Tibbetts.

Hollis writes:

Data problems – whether they be inaccurate data, incomplete data, data categorization issues, duplicate data, data in need of enrichment – are age-old.

IT executives consistently agree that data quality/data consistency is one of the biggest roadblocks to them getting full value from their data. Especially in today’s information-driven businesses, this issue is more critical than ever.

Technology, however, has not done much to help us solve the problem – in fact, technology has resulted in the increasingly fast creation of mountains of “bad data”, while doing very little to help organizations deal with the problem.

One “technology” holds much promise in helping organizations mitigate this issue – crowdsourcing. I put the word technology in quotation marks – as it’s really people that solve the problem, but it’s an underlying technology layer that makes it accurate, scalable, distributed, connectable, elastic and fast. In an article earlier this week, I referred to it as “Crowd Computing”.

Crowd Computing – for Data Problems

The Human “Crowd Computing” model is an ideal approach for newly entered data that needs to either be validated or enriched in near-realtime, or for existing data that needs to be cleansed, validated, de-duplicated and enriched. Typical data issues where this model is applicable include:

  • Verification of correctness
  • Data conflict and resolution between different data sources
  • Judgment calls (such as determining relevance, format or general “moderation”)
  • “Fuzzy” referential integrity judgment
  • Data error corrections
  • Data enrichment or enhancement
  • Classification of data based on attributes into categories
  • De-duplication of data items
  • Sentiment analysis
  • Data merging
  • Image data – correctness, appropriateness, appeal, quality
  • Transcription (e.g. hand-written comments, scanned content)
  • Translation

In areas such as the Data Warehouse, Master Data Management or Customer Data Management, Marketing databases, catalogs, sales force automation data, inventory data – this approach is ideal – or any time that business data needs to be enriched as part of a business process.

Hollis has a number of good points. But the choice doesn’t have to be “big data/iron” versus “crowd computing.”

More likely to get useful results out of some combination of the two.

Make “big data/iron” responsible for raw access, processing, visualization in an interactive environment with semantics supplied by the “crowd computers.”

And vet participants on both sides in real time. Would be a novel thing to have firms competing to supply the interactive environment and being paid on the basis of the “crowd computers” that preferred it or got better results.

That is a ways past where Hollis is going but I think it leads naturally in that direction.

Data and the Liar’s Paradox

Sunday, April 8th, 2012

Data and the Liar’s Paradox by Jim Harris.

Jim writes:

“This statement is a lie.”

That is an example of what is known in philosophy and logic as the Liar’s Paradox because if “this statement is a lie” is true, then the statement is false, which would in turn mean that it’s actually true, but this would mean that it’s false, and so on in an infinite, and paradoxical, loop of simultaneous truth and falsehood.

I have never been a fan of the data management concept known as the Single Version of the Truth, and I often quote Bob Kotch, via Tom Redman’s excellent book, Data Driven: “For all important data, there are too many uses, too many viewpoints, and too much nuance for a single version to have any hope of success. This does not imply malfeasance on anyone’s part; it is simply a fact of life. Getting everyone to work from a Single Version of the Truth may be a noble goal, but it is better to call this the One Lie Strategy than anything resembling truth.”

More business/data quality reading.

Imagine my chagrin after years of studying literary criticism in graduate seminary classes (don’t ask, its a long and boring story) to discover that business types already know “truth” is a relative thing.

What does that mean for topic maps?

I would argue with careful design we can capture several points of view, using a point of view as our vantage point.

As opposed to strategies that can only capture a single point of view, their own.

Capturing multiple viewpoints will be a hot topic when “big data” starts to hit the “big fan.”

Books That Influenced My Thinking: Quality, Productivity and Competitive Position

Sunday, April 8th, 2012

Books That Influenced My Thinking: Quality, Productivity and Competitive Position by Thomas Redman.

From the post:

I recently learned that Technics Publications, led by Steve Hoberman, is re-issuing one of my favorites, Data and Reality by William Kent. It led me to conclude I ought to review some of the books that most influenced my thinking about data quality. (I’ll include Data and Reality, when the re-issue appears). I am explicitly excluding books on data quality per se.

First up is Dr. Deming’s Quality, Productivity and Competitive Position (QPC). First published in 1982, to me this is Deming at his finest. The more famous Out of The Crisis came out about the same time and the two cover much the same material. But QPC is raw, powerful Deming. He is fed up the economic malaise of corporate America at the time and he rails against top management for simply not understanding the role of quality in marketplace competition.

Data quality is a “hot” topic these days. I thought it might be useful to see what business perspective resources were available on the topic.

Both to learn management “speak” about data quality and how solutions are evaluated.

QPC sounds a bit dated (1982) but I rather doubt management has changed that much, albeit the terms by which management is described have probably changed a lot. Not the terms used by their employees but the terms used by consultants who are being paid by management. ;-)

Not to forget that topic maps as information products, information services or software, all face the same issues of quality, productivity and competitive position.

Designing User Experiences for Imperfect Data

Wednesday, March 28th, 2012

Designing User Experiences for Imperfect Data by Matthew Hurst.

Matthew writes:

Any system that uses some sort of inference to generate user value is at the mercy of the quality of the input data and the accuracy of the inference mechanism. As neither of these can be guaranteed to by perfect, users of the system will inevitably come across incorrect results.

In web search we see this all the time with irrelevant pages being surfaced. In the context of track // microsoft, I see this in the form of either articles that are incorrectly added to the wrong cluster, or articles that are incorrectly assigned to no cluster, becoming orphans.

It is important, therefore, to take these imperfections into account when building the interface. This is not necessarily a matter of pretending that they don’t exist, or tricking the user. Rather it is a problem of eliciting an appropriate reaction to error. The average user is not conversant in error margins and the like, and thus tends to over-weight errors leading to the perception of poorer quality in the good stuff.

I am not real sure how Matthew finds imperfect data but I guess I will just have to take his word for it. ;-)

Seriously, I think he is spot on in observing that expecting users to hunt-n-peck through search results is wearing a bit thin. That is going to be particularly so when better search systems make the hidden cost of hunt-n-peck visible.

Do take the time to visit his track // microsoft site.

Now imagine your own subject specific and dynamic website. Or even search engine. Could be that search engines for “everything” are the modern day dinosaurs. Big, clumsy, fairly crude.

When It Comes to Data Quality Delivery, the Soft Stuff is the Hard Stuff (Part 1 of 6)

Saturday, March 10th, 2012

When It Comes to Data Quality Delivery, the Soft Stuff is the Hard Stuff (Part 1 of 6) by Richard Trapp.

From the post:

I regularly receive questions regarding the types of skills data quality analysts should have in order to be effective. In my experience, regardless of scope, high performing data quality analysts need to possess a well-rounded, balanced skill set – one that marries technical “know how” and aptitude with a solid business understanding and acumen. But, far too often, it seems that undue importance is placed on what I call the data quality “hard skills”, which include; a firm grasp of database concepts, hands on data analysis experience using standard analytical tool sets, expertise with commercial data quality technologies, knowledge of data management best practices and an understanding of the software development life cycle.

Read Richard’s post to get the listing of “soft skills” and evaluate yourself.

I am going to track this series and will post updates here.

Being successful with “big data,” semantic integration, whatever the next buzz words are, will require a mix of hard and soft skills.

Success has always required both hard and soft skills, but it doesn’t hurt to repeat the lesson.