Data Quality « Another Word For It

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 11, 2012

Crowdsourcing – A Solution to your “Bad Data” Problems

Filed under: Crowd Sourcing,Data Quality — Patrick Durusau @ 3:11 pm

Crowdsourcing – A Solution to your “Bad Data” Problems by Hollis Tibbetts.

Hollis writes:

Data problems – whether they be inaccurate data, incomplete data, data categorization issues, duplicate data, data in need of enrichment – are age-old.

IT executives consistently agree that data quality/data consistency is one of the biggest roadblocks to them getting full value from their data. Especially in today’s information-driven businesses, this issue is more critical than ever.

Technology, however, has not done much to help us solve the problem – in fact, technology has resulted in the increasingly fast creation of mountains of “bad data”, while doing very little to help organizations deal with the problem.

One “technology” holds much promise in helping organizations mitigate this issue – crowdsourcing. I put the word technology in quotation marks – as it’s really people that solve the problem, but it’s an underlying technology layer that makes it accurate, scalable, distributed, connectable, elastic and fast. In an article earlier this week, I referred to it as “Crowd Computing”.

Crowd Computing – for Data Problems

The Human “Crowd Computing” model is an ideal approach for newly entered data that needs to either be validated or enriched in near-realtime, or for existing data that needs to be cleansed, validated, de-duplicated and enriched. Typical data issues where this model is applicable include:

Verification of correctness

Data conflict and resolution between different data sources

Judgment calls (such as determining relevance, format or general “moderation”)

“Fuzzy” referential integrity judgment

Data error corrections

Data enrichment or enhancement

Classification of data based on attributes into categories

De-duplication of data items

Sentiment analysis

Data merging

Image data – correctness, appropriateness, appeal, quality

Transcription (e.g. hand-written comments, scanned content)

Translation

In areas such as the Data Warehouse, Master Data Management or Customer Data Management, Marketing databases, catalogs, sales force automation data, inventory data – this approach is ideal – or any time that business data needs to be enriched as part of a business process.

Hollis has a number of good points. But the choice doesn’t have to be “big data/iron” versus “crowd computing.”

More likely to get useful results out of some combination of the two.

Make “big data/iron” responsible for raw access, processing, visualization in an interactive environment with semantics supplied by the “crowd computers.”

And vet participants on both sides in real time. Would be a novel thing to have firms competing to supply the interactive environment and being paid on the basis of the “crowd computers” that preferred it or got better results.

That is a ways past where Hollis is going but I think it leads naturally in that direction.

Comments Off

April 8, 2012

Data and the Liar’s Paradox

Filed under: Data,Data Quality,Marketing — Patrick Durusau @ 4:20 pm

Data and the Liar’s Paradox by Jim Harris.

Jim writes:

“This statement is a lie.”

That is an example of what is known in philosophy and logic as the Liar’s Paradox because if “this statement is a lie” is true, then the statement is false, which would in turn mean that it’s actually true, but this would mean that it’s false, and so on in an infinite, and paradoxical, loop of simultaneous truth and falsehood.

I have never been a fan of the data management concept known as the Single Version of the Truth, and I often quote Bob Kotch, via Tom Redman’s excellent book, Data Driven: “For all important data, there are too many uses, too many viewpoints, and too much nuance for a single version to have any hope of success. This does not imply malfeasance on anyone’s part; it is simply a fact of life. Getting everyone to work from a Single Version of the Truth may be a noble goal, but it is better to call this the One Lie Strategy than anything resembling truth.”

More business/data quality reading.

Imagine my chagrin after years of studying literary criticism in graduate seminary classes (don’t ask, its a long and boring story) to discover that business types already know “truth” is a relative thing.

What does that mean for topic maps?

I would argue with careful design we can capture several points of view, using a point of view as our vantage point.

As opposed to strategies that can only capture a single point of view, their own.

Capturing multiple viewpoints will be a hot topic when “big data” starts to hit the “big fan.”

Comments Off

Books That Influenced My Thinking: Quality, Productivity and Competitive Position

Filed under: Data Quality,Marketing — Patrick Durusau @ 4:19 pm

Books That Influenced My Thinking: Quality, Productivity and Competitive Position by Thomas Redman.

From the post:

I recently learned that Technics Publications, led by Steve Hoberman, is re-issuing one of my favorites, Data and Reality by William Kent. It led me to conclude I ought to review some of the books that most influenced my thinking about data quality. (I’ll include Data and Reality, when the re-issue appears). I am explicitly excluding books on data quality per se.

First up is Dr. Deming’s Quality, Productivity and Competitive Position (QPC). First published in 1982, to me this is Deming at his finest. The more famous Out of The Crisis came out about the same time and the two cover much the same material. But QPC is raw, powerful Deming. He is fed up the economic malaise of corporate America at the time and he rails against top management for simply not understanding the role of quality in marketplace competition.

Data quality is a “hot” topic these days. I thought it might be useful to see what business perspective resources were available on the topic.

Both to learn management “speak” about data quality and how solutions are evaluated.

QPC sounds a bit dated (1982) but I rather doubt management has changed that much, albeit the terms by which management is described have probably changed a lot. Not the terms used by their employees but the terms used by consultants who are being paid by management.

Not to forget that topic maps as information products, information services or software, all face the same issues of quality, productivity and competitive position.

Comments Off

March 28, 2012

Designing User Experiences for Imperfect Data

Filed under: Data Quality,Interface Research/Design,Search Interface,Searching — Patrick Durusau @ 4:21 pm

Designing User Experiences for Imperfect Data by Matthew Hurst.

Matthew writes:

Any system that uses some sort of inference to generate user value is at the mercy of the quality of the input data and the accuracy of the inference mechanism. As neither of these can be guaranteed to by perfect, users of the system will inevitably come across incorrect results.

In web search we see this all the time with irrelevant pages being surfaced. In the context of track // microsoft, I see this in the form of either articles that are incorrectly added to the wrong cluster, or articles that are incorrectly assigned to no cluster, becoming orphans.

It is important, therefore, to take these imperfections into account when building the interface. This is not necessarily a matter of pretending that they don’t exist, or tricking the user. Rather it is a problem of eliciting an appropriate reaction to error. The average user is not conversant in error margins and the like, and thus tends to over-weight errors leading to the perception of poorer quality in the good stuff.

I am not real sure how Matthew finds imperfect data but I guess I will just have to take his word for it.

Seriously, I think he is spot on in observing that expecting users to hunt-n-peck through search results is wearing a bit thin. That is going to be particularly so when better search systems make the hidden cost of hunt-n-peck visible.

Do take the time to visit his track // microsoft site.

Now imagine your own subject specific and dynamic website. Or even search engine. Could be that search engines for “everything” are the modern day dinosaurs. Big, clumsy, fairly crude.

Comments Off

March 10, 2012

When It Comes to Data Quality Delivery, the Soft Stuff is the Hard Stuff (Part 1 of 6)

Filed under: Data Management,Data Quality — Patrick Durusau @ 8:20 pm

When It Comes to Data Quality Delivery, the Soft Stuff is the Hard Stuff (Part 1 of 6) by Richard Trapp.

From the post:

I regularly receive questions regarding the types of skills data quality analysts should have in order to be effective. In my experience, regardless of scope, high performing data quality analysts need to possess a well-rounded, balanced skill set – one that marries technical “know how” and aptitude with a solid business understanding and acumen. But, far too often, it seems that undue importance is placed on what I call the data quality “hard skills”, which include; a firm grasp of database concepts, hands on data analysis experience using standard analytical tool sets, expertise with commercial data quality technologies, knowledge of data management best practices and an understanding of the software development life cycle.

Read Richard’s post to get the listing of “soft skills” and evaluate yourself.

I am going to track this series and will post updates here.

Being successful with “big data,” semantic integration, whatever the next buzz words are, will require a mix of hard and soft skills.

Success has always required both hard and soft skills, but it doesn’t hurt to repeat the lesson.

Comments Off

« Newer Posts