Archive for the ‘Data Governance’ Category

Hortonworks Establishes Data Governance Initiative

Monday, February 2nd, 2015

Hortonworks Establishes Data Governance Initiative

From the post:

Hortonworks® (NASDAQ:HDP), the leading contributor to and provider of enterprise Apache™ Hadoop®, today announced the creation of the Data Governance Initiative (DGI). DGI will develop an extensible foundation that addresses enterprise requirements for comprehensive data governance. In addition to Hortonworks, the founding members of DGI are Aetna, Merck, and Target and Hortonworks’ technology partner SAS.

Enterprises adopting a modern data architecture must address certain realities when legacy and new data from disparate platforms are brought under management. DGI members will work with the open source community to deliver a comprehensive solution; offering fast, flexible and powerful metadata services, deep audit store and an advanced policy rules engine. It will also feature deep integration with Apache Falcon for data lifecycle management and Apache Ranger for global security policies. Additionally, the DGI solution will interoperate with and extend existing third-party data governance and management tools by shedding light on the access of data within Hadoop. Further DGI investment roadmap phases will be released in the coming weeks.

Supporting quotes

“This joint engineering initiative is another pillar in our unique open source development model,” said Tim Hall, vice president, product management at Hortonworks. “We are excited to partner with the other DGI members to build a completely open data governance foundation that meets enterprise requirements.”

“As customers are moving Hadoop into corporate data and processing environments, metadata and data governance are much needed capabilities. SAS participation in this initiative strengthens the integration of SAS data management, analytics and visualization into the HDP environment and more broadly it helps advance the Apache Hadoop project. This additional integration will give customers better ability to manage big data governance within the Hadoop framework,” said SAS Vice President of Product Management Randy Guard.

Further reading

Enterprise Hadoop:

Apache Falcon:

Hadoop and a Modern Data Architecture:

For more information:

Mike Haro

Quite possibly an opportunity to push for topic map like capabilities in an enterprise setting.

That will require affirmative action on the part of members of the TM community as it is unlikely Hortonworks and others will educate themselves on topic maps.


Good Open Data. . . by design

Wednesday, November 5th, 2014

Good Open Data. . . by design by Victoria L. Lemieux, Oleg Petrov, and, Roger Burks.

From the post:

An unprecedented number of individuals and organizations are finding ways to explore, interpret and use Open Data. Public agencies are hosting Open Data events such as meetups, hackathons and data dives. The potential of these initiatives is great, including support for economic development (McKinsey, 2013), anti-corruption (European Public Sector Information Platform, 2014) and accountability (Open Government Partnership, 2012). But is Open Data’s full potential being realized?

A news item from Computer Weekly casts doubt. A recent report notes that, in the United Kingdom (UK), poor data quality is hindering the government’s Open Data program. The report goes on to explain that – in an effort to make the public sector more transparent and accountable – UK public bodies have been publishing spending records every month since November 2010. The authors of the report, who conducted an analysis of 50 spending-related data releases by the Cabinet Office since May 2010, found that that the data was of such poor quality that using it would require advanced computer skills.

Far from being a one-off problem, research suggests that this issue is ubiquitous and endemic. Some estimates indicate that as much as 80 percent of the time and cost of an analytics project is attributable to the need to clean up “dirty data” (Dasu and Johnson, 2003).

In addition to data quality issues, data provenance can be difficult to determine. Knowing where data originates and by what means it has been disclosed is key to being able to trust data. If end users do not trust data, they are unlikely to believe they can rely upon the information for accountability purposes. Establishing data provenance does not “spring full blown from the head of Zeus.” It entails a good deal of effort undertaking such activities as enriching data with metadata – data about data – such as the date of creation, the creator of the data, who has had access to the data over time and ensuring that both data and metadata remain unalterable.

What is it worth to you to use good open data rather than dirty open data?

Take the costs of your analytics projects for the past year and multiply that by eighty (80) percent. Just an estimate, the actual cost will vary from project to project, but did that result get your attention?

If so, contact your sources for open data and lobby for clean open data.

PS: You may find the World Bank’s Open Data Readiness Assessment Tool useful.

Don’t Create A Data Governance Hairball

Wednesday, May 7th, 2014

Don’t Create A Data Governance Hairball by John Schmidt.

From the post:

Are you in one of those organizations that wants one version of the truth so badly that you have five of them? If so, you’re not alone. How does this happen? The same way the integration hairball happened; point solutions developed without a master plan in a culture of management by exception (that is, address opportunities as exceptions and deal with them as quickly as possible without consideration for broader enterprise needs). Developing a master plan to avoid a data governance hairball is a better approach – but there is a right way and a wrong way to do it.

As you probably can guess, I think John does a great job describing the “data governance hairball,” but not quite such high marks on avoiding the data governance hairball.

Not that I prefer some solution over John’s suggestions but that data governance hairballs are an essential characteristic of shared human knowledge. Human knowledge, can for some semantic locality avoid the data governance hairball, but that is always an accidental property.

An “essential” property is a property a subject must have to be that subject. The semantic differences even within domains, to say nothing of between domains, make it clear that master data governance is only possible within a limited semantic locality. An “accidental” property is a property a subject may or may not have but it is still the same subject.

The essential vs. accidental property distinction is useful in data integration/governance. If we recognize unbounded human knowledge is always subject to the data governance hairball description, then we can begin to look for John’s right level of “granularity.” That is we can create an accidental property that within a particular corporate context that we govern some data quite closely, but other data we don’t attempt to govern at all.

The difference between data we govern and data we don’t, being what ROI can be derived from the data we govern?

If data has no ROI and doesn’t enable ROI from other data, why bother?

Are you governing data with no established ROI?

Ad for Topic Maps

Monday, February 17th, 2014

Imagine my surprise at finding an op-ed piece in Information Management flogging topic maps!

Karen Heath writes in: Is it Really Possible to Achieve a Single Version of Truth?:

There is a pervasive belief that a single version of truth–eliminating data siloes by consolidating all enterprise data in a consistent, non-redundant form – remains the technology-equivalent to the Holy Grail. And, the advent of big data is making it even harder to realize. However, even though SVOT is difficult or impossible to achieve today, beginning the journey is still a worthwhile business goal.

The road to SVOT is paved with very good intentions. SVOT has provided the major justification over the past 20 years for building enterprise data warehouses, and billions of dollars have been spent on relational databases, ETL tools and BI technologies. Millions of resource hours have been expended in construction and maintenance of these platforms, yet no organization is able to achieve SVOT on a sustained basis. Why? Because new data sources, either sanctioned or rogue, are continually being introduced, and existing data is subject to decay of quality over time. As much as 25 percent of customer demographic data, including name, address, contact info, and marital status changes every year. Also, today’s data is more dispersed and distributed and even “bigger” (volume, variety, velocity) than it has ever been.

Karen does a brief overview of why so many SVOT projects have failed (think lack of imagination and insight for starters) but then concludes:

As soon as MDM and DG are recognized as having equal standing with other programs in terms of funding and staffing, real progress can be made toward realization of a sustained SVOT. It takes enlightened management and a committed workforce to understand that successful MDM and DG programs are typically multi-year endeavors that require a significant commitment to of people, processes and technology. MDM and DG are not something that organizations should undertake with a big-bang approach, assuming that there is a simple end to a single project. SVOT is no longer dependent on all data being consolidated into a single physical platform. With effective DG, a federated architecture and robust semantic layer can support a multi-layer, multi-location, multi-product organization that provides its business users the sustained SVOT. That is the reward. (emphasis added)

In case you aren’t “in the know,” DG – data governance, MDM – master data management, SVOT – single version of truth.

The bolded line about the “robust semantic layer” is obviously something topic maps can do quite well. But that’s not where I saw the topic map ad.

I saw the topic map ad being highlighted by:

As soon as MDM and DG are recognized as having equal standing with other programs in terms of funding and staffing

Because that’s never going to happen.

And why should it? GM for example has legendary data management issues but their primary business, MDM and DG people to one side, is making and financing automobiles. They could divert enormous resources to obtain an across the board SVOT but why?

Rather than across the board SVOT, GM is going to want a more selective, a MVOT (My Version Of Truth) application. So it can be applied where it returns the greatest ROI for the investment.

With topic maps as “a federated architecture and robust semantic layer [to] support a multi-layer, multi-location, multi-product organization,” then accounting can have its MVOT, production its MVOT, shipping its MVOT, management its MVOT, regulators their MVOT.

Given the choice between a Single Version Of Truth and your My Version Of Truth, which one would you choose?

That’s what I thought.

PS: Topics maps can also present a SVOT, just in case its advocates come around.

Data Governance needs Searchers, not Planners

Wednesday, March 6th, 2013

Data Governance needs Searchers, not Planners by Jim Harris.

From the post:

In his book Everything Is Obvious: How Common Sense Fails Us, Duncan Watts explained that “plans fail, not because planners ignore common sense, but rather because they rely on their own common sense to reason about the behavior of people who are different from them.”

As development economist William Easterly explained, “A Planner thinks he already knows the answer; A Searcher admits he doesn’t know the answers in advance. A Planner believes outsiders know enough to impose solutions; A Searcher believes only insiders have enough knowledge to find solutions, and that most solutions must be homegrown.”

I made a similar point in my post Data Governance and the Adjacent Possible. Change management efforts are resisted when they impose new methods by emphasizing bad business and technical processes, as well as bad data-related employee behaviors, while ignoring unheralded processes and employees whose existing methods are preventing other problems from happening.

If you don’t remember any line from any post you read here or elsewhere, remember this one:

“…they rely on their own common sense to reason about the behavior of people who are different from them.”

Whenever you encounter a situation where that description fits, you will find failed projects, waste and bad morale.

Living with Imperfect Data

Wednesday, July 4th, 2012

Living with Imperfect Data by Jim Ericson.

From the post:

In a keynote at our MDM & Data Governance conference in Toronto a few days ago, an executive from a large analytical software company said something interesting that stuck with me. I am paraphrasing from memory, but it was very much to the effect of, “Sometimes it’s better to have everyone agreeing on numbers that aren’t entirely accurate than having everyone off doing their own numbers.”

Let that sink in for a moment.

After I did, the very idea of this comment struck me at a few levels. It might have the same effect on you.

In one sense, admitting there is an acceptable level of shared inaccuracy is anathema to the way we like to describe data governance. It was especially so at a MDM-centric conference where people are pretty single-minded about what constitutes “truth.”

As a decision support philosophy, it wouldn’t fly at a health care conference.

I rather like that: “Sometimes it’s better to have everyone agreeing on numbers that aren’t entirely accurate than having everyone off doing their own numbers.”

I suspect because it is the opposite of how I really like to see data. I don’t want rough results, say in a citation network but rather all the relevant citations. Even if it isn’t possible to review all the relevant citations. Still need to be complete.

But completeness is the enemy of results or at least published results. Sure, eventually, assuming a small enough data set, it is possible to map it in its entirety. But that means that whatever good would have come from it being available sooner, has been lost.

I don’t want to lose the sense of rough agreement posed here, because that is important as well. There are many cases where, despite Fed and economists protests to the contrary, the numbers are almost fictional anyway. Pick some, they will be different soon enough. What counts is that we have agreed on numbers for planning purposes. Can always pick new ones.

The same is true for topic maps and perhaps even more so for topic maps. They are a view into an infoverse, fixed at a moment in time by authoring decisions.

Don’t like the view? Create another one.

Breaking Silos – Carrot or Stick?

Thursday, June 7th, 2012

Alex Popescu, in Silos Are Built for a Reason quotes Greg Lowe saying:

In a typical large enterprise, there are competitions for resources and success, competing priorities and lots of irrelevant activities that are happening that can become distractions from accomplishing the goals of the teams.

Another reason silos are built has to do with affiliation. This is by choice, not by edict. By building groups where you share a shared set of goals, you effectively have an area of focus with a group of people interested in the same area and/or outcome.

There are many more reasons and impacts of why silos are built, but I simply wanted to establish that silos are built for a purpose with legitimate business needs in mind.

Alex then responds:

Legitimate? Maybe. Productive? I don’t really think so.

Greg’s original post is: Breaking down silos, what does that mean?

Greg asks about the benefits of breaking down silos:

  • Are the silos mandatory?
  • What would breaking down silos enable in the business?
  • What do silos do to your business today?
  • What incentive is there for these silos to go away?
  • Is your company prepared for transparency?
  • How will leadership deal with “Monday morning quarterbacks?”

As you can see, there are many benefits to silos as well as challenges. By developing a deeper understanding of the silos and why they get created, you can then have a better handle on whether the silos are beneficial or detrimental to the organization.

I would add to Greg’s question list:

  • Which stakeholders benefit from the silos?
  • What is that benefit?
  • It there a carrot or stick that out weighs that benefit? (in the view of the stakeholder)
  • Do you have the political capital to take the stakeholders on and win?

If your answer are:

  • List of names
  • List of benefits
  • Yes, list of carrots/sticks
  • No

Then you are in good company.

Intelligence silos persist despite the United States being at war with identifiable terrorist groups.

Generalized benefit or penalty for failure, isn’t a winning argument to break a data silo.

Specific benefits and penalties must matter to stakeholders. Then you have a chance to break a data silo.

Good luck!

Who Do You Say You Are?

Friday, May 11th, 2012

In Data Governance in Context, Jim Ericson outlines several paths of data governance, or as I put it: Who Do You Say You Are?:

On one path, more enterprises are dead serious about creating and using data they can trust and verify. It’s a simple equation. Data that isn’t properly owned and operated can’t be used for regulatory work, won’t be trusted to make significant business decisions and will never have the value organizations keep wanting to ascribe it on the balance sheet. We now know instinctively that with correct and thorough information, we can jump on opportunities, unite our understanding and steer the business better than before.

On a similar path, we embrace tested data in the marketplace (see Experian, D&B, etc.) that is trusted for a use case even if it does not conform to internal standards. Nothing wrong with that either.

And on yet another path (and areas between) it’s exploration and discovery of data that might engage huge general samples of data with imprecise value.

It’s clear that we cannot and won’t have the same governance standards for all the different data now facing an enterprise.

For starters, crowd sourced and third party data bring a new dimension, because “fitness for purpose” is by definition a relative term. You don’t need or want the same standard for how many thousands or millions of visitors used a website feature or clicked on a bundle in the way you maintain your customer or financial info.

Do mortgage-backed securities fall into the “…huge general samples of data with imprecise value?” I ask because I don’t work in the financial industry. Or do they not practice data governance, except to generate numbers for the auditors?

I mention this because I suspect that subject identity governance would be equally useful for topic map authoring.

For some topic maps, say on drug trials, need to have a high degree of reliability and auditability. As well as precise identification (even if double-blind) of the test subjects.

Or there may be different tests for subject identity, some of which appear to be less precise than others.

For example, merging all the topics entered by a particular operator in a day to look for patterns that may indicate they are not following data entry protocols. (It is hard to be as random as real data.)

As with most issues, there isn’t any hard and fast rule that works for all cases. You do need to document the rules you are following and for how long. It will help you test old rules and to formulate new ones. (“Document” meaning to write down. The vagaries of memory are insufficient.)

Data Governance Next Practices: The 5 + 2 Model

Tuesday, January 17th, 2012

Data Governance Next Practices: The 5 + 2 Model by Jill Dyché.

From the post:

If you’re a regular reader of this newsletter or one of my blogs, odds are I’ve already ripped the doors off of some of your closely held paradigms about data governance. I’d like to flatter myself and say that this is because I enjoy being provocative and I’m a little sassy. Though both of these statements are factual, indeed empirically tested, the real reason is because I’m with clients all the time and I see what does and doesn’t work. And one thing I’ve learned from overseeing dozens of client engagement is this: There’s no single right way to deliver data governance.

Companies that have succeeded with data governance have deliberately designed their data governance efforts. They’ve assembled a core working group, normally comprised of half a dozen or so people from both business and IT functions, who have taken the time to envision what data governance will look like before deploying it. These core teams then identify the components, putting them into place like a Georges Seurat painting, the small pieces comprising the larger landscape.

Well, I don’t have any real loyalty to one particular approach to data governance over another. And I am very much interested in data governance approaches that succeed as opposed to those that don’t.

What Jill says makes sense, at least to me but I do have one question that perhaps one of you can answer. (I am asking at her blog as well and will report back with her response.)

In the illustrations there is a circle that surrounds the entire process, labeled “Continuous Measurement.” Happens twice. I searched the text for some explanation of “continuous measurement” or even “measurement” and came up empty.

So, “continuous measurement” of what?

I ask because if I were using this process with a topic map, I would be interested in measuring how closely the mapping of subject identity was meeting the needs of the various groups.

Particularly since semantics change over time, some more quickly than others. That is the data governance project would never be completed, although it might be more or less active depending upon the rate of semantic change.

I am sure there are other aspects to data governance other than semantic identity but it happens to be the one of the greatest interest to me.

Talend 5

Monday, December 5th, 2011

Talend 5

Talend 5 consists of:

  • Talend Open Studio for Data Integration (formerly Talend Open Studio), the most widely used open source data integration/ETL tool in the world.
  • Talend Open Studio for Data Quality (formerly Talend Open Profiler), the only open source enterprise data profiling tool.
  • Talend Open Studio for MDM (formerly Talend MDM Community Edition), the first – and only – open source MDM solution.
  • Talend Open Studio for ESB (formerly Talend ESB Studio Standard Edition), the easy to use open source ESB based on leading Apache Software Foundation integration projects.

From BusinessWire article.

Massive file downloads running now.

Are you using Talend? Thoughts/suggestions on testing/comparisons?


Monday, October 3rd, 2011


From the website:

DataCleaner is an Open Source application for analyzing, profiling, transforming and cleansing data. These activities help you administer and monitor your data quality. High quality data is key to making data useful and applicable to any modern business.

DataCleaner is the free alternative to software for master data management (MDM) methodologies, data warehousing (DW) projects, statistical research, preparation for extract-transform-load (ETL) activities and more.

Err, “…cleansing data.”? Did someone just call topic maps name? 😉

If it is important to eliminate duplicate data, everyone using duplicated data needs updates and relationships to it. Unless the duplicated data was the result of poor design or just wasting drive space.

This looks like an interesting project and certainly one were topic maps are clearly relevant as one possible output.

Data Integration: Moving Beyond ETL

Wednesday, March 16th, 2011

Data Integration: Moving Beyond ETL

A sponsored white-paper by DataFlux,

Where ETL = Extract Transform Load

Many of the arguments made in this paper fit quite easily with topic map solutions.

DataFlux appears to be selling data governance based solutions, although it appears to take an evolutionary approach to implementing such solutions.

It occurs to me that topic maps could be one stage in the documentation and evolution of data governance solutions.

High marks for a white paper that doesn’t claim IT salvation from a particular approach.

Data Governance, Data Architecture and Metadata Essentials – Webinar

Wednesday, February 2nd, 2011

Data Governance, Data Architecture and Metadata Essentials

Date: February 24, 2011 Time: 9:00AM PT

Speaker: David Loshin

From the website:

The absence of data governance standards is a critical failure point for enterprise data repurposing. As the rates of data volume grows, you want to make sure you are employing the correct practices and standards to make the most of this volume of information. Data can be your company’s best or worst asset. Join David Loshin, industry expert on data governance for this informative webcast.

I suppose it goes without saying that an absence of data governance means that a topic map effort to use outside data is going to be even more expensive. Or perhaps not.

People have been urging documentation of data practices since before the advent of the digital computer. That is still the starting point for any data governance.

What you don’t know about you can’t govern. It’s just that simple. (Can’t merge it with outside data either. But if your internal systems are toast, topic maps aren’t going to save you.)