Archive for the ‘Data Virtualization’ Category

Slaying Data Silos?

Tuesday, July 1st, 2014

Krishnan Subramanian’s Modern Enterprise: Slaying the Silos with Data Virtualization keeps coming up in my Twitter feed.

In speaking of breaking down data silos, Krishnan says:

A much better approach to solving this problem is abstraction through data virtualization. It is a powerful tool, well suited for the loose coupling approach prescribed by the Modern Enterprise Model. Data virtualization helps applications retrieve and manipulate data without needing to know technical details about each data store. when implemented, organizational data can be easily accessed using a simple REST API.

Data Virtualization (or an abstracted Database as a Service) plugs into the Modern Enterprise Platform as a higher-order layer, offering the following advantages:

  • Better business decisions due to organization wide accessibility of all data
  • Higher organizational agility
  • Loosely coupled services making future proofing easier
  • Lower cost

I find that troubling because there is no mention of data integration.

In fact, in more balanced coverage of data virtualization, which recites the same advantages as Krishnan, we read:

For some reason there are those who sell virtualization software and cloud computing enablement platforms who imply that data integration is something that comes along for the ride. However, nothing gets less complex and data integration still needs to occur between the virtualized data stores as if they existed on their own machines. They are still storing data in different physical data structures, and the data must be moved or copied, and the difference with the physical data structures dealt with, as well as data quality, data integrity, data validation, data cleaning, etc. (The Pros and Cons of Data Virtualization)

Krishnan begins his post:

There’s a belief that cloud computing breaks down silos inside enterprises. Yes, the use of cloud and DevOps breaks down organizational silos between different teams but it only solves part of the problem. The bigger problem is silos between data sources. Data silos, as I would like to refer the problem, is the biggest bottlenecks enterprises face as they try to modernize their IT infrastructure. As I advocate the Modern Enterprise Model, many people ask me what problems they’ll face if they embrace it. Today I’ll do a quick post to address this question at a more conceptual level, without getting into the details.

If data silos are the biggest bottleneck enterprises face, why is the means to address that, data integration, a detail?

Every hand waving approach to data integration fuels unrealistic expectations, even among people who should know better.

There are no free lunches and there are no free avenues for data integration.

Pfizer swaps out ETL for data virtualization tools

Thursday, February 21st, 2013

Pfizer swaps out ETL for data virtualization tools by Nicole Laskowski.

From the post:

Pfizer Inc.’s Worldwide Pharmaceutical Sciences division, which determines what new drugs will go to market, was at a technological fork in the road. Researchers were craving a more iterative approach to their work, but when it came to integrating data from different sources, the tools were so inflexible that work slowdowns were inevitable.

At the time, the pharmaceutical company was using one of the most common integration practices known as extract, transform, load (ETL). When a data integration request was made, ETL tools were used to reach into databases or other data sources, copy the requested data sets and transfer them to a data mart for users and applications to access.

But that’s not all. The Business Information Systems (BIS) unit of Pfizer, which processes data integration requests from the company’s Worldwide Pharmaceutical Sciences division, also had to collect specific requirements from the internal customer and thoroughly investigate the data inventory before proceeding with the ETL process.

“Back then, we were basically kind of in this data warehousing information factory mode,” said Michael Linhares, a research fellow and the BIS team leader.

Requests were repetitious and error-prone because ETL tools copy and then physically move the data from one point to another. Much of the data being accessed was housed in Excel spreadsheets, and by the time that information made its way to the data mart, it often looked different from how it did originally.

Plus, the integration requests were time-consuming since ETL tools process in batches. It wasn’t outside the realm of possibility for a project to take up to a year and cost $1 million, Linhares added. Sometimes, his team would finish an ETL job only to be informed it was no longer necessary.

“That’s just a sign that something takes too long,” he said.

Cost, quality and time issues aside, not every data integration request deserved this kind of investment. At times, researchers wanted quick answers; they wanted to test an idea, cross it off if it failed and move to the next one. But ETL tools meant working under rigid constraints. Once Linhares and his team completed an integration request, for example, they were unable to quickly add another field and introduce a new data source. Instead, they would have to build another ETL for that data source to be added to the data mart.

Bear in mind that we were just reminded, Leveraging Ontologies for Better Data Integration, that you have to understand data to integrate data.

That lesson holds true for integrating data after data virtualization.

Where are you going to write down your understanding of the meaning of the data you virtualize?

So subsequent users can benefit from your understanding of that data?

Or perhaps add their understanding to yours?

Or to have the capacity to merge collections of such understandings?

I would say a topic map.


Merging Data Virtualization?

Thursday, January 3rd, 2013

I saw some ad-copy from a company that “wrote the book” on data virtualization (well, “a” book on data virtualization anyway).

Searched a bit in their documentation and elsewhere, but could not find an answer to my questions (below).

Assume departments 1 and 2, each with a data virtualization layer between their apps and the same backend resources:

Data Virtualization, Two Separate Layers

Requirement: Don’t maintain two separate data virtualization layers for the same resources.

Desired result is:

Data Virtualization, One Layer

Questions: Must I return to the data resources to discover their semantices? To merge the two data virtualization layers?

Some may object there should only be one data virtualization layer.

OK, so we have Department 1 – circa 2013 and Department 1 – circa 2015, different data virtualization requirements:

Data Virtualization, Future Layer

Desired result:

Data Virtualization, Future One Layer

Same Questions:

Question: Must I return to the data resources to discover their semantics? To merge the existing and proposed data virtualizaton layers?

The semantics of each item in the data sources (one hopes) was determined for the original data virtualization layer.

It’s wasteful to re-discover the same semantics for changes in data virtualization layers.

Curious, how rediscovery of semantics is avoided in data virtualization software?

Or for that matter, how do you interchange data virtualization layer mappings?

Data Virtualization

Tuesday, April 24th, 2012

David Loshin has a series of excellent posts on data virtualization:

Fundamental Challenges in Data Reusability and Repurposing (Part 1 of 3)

Simplistic Approaches to Data Federation Solve (Only) Part of the Puzzle – We Need Data Virtualization (Part 2 of 3)

Key Characteristics of a Data Virtualization Solution (Part 3 of 3)

In part 3, David concludes:

In other words, to truly provision high quality and consistent data with minimized latency from a heterogeneous set of sources, a data virtualization framework must provide at least these capabilities:

  • Access methods for a broad set of data sources, both persistent and streaming
  • Early involvement of the business user to create virtual views without help from IT
  • Software caching to enable rapid access in real time
  • Consistent views into the underlying sources
  • Query optimizations to retain high performance
  • Visibility into the enterprise metadata and data architectures
  • Views into shared reference data
  • Accessibility of shared business rules associated with data quality
  • Integrated data profiling for data validation
  • Integrated application of advanced data transformation rules that ensure consistency and accuracy

What differentiates a comprehensive data virtualization framework from simplistic layering of access and caching services via data federation is that the comprehensive data virtualization solution goes beyond just data federation. It is not only about heterogeneity and latency, but must incorporate the methodologies that are standardized within the business processes to ensure semantic consistency for the business. If you truly want to exploit the data virtualization layer for performance and quality, you need to have aspects of the meaning and differentiation between use of the data engineered directly into the implementation. And most importantly, also make sure the business user signs-off on the data that is being virtualized for consumption. (emphasis added)

David makes explicit a number of issues, such as integration architectures needing to peer into enterprise metadata and data structures, making it plain that not only data, but the ways we contain/store data has semantics as well.

I would add: Consistency and accuracy should be checked on a regular basis with specified parameters for acceptable correctness.

The heterogeneous data sources that David speaks of are ever changing, both in form and semantics. If you need proof of that, consider the history of ETL at your company. If either form or semantics were stable, that would be a once or twice in a career event. I think we all know that is not the case.

Topic maps can disclose the data and rules for the virtualization decisions that David enumerates. Which has the potential to make those decisions themselves auditable and reusable.

Reuse being an advantage in a constantly changing and heterogeneous semantic environment. Semantics seen once, are very likely to be seen again. (Patterns anyone?)