David Loshin has a series of excellent posts on data virtualization:
In part 3, David concludes:
In other words, to truly provision high quality and consistent data with minimized latency from a heterogeneous set of sources, a data virtualization framework must provide at least these capabilities:
- Access methods for a broad set of data sources, both persistent and streaming
- Early involvement of the business user to create virtual views without help from IT
- Software caching to enable rapid access in real time
- Consistent views into the underlying sources
- Query optimizations to retain high performance
- Visibility into the enterprise metadata and data architectures
- Views into shared reference data
- Accessibility of shared business rules associated with data quality
- Integrated data profiling for data validation
- Integrated application of advanced data transformation rules that ensure consistency and accuracy
What differentiates a comprehensive data virtualization framework from simplistic layering of access and caching services via data federation is that the comprehensive data virtualization solution goes beyond just data federation. It is not only about heterogeneity and latency, but must incorporate the methodologies that are standardized within the business processes to ensure semantic consistency for the business. If you truly want to exploit the data virtualization layer for performance and quality, you need to have aspects of the meaning and differentiation between use of the data engineered directly into the implementation. And most importantly, also make sure the business user signs-off on the data that is being virtualized for consumption. (emphasis added)
David makes explicit a number of issues, such as integration architectures needing to peer into enterprise metadata and data structures, making it plain that not only data, but the ways we contain/store data has semantics as well.
I would add: Consistency and accuracy should be checked on a regular basis with specified parameters for acceptable correctness.
The heterogeneous data sources that David speaks of are ever changing, both in form and semantics. If you need proof of that, consider the history of ETL at your company. If either form or semantics were stable, that would be a once or twice in a career event. I think we all know that is not the case.
Topic maps can disclose the data and rules for the virtualization decisions that David enumerates. Which has the potential to make those decisions themselves auditable and reusable.
Reuse being an advantage in a constantly changing and heterogeneous semantic environment. Semantics seen once, are very likely to be seen again. (Patterns anyone?)