Pfizer swaps out ETL for data virtualization tools by Nicole Laskowski.
From the post:
Pfizer Inc.’s Worldwide Pharmaceutical Sciences division, which determines what new drugs will go to market, was at a technological fork in the road. Researchers were craving a more iterative approach to their work, but when it came to integrating data from different sources, the tools were so inflexible that work slowdowns were inevitable.
At the time, the pharmaceutical company was using one of the most common integration practices known as extract, transform, load (ETL). When a data integration request was made, ETL tools were used to reach into databases or other data sources, copy the requested data sets and transfer them to a data mart for users and applications to access.
But that’s not all. The Business Information Systems (BIS) unit of Pfizer, which processes data integration requests from the company’s Worldwide Pharmaceutical Sciences division, also had to collect specific requirements from the internal customer and thoroughly investigate the data inventory before proceeding with the ETL process.
“Back then, we were basically kind of in this data warehousing information factory mode,” said Michael Linhares, a research fellow and the BIS team leader.
Requests were repetitious and error-prone because ETL tools copy and then physically move the data from one point to another. Much of the data being accessed was housed in Excel spreadsheets, and by the time that information made its way to the data mart, it often looked different from how it did originally.
Plus, the integration requests were time-consuming since ETL tools process in batches. It wasn’t outside the realm of possibility for a project to take up to a year and cost $1 million, Linhares added. Sometimes, his team would finish an ETL job only to be informed it was no longer necessary.
“That’s just a sign that something takes too long,” he said.
Cost, quality and time issues aside, not every data integration request deserved this kind of investment. At times, researchers wanted quick answers; they wanted to test an idea, cross it off if it failed and move to the next one. But ETL tools meant working under rigid constraints. Once Linhares and his team completed an integration request, for example, they were unable to quickly add another field and introduce a new data source. Instead, they would have to build another ETL for that data source to be added to the data mart.
Bear in mind that we were just reminded, Leveraging Ontologies for Better Data Integration, that you have to understand data to integrate data.
That lesson holds true for integrating data after data virtualization.
Where are you going to write down your understanding of the meaning of the data you virtualize?
So subsequent users can benefit from your understanding of that data?
Or perhaps add their understanding to yours?
Or to have the capacity to merge collections of such understandings?
I would say a topic map.
You?