Archive for the ‘Data Warehouse’ Category

Data Warehousing and Big Data Papers by Peter Bailis

Thursday, March 7th, 2013

Quick and Dirty (Incomplete) List of Interesting, Mostly Recent Data Warehousing and Big Data Papers by Peter Bailis

Alex Popescu reports some twenty-seven (27) papers and links gathered by Peter Bailis on Data Warehousing and Big Data!


Pfizer swaps out ETL for data virtualization tools

Thursday, February 21st, 2013

Pfizer swaps out ETL for data virtualization tools by Nicole Laskowski.

From the post:

Pfizer Inc.’s Worldwide Pharmaceutical Sciences division, which determines what new drugs will go to market, was at a technological fork in the road. Researchers were craving a more iterative approach to their work, but when it came to integrating data from different sources, the tools were so inflexible that work slowdowns were inevitable.

At the time, the pharmaceutical company was using one of the most common integration practices known as extract, transform, load (ETL). When a data integration request was made, ETL tools were used to reach into databases or other data sources, copy the requested data sets and transfer them to a data mart for users and applications to access.

But that’s not all. The Business Information Systems (BIS) unit of Pfizer, which processes data integration requests from the company’s Worldwide Pharmaceutical Sciences division, also had to collect specific requirements from the internal customer and thoroughly investigate the data inventory before proceeding with the ETL process.

“Back then, we were basically kind of in this data warehousing information factory mode,” said Michael Linhares, a research fellow and the BIS team leader.

Requests were repetitious and error-prone because ETL tools copy and then physically move the data from one point to another. Much of the data being accessed was housed in Excel spreadsheets, and by the time that information made its way to the data mart, it often looked different from how it did originally.

Plus, the integration requests were time-consuming since ETL tools process in batches. It wasn’t outside the realm of possibility for a project to take up to a year and cost $1 million, Linhares added. Sometimes, his team would finish an ETL job only to be informed it was no longer necessary.

“That’s just a sign that something takes too long,” he said.

Cost, quality and time issues aside, not every data integration request deserved this kind of investment. At times, researchers wanted quick answers; they wanted to test an idea, cross it off if it failed and move to the next one. But ETL tools meant working under rigid constraints. Once Linhares and his team completed an integration request, for example, they were unable to quickly add another field and introduce a new data source. Instead, they would have to build another ETL for that data source to be added to the data mart.

Bear in mind that we were just reminded, Leveraging Ontologies for Better Data Integration, that you have to understand data to integrate data.

That lesson holds true for integrating data after data virtualization.

Where are you going to write down your understanding of the meaning of the data you virtualize?

So subsequent users can benefit from your understanding of that data?

Or perhaps add their understanding to yours?

Or to have the capacity to merge collections of such understandings?

I would say a topic map.


Amazon Web Services Announces Amazon Redshift

Saturday, February 16th, 2013

Amazon Web Services Announces Amazon Redshift

From the post:

Amazon Web Services, Inc. today announced that Amazon Redshift, a managed, petabyte-scale data warehouse service in the cloud, is now broadly available for use.

Since Amazon Redshift was announced at the AWS re: Invent conference in November 2012, customers using the service during the limited preview have ranged from startups to global enterprises, with datasets from terabytes to petabytes, across industries including social, gaming, mobile, advertising, manufacturing, healthcare, e-commerce, and financial services.

Traditional data warehouses require significant time and resource to administer. In addition, the financial cost associated with building, maintaining, and growing self-managed, on-premise data warehouses is very high. Amazon Redshift aims to lower the cost of a data warehouse and make it easy to analyze large amounts of data very quickly.

Amazon Redshift uses columnar data storage, advanced compression, and high performance IO and network to achieve higher performance than traditional databases for data warehousing and analytics workloads. Redshift is currently available in the US East (N. Virginia) Region and will be rolled out to other AWS Regions in the coming months.

“When we set out to build Amazon Redshift, we wanted to leverage the massive scale of AWS to deliver ten times the performance at 1/10 the cost of on-premise data warehouses in use today,” said Raju Gulabani, Vice President of Database Services, Amazon Web Services….

Amazon Web Services

Wondering what impact a 90% reduction in cost, if borne out over a variety of customers, will have on the cost of on-premise data warehouses?

Suspect the cost for on-premise warehouses will go up because there will be a smaller market for the hardware and people to run them.

Something to consider as a startup that wants to deliver big data services.

Do you really want your own server room/farm, etc.?

Or for that matter, will VCs ask: Why are you allocating funds to a server farm?

PS: Amazon “Redshift” is another example of semantic pollution. “Redshift” had (past tense) a well know and generally accepted semantic. Well, except for the other dozen or so meanings for “redshift” that I counted in less than a minute. 😉

Sigh, semantic confusion continues unabated.

Big Data Reference Model (And Panopticons)

Tuesday, April 10th, 2012

Big Data Reference Model

Michael Nygard writes:

A project that approaches Big Data as a purely technical challenge will not deliver results. It is about more than just massive Hadoop clusters and number-crunching. In order to deliver value, a Big Data project has to enable change and adaptation. This requires that there are known problems to be solved. Yet, identifying the problem can be the hardest part. It’s often the case that you have to collect some information to even discover what problem to solve. Deciding how to solve that problem creates a need for more information and analysis. This is an empirical discovery loop similar to that found in any research project or Six Sigma initiative.

Michael takes you on a sensible loop of discover and evaluation, making you more likely (no guarantees) to succeed with your next “big data” project. In particular see the following caution:

… it is tempting to think that we could build a complete panopticon: a universal data warehouse with everything in the company. This is an expensive endeavor, and not a historically successful path. Whether structured or unstructured, any data store is suited to answer some questions but not others. No matter how much you invest in building the panopticon, there will be dimensions you don’t think to support. It is better to skip the massive up-front time and expense, focusing instead on making it very fast and easy to add new data sources or new elements to existing sources.

I like the term panopticon. In part because if its historical association with prisons.

Data warehouses/structures are prisons and suited better for one purpose (or group of purposes) than another.

We must build prisons for today and leave tomorrow’s prisons for tomorrow.

The problem that topic maps trys to address is how to safely transfer prisoners from today’s prisons to tomorrows? Which is made more complicated by some people still using old prisons, sometimes generations of prisons older than most people. Not to mention the variety of prisons across businesses, governments, nationalities.

All of them have legitimate purposes and serve some purpose now, else their users would have migrated their prisoners to a new prison.

I will have to think about the prison metaphor. I think it works fairly well.


Hadapt is moving forward

Friday, November 25th, 2011

Hadapt is moving forward

A bullet-point type review, mostly a summary of information from the vendor. Not a bad thing, can be useful. But, you would think that when reviewing a vendor or their product, there would be a link to the vendor/product. Yes? No one that I can find in that post.

Let me make it easy for you: How hard was that? Maybe 10 seconds of my time and that is because I have gotten slow? The point of the WWW, at least as I understand it, is to make information more accessible to users. But it doesn’t happen by itself. Put in hyperlinks where appropriate.

There is a datasheet on the Adaptive Analytic Platform &trade:.

You can follow the link for the technical report and register, but it is little more than a sales brochure.

More informative is: Efficient Processing of Data Warehousing Queries in a Split Execution Environment.

I don’t have a local setup that would exercise Hadapt. If you do or if you are using it in the cloud, would appreciate any comments or pointers you have.

Tiny Trilogy

Wednesday, November 9th, 2011

Tiny Trilogy

Peter Thomas writes:

Although was a pioneer in URL shortening, it seems to have been overtaken by a host of competing services. For example I tend to use most of the time. However I still rather like the option to create your own bespoke shortened URLs.

This feature rather came into its own recently when I was looking for a concise way to share my recent trilogy focusing on the use of historical data to justify BI/DW investments in Insurance.

Good series of posts on historical data and business intelligence. I suspect many of these lessons could be applied fairly directly to using historical data to justify semantic integration projects.

Such as showing what sharing would have meant as far as information on terrorists prior to 9/11.

Our big data/total data survey is now live [the 451 Group]

Monday, October 3rd, 2011

Our big data/total data survey is now live [the 451 Group]

The post reads in part:

The 451 Group is conducting a survey into end user attitudes towards the potential benefits of ‘big data’ and new and emerging data management technologies.

In return for your participation, you will receive a copy of a forthcoming long-format report covering introducing Total Data, The 451 Group’s concept for explaining the changing data management landscape, which will include the results. Respondents will also have the opportunity to become members of TheInfoPro’s peer network.

Just a word about the survey.

Question 10 reads:

What is the primary platform used for storing and querying from each of the following types of data?

Good question but you have to choose one of three answers to put other (and say what “other” means), you are not allowed to skip any type of data.

Data types are:

  • Customer Data
  • Transactional Data
  • Online Transaction Data
  • Domain-specific Application Data (e.g., Trade Data in Financial Services, and Call Data in Telecoms)
  • Application Log Data
  • Web Log Data
  • Network Log Data
  • Other Log Files
  • Social Media/Online Data
  • Search Log
  • Audio/Video/Graphics
  • Other Documents/Content

Same thing happens for Question 11:

What is the primary platform used for each of the following analytics workloads?

Eleven required answers that I won’t bother to repeat here.

As a consultant I really don’t have serious iron/data on the premises but that doesn’t seem to occurred to the survey designers. Nor that even a major IT installation might not have all forms of data or analytics.

My solution? I just marked Hadoop on Questions 10 and 11 so I could get to the rest of the survey.

Q12. Which are the top three benefits associated with each of the following data management technologies?

Q13. Which are the top three challenges associated with each of the following data management technologies?

Q14. To what extent do you agree with the following statements? (which includes: “The enterprise data warehouse is the single version of the truth for business intelligence”

Questions 12 – 14 all require answers to all options.

Note the clever first agree/disagree statement for Q.14.

Someone will conduct a useful survey of business opinions about big data and likely responses to it.

Hopefully with a technical survey of the various options and their advantages/disadvantages.

Please let me know when you see it, I would like to point people to it.

(I completed this form on Sunday, October 2, 2011, around 11 AM Eastern time.)

It’s official — the grand central EDW will never happen

Friday, June 24th, 2011

It’s official — the grand central EDW will never happen

Kurt Monash cites presentations at the Enzee Universe conference by IBM, Merv Adrian (Gartner) and Forrester Research panning the idea of a grand central EDW (Enterprise Data Warehouse).

If that isn’t going to happen for any particular enterprise, does that mean no universal data warehouse, a/k/a, the Semantic Web?

Even if Linked Data were to succeed in linking all data together, that’s the easy part. Useful access has always been a question of mapping semantics and that’s the hard part. The part that requires people in the loop. People like librarians.