Archive for the ‘ETL’ Category

Stealth-mode 28msec wants to build a Tower of Babel for databases

Wednesday, June 12th, 2013

Stealth-mode 28msec wants to build a Tower of Babel for databases by Derrick Harris.

From the post:

28msec is not your average database startup but, then again, neither is its mission. The company — still in stealth mode (until our Structure Launchpad event on June 20) after about seven years of existence — has created a data-processing platform that it says can take and analyze data from any source, and then deliver the results in real time.

The company took so long to officially launch, CEO Eric Kish told me, because it took such a long time to build. The 28msec history goes like this: The early investors are database industry veterans (one was employee No. 6 at Oracle) who, at some point in 2006, envisioned an explosion in data formats and databases. Their solution was to create a platform able to extract data from any of these sources, transform it into a standard format, and then let users analyze it using a single query language that looks a lot like the SQL they already know. 28msec is based on the open source JSONiq and Zorba query languages and will be available as a cloud service.

Alex Popescu points to his: Main difference between Hadapt and Microsoft Polybase, HAWQ, SQL-H to underline the point that we all know ETL works, the question is what is required to optimize it.

I first saw this at Alex Popescu’s 28msec – query data from any source in real time.

PS: Should I send a note along to the NSA or just assume they are listening? ;-)

Take DMX-h ETL Pre-Release for a Test Drive!

Sunday, June 2nd, 2013

Take DMX-h ETL Pre-Release for a Test Drive! by Keith Kohl.

From the post:

Last Monday, we announced two new DMX-h Hadoop products, DMX-h Sort Edition and DMX-h ETL Edition. Several Blog posts last week included why I thought the announcement was cool and also some Hadoop benchmarks on both TeraSort and also running ETL.

Part of our announcement was the DMX-h ETL Pre-Release Test Drive. The test drive is a trial download of our DMX-h ETL software. We have installed our software on our partner Cloudera’s VM (VMware) image complete with the user case accelerators, sample data, documentation and even videos. While the download is a little large ─ ok it’s over 3GB─ it’s a complete VM with Linux and Cloudera’s CDH 4.2 Hadoop release (the DMX-h footprint is a mere 165MB!).

Cool!

Then Keith asks later in the post:

The test drive is not your normal download. This is actually a pre-release of our DMX-h ETL product offering. While we have announced our product, it is not generally available (GA) yet…scheduled for end of June. We are offering a download of a product that isn’t even available yet…how many vendors do that?!

Err, lots of them? It’s call a beta/candidate/etc release?

;-)

Marketing quibbles aside, it does sound quite interesting.

In some ways I would like to see the VM release model become more common.

Test driving software should not be a install/configuration learning experience.

That should come after users are interested in the software.

BTW, interesting approach, at least reading the webpages/documentation.

Doesn’t generate code for conversion/ETL so there is no code to maintain. Written against the DMX-h engine.

Need to think about what that means in terms of documenting semantics.

Or reconciling different ETL approaches in the same enterprise.

More to follow.

Pfizer swaps out ETL for data virtualization tools

Thursday, February 21st, 2013

Pfizer swaps out ETL for data virtualization tools by Nicole Laskowski.

From the post:

Pfizer Inc.’s Worldwide Pharmaceutical Sciences division, which determines what new drugs will go to market, was at a technological fork in the road. Researchers were craving a more iterative approach to their work, but when it came to integrating data from different sources, the tools were so inflexible that work slowdowns were inevitable.

At the time, the pharmaceutical company was using one of the most common integration practices known as extract, transform, load (ETL). When a data integration request was made, ETL tools were used to reach into databases or other data sources, copy the requested data sets and transfer them to a data mart for users and applications to access.

But that’s not all. The Business Information Systems (BIS) unit of Pfizer, which processes data integration requests from the company’s Worldwide Pharmaceutical Sciences division, also had to collect specific requirements from the internal customer and thoroughly investigate the data inventory before proceeding with the ETL process.

“Back then, we were basically kind of in this data warehousing information factory mode,” said Michael Linhares, a research fellow and the BIS team leader.

Requests were repetitious and error-prone because ETL tools copy and then physically move the data from one point to another. Much of the data being accessed was housed in Excel spreadsheets, and by the time that information made its way to the data mart, it often looked different from how it did originally.

Plus, the integration requests were time-consuming since ETL tools process in batches. It wasn’t outside the realm of possibility for a project to take up to a year and cost $1 million, Linhares added. Sometimes, his team would finish an ETL job only to be informed it was no longer necessary.

“That’s just a sign that something takes too long,” he said.

Cost, quality and time issues aside, not every data integration request deserved this kind of investment. At times, researchers wanted quick answers; they wanted to test an idea, cross it off if it failed and move to the next one. But ETL tools meant working under rigid constraints. Once Linhares and his team completed an integration request, for example, they were unable to quickly add another field and introduce a new data source. Instead, they would have to build another ETL for that data source to be added to the data mart.

Bear in mind that we were just reminded, Leveraging Ontologies for Better Data Integration, that you have to understand data to integrate data.

That lesson holds true for integrating data after data virtualization.

Where are you going to write down your understanding of the meaning of the data you virtualize?

So subsequent users can benefit from your understanding of that data?

Or perhaps add their understanding to yours?

Or to have the capacity to merge collections of such understandings?

I would say a topic map.

You?

JMyETL

Friday, September 14th, 2012

JMyETL, an easy to use ETL tool that supports 10 different RDBMS by Esen Sagynov.

From the post:

JMyETL is a very useful and simple Java based application for Windows OS which allows users to import and export data from/to various database systems. For example:

  • CUBRID –> Sybase ASE, Sybase ASA, MySQL, Oracle, PostgreSQL, SQL Server, DB2, Access, SQLite
  • MySQL –> Sybase ASE/ASA, Oracle, Access, PostgreSQL, SQL Server, DB2, SQLite, CUBRID
  • Sybase ASE –> Sybase ASA, MySQL, Oracle, Access, PostgreSQL, SQL Server, DB2, SQLite, CUBRID
  • Sybase ASA –> Sybase ASE, MySQL, Oracle, Access, PostgreSQL, SQL Server, DB2, SQLite, CUBRID
  • Oracle –> Sybase ASA, Sybase ASE, MySQL, Access, PostgreSQL, SQL Server, DB2, SQLite, CUBRID
  • Access –> Sybase ASE, Sybase ASA, MySQL, Oracle, PostgreSQL, SQL Server, DB2, SQLite, CUBRID
  • PostgreSQL –> Sybase ASE, Sybase ASA, MySQL, Oracle, Access, SQL Server, DB2, SQLite, CUBRID
  • SQL Server –> Sybase ASE, Sybase ASA, MySQL, Oracle, PostgreSQL, Access, DB2, SQLite, CUBRID
  • DB2 –> Sybase ASE, Sybase ASA, MySQL, Oracle, PostgreSQL, SQL Server, Access, SQLite, CUBRID
  • SQLite –> Sybase ASE, Sybase ASA, MySQL, Oracle, PostgreSQL, SQL Server, DB2, Access, CUBRID

Just in case you need a database to database ETL utility.

I first saw this at DZone.

ETL 2.0 – Data Integration Comes of Age

Monday, May 14th, 2012

ETL 2.0 – Data Integration Comes of Age by Robin Bloor PhD & Rebecca Jozwiak.

Well…., sort of.

It is a “white paper” and all that implies but when you read:

Versatility of Transformations and Scalability

All ETL products provide some transformations but few are versatile. Useful transformations may involve translating data formats and coded values between the data sources and the target (if they are, or need to be, different). They may involve deriving calculated values, sorting data, aggregating data, or joining data. They may involve transposing data (from columns to rows) or transposing single columns into multiple columns. They may involve performing look-ups and substituting actual values with looked-up values accordingly, applying validations (and rejecting records that fail) and more. If the ETL tool cannot perform such transformations, they will have to be hand coded elsewhere – in the database or in an application.

It is extremely useful if transformations can draw data from multiple sources and data joins can be performed between such sources “in flight,” eliminating the need for costly and complex staging. Ideally, an ETL 2.0 product will be rich in transformation options since its role is to eliminate the need for direct coding all such data transformations.

you start to lose what little respect you had for industry “white papers.”

Not once in this white paper is the term “semantics” used. It is also innocent of using the term “documentation.”

Don’t you think an ETL 2.0 application should enable re-use of “useful transformations?”

Wouldn’t that be a good thing?

Instead of IT staff starting from zero with every transformation request?

Failure to capture the semantics of data leaves you at ETL 2.0, while everyone else is at ETL 3.0.

Where does your business sense tell you about that choice?

(ETL 3.0 – Documented, re-usable, semantics for data and data structures. Enables development of transformation modules for particular data sources.)

ETL Demo with Data From Data.Gov

Friday, December 23rd, 2011

ETL Demo with Data From Data.Gov by Kevin E. Kline.

From the post:

A little over a month ago, I wrote an article (Is There Such a Thing as Easy ETL) about expressor software and their desktop ETL application, expressor Studio. I wrote about how it seemed much easier than the native ETL tools in SQL Server when I was reading up on the tool, but that the “proof would be in the pudding” so to speak when I actually tried it out loading some free (and incredibly useful) data from the US federal data clearinghouse, Data.Gov.

If you’d rather not read my entire previous article – quick recap, expressor Studio uses “semantic types” to manage and abstract mappings between sources and targets. In essence, these types are used for describing data in terms that humans can understand—instead of describing data in terms that computers can understand. The idea of semantic abstraction is quite intriguing and it gave me an excuse to use data from data.gov to build a quick demo. You can download the complete data set I used from the following location: International Statistics. (Note: I have this dream that I’m going to someday download all of this free statistical data sets, build a bunch of amazing and high-value analytics, and make a mint. If, instead, YOU do all of those things, then please pay to send at least one of my seven kids to college in repayment for the inspiration. I’m not kidding. I have SEVEN kids. God help me).

The federal government, to their credit, has made great progress in making data available. However, there is a big difference between accessing data and understanding data. When I first looked at one of the data files I downloaded, I figured it was going to take me years to decrypt the field names. Luckily, I did notice an Excel file with field names and descriptions. Seriously, there are single letter field names in these files where the field name “G” has a description of “Age group indicator” (Oh Wow). See the figure below.

I like Kevin’s point about the difference between “accessing data and understanding data.”