Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 23, 2011

ETL Demo with Data From Data.Gov

Filed under: ETL,Expressor — Patrick Durusau @ 4:28 pm

ETL Demo with Data From Data.Gov by Kevin E. Kline.

From the post:

A little over a month ago, I wrote an article (Is There Such a Thing as Easy ETL) about expressor software and their desktop ETL application, expressor Studio. I wrote about how it seemed much easier than the native ETL tools in SQL Server when I was reading up on the tool, but that the “proof would be in the pudding” so to speak when I actually tried it out loading some free (and incredibly useful) data from the US federal data clearinghouse, Data.Gov.

If you’d rather not read my entire previous article – quick recap, expressor Studio uses “semantic types” to manage and abstract mappings between sources and targets. In essence, these types are used for describing data in terms that humans can understand—instead of describing data in terms that computers can understand. The idea of semantic abstraction is quite intriguing and it gave me an excuse to use data from data.gov to build a quick demo. You can download the complete data set I used from the following location: International Statistics. (Note: I have this dream that I’m going to someday download all of this free statistical data sets, build a bunch of amazing and high-value analytics, and make a mint. If, instead, YOU do all of those things, then please pay to send at least one of my seven kids to college in repayment for the inspiration. I’m not kidding. I have SEVEN kids. God help me).

The federal government, to their credit, has made great progress in making data available. However, there is a big difference between accessing data and understanding data. When I first looked at one of the data files I downloaded, I figured it was going to take me years to decrypt the field names. Luckily, I did notice an Excel file with field names and descriptions. Seriously, there are single letter field names in these files where the field name “G” has a description of “Age group indicator” (Oh Wow). See the figure below.

I like Kevin’s point about the difference between “accessing data and understanding data.”

December 18, 2011

Error Handling, Validation and Cleansing with Semantic Types and Mappings

Filed under: Expressor,Mapping,Semantics,Types — Patrick Durusau @ 8:41 pm

Error Handling, Validation and Cleansing with Semantic Types and Mappings by Michael Tarallo.

From the post:

expressor ETL applications can setup data validation rules and error handling in a few ways. The traditional approach with many ETL tools is to build in the rules using the various ETL operators. A more streamlined approached is to also use the power of expressor Semantic Mappings and Semantic Types.

  • Semantic Mappings specify how a variety of characteristics are to be handled when string, number, and date-time data types are mapped from the physical schema (your source) to the logical semantic layer known as the Semantic Type.
  • Semantic Types allow you to define, in business terms, how you want the data and the data model to be represented.

The use of these methods both provide a means of attribute data validation and invoking corrective actions if rules are violated.

  • Data Validation rules can be in the form of pattern matching, value ranges, character lengths, formatting, currency and other specific data type constraints.
  • Corrective Actions can be in the form of null, default and correction value replacements as well as specific operator handling to either skip records or reject them to another operator.

NOTE: Semantic Mapping rules are applied first before Semantic Type rules.

Read more here:

I am still trying to find time to test at least the community edition of the software.

What “extra” time I have now is being soaked up configuring/patching Eclipse to build Nutch, to correct a known problem between Nutch and Solr. I suspect you could sell a packaged version of open source software that has all the paths and classes hard coded into the software. No more setting paths, having inconsistent library versions, etc. Just unpack and run. Store data in separate directory. New version comes out, simply rm – R on the install directory and unpack the new one. That should also include the “.” files. Configuration/patching isn’t a good use of anyone’s time. (Unless you want to sell the results. 😉 )

But I will get to it! Unless someone beats me to it and wants to send me a link to their post that I can cite and credit on my blog.


Two things I would change about Michael’s blog:

Prerequisite: Knowledge of expressor Studio and dataflows. You can find tutorials and documentation here

To read:

Prerequisites:

  • expressor software (community or 30-day free trial) here.
  • Knowledge of expressor Studio and dataflows. You can find tutorials and documentation here

And, well, not Michael’s blog but on the expressor download page, if the desktop/community edition is “expressor Studio” then call it that on the download page.

Don’t use different names for a software package and expect users to sort it out. Not if you want to encourage downloads and sales anyway. Surveys show you have to wait until they are paying customers to start abusing them. 😉

December 12, 2011

Extracting data from the Facebook social graph with expressor, a Tutorial

Filed under: Expressor,Facebook,Social Graphs — Patrick Durusau @ 10:21 pm

Extracting data from the Facebook social graph with expressor, a Tutorial by Michael Tarallo.

From the post:

In my last article,Enterprise Application Integration with Social Networking Data, I describe how social networking sites, such as Facebook and Twitter, provide APIs to communicate with the various components available in these applications. One in particular, is their “social graph” API which enables software developers to create programs that can interface with the many “objects” stored within these graphs.

In this article, I will briefly review the Facebook social graph and provide a simple tutorial with an expressor downloadable project. I will cover how expressor can extract data using the Facebook graph API and flatten it by using the provided reusable Datascript Module. I will also demonstrate how to add new user defined attributes to the expressor Dataflow so one can customize the output needed.

Looks interesting.

Seems appropriate after starting today’s posts with work on the ODP files.

As you know, I am not a big fan of ETL but it has been a survivor. And if the folks who are signing off on the design want ETL, maybe it isn’t all that weird. 😉

Powered by WordPress