Archive for the ‘Data Conversion’ Category

Data Shaping in Google Refine

Tuesday, July 31st, 2012

Data Shaping in Google Refine by AJ Hirst.

From the post:

One of the things I’ve kept stumbling over in Google Refine is how to use it to reshape a data set, so I had a little play last week and worked out a couple of new (to me) recipes.

The first relates to reshaping data by creating new rows based on columns. For example, suppose we have a data set that has rows relating to Olympics events, and columns relating to Medals, with cell entries detailing the country that won each medal type:

A bit practical but I was in a conversation earlier today about re-shaping a topic map so “practical” things are on my mind.

With the amount of poorly structured data on the web, you will find this useful.

I first saw this at: Dzone.

An Asymmetric Data Conversion Scheme based on Binary Tags

Wednesday, March 21st, 2012

An Asymmetric Data Conversion Scheme based on Binary Tags by Zhu Wang; Chonglei Mei; Hai Jiang; Wilkin, G.A..

Abstract:

In distributed systems with homogeneous or heterogeneous computers, data generated on one machine might not always be used by another machine directly. For a particular data type, its endianness, size and padding situation cause incompatibility issue. Data conversion procedure is indispensable, especially in open systems. So far, there is no widely accepted data format standard in high performance computing community. Most time, programmers have to handle data formats manually. In order to achieve high programmability and efficiency in both homogeneous and heterogeneous open systems, a novel asymmetric binary-tag-based data conversion scheme (BinTag) is proposed to share data smoothly. Each data item carries one binary tag generated by BinTag’s parser without much programmer’s involvement. Data conversion only happens when it is absolutely necessary. Experimental results have demonstrated its effectiveness and performance gains in terms of productivity and data conversion speed. BinTag can be used in both memory and secondary storage systems.

Homogeneous and heterogeneous in the sense of padding, size, endianness? Serious issue for high performance computing.

Are there lessons to be taught or learned here for other notions of homogeneous/heterogeneous data?

Do we need binary tags to track semantics at a higher level?

Or can we view data as though it had particular semantics? At higher and lower levels?