Archive for the ‘Data Conversion’ Category

brename – data munging tool

Thursday, August 31st, 2017

brename — a practical cross-platform command-line tool for safely batch renaming files/directories via regular expression

Renaming files is a daily activity when data munging. Wei Shen has created a batch renaming tool with these features:

  • Cross-platform. Supporting Windows, Mac OS X and Linux.
  • Safe. By checking potential conflicts and errors.
  • File filtering. Supporting including and excluding files via regular expression.
    No need to run commands like find ./ -name "*.html" -exec CMD.
  • Renaming submatch with corresponding value via key-value file.
  • Renaming via ascending integer.
  • Recursively renaming both files and directories.
  • Supporting dry run.
  • Colorful output. Screenshots:

Binaries are available for Linux, OS X and Windows, both 32 and 64-bit versions.

Linux has a variety of batch file renaming options but I didn’t see any short-comings in brename that jumped out at me.

You?

HT, Stephen Turner.

Are You Investing in Data Prep or Technology Skills?

Wednesday, August 30th, 2017

Kirk Borne posted for #wisdomwednesday:

New technologies are my weakness.

What about you?

What if we used data driven decision making?

Different result?

Import Table Into Google Spreadsheet – Worked Example – Baby Blue’s

Wednesday, February 24th, 2016

I encountered a post by Zach Klein with the title: You can automatically scrape and import any table or list from any URL into Google Spreadsheets.

As an image of his post:

zach-klein

Despite it having 1,844 likes and 753 retweets, I had to test it before posting it here. 😉

An old habit born of not cited anything I haven’t personally checked. It means more reading but you get to commit your own mistakes and are not limited to the mistakes made by others.

Anyway, I thought of the HTML version of Baby Blue’s Manual of Legal Citation as an example.

After loading that URL, view the source of the page because we want to search for table elements in the text. There are display artifacts that look like tables but are lists, etc.

The table I chose was #11, which appears in Baby Blue’s as:

bb-court

So I opened up a blank Google Spreadsheet and entered:

=ImportHTML("https://law.resource.org/pub/us/code/blue/
BabyBlue.20160205.html", "table", 11)

in the top left cell.

The results:

bb-court-gs

I’m speculating but Google Spreadsheets appears to have choked on the entities used around “name” in the entry for Borough court.

If you’re not fluent with XSLT or XQuery, importing tables and lists into Google Spreadsheets is an easy way to capture information.

DataGraft: Initial Public Release

Monday, September 7th, 2015

DataGraft: Initial Public Release

As a former resident of Louisiana and given my views on the endemic corruption in government contracts, putting “graft” in the title of anything is like waving a red flag at a bull!

From the webpage:

We are pleased to announce the initial public release of DataGraft – a cloud-based service for data transformation and data access. DataGraft is aimed at data workers and data developers interested in simplified and cost-effective solutions for managing their data. This initial release provides capabilities to:

  • Transform tabular data and share transformations: Interactively edit, host, execute, and share data transformations
  • Publish, share, and access RDF data: Data hosting and reliable RDF data access / data querying

Sign up for an account and try DataGraft now!

You may want to check out our FAQ, documentation, and the APIs. We’d be glad to hear from you – don’t hesitate to get in touch with us!

I followed a tweet from Kirk Borne recently to a demo of Pentaho on data integration. I mention that because Pentaho is a good representative of the commercial end of data integration products.

Oh, the demo was impressive, a visual interface selecting nicely styled icons from different data sources, integration, visualization, etc.

But, the one characteristic it shares with DataGraft is that I would be hard pressed to follow or verify your reasoning for the basis for integrating that particular data.

If it happens that both files have customerID and they both have the same semantic, by some chance, then you can glibly talk about integrating data from diverse resources. If not, well, then your mileage will vary a great deal.

The important point that is dropped by both Pentaho and DataGraft is that data integration isn’t just an issue for today, that same data integration must be robust long after I have moved onto another position.

Like spreadsheets, the next person in my position could just run the process blindly and hope that no one ever asks for a substantive change, but that sounds terribly inefficient.

Why not provide users with the ability to disclose the properties they “see” in the data sources and to indicate why they made the mappings they did?

That is make the mapping process more transparent.

Lab Report: The Final Grade [Normalizing Corporate Small Data]

Sunday, December 7th, 2014

Lab Report: The Final Grade by Dr. Geoffrey Malafsky.

From the post:

We have completed our TechLab series with Cloudera. Its objective was to explore the ability of Hadoop in general, and Cloudera’s distribution in particular, to meet the growing need for rapid, secure, adaptive merging and correction of core corporate data. I call this Corporate Small Data which is:

“Structured data that is the fuel of an organization’s main activities, and whose problems with accuracy and trustworthiness are past the stage of being alleged. This includes financial, customer, company, inventory, medical, risk, supply chain, and other primary data used for decision making, applications, reports, and Business Intelligence. This is Small Data relative to the much ballyhooed Big Data of the Terabyte range.”1

Corporate Small Data does not include the predominant Big Data examples which are almost all stochastic use cases. These can succeed even if there is error in the source data and uncertainty in the results since the business objective is getting trends or making general associations. In stark contrast are deterministic use cases, where the ramifications for wrong results are severely negative, such as for executive decision making, accounting, risk management, regulatory compliance, and security.

Dr. Malafsky gives Cloudera high marks (A-) for use in enterprises and what he describes as “data normalization.” Not in the relational database sense but more in the data cleaning sense.

While testing a Cloudera distribution at your next data cleaning exercise, ask yourself this question: OK, the processing worked great, but how to I avoid collecting all the information I needed for this project, again in the future?

Data Shaping in Google Refine

Tuesday, July 31st, 2012

Data Shaping in Google Refine by AJ Hirst.

From the post:

One of the things I’ve kept stumbling over in Google Refine is how to use it to reshape a data set, so I had a little play last week and worked out a couple of new (to me) recipes.

The first relates to reshaping data by creating new rows based on columns. For example, suppose we have a data set that has rows relating to Olympics events, and columns relating to Medals, with cell entries detailing the country that won each medal type:

A bit practical but I was in a conversation earlier today about re-shaping a topic map so “practical” things are on my mind.

With the amount of poorly structured data on the web, you will find this useful.

I first saw this at: Dzone.

An Asymmetric Data Conversion Scheme based on Binary Tags

Wednesday, March 21st, 2012

An Asymmetric Data Conversion Scheme based on Binary Tags by Zhu Wang; Chonglei Mei; Hai Jiang; Wilkin, G.A..

Abstract:

In distributed systems with homogeneous or heterogeneous computers, data generated on one machine might not always be used by another machine directly. For a particular data type, its endianness, size and padding situation cause incompatibility issue. Data conversion procedure is indispensable, especially in open systems. So far, there is no widely accepted data format standard in high performance computing community. Most time, programmers have to handle data formats manually. In order to achieve high programmability and efficiency in both homogeneous and heterogeneous open systems, a novel asymmetric binary-tag-based data conversion scheme (BinTag) is proposed to share data smoothly. Each data item carries one binary tag generated by BinTag’s parser without much programmer’s involvement. Data conversion only happens when it is absolutely necessary. Experimental results have demonstrated its effectiveness and performance gains in terms of productivity and data conversion speed. BinTag can be used in both memory and secondary storage systems.

Homogeneous and heterogeneous in the sense of padding, size, endianness? Serious issue for high performance computing.

Are there lessons to be taught or learned here for other notions of homogeneous/heterogeneous data?

Do we need binary tags to track semantics at a higher level?

Or can we view data as though it had particular semantics? At higher and lower levels?