Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 11, 2013

Flatten entire HBase column families… [Mixing Labels and Data]

Filed under: HBase,Pig,Python — Patrick Durusau @ 4:24 pm

Flatten entire HBase column families with Pig and Python UDFs by Chase Seibert.

From the post:

Most Pig tutorials you will find assume that you are working with data where you know all the column names ahead of time, and that the column names themselves are just labels, versus being composites of labels and data. For example, when working with HBase, it’s actually not uncommon for both of those assumptions to be false. Being a columnar database, it’s very common to be working to rows that have thousands of columns. Under that circumstance, it’s also common for the column names themselves to encode to dimensions, such as date and counter type.

How do you solve this mismatch? If you’re in the early stages of designing a schema, you could reconsider a more row based approach. If you have to work with an existing schema, however, you can with the help of Pig UDFs.

Now there’s an ugly problem.

You can split the label from the data as shown, but that doesn’t help when the label/data is still in situ.

Saying: “Don’t do that!” doesn’t help because it is already being done.

If anything, topic maps need to take subjects as they are found, not as we might wish for them to be.

Curious, would you write an identifier as a regex that parses such a mix of label and data, assigning each to further processing?

Suggestions?

I first saw this at Flatten Entire HBase Column Families With Pig and Python UDFs by Alex Popescu.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress