Archive for the ‘Structured Data’ Category

Infinit.e Overview

Monday, December 15th, 2014

Infinit.e Overview by Alex Piggott.

From the webpage:

Infinit.e is a scalable framework for collecting, storing, processing, retrieving, analyzing, and visualizing unstructured documents and structured records.

[Image omitted. Too small in my theme to be useful.]

Let’s provide some clarification on each of the often overloaded terms used in that previous sentence:

  • It is a "framework" (or "platform") because it is configurable and extensible by configuration (DSLs) or by various plug-ins types – the default configuration is expected to be useful for a range of typical analysis applications but to get the most out of Infinit.e we anticipate it will usually be customized.
    • Another element of being a framework is being designed to integrate with existing infrastructures as well run standalone.
  • By "scalable" we mean that new nodes (or even more granular: new components) can be added to meet increasing workload (either more users or more data), and that provision of new resources are near real-time.
    • Further, the use of fundamentally cloud-based components means that there are no bottlenecks at least to the ~100 node scale.
  • By "unstructured documents" we mean anything from a mostly-textual database record to a multi-page report – but Infinit.e’s "sweet spot" is in the range of database records that would correspond to a paragraph or more of text ("semi-structured records"), through web pages, to reports of 10 pages or less.
    • Smaller "structured records" are better handled by structured analysis tools (a very saturated space), though Infinit.e has the ability to do limited aggregation, processing and integration of such datasets. Larger reports can still be handled by Infinit.e, but will be most effective if broken up first.
  • By "processing" we mean the ability to apply complex logic to the data. Infinit.e provides some standard "enrichment", such as extraction of entities (people/places/organizations.etc) and simple statistics; and also the ability to "plug in" domain specific processing modules using the Hadoop API.
  • By "retrieving" we mean the ability to search documents and return them in ranking order, but also to be able to retrieve "knowledge" aggregated over all documents matching the analyst’s query.
    • By "query"/"search" we mean the ability to form complex "questions about the data" using a DSL (Domain Specific Language).
  • By "analyzing" we mean the ability to apply domain-specific logic (visual/mathematical/heuristic/etc) to "knowledge" returned from a query.

We refer to the processing/retrieval/analysis/visualization chain as document-centric knowledge discovery:

  • "document-centric": means the basic unit of storage is a generically-formatted document (eg useful without knowledge of the specific data format in which it was encoded)
  • "knowledge discovery": means using statistical and text parsing algorithms to extract useful information from a set of documents that a human can interpret in order to understand the most important knowledge contained within that dataset.

One important aspect of the Infinit.e is our generic data model. Data from all sources (from large unstructured documents to small structured records) is transformed into a single, simple. data model that allows common queries, scoring algorithms, and analytics to be applied across the entire dataset. …

I saw this in a tweet by Gregory Piatetsky yesterday and so haven’t had time to download or test any of the features of Infinit.e.

The list of features is a very intriguing one.

Definitely worth the time to throw another VM on the box and try it out with a dataset of interest.

Would appreciate your doing the same and sending comments and/or pointers to posts with your experiences. Suspect we will have different favorite features and hit different limitations.


PS: Downloads.

Integrating Structured and Unstructured Data

Thursday, February 21st, 2013

Integrating Structured and Unstructured Data by David Loshin.

It’s a checklist report but David comes up with useful commentary on the following seven points:

  1. Document clearly defined business use cases.
  2. Employ collaborative tools for the analysis, use, and management of semantic metadata.
  3. Use pattern-based analysis tools for unstructured text.
  4. Build upon methods to derive meaning from content, context, and concept.
  5. Leverage commodity components for performance and scalability.
  6. Manage the data life cycle.
  7. Develop a flexible data architecture.

It’s not going to save you planning time but may keep you from overlooking important issues.

My only quibble is that David doesn’t call out data structures as needing defined and preserved semantics.

Data is a no brainer but the containers of data, dare I say “Hadoop silos,” need to have semantics defined as well.

Data or data containers without defined and preserved semantics are much more costly in the long run.

Both in lost opportunity costs and after the fact integration costs.

Is Google Hijacking Semantic Markup/Structured Data? [FALSE]

Saturday, January 26th, 2013

Is Google Hijacking Semantic Markup/Structured Data? by Barbara Starr.

From the post:

On December 12, 2012, Google rolled out a new tool, called the Google Data Highlighter for event data. Upon a cursory read, it seems to be a tagging tool, where a human trains the Data Highlighter using a few pages on their website, until Google can pick up enough of a pattern to do the remainder of the site itself.

Better yet, you can see all of these results in the structured data dashboard. It appears as if event data is marked up and is compatible with However, there is a caveat here that some folks may not notice.

No actual markup is placed on the page, meaning that none of the semantic markup using this Data Highlighter tool is consumable by Bing, Yahoo or any other crawler on the Web; only Google can use it!

Google is essentially hi-jacking semantic markup so only Google can take advantage of it. Google has the global touch and the ability to execute well-thought-out and brilliantly strategic plans.

Let’s do this by the numbers:

  1. Google develops a service for webmasters to add semantic annotations to their webpages.
  2. Google allows webmasters to use that service at no charge.
  3. Google uses those annotations to improve the search results it provides users (for free).

Google used its own resources to develop a valuable service for webmasters that enhances their websites and user experience with Google, for free.

Perhaps there is a new definition of highjacking?

Webster says the traditional definition includes “to steal or rob as if by hijacking.”

The Semantic Web:


(a) Failing to whitewash the Semantic Web’s picket fence while providing free services to webmasters and users to enhance searching of web content.

(b) Failing to give away data from free services to webmasters and users to those who did not plant, reap, spin, weave or sew.

I don’t find the Semantic Web’s definition of “hijacking” persuasive.


I first saw this at: Google’s Structured Data Take Over by Angela Guess.

Are Texts Unstructured Data? [Text Series]

Monday, November 5th, 2012

I ask because there is a pejorative tinge to “unstructured” when applied to texts. As though texts lack structure and can be improved by various schemes and designs.

Before reaching other aspects of such claims, I wanted to test the notion that texts are “unstructured.” If that the case, then, the Gettysburg Address written:

Four score and seven years ago our fathers brought forth on this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal.

Now we are engaged in a great civil war, testing whether that nation, or any nation, so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.

But, in a larger sense, we can not dedicate, we can not consecrate, we can not hallow this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us—that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion—that we here highly resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth of freedom—and that government of the people, by the people, for the people, shall not perish from the earth.

Should be equivalent to the Gettysburg Address Scrambled (via the 3by3by3 Text Scrambler):

not are or freedom—and great a consecrated gave by as devotion they created in portion great and fought forth lives that are should we have that is nor died our to not a struggled the testing here, equal.

Now nation, little this to all civil highly we in these remaining say who we work perish nation, endure. engaged brave resolve that for here. the The of shall Four remember take here we fitting we liberty, forget did to It living poor that above and have honored ground. consecrate, place that is they the for have nation, the nation, for so ago what here, score not us—that it that us dead for who to we those from resting here can fathers this.

But, the that here It war. a is far never do years not what of proper new this brought and shall last earth. conceived their be of nation to detract. men dedicated here battle-field dedicated men, and in come are altogether the cause of great can which whether nobly living, this to which dead, rather to birth that who a field, will advanced. add so proposition people, long The the they power new It any us long increased and on we met or sense, for might hallow dedicate these far devotion—that to not note, in a the We dead before on full government dedicated we have from that war, the conceived God, a thus measure can gave of be people, that here We continent so a it, world people, vain—that dedicate, under can shall live. but larger task have dedicated, unfinished final our rather, can seven

It may just be me but I don’t get the same semantics from the second version as the first.


My premise going forward is that texts are structured.