Archive for the ‘Unstructured Data’ Category

Expand Your Big Data Capabilities With Unstructured Text Analytics

Wednesday, May 6th, 2015

Expand Your Big Data Capabilities With Unstructured Text Analytics by Boris Evelson.

From the post:

Beware of insights! Real danger lurks behind the promise of big data to bring more data to more people faster, better and cheaper. Insights are only as good as how people interpret the information presented to them.

When looking at a stock chart, you can’t even answer the simplest question — “Is the latest stock price move good or bad for my portfolio?” — without understanding the context: Where you are in your investment journey and whether you’re looking to buy or sell.

While structured data can provide some context — like checkboxes indicating your income range, investment experience, investment objectives, and risk tolerance levels — unstructured data sources contain several orders of magnitude more context.

An email exchange with a financial advisor indicating your experience with a particular investment vehicle, news articles about the market segment heavily represented in your portfolio, and social media posts about companies in which you’ve invested or plan to invest can all generate much broader and deeper context to better inform your decision to buy or sell.

A thumbnail sketch of the complexity of extracting value from unstructured data sources. As such a sketch, there isn’t much detail but perhaps enough to avoid paying $2495 for the full report.

Infinit.e Overview

Monday, December 15th, 2014

Infinit.e Overview by Alex Piggott.

From the webpage:

Infinit.e is a scalable framework for collecting, storing, processing, retrieving, analyzing, and visualizing unstructured documents and structured records.

[Image omitted. Too small in my theme to be useful.]

Let’s provide some clarification on each of the often overloaded terms used in that previous sentence:

  • It is a "framework" (or "platform") because it is configurable and extensible by configuration (DSLs) or by various plug-ins types – the default configuration is expected to be useful for a range of typical analysis applications but to get the most out of Infinit.e we anticipate it will usually be customized.
    • Another element of being a framework is being designed to integrate with existing infrastructures as well run standalone.
  • By "scalable" we mean that new nodes (or even more granular: new components) can be added to meet increasing workload (either more users or more data), and that provision of new resources are near real-time.
    • Further, the use of fundamentally cloud-based components means that there are no bottlenecks at least to the ~100 node scale.
  • By "unstructured documents" we mean anything from a mostly-textual database record to a multi-page report – but Infinit.e’s "sweet spot" is in the range of database records that would correspond to a paragraph or more of text ("semi-structured records"), through web pages, to reports of 10 pages or less.
    • Smaller "structured records" are better handled by structured analysis tools (a very saturated space), though Infinit.e has the ability to do limited aggregation, processing and integration of such datasets. Larger reports can still be handled by Infinit.e, but will be most effective if broken up first.
  • By "processing" we mean the ability to apply complex logic to the data. Infinit.e provides some standard "enrichment", such as extraction of entities (people/places/organizations.etc) and simple statistics; and also the ability to "plug in" domain specific processing modules using the Hadoop API.
  • By "retrieving" we mean the ability to search documents and return them in ranking order, but also to be able to retrieve "knowledge" aggregated over all documents matching the analyst’s query.
    • By "query"/"search" we mean the ability to form complex "questions about the data" using a DSL (Domain Specific Language).
  • By "analyzing" we mean the ability to apply domain-specific logic (visual/mathematical/heuristic/etc) to "knowledge" returned from a query.

We refer to the processing/retrieval/analysis/visualization chain as document-centric knowledge discovery:

  • "document-centric": means the basic unit of storage is a generically-formatted document (eg useful without knowledge of the specific data format in which it was encoded)
  • "knowledge discovery": means using statistical and text parsing algorithms to extract useful information from a set of documents that a human can interpret in order to understand the most important knowledge contained within that dataset.

One important aspect of the Infinit.e is our generic data model. Data from all sources (from large unstructured documents to small structured records) is transformed into a single, simple. data model that allows common queries, scoring algorithms, and analytics to be applied across the entire dataset. …

I saw this in a tweet by Gregory Piatetsky yesterday and so haven’t had time to download or test any of the features of Infinit.e.

The list of features is a very intriguing one.

Definitely worth the time to throw another VM on the box and try it out with a dataset of interest.

Would appreciate your doing the same and sending comments and/or pointers to posts with your experiences. Suspect we will have different favorite features and hit different limitations.


PS: Downloads.

Integrating Structured and Unstructured Data

Thursday, February 21st, 2013

Integrating Structured and Unstructured Data by David Loshin.

It’s a checklist report but David comes up with useful commentary on the following seven points:

  1. Document clearly defined business use cases.
  2. Employ collaborative tools for the analysis, use, and management of semantic metadata.
  3. Use pattern-based analysis tools for unstructured text.
  4. Build upon methods to derive meaning from content, context, and concept.
  5. Leverage commodity components for performance and scalability.
  6. Manage the data life cycle.
  7. Develop a flexible data architecture.

It’s not going to save you planning time but may keep you from overlooking important issues.

My only quibble is that David doesn’t call out data structures as needing defined and preserved semantics.

Data is a no brainer but the containers of data, dare I say “Hadoop silos,” need to have semantics defined as well.

Data or data containers without defined and preserved semantics are much more costly in the long run.

Both in lost opportunity costs and after the fact integration costs.

New Query Tool Searches EHR Unstructured Data

Friday, February 15th, 2013

New Query Tool Searches EHR Unstructured Data by Ken Terry.

From the post:

A new electronic health record “intelligence platform” developed at Massachusetts General Hospital (MGH) and its parent organization, Partners Healthcare, is being touted as a solution to the problem of searching structured and unstructured data in EHRs for clinically useful information.

QPID Inc., a new firm spun off from Partners and backed by venture capital funds, is now selling its Web-based search engine to other healthcare organizations. Known as the Queriable Patient Inference Dossier (QPID), the tool is designed to allow clinicians to make ad hoc queries about particular patients and receive the desired information within seconds.

Today, 80% of stored health information is believed to be unstructured. It is trapped in free text such as physician notes and reports, discharge summaries, scanned documents and e-mail messages. One reason for the prevalence of unstructured data is that the standard methods for entering structured data, such as drop-down menus and check boxes, don’t fit into traditional physician workflow. Many doctors still dictate their notes, and the transcription goes into the EHR as free text.


QPID, which was first used in the radiology department of MGH in 2005, incorporates an EHR search engine, a library of search queries based on clinical concepts, and a programming system for application and query development. When a clinician submits a query, QPID presents the desired data in a “dashboard” format that includes abnormal results, contraindications and other alerts, Doyle said.

The core of the system is a form of natural language processing (NLP) based on a library encompassing “thousands and thousands” of clinical concepts, he said. Because it was developed collaboratively by physicians and scientists, QPID identifies medical concepts imbedded in unstructured data more effectively than do other NLP systems from IBM, Nuance and M*Modal, Doyle maintained.

Take away points for data search/integration solutions:

  1. 80% of stored health information (need)
  2. traditional methods for data entry….don’t fit into traditional physician workflow (user requirement)
  3. developed collaboratively by physicians and scientists (semantics originate with users, not top down)

I am interested in how QPID conforms (or not) QPID to local medical terminology practices.

To duplicate their earlier success, conforming to local terminology practices is critical.

If for no other reason it will give physicians and other health professionals “ownership” of the vocabulary and hence faith in the system.

Are Texts Unstructured Data? [Text Series]

Monday, November 5th, 2012

I ask because there is a pejorative tinge to “unstructured” when applied to texts. As though texts lack structure and can be improved by various schemes and designs.

Before reaching other aspects of such claims, I wanted to test the notion that texts are “unstructured.” If that the case, then, the Gettysburg Address written:

Four score and seven years ago our fathers brought forth on this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal.

Now we are engaged in a great civil war, testing whether that nation, or any nation, so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.

But, in a larger sense, we can not dedicate, we can not consecrate, we can not hallow this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us—that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion—that we here highly resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth of freedom—and that government of the people, by the people, for the people, shall not perish from the earth.

Should be equivalent to the Gettysburg Address Scrambled (via the 3by3by3 Text Scrambler):

not are or freedom—and great a consecrated gave by as devotion they created in portion great and fought forth lives that are should we have that is nor died our to not a struggled the testing here, equal.

Now nation, little this to all civil highly we in these remaining say who we work perish nation, endure. engaged brave resolve that for here. the The of shall Four remember take here we fitting we liberty, forget did to It living poor that above and have honored ground. consecrate, place that is they the for have nation, the nation, for so ago what here, score not us—that it that us dead for who to we those from resting here can fathers this.

But, the that here It war. a is far never do years not what of proper new this brought and shall last earth. conceived their be of nation to detract. men dedicated here battle-field dedicated men, and in come are altogether the cause of great can which whether nobly living, this to which dead, rather to birth that who a field, will advanced. add so proposition people, long The the they power new It any us long increased and on we met or sense, for might hallow dedicate these far devotion—that to not note, in a the We dead before on full government dedicated we have from that war, the conceived God, a thus measure can gave of be people, that here We continent so a it, world people, vain—that dedicate, under can shall live. but larger task have dedicated, unfinished final our rather, can seven

It may just be me but I don’t get the same semantics from the second version as the first.


My premise going forward is that texts are structured.