Data Integration « Another Word For It

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 7, 2010

Datasets Galore! (Data.gov)

Filed under: Data Integration,Data Source,Linked Data,Subject Identity,Topic Maps — Patrick Durusau @ 9:56 am

Data.gov hosts 272,677 datasets.

LinkingOpenData will point you to a 400 subset that is available as “Linked Data.”

I guess that means that the other 272,277 datasets are not “Linked Data.”

Fertile ground for topic maps.

Topic Maps don’t limit users to “[u]se URIs as names for things.” (Linked Data)

A topic map can use the identifiers that are in place in one or more of the 272,277 datasets and create mappings to one or more of the 400 datasets in “Linked Data.”

Without creating “Linked Data” or the overhead of the “303 Cloud.”

Which datasets look the most promising to you?

Comments Off

May 31, 2010

Semantic Web Challenge

Filed under: Conferences,Data Integration,Linked Data,Ontology,Semantic Web,Subject Identity,Topic Maps — Patrick Durusau @ 10:40 am

The Semantic Web Challenge 2010 details landed in my inbox this morning. My first reaction was to refine my spam filter. 😉 Just teasing. My second and more considered reaction was to think about the “challenge” in terms of topic maps.

Particularly because a posting from the Ontology Alignment Evaluation Initiative arrived the same day, in response to a posting from sameas.org.

I freely grant that URIs that cannot distinguish between identifiers and resources without 303 overhead are poor design. But the fact remains that there are many data sets, representing large numbers of subjects that have even poorer subject identification practices. And there are no known approaches that are going to result in the conversion of those data sets.

Personally I am unwilling to wait until some new “perfect” language for data sweeps the planet and results in all data being converted into the “perfect” format. Anyone who thinks that is going to happen needs to stand with the end-of-the-world-in-2012 crowd. They have a lot in common. Magical thinking being one common trait.

The question for topic mappers to answer is how do we attribute to whatever data language we are confronting, characteristics that will enable us to reliably merge information about subjects in that format either with other information in the same or another data language? Understanding that the necessary characteristics may vary from data language to data language.

Take the lack of a distinction between identifier and resource in the Semantic Web for instance. One easy step towards making use of such data would be to attribute to each URI the status of either being an identifier or a resource. I suspect, but cannot say, that the authors/users of those URIs know the answer to that question. It seems even possible that some sets of such URIs are all identifiers and if so marked/indicated in some fashion, they automatically become useful as just that, identifiers (without 303 overhead).

As identifiers they may lack the resolution that topic maps provide to the human user, which enables them to better understand what subject is being identified. But, since topic maps can map additional identifiers together, when you encounter a deficient identifier, simply create another one for the same subject and map them together.

I think we need to view the Semantic Web data sets as opportunities to demonstrate how understanding subject identity, however that is indicated, is the linchpin to meaningful integration of data about subjects.

Bearing in mind that all our identifications, Semantic Web, topic map or otherwise, are always local, provisional and subject to improvement, in the eye of another.

Comments (3)

May 22, 2010

Peter McBrien

Filed under: Data Integration,Heterogeneous Data,Hypergraphs,Researchers — Patrick Durusau @ 3:36 pm

Peter McBrien focuses on data modeling and integration.

Part of the AutoMed project on database integration. Recent work includes temporal constraints and P2P exchange of heterogeneous data.

Publications (dblp).

Homepage

Databases: Tools and Data for Teaching and Research: Useful collection of datasets and other materials on databases, data modeling and integration.

I first encountered Peter’s research in Comparing and Transforming Between Data Models via an Intermediate Hypergraph Data Model.

From a topic map perspective, the authors assumed the identities of the subjects to which their transformation rules were applied. Someone less familiar with the schema languages could have made other choices.

That’s the hard question isn’t it? How to have reliable integration without presuming a common perspective/interpretation of the schema languages?

*****
PS: This is the first of many posts on researchers working in areas of interest to the topic maps community.

Comments Off

May 19, 2010

Context of Data?

Filed under: Context,Data Integration,Information Retrieval,Researchers — Patrick Durusau @ 6:02 am

Cristiana Bolchini and others in And What Can Context Do For Data? have started down an interesting path for exploration.

That all data exists in some context is an unremarkable observation until one considers how often that context can be stated, attributed to data, to say nothing of being used to filter or access that data.

Bolchini introduces the notion of a context dimension tree (CDT) which “models context in terms of a set of context dimensions, each capturing a different characteristic of the context.” (CACM, Nov. 2009, page 137) Note that dimensions can be decomposed into sub-trees for further analysis. Further operations combine these dimensions into the “context” of the data that is used to produce a particular view of the data.

Not quite what is meant by scope in topic maps but something a bit more nuanced and subtle. I would argue (no surprise) that the context of a subject is part and parcel of its identity. And how much of that context we choose to represent will vary from project to project.

May 7, 2010

Cumulative Data Mining?

Filed under: Conferences,Data Integration — Patrick Durusau @ 8:12 pm

My impression is that data mining isn’t cumulative.

That is when I read about a new data mining technique, even over a known data set, like the ones used at TREC, they all make a fresh start on the data.

It is like having read a book and to find a particular passage, you start over at page one. That seems like a poor use of resources.

Another approach would be to record previously discovered relevant documents. Subsequent users can then benefit from what has been found. (Note the use of past tense.)

Can anyone suggest examples of cumulative data mining?

Comments Off

May 2, 2010

Topic Maps: A Value-Add Technology

Filed under: Data Integration,Heterogeneous Data,Marketing — Patrick Durusau @ 7:41 pm

It isn’t always clear that topic maps are a value-add, not a replacement technology.

Topic maps, by virtue of subject identity and mapping rules, can enhance existing information technologies and provide reliable interoperability between them. Without changing the underlying information technologies.

Topic maps are a value-add proposition because the structures of information technologies are subjects themselves. Database schemas and their fields, for instance, are subjects in the view of a topic map. Which means that users can map, seamlessly and reliably, between a relational database and a document archive, that use completely different terminology.

Or a subscriber to several financial reporting services, can create a topic map to filter and organize those reports. That is doable without a topic map, but what happens when another report service is added? What subjects were mapped together before? Topic maps are the value-add that can provide an answer to that question.

Comments Off

April 24, 2010

Explicit Semantic Analysis

Filed under: Classification,Data Integration,Information Retrieval,Semantics — Patrick Durusau @ 7:58 am

Explicit Semantic Analysis looks like another tool for the topic maps toolkit.

Not 100% accurate but close enough to give a topic map project involving a serious amount of text a running start.

Start with Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis by Evgeniy Gabrilovich and Shaul Markovitch.

There are 55 citations of this work (as of 2010-04-24), ranging from Geographic Information Retrieval and Beyond the Stars: Exploiting Free-Text User Reviews for Improving the Accuracy of Movie Recommendations (2009) to Explicit Versus Latent Concept Models for Cross-Language Information Retrieval.

I encountered this line of work while reading Combining Concept Based and Text Based Indexes for CLIR by Philipp Sorg and Philipp Cimiano (slides) from the 2009 Cross Language Evaluation Forum. (For any search engines, CLIR = Cross-Language Information Retrieval.) Cross Language Evaluation Forum General link because it does not expose direct links to resources.

Quibble:

Evgeniy Gabrilovich and Shaul Markovitch say that:

We represent texts as a weighted mixture of a predetermined set of natural concepts, which are defined by humans themselves and can be easily explained. To achieve this aim, we use concepts defined by Wikipedia articles, e.g., COMPUTER SCIENCE, INDIA, or LANGUAGE.

and

The choice of encyclopedia articles as concepts is quite natural, as each article is focused on a single issue, which it discusses in detail.

Their use of “natural,” which I equate in academic writing to “…a miracle occurs…,” drew my attention. There are things we choose to treat as concepts or even subject representatives, but that hardly makes them “natural.” Most academic articles would claim (whether true or not) to be “…focused on a single issue, which it discusses in detail.”

Rather than “natural concepts,” describe the headers of Wikipedia texts. More accurate and sets the groundwork for investigation into the nature and length of headers and their impact on semantic mapping and information retrieval.

Comments Off

April 20, 2010

Data Virtualization

Filed under: Data Integration,Heterogeneous Data,Subject Identity — Patrick Durusau @ 6:47 pm

I ran across a depressing quote today on data virtualization:

But data is distributed, heterogeneous, and often full of errors. Simply federating it insufficient. IT organizations must build a single, accurate, and consistent view of data, and deliver it precisely when it’s needed. Data virtualization needs to take this complexity into account.*

It is very important to have a single view of data for some purposes, but what happens when circumstances change and we need a different view than the one before?

Without explicit identification of subjects, all the IT effort that went into the first data integration project gets repeated in the next data integration project.

You would think that after sixty years of data migration, largely repeating the efforts of prior migrations, even business types would have caught on by this point.

Without explicit identification of subjects, there isn’t any way to “know” what subjects were being identified. Or to create reliable new mappings. So the cycle of data migrations goes on and on.

Break the cycle of data migrations, choose topic maps!

*Look under webinars at: http://www.informatica.com/Pages/index.aspx There wasn’t a direct link that I could post to lead you to the quote.

Comments Off

« Newer Posts