Merging « Another Word For It

September 27, 2011

Linked Data Semantic Issues (same for topic maps?)

Filed under: Linked Data,LOD,Marketing,Merging,Topic Maps — Patrick Durusau @ 6:51 pm

Sebastian Schaffert posted a message on the pub-lod@w3c.org list that raised several issues about Linked Data. Issues that sound relevant to topic maps. See what you think.

From the post:

We are working together with many IT companies (with excellent software developers) and trying to convince them that Semantic Web technologies are superior for information integration. They are already overwhelmed when they have to understand that a database ID for an object is not enough. If they have to start distinguishing between the data object and the real world entity the object might be representing, they will be lost completely.

I guess being told that a “real world entity” may have different ways to be identified must seem to be the road to perdition.

Curious because the “real world” is a messy place. Or is that the problem? That the world of developers is artificially “clean,” at least as far as identification and reference.

Perhaps CS programs need to train developers for encounter with the messy “real world.”

From the post:

> When you dereference the URL for a person (such as …/561666514#), you get back RDF. Our _expectation_, of course, is that that RDF will include some remarks about that person (…/561666514#), but there can be no guarantee of this, and no guarantee that it won’t include more information than you asked for. All you can reliably expect is that _something_ will come back, which the service believes to be true and hopes will be useful. You add this to your knowledge of the world, and move on.

There I have my main problem. If I ask for “A”, I am not really interested in “B”. What our client implementation therefore does is to throw away everything that is about B and only keeps data about A. Which is – in case of the FB data – nothing. The reason why we do this is that often you will get back a large amount of irrelevant (to us) data even if you only requested information about a specific resource. I am not interested in the 999 other resources the service might also want to offer information about, I am only interested in the data I asked for. Also, you need to have some kind of “handle” on how to start working with the data you get back, like:
1. I ask for information about A, and the server gives me back what it knows about A (there, my expectation again …)
2. From the data I get, I specifically ask for some common properties, like A foaf:name ?N and do something with the bindings of N. Now how would I know how to even formulate the query if I ask for A but get back B?

Ouch! That one cuts a little close. 😉

What about the folks who are “…not really interested in ‘B’.” ?

How do topic maps serve their interests?

Or have we decided for them that more information about a subject is better?

Or is that a matter of topic map design? What information to include?

That “merging” and what gets “merged” is a user/client decision?

That is how it works in practice simply due to time, resources, and other constraints.

Marketing questions:

How to discover data users would like to have appear with other data, prior to having a contract to do so?

Can we re-purpose search logs for that?

Comments Off

July 24, 2011

May 22, 2011

reclab

Filed under: Data Mining,Merging,Topic Map Software — Patrick Durusau @ 5:34 pm

reclab

From the website:

If you can’t bring the data to the code, bring the code to the data.

…

How do we do this? Simple. RecLab solves the intractable problem of supplying real data to researchers by turning it on is head. Rather than attempt the impossible task of bringing sensitive, proprietary retail data to innovative code, RecLab brings the code to the data on live retailing sites. This is done via the RichRelevance cloud environment-a large-scale, distributed environment that is the backbone of the leading dynamic personalization technology solution for the web’s top retailers.

Two things occurred to me while at this site:

1) Does this foreshadow enterprises being able to conduct competitions on analysis/mining/processing (BI) of their data? Rather than buying solutions and then learning the potential of an acquired solution?

2) For topic maps, is this a way to create competition between “merging” algorithms on “sensitive, proprietary” data? After all, it is users who decide whether appropriate “merging” has taken place.

BTW, this site has links to a contest with a $1 Million dollar prize. Just in case you are using topic maps to power recommender systems.

Comments Off

April 30, 2011

When Data Mining Goes Horribly Wrong

Filed under: Data Mining,Merging,Search Engines — Patrick Durusau @ 10:22 am

In When Data Mining Goes Horribly Wrong, Matthew Hurst brings us a cautionary tale about what can happen when “merging” decisions are made badly.

From the blog:

Consequently, when you see a details page – either on Google, Bing or some other search engine with a local search product – you are seeing information synthesized from multiple sources. Of course, these sources may differ in terms of their quality and, as a result, the values they provide for certain attributes.

When combining data from different sources, decisions have to be made as to firstly when to match (that is to say, assert that the data is about the same real world entity) and secondly how to merge (for example: should you take the phone number found in one source or another?).

This process – the conflation of data – is where you either succeed or fail.

Read Matthew’s post for encouraging signs that there is plenty of room for the use of topic maps.

What I find particularly amusing is that repair of the merging in this case doesn’t help prevent it from happening again and again.

Not much of a repair if the problem continues to happen elsewhere.

Comments Off

April 16, 2011

MERGE Ahead

Filed under: Marketing,Merging,SQL — Patrick Durusau @ 2:44 pm

Merge Ahead: Introducing the DB2 for i SQL MERGE statement

Karl Hanson of IBM writes:

As any shade tree mechanic or home improvement handyman knows, you can never have too many tools. Sure, you can sometimes get by with inadequate tools on hand, but the right tools can help complete a job in a simpler, safer, and quicker way. The same is true in programming. New in DB2 for i 7.1, the MERGE statement is a handy tool to synchronize data in two tables. But as you will learn later, it can also do more. You might think of MERGE as doing the same thing you could do by writing a program, but with less work and with simpler notation.

Don’t panic, it isn’t merge in the topic map sense but it does show there are market opportunities for what is a trivial task for a topic map.

That implies to me there are also opportunities for more complex tasks, suitable only for topic maps.

Comments Off

February 21, 2011

Introducing Vector Maps

Filed under: Merging,Vectors — Patrick Durusau @ 4:09 pm

Introducing Vector Maps

From the post:

Modern distributed data stores such as CouchDB and Riak, use variants of Multi-Version Concurrency Control to detect conflicting database updates and present these as multi-valued responses.

So, if I and my buddy Ola both update the same data record concurrently, the result may be that the data record now has multiple values – both mine and Ola’s – and it will be up to the eventual consumer of the data record to resolve the problem. The exact schemes used to manage the MVCC differs from system to system, but the effect is the same; the client is left with the turd to sort out.

This let me to an idea, of trying to create a data structure which is by it’s very definition itself able to be merged, and then store such data in these kinds of databases. So, if you are handed two versions, there is a reconciliation function that will take those two records and “merge” them into one sound record, by some definition of “sound”.

Seems to me that reconciliation should not be limited to records differing based on time stamps. 😉

Will have to think about this one for while but it looks deeply similar to issues we are confronting in topic maps.

BTW, saw this first at Alex Popescu’s myNoSQL blog.

Comments Off

February 13, 2011

Programming Scala

Filed under: Merging,Scala — Patrick Durusau @ 6:51 am

Programming Scala by Dean Wampler and Alex Payne.

Experimental book at O’Reilly Labs.

Seems to be the day for not-strictly topic map posts but I think Scala is going to be important both for topic maps as well as scalable programming in general.

I suspect that reflects my personal view that functional approaches to merging are more likely to be successful with topic maps than approaches that rely upon mutable objects.

Comments about your experience with Scala, particularly with regard to topic maps and with this book most welcome!

Comments Off

January 26, 2011

Dimensions to use to compare NoSQL data stores – Queries to Produce Topic Maps

Filed under: Merging,NoSQL,TMDM,Topic Map Software,Topic Maps — Patrick Durusau @ 9:08 am

Dimensions to use to compare NoSQL data stores

A post by Huan Liu to read after Billy Newport’s Enterprise NoSQL: Silver Bullet or Poison Pill? – (Unique Questions?)

A very good quick summary of the dimension to consider. As Liu makes clear, choosing the right data store is a complex issue.

I would use this as an overview article to get everyone on a common ground for a discussion of NoSQL data stores.

At least that way, misunderstandings will be on some other topic of discussion.

BTW, if you think about Newport’s point (however correct/incorrect) that NoSQL databases enable only one query, doesn’t that fit the production of a topic map?

That is there is a defined set of constructs, with defined conditions of equivalence. So the only query in that regard has been fixed.

Questions remain about querying the data that a topic map holds, but the query that results in merged topics, associations, etc.

In some processing models, that query is performed and a merged artifact is produced.

Following the same data model rules, I would prefer to allow those queries be made on an ad hoc basis. So that users are always presented with the latest merged results.

Same rules as the TMDM, just a question of when they fire.

Questions:

NoSQL – What other general compare/dimension articles would you recommend as common ground builders? (1-3 citations)
Topic maps as artifacts – What other data processing approaches produce static artifacts for querying? (3-5 pages, citations)
Topic maps as query results – What are the concerns and benefits of topic maps as query results? (3-5 pages, citations)

Comments (1)

January 24, 2011

Merge Me Baby One More Time!

Filed under: Merging,R — Patrick Durusau @ 5:41 pm

Merge Me Baby One More Time!

Ok, I admit the title caught my attention. 😉

Covers the use of merge_data.r for quick and dirty merges of a data set that has diverged.

Good to know if you don’t have a situation where the full overhead of a topic map solution is required.

Do note that the article passes over the question of subject identity or the correctness of the merge without even a pause.

That works, but can also mean that when you have forgotten why the data is arranged as it is, well…, that’s life without subject identity.

Comments Off

January 19, 2011

NCIBI – National Center for Integrative Biomedical Informatics

Filed under: Bioinformatics,Biomedical,Heterogeneous Data,Merging — Patrick Durusau @ 2:13 pm

NCIBI – National Center for Integrative Biomedical Informatics

From the website:

The National Center for Integrative Biomedical Informatics (NCIBI) is one of seven National Centers for Biomedical Computing (NCBC) within the NIH Roadmap. The NCBC program is focused on building a universal computing infrastructure designed to speed progress in biomedical research. NCIBI was founded in September 2005 and is based at the University of Michigan as part of the Center for Computational Medicine and Bioinformatics (CCMB).

Note the use of integrative in the name of the center.

They “get” that part.

They are in fact working on mappings to support integration of data even as I write these lines.

There is a lot to be learned about their strategies for integration and to better understand the integration issues they face in this domain. This site is a good starting place to do both.

Comments Off

MIMI Merge Process

Filed under: Bioinformatics,Biomedical,Data Source,Merging — Patrick Durusau @ 2:01 pm

Michigan Molecular Interactions

From the website:

MiMI provides access to the knowledge and data merged and integrated from numerous protein interactions databases. It augments this information from many other biological sources. MiMI merges data from these sources with “deep integration” (see The MiMI Merge Process section) into its single database. A simple yet powerful user interface enables you to query the database, freeing you from the onerous task of having to know the data format or having to learn a query language. MiMI allows you to query all data, whether corroborative or contradictory, and specify which sources to utilize.

MiMI displays results of your queries in easy-to-browse interfaces and provides you with workspaces to explore and analyze the results. Among these workspaces is an interactive network of protein-protein interactions displayed in Cytoscape and accessed through MiMI via a MiMI Cytoscape plug-in.

MiMI gives you access to more information than you can get from any one protein interaction source such as:

Vetted data on genes, attributes, interactions, literature citations, compounds, and annotated text extracts through natural language processing (NLP)

Linkouts to integrated NCIBI tools to: analyze overrepresented MeSH terms for genes of interest, read additional NLP-mined text passages, and explore interactive graphics of networks of interactions

Linkouts to PubMed and NCIBI’s MiSearch interface to PubMed for better relevance rankings

Queriying by keywords, genes, lists or interactions

Provenance tracking

Quick views of missing information across databases.

I found the site looking for tracking of provenance after merging and then saw the following description of merging:

MIMI Merge Process

Protein interaction data exists in a number of repositories. Each repository has its own data format, molecule identifier, and supplementary information. MiMI assists scientists searching through this overwhelming amount of protein interaction data. MiMI gathers data from well-known protein interaction databases and deep-merges the information.

Utilizing an identity function, molecules that may have different identifiers but represent the same real-world object are merged. Thus, MiMI allows the user to retrieve information from many different databases at once, highlighting complementary and contradictory information.

There are several steps needed to create the final MiMI dataset. They are:

The original source datasets are obtained, and transformed into the MiMI schema, except KEGG, NCBI Gene, Uniprot, Ensembl.

Molecules that can be rolled into a gene are annotated to that gene record.

Using all known identifiers of a merged molecule, sources such as Organelle DB or miBLAST, are queried to annotate specific molecular fields.

The resulting dataset is loaded into a relational database.

Because this is an automated process, and no curation occurs, any errors or misnomers in the original data sources will also exist in MiMI. For example, if a source indicates that the organism is unknown, MiMI will as well.

If you find that a molecule has been incorrectly merged under a gene record, please contact us immediately. Because MiMI is completely automatically generated, and there is no data curation, it is possible that we have merged molecules with gene records incorrectly. If made aware of the error, we can and will correct the situation. Please report any problems of this kind to mimi-help@umich.edu.

Tracking provenance is going to be a serious requirement for mission critical, financial and medical usage topic maps.

Comments Off

January 7, 2011

Provenance for Aggregate Queries

Filed under: Aggregation,Merging,Query Language,TMQL — Patrick Durusau @ 7:19 am

Provenance for Aggregate Queries Authors: Yael Amsterdamer, Daniel Deutch, Val Tannen

Abstract:

We study in this paper provenance information for queries with aggregation. Provenance information was studied in the context of various query languages that do not allow for aggregation, and recent work has suggested to capture provenance by annotating the different database tuples with elements of a commutative semiring and propagating the annotations through query evaluation. We show that aggregate queries pose novel challenges rendering this approach inapplicable. Consequently, we propose a new approach, where we annotate with provenance information not just tuples but also the individual values within tuples, using provenance to describe the values computation. We realize this approach in a concrete construction, first for “simple” queries where the aggregation operator is the last one applied, and then for arbitrary (positive) relational algebra queries with aggregation; the latter queries are shown to be more challenging in this context. Finally, we use aggregation to encode queries with difference, and study the semantics obtained for such queries on provenance annotated databases.

Not for the faint of heart reading.

But, provenance for merging is one obvious application of this paper.

For that matter, provenance should also be a consideration for TMQL.

Comments Off

October 29, 2010

Ordinance Survey Linked Data

Filed under: Authoring Topic Maps,Mapping,Merging,Topic Maps — Patrick Durusau @ 5:40 am

Ordinance Survey Linked Data.

Description:

Ordnance Survey is Great Britain’s national mapping agency, providing the most accurate and up-to-date geographic data, relied on by government, business and individuals. OS OpenData is the opening up of Ordnance Survey data as part of the drive to increase innovation and support the “Making Public Data Public” initiative. As part of this initiative Ordnance Survey has published a number of its products as Linked Data. Linked Data is a growing part of the Web where data is published on the Web and then linked to other published data in much the same way that web pages are interlinked using hypertext. The term Linked Data is used to describe a method of exposing, sharing, and connecting data via URIs on the Web….

Let’s use topic maps to connect subjects that don’t have URIs.

Subject mapping exercise:

Connect 5 subjects from the Domseday Book
Connect 5 subjects from either The Shakespeare Paper Trail: The Early Years and/or The Shakespeare Paper Trail: The Later Years
Connect 5 subjects from WW2 People’s War (you could do occurrences but try for something more imaginative)
Connect 5 subjects from some other period of English history.
Suggest other linked data sources and sources of subjects for subject mapping (extra credit)

Comments Off

October 16, 2010

Incidence of Merging?

Filed under: Merging,Topic Map Software,Topic Maps — Patrick Durusau @ 10:20 am

Is there an average incidence of merging?

I know the rhetoric well enough, discover new relationships, subjects, cross domain or even semantic universe boundaries, etc., but ok, how often?

Take for example the Opera and CIA World Fact Book topic maps. When they merge, how many topics actually merge?

One expects only the geographic locations, which is useful but what percentage of the overall topics does that represent? In either map?

Questions:

Is incidence of merging a useful measurement? Yes/No, Why?
Is there something beyond incidence of merging that you would measure for merged topic maps?
How would you evaluate the benefits of merging two (or more) topic maps?
How would you plan for merging in a topic map design?

(Either of the last two questions can be expanded into design projects.)

Comments Off

September 10, 2010

LNCS Volume 6263: Data Warehousing and Knowledge Discovery

Filed under: Database,Graphs,Heterogeneous Data,Indexing,Merging — Patrick Durusau @ 8:20 pm

LNCS Volume 6263: Data Warehousing and Knowledge Discovery edited by Torben Bach Pedersen, Mukesh K. Mohania, A Min Tjoa, has a number of articles of interest to the topic map community.

Here are five (5) that caught my eye:

A Model-Driven Heuristic Approach for Detecting Multidimensional Facts in Relational Data Sources Author(s): Andrea Carmè, Jose-Norberto Mazón, Stefano Rizzi Keywords: relational data sources, multidimensional schema, model-driven, data warehouse, multidimensional facts.
A Graph-Based Clustering Scheme for Identifying Related Tags in Folksonomies Author(s): Symeon Papadopoulos, Yiannis Kompatsiaris, Athena Vakali Keywords: graph-based clustering – community detection – folksonomies – tag recommendation.
$\mathcal{F}$&$\mathcal{A}$: A Methodology for Effectively and Efficiently Designing Parallel Relational Data Warehouses on Heterogenous Database Clusters Author(s): Ladjel Bellatreche, Alfredo Cuzzocrea, Soumia Benkrid Keywords: homogeneous clusters, heterogeneous clusters, data warehouses, relational data, naive replication algorithm.
XML Data Fusion Author(s): Frantchesco Cecchin, Cristina Dutra Aguiar Ciferri, Carmem Satie Hara Keywords: XML data fusion, data cleaning rules, value conflicts, integration, fusion policy validation, XFusion.
An Efficient Duplicate Record Detection Using q-Grams Array Inverted Index Author(s): Alfredo Ferro, Rosalba Giugno, Piera Laura Puglisi, Alfredo Pulvirenti Keywords: Duplicate record detection – q-grams – inverted index – bitmaps – clustering.

Comments Off

August 29, 2010

Journal of Artificial Intelligence Research – Journal

Filed under: Data Integration,Merging,Subject Identity — Patrick Durusau @ 7:23 pm

Journal of Artificial Intelligence Research is one of the oldest electronic journals on the Internet, not to mention that it offers free access to all its contents.

While some of the articles have titles like “The Strategy-Proofness Landscape of Merging”, P. Everaere, S. Konieczny and P. Marquis (2007), Volume 28, pages 49-105, they raise issues that sophisticated topic mappers will need to be able to discuss intelligently with data analysts.

Comments Off

Information Fusion – Journal

Filed under: Data Integration,Merging,Subject Identity — Patrick Durusau @ 6:59 pm

Information Fusion covers a number of areas of direct interest to topic map researchers and developers. An incomplete list includes:

Fusion Learning In Imperfect, Imprecise And Incomplete Environments
Intelligent Techniques For Fusion Processing
Fusion System Design And Algorithmic Issues
Fusion System Computational Resources and Demands Optimization
Special Purpose Hardware Dedicated To Fusion Applications

If you are considering this as a publication venue, consider their “open access” (quotes are theirs) before making that choice.

Comments Off

August 28, 2010

Annotated Computer Vision Bibliography

Filed under: Merging,Searching,Subject Identity — Patrick Durusau @ 5:33 am

Annotated Computer Vision Bibliography in its 17th year on the Internet!

Relevant to topic maps, among other reasons:

Users visually distinguishing subjects in topic map use/authoring
Pattern recognition, clustering, related techniques (chapter 14)
Subject recognition of various types

Suggestions of specific articles of interest to topic mappers greatly appreciated!

Comments Off

August 27, 2010

A Comparison of Merging Operators in Possibilistic Logic

Filed under: Mapping,Merging,Subject Identity — Patrick Durusau @ 7:26 am

A Comparison of Merging Operators in Possibilistic Logic by Guilin Qi, Weiru Liu and David Bell has topic maps written all over it doesn’t it?

The article is not yet available on my university server but I will keep a watch for it and will report back when I have more details. The author links are to their DBLP records.

Try the following searches on “merging operators” in DBLP and CiteSeerX:

******
Update: 28 August 2010

A Comparison of Merging Operators in Possibilistic Logic (another source for the paper) More comments to follow.

******

Update: 28 August 2010

Qi’s PhD thesis (2006) FUSION OF UNCERTAIN INFORMATION IN THE FRAMEWORK OF POSSIBILISTIC LOGIC starts with:

Possibilistic logic provides a good framework for dealing with merging problems when information is pervaded with uncertainty and inconsistency. Many merging operators in possibilistic logic have been proposed. However, there are still some important problems left unsolved.

Makes me curious about the “Many merging operators….” No promises of when but it would be interesting to start a list of those both within and without possibilistic logic.

Comments Off

July 12, 2010

Set-Similarity and Topic Maps

Filed under: Mapping,Merging,TMRM,Topic Maps — Patrick Durusau @ 7:09 pm

The set-similarity offers a useful way to think about merging in a topic maps context. The measure of self-similarity that we want for merging in topic maps is the same subject.

Self-similarity, in the TMDM, for topics is:

at least one equal string in their [subject identifiers] properties,
at least one equal string in their [item identifiers] properties,
at least one equal string in their [subject locators] properties,
an equal string in the [subject identifiers] property of the one topic item and the [item identifiers] property of the other, or
the same information item in their [reified] properties.

The research literature makes it clear that judging self-similarity isn’t subject to one test or even a handful of them for all purposes. Not to mention that more often than not, self-similarity is being judged on high dimensional data.

Despite clever approaches and quite frankly amazing results, I have yet to run across sustained discussion of how to interchange self-similarity tests. Perhaps it is my markup background but that seems like the sort of capability that would be widely desired.

The issue of interchangeable self-similarity tests looks like an area where JTC 1/SC 34/WG 3 could make a real contribution.

Comments Off

May 11, 2010

Topic Maps Are…

Filed under: Marketing,Merging,Topic Maps — Patrick Durusau @ 6:27 pm

….the results of searching..

The Watching the Watchers topic map is the result of searching. Information I gained by searching is recorded in the topic map.

Does that seem trivial?

Can you name one major search engine that preserves your analysis of search results?

Or that makes it possible to reliably merge your analysis with that of a co-worker?

Maybe being the result of searching isn’t a trivial thing.

Comments Off

April 21, 2010

Complex Merging Conditions In XTM

Filed under: Merging,Subject Identifiers,TMDM,Topic Maps — Patrick Durusau @ 6:09 pm

We need a way to merge topics for reasons that are not specified by the TMDM.

For example, I want merge topics that have equivalent occurrences of type ISBN. Library catalogs in different languages may only share the ISBN of an item as a common characteristic. A topic map generated from each of them could have the ISBN as an occurrence on each topic.

I am assuming each topic map relies upon library identifiers for “standard” merging because that is typically how library systems bind the information for a particular item together.

So, how to make merging occur when there are equivalent occurrences of type ISBN?

Solution: As part of the process of creating the topics, add a subject identifier based on the occurrences of type ISBN that results in equivalent subject identifiers when the ISBN numbers are equivalent. That results in topics that share equivalent occurrences of type ISBN merging.

While the illustration is with one occurrence, there is no limit as to the number of properties of a topic that can be considered in the creation of a subject identifier that will result in merging. Such subject identifiers, when resolved, should document the basis for their assignment to a topic.

BTW, assuming a future TMQL that enables such merging, note this technique will work with XTM 1.0 topic map engines.

Caution: This solution does not work for properties that can be determined only after the topic map has been constructed. Such as participation in particular associations or the playing of particular roles.

PS: There is a modification of this technique to deal with participation in associations or the playing of particular roles. More on that in another post.

Comments (4)

« Newer Posts

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 27, 2011

July 24, 2011

May 22, 2011

April 30, 2011

April 16, 2011

February 21, 2011

February 13, 2011

January 26, 2011

January 24, 2011

January 19, 2011

January 7, 2011

October 29, 2010

October 16, 2010

September 10, 2010

August 29, 2010

August 28, 2010

August 27, 2010

July 12, 2010

May 11, 2010

April 21, 2010