Heterogeneous Data « Another Word For It

October 10, 2010

Distributed Knowledge Discovery with Non Linear Dimensionality Reduction

Filed under: Data Mining,Dimension Reduction,Heterogeneous Data,High Dimensionality — Patrick Durusau @ 9:43 am

Distributed Knowledge Discovery with Non Linear Dimensionality Reduction Authors: Panagis Magdalinos, Michalis Vazirgiannis, Dialecti Valsamou Keywords: distributed non linear dimensionality reduction, NLDR, distributed dimensionality reduction, DDR, distributed data mining, DDM, dimensionality reduction, DR, Distributed Isomap, D-Isomap, C-Isomap, L-Isomap

Abstract:

Data mining tasks results are usually improved by reducing the dimensionality of data. This improvement however is achieved harder in the case that data lay on a non linear manifold and are distributed across network nodes. Although numerous algorithms for distributed dimensionality reduction have been proposed, all assume that data reside in a linear space. In order to address the non-linear case, we introduce D-Isomap, a novel distributed non linear dimensionality reduction algorithm, particularly applicable in large scale, structured peer-to-peer networks. Apart from unfolding a non linear manifold, our algorithm is capable of approximate reconstruction of the global dataset at peer level a very attractive feature for distributed data mining problems. We extensively evaluate its performance through experiments on both artificial and real world datasets. The obtained results show the suitability and viability of our approach for knowledge discovery in distributed environments.

Data mining in peer-to-peer networks will face topic map authors sooner or later.

Not only a useful discussion of the issues, but, the authors have posted source code and data sets used in the article as well:

http://www.db-net.aueb.gr/panagis/PAKDD2010/

Comments Off

October 9, 2010

Evolutionary Clustering and Analysis of Heterogeneous Information Networks

Filed under: Clustering,Evoluntionary,Heterogeneous Data,Networks — Patrick Durusau @ 4:48 pm

Evolutionary Clustering and Analysis of Heterogeneous Information Networks Authors: Manish Gupta; Charu Aggarwal; Jiawei Han; Yizhou Sun Keywords: ENetClus, evolutionary clustering, typed-clustering, DBLP, bibliographic networks

Abstract:

In this paper, we study the problem of evolutionary clustering of multi-typed objects in a heterogeneous bibliographic network. The traditional methods of homogeneous clustering methods do not result in a good typed-clustering. The design of heterogeneous methods for clustering can help us better understand the evolution of each of the types apart from the evolution of the network as a whole. In fact, the problem of clustering and evolution diagnosis are closely related because of the ability of the clustering process to summarize the network and provide insights into the changes in the objects over time. We present such a tightly integrated method for clustering and evolution diagnosis of heterogeneous bibliographic information networks. We present an algorithm, ENetClus, which performs such an agglomerative evolutionary clustering which is able to show variations in the clusters over time with a temporal smoothness approach. Previous work on clustering networks is either based on homogeneous graphs with evolution, or it does not account for evolution in the process of clustering heterogeneous networks. This paper provides the first framework for evolution-sensitive clustering and diagnosis of heterogeneous information networks. The ENetClus algorithm generates consistent typed-clusterings across time, which can be used for further evolution diagnosis and insights. The framework of the algorithm is specifically designed in order to facilitate insights about the evolution process. We use this technique in order to provide novel insights about bibliographic information networks.

Exploring heterogeneous information networks is a first step towards discovery/recognition of new subjects. What other novel insights will emerge from work on heterogeneous information networks only future research can answer.

Comments Off

October 3, 2010

Designing a thesaurus-based comparison search interface for linked cultural heritage sources

Filed under: Classification,Heterogeneous Data,Interface Research/Design,Thesaurus — Patrick Durusau @ 7:15 am

Designing a thesaurus-based comparison search interface for linked cultural heritage sources Authors: Alia Amin, Michiel Hildebrand, Jacco van Ossenbruggen, Lynda Hardman Keywords: comparison search, thesauri, cultural heritage

Prototype: LISA, e-culture.multimedian.nl

Abstract:

Comparison search is an information seeking task where a user examines individual items or sets of items for similarities and differences. While this is a known information need among experts and knowledge workers, appropriate tools are not available. In this paper, we discuss comparison search in the cultural heritage domain, a domain characterized by large, rich and heterogeneous data sets, where different organizations deploy different schemata and terminologies to describe their artifacts. This diversity makes meaningful comparison difficult. We developed a thesaurus-based comparison search application called LISA, a tool that allows a user to search, select and compare sets of artifacts. Different visualizations allow users to use different comparison strategies to cope with the underlying heterogeneous data and the complexity of the search tasks. We conducted two user studies. A preliminary study identifies the problems experts face while performing comparison search tasks. A second user study examines the effectiveness of LISA in helping to solve comparison search tasks. The main contribution of this paper is to establish design guidelines for the data and interface of a comparison search application. Moreover, we offer insights into when thesauri and metadata are appropriate for use in such applications.

User-centric project that develops an interface into heterogeneous data sets.

What I would characterize as pre-mapping, that is no “canonical” mapping has yet been established.

Perhaps a good idea to preserve a pre-mapping stage as any mapping represents but one choice among many.

Comments Off

September 30, 2010

Entity Resolution – Journal of Data and Information Quality

Filed under: Entity Resolution,Heterogeneous Data,Subject Identity — Patrick Durusau @ 5:37 am

Special Issue on Entity Resolution.

The Journal of Data and Information Quality is a new journal from the ACM.

Calls for papers should not require ACM accounts for viewing.

I have re-ordered (to put the important stuff first) and reproduced the call below:

Important Dates

Submissions due: December 15, 2010
Acceptance Notification: April 30, 2011
Final Paper Due: June 30, 2011
Target Date for Special Issue: September 2011

Resources for authors include:

Guidelines for authors are available at http://jdiq.acm.org/authors.htm
An ACM standard paper template and example are also available at http://www.acm.org/publications/latex_style/ms_word_template.doc
Submissions will be accepted through the ACM Manuscript Central at http://mc.manuscriptcentral.com/jdiq
All submissions will undergo a double blind review

Topics of interest include, but are not limited to:

ER impacts on Information Quality and impacts of Information Quality
on ER

ER frameworks and architectures

ER outcome/performance assessment and metrics

ER in special application domains and contexts

ER and high-performance computing (HPC)

ER education

ER case studies

Theoretical frameworks for ER and entity-based integration

Method and techniques for

Entity reference extraction

Entity reference resolution

Entity identity management and identity resolution

Entity relationship analysis

Entity resolution (ER) is a key process for improving data quality in data integration in modern information systems. ER covers a wide range of approaches to entity-based integration, known variously as merge/purge, record de-duplication, heterogeneous join, identity resolution, and customer recognition. More broadly, ER also includes a number of important pre- and post-integration activities, such as entity reference extraction and entity relationship analysis. Based on direct record matching strategies, such as those described by the Fellegi-Sunter Model, new theoretical frameworks are evolving to describe ER processes and outcomes that include other types of inferred and asserted reference linking techniques. Businesses have long recognized that the quality of their ER processes directly impacts the overall value of their information assets and the quality of the information products they produce. Government agencies and departments, including law enforcement and the intelligence community, are increasing their use of ER as a tool for accomplishing their missions as well. Recognizing the growing interest in ER theory and practice, and its impact on information quality in organizations, the ACM Journal of Data and Information Quality (JDIQ) will devote a special issue to innovative and high-quality research papers in this area. Papers that address any aspect of entity resolution are welcome.

Comments Off

September 18, 2010

TREC 2010/2011

Filed under: Conferences,Heterogeneous Data,Information Retrieval,Searching,Software — Patrick Durusau @ 7:34 am

It’s too late to become a participant in TREC 2010 but everyone interested in building topic maps should be aware of this conference.

The seven tracks for this year are blog, chemical IR, entity, legal, relevance feedback, “session,” and web.

Prior TREC conferences are online, along with a host of other materials, at the Text REtrieval Conference (TREC) site.

The 2011 cycle isn’t that far away so consider being a participant next year.

Comments Off

September 10, 2010

LNCS Volume 6263: Data Warehousing and Knowledge Discovery

Filed under: Database,Graphs,Heterogeneous Data,Indexing,Merging — Patrick Durusau @ 8:20 pm

LNCS Volume 6263: Data Warehousing and Knowledge Discovery edited by Torben Bach Pedersen, Mukesh K. Mohania, A Min Tjoa, has a number of articles of interest to the topic map community.

Here are five (5) that caught my eye:

A Model-Driven Heuristic Approach for Detecting Multidimensional Facts in Relational Data Sources Author(s): Andrea Carmè, Jose-Norberto Mazón, Stefano Rizzi Keywords: relational data sources, multidimensional schema, model-driven, data warehouse, multidimensional facts.
A Graph-Based Clustering Scheme for Identifying Related Tags in Folksonomies Author(s): Symeon Papadopoulos, Yiannis Kompatsiaris, Athena Vakali Keywords: graph-based clustering – community detection – folksonomies – tag recommendation.
$\mathcal{F}$&$\mathcal{A}$: A Methodology for Effectively and Efficiently Designing Parallel Relational Data Warehouses on Heterogenous Database Clusters Author(s): Ladjel Bellatreche, Alfredo Cuzzocrea, Soumia Benkrid Keywords: homogeneous clusters, heterogeneous clusters, data warehouses, relational data, naive replication algorithm.
XML Data Fusion Author(s): Frantchesco Cecchin, Cristina Dutra Aguiar Ciferri, Carmem Satie Hara Keywords: XML data fusion, data cleaning rules, value conflicts, integration, fusion policy validation, XFusion.
An Efficient Duplicate Record Detection Using q-Grams Array Inverted Index Author(s): Alfredo Ferro, Rosalba Giugno, Piera Laura Puglisi, Alfredo Pulvirenti Keywords: Duplicate record detection – q-grams – inverted index – bitmaps – clustering.

Comments Off

August 23, 2010

KNIME – Professional Open-Source Software

Filed under: Heterogeneous Data,Mapping,Software,Subject Identity — Patrick Durusau @ 7:27 pm

KNIME – Professional Open-Source Software is another effort by domain bridging folks I mentioned yesterday.

From the homepage:

KNIME (Konstanz Information Miner) is a user-friendly and comprehensive Open-Source data integration, processing, analysis, and exploration platform. From day one, KNIME has been developed using rigorous software engineering practices and is currently being used actively by over 6.000 professionals all over the world, both in industry and academia.

Read the KNIME features page for a very long list of potentially useful subject identity tests.

There is a place for string matching IRIs, but there is a world of subject identity beyond that as well.

Comments Off

August 22, 2010

Domain Bridging Associations Support Creativity

Filed under: Data Integration,Heterogeneous Data,Mapping,Semantics — Patrick Durusau @ 10:21 am

Domain Bridging Associations Support Creativity by Tobias Kötter, Kilian Thiel, and Michael R. Berthold, offers the following abstract:

This paper proposes a new approach to support creativity through assisting the discovery of unexpected associations across different domains. This is achieved by integrating information from heterogeneous domains into a single network, enabling the interactive discovery of links across the corresponding information resources. We discuss three different pattern of domain crossing associations in this context.

Does that sound familiar to anyone?

Part of the continuing irony that semantic integration research suffers from a lack of semantic integration.

I am just at the tip of this particular iceberg of research so please chime in with pointers to conferences, proceedings, articles, books, etc.

The Universtät Konstanz, Nycomed Chair for Bioinformatics and Data Mining, Publications page, where I found this paper and a number of other resources.

Comments Off

July 26, 2010

OneSource

Filed under: Heterogeneous Data,Mapping,Ontology,Semantic Diversity — Patrick Durusau @ 7:23 am

OneSource describes itself as:

OneSource is an evolving data analysis and exploration tool used internally by the USAF Air Force Command and Control Integration Center (AFC2IC) Vocabulary Services Team, and provided at no additional cost to the greater Department of Defense (DoD) community. It empowers its users with a consistent view of syntactical, lexical, and semantic data vocabularies through a community-driven web environment, directly supporting the DoD Net-Centric Data Strategy of visible, understandable, and accessible data assets.

Video guides to the site:

OneSource includes 158 vocabularies of interest to the greater U.S. Department of Defense (DoD) community. (My first post to answer Lars Heuer’s question “…where is the money?”)

Following posts will explore OneSource and what we can learn from each other.

Comments Off

July 13, 2010

The FLAMINGO Project on Data Cleaning – Site

Filed under: Data Integration,Data Mining,Heterogeneous Data,Information Retrieval,MapReduce,Semantic Diversity,Software — Patrick Durusau @ 5:28 am

The FLAMINGO Project on Data Cleaning is the other project that has influenced the self-similarity work with MapReduce.

From the project description:

Supporting fuzzy queries is becoming increasingly more important in applications that need to deal with a variety of data inconsistencies in structures, representations, or semantics. Many existing algorithms require an offline analysis of data sets to construct an efficient index structure to support online query processing. Fuzzy join queries of data sets are more time consuming due to the computational complexity. The PI is studying three research problems: (1) constructing high-quality inverted lists for fuzzy search queries using Hadoop; (2) supporting fuzzy joins of large data sets using Hadoop; and (3) using the developed techniques to improve data quality of large collections of documents.

See the project webpage to learn more about their work on “us[ing] limited programming primitives in the cloud to implement index structures and search algorithms.”

The relationship between “dirty” data and the increase in data overall is at least linear, but probably worse. Far worse. Whether data is “dirty” depends on your perspective. The more data that appears on “***” format (fill in the one you like the least) the dirtier the universe of data has become. “Dirty” data will be with you always.

Comments (1)

ASTERIX: A Highly Scalable Parallel Platform for Semistructured Data Management and Analysis – SITE

Filed under: Data Integration,Heterogeneous Data,Information Retrieval,MapReduce,Search Engines,Software — Patrick Durusau @ 5:21 am

ASTERIX: A Highly Scalable Parallel Platform for Semistructured Data Management and Analysis is one of the projects behind the self-similarity and MapReduce posting.

From the project page:

The ASTERIX project is developing new technologies for ingesting, storing, managing, indexing, querying, analyzing, and subscribing to vast quantities of semi-structured information. The project is combining ideas from three distinct areas – semi-structured data, parallel databases, and data-intensive computing – to create a next-generation, open source software platform that scales by running on large, shared-nothing computing clusters.

Home of Hydrax Hyrax: Demonstrating a New Foundation for Data-Parallel Computation, “out-of-the-box support for common distributed communication patterns and set-oriented data operators.” (Need I say more?)

Comments Off

July 11, 2010

Efficient Parallel Set-Similarity Joins Using MapReduce

Filed under: Data Integration,Heterogeneous Data,Information Retrieval,MapReduce,Semantic Diversity,Software — Patrick Durusau @ 9:36 am

Efficient Parallel Set-Similarity Joins Using MapReduce by Rares Vernica, Michael J. Carey, and, Chen Li, Department of Computer Science, University of California, Irvine, used Citeseer (1.3M publications) and DBLP (1.2M publications) and “…increased their sizes as needed.”

The contributions of this paper are:

“We describe efficient ways to partition a large dataset across nodes in order to balance the workload and minimize the need for replication. Compared to the equi-join case, the set-similarity joins case requires “partitioning” the data based on set contents.
We describe efficient solutions that exploit the MapReduce framework. We show how to efficiently deal with problems such as partitioning, replication, and multiple
inputs by manipulating the keys used to route the data in the framework.
We present methods for controlling the amount of data kept in memory during a join by exploiting the properties of the data that needs to be joined.
We provide algorithms for answering set-similarity self-join queries end-to-end, where we start from records containing more than just the join attribute and end with actual pairs of joined records.
We show how our set-similarity self-join algorithms can be extended to answer set-similarity R-S join queries.
We present strategies for exceptional situations where, even if we use the finest-granularity partitioning method, the data that needs to be held in the main memory of one node is too large to fit.”

A number of lessons and insights relevant to topic maps in this paper.

Makes me think of domain specific (as well as possibly one or more “general”) set-similarity join interchange languages! What are you thinking of?

Comments (2)

NTCIR (NII Test Collection for IR Systems) Project

Filed under: Conferences,Heterogeneous Data,Information Retrieval,Search Engines,Software — Patrick Durusau @ 7:47 am

NTCIR (NII Test Collection for IR Systems) Project focuses on information retrieval tasks in Japanese, Chinese, Korean, English and cross-lingual information retrieval.

From the project description:

For the laboratory-typed testing, we have placed emphasis on (1) information retrieval (IR) with Japanese or other Asian languages and (2) cross-lingual information retrieval. For the challenging issues, (3) shift from document retrieval to “information” retrieval and technologies to utilizing information in the documents, and (4) investigation for realistic evaluation, including evaluation methods for summarization, multigrade relevance judgments and single-numbered averageable measures for such judgments, evaluation methods suitable for retrieval and processing of particular document-genre and its usage of the user group of the genre and so on.

I know there are active topic map communities in both Japan and Korea. Perhaps this is a place to meet researchers working on issues closely similar to those in topic maps and to discuss the contribution that topic maps have to offer.

Comments Off

Forum for Information Retrieval Evaluation (FIRE)

Filed under: Conferences,Heterogeneous Data,Information Retrieval,Search Engines,Software — Patrick Durusau @ 6:44 am

Forum for Information Retrieval Evaluation (FIRE) aims:

to encourage research in South Asian language Information Access technologies by providing reusable large-scale test collections for ILIR experiments
to explore new Information Retrieval / Access tasks that arise as our information needs evolve, and new needs emerge
to provide a common evaluation infrastructure for comparing the performance of different IR systems
to investigate evaluation methods for Information Access techniques and methods for constructing a reusable large-scale data set for ILIR experiments.

I know there is a lot of topic map development in South Asia and this looks like a great place to meet current researchers and to interest others in topic maps.

Comments Off

INEX: Initiative for Evaluation of XML Retrieval

Filed under: Conferences,Heterogeneous Data,Information Retrieval,Search Engines,Software — Patrick Durusau @ 6:30 am

INEX: Initiative for Evaluation of XML Retrieval is another must-see for serious topic map researchers.

No surprise that my first stop was the iNEX Publications page with proceedings from 2002-date.

However, INEX offers an opportunity for evaluation of topic maps in the context of other solutions, providing that one or more of us participate in the initiative.

If you or your institution decided to participate, please let others in the community know. I for one would like to join such an effort.

Comments Off

June 19, 2010

Demonstrating The Need For Topic Maps

Filed under: Heterogeneous Data,Information Retrieval,Marketing,Semantic Diversity,Vocabulary Mismatch — Patrick Durusau @ 3:57 pm

Individual Differences in the Interpretation of Text: Implications for Information Science by Jane Morris demonstrates that different readers have different perceptions of lexical cohesion in a text. About 40% worth’s of difference. That is a difference in the meaning of the text.

Many tasks in library and information science (e.g., indexing, abstracting, classification, and text analysis techniques such as discourse and content analysis) require text meaning interpretation, and, therefore, any individual differences in interpretation are relevant and should be considered, especially for applications in which these tasks are done automatically. This article investigates individual differences in the interpretation of one aspect of text meaning that is commonly used in such automatic applications: lexical cohesion and lexical semantic relations. Experiments with 26 participants indicate an approximately 40% difference in interpretation. In total, 79, 83, and 89 lexical chains (groups of semantically related words) were analyzed in 3 texts, respectively. A major implication of this result is the possibility of modeling individual differences for individual users. Further research is suggested for different types of texts and readers than those used here, as well as similar research for different aspects of text meaning.

I won’t belabor what a 40% difference in interpretation implies for the one interpretation of data crowd. At least for those who prefer an evidence versus ideology approach to IR.

What is worth belaboring is how to use Morris’ technique to demonstrate such differences in interpretation to potential topic map customers. As a community we could develop texts for use with particular market segments, business, government, legal, finance, etc. An interface to replace the colored pencils used to mark all words belonging to a particular group. Automating some of the calculations and other operations on the resulting data.

Sensing that interpretations of texts vary is one thing. Having an actual demonstration, possibly using texts from a potential client, is quite another.

This is a tool we should build. I am willing to help. Who else is interested?

Comments Off

June 8, 2010

Semantic Overlay Networks

Filed under: Data Integration,Heterogeneous Data,Searching,Semantic Diversity,Semantic Overlay Network — Patrick Durusau @ 7:19 pm

GridVine: Building Internet-Scale Semantic Overlay Networks sounds like they are dealing with topic map like issues to me. You be the judge:

This paper addresses the problem of building scalable semantic overlay networks. Our approach follows the principle of data independence by separating a logical layer, the semantic overlay for managing and mapping data and metadata schemas, from a physical layer consisting of a structured peer-to-peer overlay network for efficient routing of messages. The physical layer is used to implement various functions at the logical layer, including attribute-based search, schema management and schema mapping management. The separation of a physical from a logical layer allows us to process logical operations in the semantic overlay using different physical execution strategies. In particular we identify iterative and recursive strategies for the traversal of semantic overlay networks as two important alternatives. At the logical layer we support semantic interoperability through schema inheritance and semantic gossiping. Thus our system provides a complete solution to the implementation of semantic overlay networks supporting both scalability and interoperability.

The concept of “semantic gossiping” enables semantic similarity to be established the combination of local mappings, that is by adding the mappings together. (Similar to the set behavior of subject identifiers/locators in the TMDM. That is to say if you merge two topic maps, any additional subject identifiers, previously unknown to the first topic map, with enable those topics to merge with topics in later merges where previously they may not have.)

Open Question: If everyone concedes that:

we live in a heterogeneous world
we have stored vast amounts of heterogeneous data
we are going to continue to create/store even vaster amounts of heterogeneous data
we keep maintaining and creating more heterogeneous data structures to store our heterogeneous data

If every starting point is heterogeneous, shouldn’t heterogeneous solutions be the goal?

Such as supporting heterogeneous mapping technologies? (Granting there will also be a limit to those supported at any one time but it should be possible to extend to embrace others.)

Author Bibliographies:

Karl Aberer

Phillipe Cudré-Mauroux

Manfred Hauswirth

Tim Van Pelt

Comments Off

May 22, 2010

Peter McBrien

Filed under: Data Integration,Heterogeneous Data,Hypergraphs,Researchers — Patrick Durusau @ 3:36 pm

Peter McBrien focuses on data modeling and integration.

Part of the AutoMed project on database integration. Recent work includes temporal constraints and P2P exchange of heterogeneous data.

Publications (dblp).

Homepage

Databases: Tools and Data for Teaching and Research: Useful collection of datasets and other materials on databases, data modeling and integration.

I first encountered Peter’s research in Comparing and Transforming Between Data Models via an Intermediate Hypergraph Data Model.

From a topic map perspective, the authors assumed the identities of the subjects to which their transformation rules were applied. Someone less familiar with the schema languages could have made other choices.

That’s the hard question isn’t it? How to have reliable integration without presuming a common perspective/interpretation of the schema languages?

*****
PS: This is the first of many posts on researchers working in areas of interest to the topic maps community.

Comments Off

May 2, 2010

Topic Maps: A Value-Add Technology

Filed under: Data Integration,Heterogeneous Data,Marketing — Patrick Durusau @ 7:41 pm

It isn’t always clear that topic maps are a value-add, not a replacement technology.

Topic maps, by virtue of subject identity and mapping rules, can enhance existing information technologies and provide reliable interoperability between them. Without changing the underlying information technologies.

Topic maps are a value-add proposition because the structures of information technologies are subjects themselves. Database schemas and their fields, for instance, are subjects in the view of a topic map. Which means that users can map, seamlessly and reliably, between a relational database and a document archive, that use completely different terminology.

Or a subscriber to several financial reporting services, can create a topic map to filter and organize those reports. That is doable without a topic map, but what happens when another report service is added? What subjects were mapped together before? Topic maps are the value-add that can provide an answer to that question.

Comments Off

April 20, 2010

Data Virtualization

Filed under: Data Integration,Heterogeneous Data,Subject Identity — Patrick Durusau @ 6:47 pm

I ran across a depressing quote today on data virtualization:

But data is distributed, heterogeneous, and often full of errors. Simply federating it insufficient. IT organizations must build a single, accurate, and consistent view of data, and deliver it precisely when it’s needed. Data virtualization needs to take this complexity into account.*

It is very important to have a single view of data for some purposes, but what happens when circumstances change and we need a different view than the one before?

Without explicit identification of subjects, all the IT effort that went into the first data integration project gets repeated in the next data integration project.

You would think that after sixty years of data migration, largely repeating the efforts of prior migrations, even business types would have caught on by this point.

Without explicit identification of subjects, there isn’t any way to “know” what subjects were being identified. Or to create reliable new mappings. So the cycle of data migrations goes on and on.

Break the cycle of data migrations, choose topic maps!

*Look under webinars at: http://www.informatica.com/Pages/index.aspx There wasn’t a direct link that I could post to lead you to the quote.

Comments Off

April 19, 2010

Why Semantic Technologies Remain Orphans (Lack of Adoption)

Filed under: Data Silos,Heterogeneous Data,Mapping,Semantic Diversity,Topic Maps — Patrick Durusau @ 6:54 pm

In the debate over Data 3.0 (a Manifesto for Platform Agnostic Structured Data) Update 1, Kingsley Idehen has noted the lack of widespread adoption of semantic technologies.

Everyone prefers their own world view. We see some bright, shiny future if everyone else, at their expense, would adopt our view of the world. That hasn’t been persuasive.

And why should it be? What motivation do I have to change how I process/encode my data, in the hopes that if everyone else in my field does the same thing, then at some unknown future point, I will have some unquantifiable advantage over how I process data now?

I am not advocating that everyone adopt XTM syntax or the TMDM as a data model. Just as there are an infinite number of semantics there are an infinite number of ways to map and combine those semantics. I am advocating a disclosed mapping strategy that enables others to make meaningful use of the resulting maps.

Let’s take a concrete case.

The Christmas Day “attack” by a terrorist who set his pants on fire (Christmas Day Attack Highlights US Intelligence Failures) illustrates a failure to share intelligence data.

One strategy, the one most likely to fail, is the development of a common data model for sharing intelligence data. The Guide to Sources of Information for Intelligence Officers, Analysts, and Investigators, Updated gives you a feel for the scope of such a project. (100+ pages listing sources of intelligence data)

A disclosed mapping strategy for the integration of intelligence data would enable agencies to keep their present systems, data structures, interfaces, etc.

Disclosing the basis for mapping, whatever the target (such as RDF), will mean that users can combine the resulting map with other data. Or not. But it will be a meaningful choice. A far saner (and more cost effective) strategy than a common data model.

Semantic diversity is our strength. So why not play to our strength, rather than against it?

Comments Off

April 12, 2010

Topic Maps and the “Vocabulary Problem”

Filed under: Full-Text Search,Heterogeneous Data,Information Retrieval,Search Engines,Searching,Semantic Diversity,Vocabulary Mismatch — Patrick Durusau @ 3:09 pm

To situate topic maps in a traditional area of IR (information retrieval), try the “vocabulary problem.”

Furnas describes the “vocabulary problem” as follows:

Many functions of most large systems depend on users typing in the right words. New or intermittent users often use the wrong words and fail to get the actions or information they want. This is the vocabulary problem. It is a troublesome impediment in computer interactions both simple (file access and command entry) and complex (database query and natural language dialog).

In what follows we report evidence on the extent of the vocabulary problem, and propose both a diagnosis and a cure. The fundamental observation is that people use a surprisingly great variety of words to refer to the same thing. In fact, the data show that no single access word, however well chosen, can be expected to cover more than a small proportion of user’s attempts. Designers have almost always underestimated the problem and, by assigning far too few alternate entries to databases or services, created an unnecessary barrier to effective use. Simulations and direct experimental tests of several alternative solutions show that rich, probabilistically weighted indexes or alias lists can improve success rates by factors of three to five.

The Vocabulary Problem in Human-System Communication (1987)

Substitute topic maps for probabilistically weighted indexes or alias lists. (Techniques we are going to talk about in connection with topic maps authoring.)

Three to five times greater success is an incentive to use topic maps.

Marketing Department Summary

Customers can’t buy what they can’t find. Topic Maps help customers find purchases, increases sales. (Be sure to track pre and post topic maps sales results. So marketing can’t successfully claim the increases are due to their efforts.)

Comments Off

April 6, 2010

Building Multilingual Topic Maps

Filed under: Conferences,Heterogeneous Data,Semantic Diversity — Patrick Durusau @ 8:42 pm

The one article of faith shared by all topic map enthusiasts is: topic maps can express anything! But having said that, “when the rubber hits the road” (Americanism, means to become meaningful, action being taken) the question is how to build a topic map, particularly a multilingual one.

We are all familiar with the ability of topic maps to place a “scope” on a name so that its language can be indicated. But that is only one aspect of a what is expected of a modern multilingual system.

Fortunately, topic map fans don’t have to re-invent multilingual information retrieval techniques!

Bookmark and use the resources found at the Cross Language Evaluation Forum. CLEF is sponsored by TrebleCLEF, an activity of the European Commission.

CLEF has almost a decade of annual proceedings and both sites offer link collection to other multilingual resources. I am going to start mining those proceedings and other documents for suggestions and tips on constructing topic maps.

Suggestions, comments, tips, etc., that you have found useful would be appreciated.

(PS: I am sure all this is old hat to European topic map folks but realize there are, ahem, parts of the world where multilingualism isn’t valued. I suspect many of the same techniques will work for multiple identifications in single languages.)

Comments Off

April 5, 2010

Are You Designing a 10% Solution?

Filed under: Full-Text Search,Heterogeneous Data,Recall,Search Engines — Patrick Durusau @ 8:28 pm

The most common feature on webpages is the search box. It is supposed to help readers find information, products, services; in other words, help the reader or your cash flow.

How effective is text searching? How often will your reader use the same word as your content authors for some object, product, service? Survey says: 10 to 20%!*

So the next time you insert a search box on a webpage, you or your client may be missing 80 to 90% of the potential readers or customers. Ouch!

Unlike the imaginary world of universal and unique identifiers, the odds of users choosing the same words has been established by actual research.

The data sets were:

verbs used to describe text-editing operations
descriptions of common objects, similar to PASSWORD ™ game
superordinate category names for swap-and-sale listings
main-course cooking recipes

There are a number of interesting aspects to the study that I will cover in future posts but the article offers the following assessment of text searching:

We found that random pairs of people use the same word for an object only 10 to 20 percent of the time.

This research is relevant to all information retrieval systems. Online stores, library catalogs, whether you are searching simple text, RDF or even topic maps. Ask yourself or your users: Is a 10% success rate really enough?

(There ways to improve that 10% score. More on those to follow.)

*Furnas, G. W., Landauer, T. K., Gomez, L. M., Dumais, S. T., (1983) “Statistical semantics: Analysis of the potential performance of keyword information access systems.” Bell System Technical Journal, 62, 1753-1806. Reprinted in: Thomas, J.C., and Schneider, M.L, eds. (1984) Human Factors in Computer Systems. Norwood, New Jersey: Ablex Publishing Corp., 187-242.

Comments (3)

April 3, 2010

Source of Heterogeneous Data?

Filed under: Heterogeneous Data,Semantic Diversity — Patrick Durusau @ 7:19 pm

Topic maps are designed to deal with heterogeneous data. The question I have never heard asked (or answered) is: “Where does all this heterogeneous data come from?” Heterogeneous data is the topic of conversation in digital IT and pre-digital IT literature.

You would think that question would been asked and answered. I went out looking for it, since email is slow today. (Holy Saturday 2010)

If I can find a time when there wasn’t any heterogeneous data, then someone may have commented, “look, there’s heterogeneous data.” I could then track the cause forward. Sounds simple enough.

I have a number of specialized works on languages of the Ancient Near East but it turns out the Unicode standard has the information we need.

Chapter 14, Archaic Scripts has entries for both Egyptian hieroglyphics and Sumero-Akkadian. Both arose at about the same time, somewhere from the middle to the near the end of the fourth millennium BCE. That’s recorded heterogeneous data isn’t it?

For somewhere between 5,000 to 5,500 years we have had heterogeneous data. It appears to be universal, geographically speaking.

The source of heterogeneous data? That would be us. What we need is a solution that works with us and not against us. That would be topic maps.

Comments Off

April 2, 2010

Re-Inventing Natural Language

Filed under: Heterogeneous Data,Ontology,Semantic Diversity — Patrick Durusau @ 8:29 pm

What happens when users use ontologies? That is when ontologies leave the rarefied air of campuses, turgid dissertations and the clutches of arm chair ontologists?

Would you believe that users simply take terms from ontologies and use them as they wish? In other words, after decades of research, ontologists have re-invented natural language! With all of its inconsistent usage, etc.

I would send a fruit basket if I had their address.

For the full details, take a look at: The perceived utility of standard ontologies in document management for specialized domains. From the conclusion:

…rather than being locked into conforming to the standard, users will be free to use all or small fragments of the ontology as best suits their purpose; that is, these communities will be able to very flexibly import ontologies and make selective use of ontology resources. Their selective use and the extra terms they add will provide useful feedback on how the external ontologies could be evolved. A new ontology will emerge as the result and this itself may become a new standard ontology.

I would amend the final two sentences to read: “Their selective use and the extra terms they add will provide useful feedback on how their language is evolving. A new language will emerge as the result and this may itself become a new standard language.

Imagine, all that effort and we are back where we started. Users using language (terms from an ontology) to mean what they want it to mean and not what was meant by the ontology.

The arm chair ontologists have written down what they mean. Why don’t we ask ordinary users the same thing, and write that down?

Comments Off

« Newer Posts

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 10, 2010

October 9, 2010

October 3, 2010

September 30, 2010

September 18, 2010

September 10, 2010

August 23, 2010

August 22, 2010

July 26, 2010

July 13, 2010

July 11, 2010

June 19, 2010

June 8, 2010

May 22, 2010

May 2, 2010

April 20, 2010

April 19, 2010

April 12, 2010

April 6, 2010

April 5, 2010

April 3, 2010

April 2, 2010