Archive for the ‘Heterogeneous Data’ Category

Why I Love XML (and Good Thing, It’s Everywhere) [Needs Subject Identity Too]

Sunday, March 5th, 2017

Why I Love XML (and Good Thing, It’s Everywhere) by Lee Pollington.

Lee makes a compelling argument for XML as the underlying mechanism for data integration when saying:

…Perhaps the data in your relational databases is structured. What about your knowledge management systems, customer information systems, document systems, CMS, mail, etc.? How do you integrate that data with structured data to get a holistic view of all your data? What do you do when you want to bring a group of relational schemas from different systems together to get that elusive 360 view – which is being demanded by the world’s regulators banks? Mergers and acquisitions drive this requirement too. How do you search across that data?

Sure there are solution stack answers. We’ve all seen whiteboards with ever growing number of boxes and those innocuous puny arrows between them that translate to teams of people, buckets of code, test and operations teams. They all add up to ever-increasing costs, complexity, missed deadlines & market share loss. Sound overly dramatic? Gartner calculated a worldwide spend of $5 Billion on data integration software in 2015. How much did you spend … would you know where to start calculating that cost?

While pondering what you spend on a yearly basis for data integration, contemplate two more questions from Lee:

…So take a moment to think about how you treat the data format that underpins your intellectual property? First-class citizen or after-thought?…

If you are treating your XML elements as first class citizens, do tell me that you created subject identity tests for those subjects?

So that a programmer new to your years of legacy XML will understand that <MFBM>, <MBFT> and <MBF> elements are all expressed in units of 1,000 board feet.


Reducing the cost of data integration tomorrow, next year and five years after that, requires investment in the here and now.

Perhaps that is why data integration costs continue to climb.

Why pay for today what can be put off until tomorrow? (Future conversion costs are a line item in some future office holder’s budget.)

CIDOC Conceptual Reference Model

Saturday, February 22nd, 2014

CIDOC Conceptual Reference Model (pdf)

From the “Definition of the CIDOC Conceptual Reference Model:”

This document is the formal definition of the CIDOC Conceptual Reference Model (“CRM”), a formal ontology intended to facilitate the integration, mediation and interchange of heterogeneous cultural heritage information. The CRM is the culmination of more than a decade of standards development work by the International Committee for Documentation (CIDOC) of the International Council of Museums (ICOM). Work on the CRM itself began in 1996 under the auspices of the ICOM-CIDOC Documentation Standards Working Group. Since 2000, development of the CRM has been officially delegated by ICOM-CIDOC to the CIDOC CRM Special Interest Group, which collaborates with the ISO working group ISO/TC46/SC4/WG9 to bring the CRM to the form and status of an International Standard.

Objectives of the CIDOC CRM

The primary role of the CRM is to enable information exchange and integration between heterogeneous sources of cultural heritage information. It aims at providing the semantic definitions and clarifications needed to transform disparate, localised information sources into a coherent global resource, be it with in a larger institution, in intranets or on the Internet. Its perspective is supra-institutional and abstracted from any specific local context. This goal determines the constructs and level of detail of the CRM.

More specifically, it defines and is restricted to the underlying semantics of database schemata and document structures used in cultural heritage and museum documentation in terms of a formal ontology. It does not define any of the terminology appearing typically as data in the respective data structures; however it foresees the characteristic relationships for its use. It does not aim at proposing what cultural institutions should document. Rather it explains the logic of what they actually currently document, and thereby enables semantic interoperability.

It intends to provide a model of the intellectual structure of cultural documentation in logical terms. As such, it is not optimised for implementation-specific storage and processing aspects. Implementations may lead to solutions where elements and links between relevant elements of our conceptualizations are no longer explicit in a database or other structured storage system. For instance the birth event that connects elements such as father, mother, birth date, birth place may not appear in the database, in order to save storage space or response time of the system. The CRM allows us to explain how such apparently disparate entities are intellectually interconnected, and how the ability of the database to answer certain intellectual questions is affected by the omission of such elements and links.

The CRM aims to support the following specific functionalities:

  • Inform developers of information systems as a guide to good practice in conceptual modelling, in order to effectively structure and relate information assets of cultural documentation.
  • Serve as a common language for domain experts and IT developers to formulate requirements and to agree on system functionalities with respect to the correct handling of cultural contents.
  • To serve as a formal language for the identification of common information contents in different data formats; in particular to support the implementation of automatic data transformation algorithms from local to global data structures without loss of meaning. The latter being useful for data exchange, data migration from legacy systems, data information integration and mediation of heterogeneous sources.
  • To support associative queries against integrated resources by providing a global model of the basic classes and their associations to formulate such queries.
  • It is further believed, that advanced natural language algorithms and case-specific heuristics can take significant advantage of the CRM to resolve free text information into a formal logical form, if that is regarded beneficial. The CRM is however not thought to be a means to replace scholarly text, rich in meaning, by logical forms, but only a means to identify related data.

(emphasis in original)

Apologies for the long quote but this covers a number of important topic map issues.

For example:

For instance the birth event that connects elements such as father, mother, birth date, birth place may not appear in the database, in order to save storage space or response time of the system. The CRM allows us to explain how such apparently disparate entities are intellectually interconnected, and how the ability of the database to answer certain intellectual questions is affected by the omission of such elements and links.

In topic map terms I would say that the database omits a topic to represent “birth event” and therefore there is no role player for an association with the various role players. What subjects will have representatives in a topic map is always a concern for topic map authors.

Helpfully, CIDOC explicitly separates the semantics it documents from data structures.

Less helpfully:

Because the CRM’s primary role is the meaningful integration of information in an Open World, it aims to be monotonic in the sense of Domain Theory. That is, the existing CRM constructs and the deductions made from them must always remain valid and well-formed, even as new constructs are added by extensions to the CRM.

Which restricts integration using CRM to systems where CRM is the primary basis for integration, as opposed to be one way to integrate several data sets.

That may not seem important in “web time,” where 3 months equals 1 Internet year. But when you think of integrating data and integration practices as they evolve over decades if not centuries, the limitations of monotonic choices come to the fore.

To take one practical discussion under way, how to handle warning about radioactive waste, which must endure anywhere from 10,000 to 1,000,000 years? A far simpler task than preserving semantics over centuries.

If you think that is easy, remember that lots of people saw the pyramids of Egypt being built. But it was such common knowledge, that no one thought to write it down.

Preservation of semantics is a daunting task.

CIDOC merits a slow read by anyone interested in modeling, semantics, vocabularies, and preservation.

PS: CIDOC: Conceptual Reference Model as a Word file.

SDM 2014 Workshop on Heterogeneous Learning

Sunday, January 5th, 2014

SDM 2014 Workshop on Heterogeneous Learning

Key Dates:

01/10/2014: Paper Submission
01/31/2014: Author Notification
02/10/2014: Camera Ready Paper Due

From the post:

The main objective of this workshop is to bring the attention of researchers to real problems with multiple types of heterogeneities, ranging from online social media analysis, traffic prediction, to the manufacturing process, brain image analysis, etc. Some commonly found heterogeneities include task heterogeneity (as in multi-task learning), view heterogeneity (as in multi-view learning), instance heterogeneity (as in multi-instance learning), label heterogeneity (as in multi-label learning), oracle heterogeneity (as in crowdsourcing), etc. In the past years, researchers have proposed various techniques for modeling a single type of heterogeneity as well as multiple types of heterogeneities.

This workshop focuses on novel methodologies, applications and theories for effectively leveraging these heterogeneities. Here we are facing multiple challenges. To name a few: (1) how can we effectively exploit the label/example structure to improve the classification performance; (2) how can we handle the class imbalance problem when facing one or more types of heterogeneities; (3) how can we improve the effectiveness and efficiency of existing learning techniques for large-scale problems, especially when both the data dimensionality and the number of labels/examples are large; (4) how can we jointly model multiple types of heterogeneities to maximally improve the classification performance; (5) how do the underlying assumptions associated with multiple types of heterogeneities affect the learning methods.

We encourage submissions on a variety of topics, including but not limited to:

(1) Novel approaches for modeling a single type of heterogeneity, e.g., task/view/instance/label/oracle heterogeneities.

(2) Novel approaches for simultaneously modeling multiple types of heterogeneities, e.g., multi-task multi-view learning to leverage both the task and view heterogeneities.

(3) Novel applications with a single or multiple types of heterogeneities.

(4) Systematic analysis regarding the relationship between the assumptions underlying each type of heterogeneity and the performance of the predictor;

Apologies but I saw this announcement too late for you to have a realistic opportunity to submit a paper. 🙁

Very unfortunate because the focus of the workshop is right up the topic map alley.

The main conference, which focuses on data mining, is April 24-26, 2014 in Philadelphia, Pennsylvania, USA.

I am very much looking forward to reading the papers from this workshop! (And looking for notice of next year’s workshop much earlier!)

Corporate Culture Clash:…

Monday, July 15th, 2013

Corporate Culture Clash: Getting Data Analysts and Executives to Speak the Same Language by Drew Rockwell

From the post:

A colleague recently told me a story about the frustration of putting in long hours and hard work, only to be left feeling like nothing had been accomplished. Architecture students at the university he attended had scrawled their frustrations on the wall of a campus bathroom…“I wanted to be an architect, but all I do is create stupid models,” wrote students who yearned to see their ideas and visions realized as staples of metropolitan skylines. I’ve heard similar frustrations expressed by business analysts who constantly face the same uphill battle. In fact, in a recent survey we did of 600 analytic professionals, some of the biggest challenges they cited were “getting MBAs to accept advanced methods”, getting executives to buy into the potential of analytics, and communicating with “pointy-haired” bosses.

So clearly, building the model isn’t enough when it comes to analytics. You have to create an analytics-driven culture that actually gets everyone paying attention, participating and realizing what analytics has to offer. But how do you pull that off? Well, there are three things that are absolutely critical to building a successful, analytics-driven culture. Each one links to the next and bridges the gap that has long divided analytics professionals and business executives.

Some snippets to attract you to this “must read:”

In the culinary world, they say you eat with your eyes before your mouth. A good visual presentation can make your mouth water, while a bad one can kill your appetite. The same principle applies when presenting data analytics to corporate executives. You have to show them something that stands out, that they can understand and that lets them see with their own eyes where the value really lies.
One option for agile integration and analytics is data discovery – a type of analytic approach that allows business people to explore data freely so they can see things from different perspectives, asking new questions and exploring new hypotheses that could lead to untold benefits for the entire organization.
If executives are ever going to get on board with analytics, the cost of their buy-in has to be significantly lowered, and the ROI has to be clear and substantial.

I did pick the most topic map “relevant” quotes but they are as valid for topic maps as any other approach.

Seeing from different perspectives sounds like on-the-fly merging to me.

How about you?

Semantics for Big Data [W3C late to semantic heterogeneity party]

Sunday, March 31st, 2013

Semantics for Big Data


Submission due: May 24, 2013

Acceptance Notification: June 21, 2013

Camera-ready Copies: June 28, 2013

Symposium: November 15-17, 2013

From the webpage:

AAAI 2013 Fall Symposium; Westin Arlington Gateway in Arlington, Virginia, November 15-17, 2013.

Workshop Description and Scope

One of the key challenges in making use of Big Data lies in finding ways of dealing with heterogeneity, diversity, and complexity of the data, while its volume and velocity forbid solutions available for smaller datasets as based, e.g., on manual curation or manual integration of data. Semantic Web Technologies are meant to deal with these issues, and indeed since the advent of Linked Data a few years ago, they have become central to mainstream Semantic Web research and development. We can easily understand Linked Data as being a part of the greater Big Data landscape, as many of the challenges are the same. The linking component of Linked Data, however, puts an additional focus on the integration and conflation of data across multiple sources.

Workshop Topics

In this symposium, we will explore the many opportunities and challenges arising from transferring and adapting Semantic Web Technologies to the Big Data quest. Topics of interest focus explicitly on the interplay of Semantics and Big Data, and include:

  • the use of semantic metadata and ontologies for Big Data,
  • the use of formal and informal semantics,
  • the integration and interplay of deductive (semantic) and statistical methods,
  • methods to establish semantic interoperability between data sources
  • ways of dealing with semantic heterogeneity,
  • scalability of Semantic Web methods and tools, and
  • semantic approaches to the explication of requirements from eScience applications.

The W3C is late to the party as evidenced by semantic heterogeneity becoming “…central to mainstream Semantic Web research and development” after the advent of Linked Data.

I suppose better late than never.

At least if they remember that:

Users experience semantic heterogeneity in data and in the means used to describe and store data.

Whatever solution is crafted, its starting premise must be to capture semantics as seen by some defined user.

Otherwise, it is capturing the semantics of designers, authors, etc., which may or may not be valuable to some particular user.

RDF is a good example of capturing someone else’s semantics.

As its uptake is evidence of the interest in someone else’s semantics. (Simple Web Semantics – The Semantic Web Is Failing — But Why?)

On ranking relevant entities in heterogeneous networks…

Tuesday, January 22nd, 2013

On ranking relevant entities in heterogeneous networks using a language-based model by Laure Soulier, Lamjed Ben Jabeur, Lynda Tamine, Wahiba Bahsoun. (Soulier, L., Jabeur, L. B., Tamine, L. and Bahsoun, W. (2013), On ranking relevant entities in heterogeneous networks using a language-based model. J. Am. Soc. Inf. Sci.. doi: 10.1002/asi.22762)


A new challenge, accessing multiple relevant entities, arises from the availability of linked heterogeneous data. In this article, we address more specifically the problem of accessing relevant entities, such as publications and authors within a bibliographic network, given an information need. We propose a novel algorithm, called BibRank, that estimates a joint relevance of documents and authors within a bibliographic network. This model ranks each type of entity using a score propagation algorithm with respect to the query topic and the structure of the underlying bi-type information entity network. Evidence sources, namely content-based and network-based scores, are both used to estimate the topical similarity between connected entities. For this purpose, authorship relationships are analyzed through a language model-based score on the one hand and on the other hand, non topically related entities of the same type are detected through marginal citations. The article reports the results of experiments using the Bibrank algorithm for an information retrieval task. The CiteSeerX bibliographic data set forms the basis for the topical query automatic generation and evaluation. We show that a statistically significant improvement over closely related ranking models is achieved.

Note the “estimat[ion] of topic similarity between connected entities.”

Very good work but rather than a declaration of similarity (topic maps) we have an estimate of similarity.

Before you protest about the volume of literature/data, recall that some author write the documents in question. And selected the terms and references found therein.

Rather than guessing what may be similar to what the author wrote, why not devise a method to allow the author to say?

And build upon similarity/sameness declarations across heterogeneous networks of data.

Sky Survey Data Lacks Standardization [Heterogeneous Big Data]

Tuesday, November 27th, 2012

Sky Survey Data Lacks Standardization by Ian Armas Foster.

From the post:

The Sloan Digital Sky Survey is at the forefront of astronomical research, compiling data from observatories around the world in an effort to truly pinpoint where we lie on the universal map. In order to do that, they must aggregate data from several observatories across the world, an intensive data operation.

According to a report written by researchers at UCLA, even though the SDSS is a data intensive astronomical mapping survey, it has yet to lay down a standardized foundation for retrieving and storing scientific data.

Per, the first two projects were responsible for observing “a quarter of the sky” and picking out nearly a million galaxies and over 100,000 quasars. The project started at the Apache Point observatory in New Mexico and has since grown to include 25 observatories across the globe. The SDSS gained recognition in2009 with the Nobel Prize in physics awarded to the advancement of optical fibers and digital imaging detectors (or CCDs) that allowed the project to grow in scale.

The point is that the datasets that the scientists used seemed to be scattered. Some would come about through informal social contacts such as email while others would simply search for necessary datasets on Google. Further, once these datasets were found, there was even an inconsistency in how they were stored before they could be used. However, this may have had to do with the varying sizes of the sets and how quickly the researchers wished to use the data. The entire SDSS dataset consists of over 130 TB, according to the report, and that volume can be slightly unwieldy.

“Large sky surveys, including the SDSS, have significantly shaped research practices in the field of astronomy,” the report concluded. “However, these large data sources have not served to homogenize information retrieval in the field. There is no single, standardized method for discovering, locating, retrieving, and storing astronomy data.”

So, big data isn’t going to be homogeneous big data but heterogeneous big data.

That sounds like an opportunity for topic maps to me.


Mapping solution to heterogeneous data sources

Monday, September 10th, 2012

dbSNO: a database of cysteine S-nitrosylation by Tzong-Yi Lee, Yi-Ju Chen, Cheng-Tsung Lu, Wei-Chieh Ching, Yu-Chuan Teng, Hsien-Da Huang and Yu-Ju Chen. (Bioinformatics (2012) 28 (17): 2293-2295. doi: 10.1093/bioinformatics/bts436)

OK, the title doesn’t jump out and say “mapping solution here!” 😉

Reading a bit further, you discover that text mining is used to locate sequences and that data is then mapped to “UniProtKB protein entries.”

The data set provides access to:

  • UniProt ID
  • Organism
  • Position
  • PubMed Id
  • Sequence

My concern is what happens when X is mapped to a UniProtKB protein entry to:

  • The prior identifier for X (in the article or source), and
  • The mapping from X to the UniProtKB protein entry?

If both of those are captured, then prior literature can be annotated upon rendering to point to later aggregation of information on a subject.

If the prior identifier, place of usage, the mapping, etc., are not captured, then prior literature, when we encounter it, remains frozen in time.

Mapping solutions work, but repay the effort several times over if the prior identifier and its mapping to the “new” identifier are captured as part of the process.


Summary: S-nitrosylation (SNO), a selective and reversible protein post-translational modification that involves the covalent attachment of nitric oxide (NO) to the sulfur atom of cysteine, critically regulates protein activity, localization and stability. Due to its importance in regulating protein functions and cell signaling, a mass spectrometry-based proteomics method rapidly evolved to increase the dataset of experimentally determined SNO sites. However, there is currently no database dedicated to the integration of all experimentally verified S-nitrosylation sites with their structural or functional information. Thus, the dbSNO database is created to integrate all available datasets and to provide their structural analysis. Up to April 15, 2012, the dbSNO has manually accumulated >3000 experimentally verified S-nitrosylated peptides from 219 research articles using a text mining approach. To solve the heterogeneity among the data collected from different sources, the sequence identity of these reported S-nitrosylated peptides are mapped to the UniProtKB protein entries. To delineate the structural correlation and consensus motif of these SNO sites, the dbSNO database also provides structural and functional analyses, including the motifs of substrate sites, solvent accessibility, protein secondary and tertiary structures, protein domains and gene ontology.

Availability: The dbSNO is now freely accessible via The database content is regularly updated upon collecting new data obtained from continuously surveying research articles.

Contacts: or

Center for Intelligent Information Retrieval (CIIR) [University of Massachusetts Amherst]

Tuesday, August 28th, 2012

Center for Intelligent Information Retrieval (CIIR)

From the webpage:

The Center for Intelligent Information Retrieval (CIIR) is one of the leading research groups working in the areas of information retrieval and information extraction. The CIIR studies and develops tools that provide effective and efficient access to large networks of heterogeneous, multimedia information.

CIIR accomplishments include significant research advances in the areas of retrieval models, distributed information retrieval, information filtering, information extraction, topic models, social network analysis, multimedia indexing and retrieval, document image processing, search engine architecture, text mining, structured data retrieval, summarization, evaluation, novelty detection, resource discovery, interfaces and visualization, digital libraries, computational social science, and cross-lingual information retrieval.

The CIIR has published more than 900 papers on these areas, and has worked with over 90 government and industry partners on research and technology transfer. Open source software supported by the Center is being used worldwide.

Please contact us to talk about potential new projects, collaborations, membership, or joining us as a graduate student or visiting researcher.

To get an idea of the range of their activities, visit the publications page and just browse.

Tutorial on biological networks [The Heterogeneity of Nature]

Friday, July 6th, 2012

Tutorial on biological networks by Francisco G. Vital-Lopez, Vesna Memišević, and Bhaskar Dutta. (Vital-Lopez, F. G., Memišević, V. and Dutta, B. (2012), Tutorial on biological networks. WIREs Data Mining Knowl Discov, 2: 298–325. doi: 10.1002/widm.1061)


Understanding how the functioning of a biological system emerges from the interactions among its components is a long-standing goal of network science. Fomented by developments in high-throughput technologies to characterize biomolecules and their interactions, network science has emerged as one of the fastest growing areas in computational and systems biology research. Although the number of research and review articles on different aspects of network science is increasing, updated resources that provide a broad, yet concise, review of this area in the context of systems biology are few. The objective of this article is to provide an overview of the research on biological networks to a general audience, who have some knowledge of biology and statistics, but are not necessarily familiar with this research field. Based on the different aspects of network science research, the article is broadly divided into four sections: (1) network construction, (2) topological analysis, (3) network and data integration, and (4) visualization tools. We specifically focused on the most widely studied types of biological networks, which are, metabolic, gene regulatory, protein–protein interaction, genetic interaction, and signaling networks. In future, with further developments on experimental and computational methods, we expect that the analysis of biological networks will assume a leading role in basic and translational research.

As a frozen artifact in time, I would suggest reading this article before it is too badly out of date. It will be sad to see it ravaged by time and pitted by later research that renders entire sections obsolete. Or of interest only to medical literature spelunkers of some future time.

Developers of homogeneous and “correct” models of biological networks should take warning from the closing lines of this survey article:

Currently different types of networks, such as PPI, GRN, or metabolic networks are analyzed separately. These heterogeneous networks have to be integrated systematically to generate comprehensive network, which creates a realistic representation of biological systems.[cite omitted] The integrated networks have to be combined with different types of molecular profiling data that measures different facades of the biological system. A recent multi institutional collaborative project, named The Cancer Genome Atlas,[cite omitted] has already started generating much multi-‘omics’ data for large cancer patient cohorts. Thus, we can expect to witness an exciting and fast paced growth on biological network research in the coming years.


Nature uses heterogeneous networks, with great success.

We can keep building homogenous networks or we can start building heterogeneous networks (at least to the extent we are capable).

What do you think?

Large Heterogeneous Data 2012

Thursday, May 31st, 2012

Workshop on Discovering Meaning On the Go in Large Heterogeneous Data 2012 (LHD-12)

Important Dates

  • Deadline for paper subsmission: July 31, 2012
  • Author notification: August 21, 2012
  • Deadline for camera-ready: September 10, 2012
  • Workshop date: November 11th or 12th, 2012

Take the time to read the workshop description.

A great summary of the need for semantic mappings, not more semantic fascism.

From the call for papers:

An interdisciplinary approach is necessary to discover and match meaning dynamically in a world of increasingly large data sources. This workshop aims to bring together practitioners from academia, industry and government for interaction and discussion. This will be a half-day workshop which primarily aims to initiate discussion and debate. It will involve

  • A panel discussion focussing on these issues from an industrial and governmental point of view. Membership to be confirmed, but we expect a representative from Scottish Government and from Google, as well as others.
  • Short presentations grouped into themed panels, to stimulate debate not just about individual contributions but also about the themes in general.

Workshop Description

The problem of semantic alignment – that of two systems failing to understand one another when their representations are not identical – occurs in a huge variety of areas: Linked Data, database integration, e-science, multi-agent systems, information retrieval over structured data; anywhere, in fact, where semantics or a shared structure are necessary but centralised control over the schema of the data sources is undesirable or impractical. Yet this is increasingly a critical problem in the world of large scale data, particularly as more and more of this kind of data is available over the Web.

In order to interact successfully in an open and heterogeneous environment, being able to dynamically and adaptively integrate large and heterogeneous data from the Web “on the go” is necessary. This may not be a precise process but a matter of finding a good enough integration to allow interaction to proceed successfully, even if a complete solution is impossible.

Considerable success has already been achieved in the field of ontology matching and merging, but the application of these techniques – often developed for static environments – to the dynamic integration of large-scale data has not been well studied.

Presenting the results of such dynamic integration to both end-users and database administrators – while providing quality assurance and provenance – is not yet a feature of many deployed systems. To make matters more difficult, on the Web there are massive amounts of information available online that could be integrated, but this information is often chaotically organised, stored in a wide variety of data-formats, and difficult to interpret.

This area has been of interest in academia for some time, and is becoming increasingly important in industry and – thanks to open data efforts and other initiatives – to government as well. The aim of this workshop is to bring together practitioners from academia, industry and government who are involved in all aspects of this field: from those developing, curating and using Linked Data, to those focusing on matching and merging techniques.

Topics of interest include, but are not limited to:

  • Integration of large and heterogeneous data
  • Machine-learning over structured data
  • Ontology evolution and dynamics
  • Ontology matching and alignment
  • Presentation of dynamically integrated data
  • Incentives and human computation over structured data and ontologies
  • Ranking and search over structured and semi-structured data
  • Quality assurance and data-cleansing
  • Vocabulary management in Linked Data
  • Schema and ontology versioning and provenance
  • Background knowledge in matching
  • Extensions to knowledge representation languages to better support change
  • Inconsistency and missing values in databases and ontologies
  • Dynamic knowledge construction and exploitation
  • Matching for dynamic applications (e.g., p2p, agents, streaming)
  • Case studies, software tools, use cases, applications
  • Open problems
  • Foundational issues

Applications and evaluations on data-sources that are from the Web and Linked Data are particularly encouraged.

Several years from now, how will you find this conference (and its proceedings)?

  • Large Heterogeneous Data 2012
  • Workshop on Discovering Meaning On the Go in Large Heterogeneous Data 2012
  • LHD-12

Just curious.

Cell Architectures (adding dashes of heterogeneity)

Saturday, May 12th, 2012

Cell Architectures

From the post:

A consequence of Service Oriented Architectures is the burning need to provide services at scale. The architecture that has evolved to satisfy these requirements is a little known technique called the Cell Architecture.

A Cell Architecture is based on the idea that massive scale requires parallelization and parallelization requires components be isolated from each other. These islands of isolation are called cells. A cell is a self-contained installation that can satisfy all the operations for a shard. A shard is a subset of a much larger dataset, typically a range of users, for example.

Cell Architectures have several advantages:

  • Cells provide a unit of parallelization that can be adjusted to any size as the user base grows.
  • Cell are added in an incremental fashion as more capacity is required.
  • Cells isolate failures. One cell failure does not impact other cells.
  • Cells provide isolation as the storage and application horsepower to process requests is independent of other cells.
  • Cells enable nice capabilities like the ability to test upgrades, implement rolling upgrades, and test different versions of software.
  • Cells can fail, be upgraded, and distributed across datacenters independent of other cells.

The intersection of semantic heterogeneity and scaling remains largely unexplored.

I suggest scaling in a homogeneous environment and then adding dashes of heterogeneity to see what breaks.

Adjust and try again.

“AI on the Web” 2012 – Saarbrücken, Germany

Monday, April 23rd, 2012

“AI on the Web” 2012 – Saarbrücken, Germany

Important Dates:

Deadline for Submission: July 5, 2012

Notification of Authors: August 14, 2012

Final Versions of Papers: August 28, 2012

Workshop: September 24/25, 2012

From the website:

The World Wide Web has become a unique source of knowledge on virtually any imaginable topic. It is continuously fed by companies, academia, and common people with a variety of information in numerous formats. By today, the Web has become an invaluable asset for research, learning, commerce, socializing, communication, and entertainment. Still, making full use of the knowledge contained on the Web is an ongoing challenge due to the special properties of the Web as an information source:

  • Heterogeneity: web data occurs in any kind of formats, languages, data structures and terminology one can imagine.
  • Decentrality: the Web is inherently decentralized which means that there is no central point of control that can ensure consistency or synchronicity.
  • Scale: the Web is huge and processing data at web scale is a major challenge in particular for knowledge‐intensive methods.

These characteristics make the Web a challenging but also a promising chance for AI methods that can help to make the knowledge on the Web more accessible for humans and machines by capturing, representing and using information semantics. The relevance and importance of AI methods for the Web is underlined by the fact that the AAAI – as one of the major AI conferences – has been featuring a special track “AI on the Web” for more than five years now. In line with this track and in order to stress this relevance within the German AI community, we are looking for work on relevant methods and their application to web data.

Look beyond the Web, to the larger world of information of the “deep” web or the even larger world of information, web or not, and what do you see?

Heterogeneity, Decentrality, Scale.

What we learn about AI for the Web may help us with larger information problems.

Using an RDF Data Pipeline to Implement Cross-Collection Search

Saturday, March 31st, 2012

Using an RDF Data Pipeline to Implement Cross-Collection Search by David Henry and Eric Brown.


This paper presents an approach to transforming data from many diverse sources in support of a semantic cross-collection search application. It describes the vision and goals for a semantic cross-collection search and examines the challenges of supporting search of that kind using very diverse data sources. The paper makes the case for supporting semantic cross-collection search using semantic web technologies and standards including Resource Descriptive Framework (RDF), SPARQL Protocol and RDF Query Language (SPARQL ), and an XML mapping language. The Missouri History Museum has developed a prototype method for transforming diverse data sources into a data repository and search index that can support a semantic cross-collection search. The method presented in this paper is a data pipeline that transforms diverse data into localized RDF; then transforms the localized RDF into more generalized RDF graphs using common vocabularies; and ultimately transforms generalized RDF graphs into a Solr search index to support a semantic cross-collection search. Limitations and challenges of this approach are detailed in the paper.

A great report on the issues you will face with diverse data resources. (And who doesn’t have those?)

The “practical considerations” section is particularly interesting and I am sure the project participants would appreciate any suggestions you may have.

LDIF – Linked Data Integration Framework (0.4)

Tuesday, January 24th, 2012

LDIF – Linked Data Integration Framework (0.4)

Version 0.4 News:

Up till now, LDIF stored data purely in-memory which restricted the amount of data that could be processed. Version 0.4 provides two alternative implementations of the LDIF runtime environment which allow LDIF to scale to large data sets: 1. The new triple store backed implementation scales to larger data sets on a single machine with lower memory consumption at the expense of processing time. 2. The new Hadoop-based implementation provides for processing very large data sets on a Hadoop cluster, for instance within Amazon EC2. A comparison of the performance of all three implementations of the runtime environment is found on the LDIF benchmark page.

From the “About LDIF:”

The Web of Linked Data grows rapidly and contains data from a wide range of different domains, including life science data, geographic data, government data, library and media data, as well as cross-domain data sets such as DBpedia or Freebase. Linked Data applications that want to consume data from this global data space face the challenges that:

  1. data sources use a wide range of different RDF vocabularies to represent data about the same type of entity.
  2. the same real-world entity, for instance a person or a place, is identified with different URIs within different data sources.

This usage of different vocabularies as well as the usage of URI aliases makes it very cumbersome for an application developer to write SPARQL queries against Web data which originates from multiple sources. In order to ease using Web data in the application context, it is thus advisable to translate data to a single target vocabulary (vocabulary mapping) and to replace URI aliases with a single target URI on the client side (identity resolution), before starting to ask SPARQL queries against the data.

Up-till-now, there have not been any integrated tools that help application developers with these tasks. With LDIF, we try to fill this gap and provide an an open-source Linked Data Integration Framework that can be used by Linked Data applications to translate Web data and normalize URI while keeping track of data provenance.

With the addition of Hadoop based processing, definitely worth your time to download and see what you think of it.

Ironic that the problem it solves:

  1. data sources use a wide range of different RDF vocabularies to represent data about the same type of entity.
  2. the same real-world entity, for instance a person or a place, is identified with different URIs within different data sources.

already existed, prior to Linked Data as:

  1. data sources use a wide range of different vocabularies to represent data about the same type of entity.
  2. the same real-world entity, for instance a person or a place, is identified differently within different data sources.

So the Linked Data drill is to convert data, which already has these problems, into Linked Data, which will still have these problems, and then solve the problem of differing identifications.


Did I miss a step?

Call for Papers on Big Data: Theory and Practice

Thursday, January 12th, 2012

SWJ 2012 : Semantic Web Journal Call for Papers on Big Data: Theory and Practice


Manuscript submission due: 13. February 2012
First notification: 26. March 2012
Issue publication: Summer 2012

From the post:

The Semantic Web journal calls for innovative and high-quality papers describing theory and practice of storing, accessing, searching, mining, processing, and visualizing big data. We especially invite papers that describe or demonstrate how ontologies, Linked Data, and Semantic Web technologies can handle the problems arising when integrating massive amounts of multi-thematic and multi-perspective information from heterogeneous sources to answer complex questions that cut through domain boundaries.

We welcome all paper categories, i.e., full research papers, application reports, systems and tools, ontology papers, as well as surveys, as long as they clearly relate to challenges and opportunities arising from processing big data – see our listing of paper types in the author guidelines. In other words, we expect all submitted manuscripts to address how the presented work can exploit massive and/or heterogeneous data.

Semantic Web technologies represent subjects as well as being subjects themselves should enable demonstrations of integrating diverse Semantic Web approaches to the same data. Where the underlying data is heterogeneous as well. Now that would be an interesting paper.

Proceedings…Information Heterogeneity and Fusion in Recommender Systems

Tuesday, January 10th, 2012

Proceedings of the 2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems

I am still working on the proceeding for the main conference but thought these might be of interest:

  • Information market based recommender systems fusion
    Efthimios Bothos, Konstantinos Christidis, Dimitris Apostolou, Gregoris Mentzas
    Pages: 1-8
  • A kernel-based approach to exploiting interaction-networks in heterogeneous information sources for improved recommender systems
    Oluwasanmi Koyejo, Joydeep Ghosh
    Pages: 9-16
  • Learning multiple models for exploiting predictive heterogeneity in recommender systems
    Clinton Jones, Joydeep Ghosh, Aayush Sharma
    Pages: 17-24
  • A generic semantic-based framework for cross-domain recommendation
    Ignacio Fernández-Tobías, Iván Cantador, Marius Kaminskas, Francesco Ricci
    Pages: 25-32
  • Hybrid algorithms for recommending new items
    Paolo Cremonesi, Roberto Turrin, Fabio Airoldi
    Pages: 33-40
  • Expert recommendation based on social drivers, social network analysis, and semantic data representation
    Maryam Fazel-Zarandi, Hugh J. Devlin, Yun Huang, Noshir Contractor
    Pages: 41-48
  • Experience Discovery: hybrid recommendation of student activities using social network data
    Robin Burke, Yong Zheng, Scott Riley
    Pages: 49-52
  • Personalizing tags: a folksonomy-like approach for recommending movies
    Alan Said, Benjamin Kille, Ernesto W. De Luca, Sahin Albayrak
    Pages: 53-56
  • Personalized pricing recommender system: multi-stage epsilon-greedy approach
    Toshihiro Kamishima, Shotaro Akaho
    Pages: 57-64
  • Matrix co-factorization for recommendation with rich side information and implicit feedback
    Yi Fang, Luo Si
    Pages: 65-69

Querying Semi-Structured Data

Friday, January 6th, 2012

Querying Semi-Structured Data

The Semi-structured data and P2P graph databases post I point to has a broken reference to Serge Abiteboul’s “Querying Semi-Structured Data.” Since I could not correct it there and the topic is of interest for topic maps, I created this entry for it here.

From the Introduction:

The amount of data of all kinds available electronically has increased dramatically in recent years. The data resides in diff erent forms, ranging from unstructured data in le systems to highly structured in relational database systems. Data is accessible through a variety of interfaces including Web browsers, database query languages, application-specifi c interfaces, or data exchange formats. Some of this data is raw data, e.g., images or sound. Some of it has structure even if the structure is often implicit, and not as rigid or regular as that found in standard database systems. Sometimes the structure exists but has to be extracted from the data. Sometimes also it exists but we prefer to ignore it for certain purposes such as browsing. We call here semi-structured data this data that is (from a particular viewpoint) neither raw data nor strictly typed, i.e., not table-oriented as in a relational model or sorted-graph as in object databases.

As will seen later when the notion of semi-structured data is more precisely defi ned, the need for semi-structured data arises naturally in the context of data integration, even when the data sources are themselves well-structured. Although data integration is an old topic, the need to integrate a wider variety of data-formats (e.g., SGML or ASN.1 data) and data found on the Web has brought the topic of semi-structured data to the forefront of research.

The main purpose of the paper is to isolate the essential aspects of semi-structured data. We also survey some proposals of models and query languages for semi-structured data. In particular, we consider recent works at Stanford U. and U. Penn on semi-structured data. In both cases, the motivation is found in the integration of heterogeneous data. The “lightweight” data models they use (based on labelled graphs) are very similar.

As we shall see, the topic of semi-structured data has no precise boundary. Furthermore, a theory of semi-structured data is still missing. We will try to highlight some important issues in this context.

The paper is organized as follows. In Section 2, we discuss the particularities of semi-structured data. In Section 3, we consider the issue of the data structure
and in Section 4, the issue of the query language.

A bit dated, 1996, but still worth reading. Updating the paper would make a nice semester size project

BTW, note the download graphics. Makes me think that archives should have an “anonymous notice” feature that allows anyone downloading a paper to send an email to anyone who has downloaded the paper in the past, without disclosing the emails of the prior downloaders.

I would really like to know what the people in Jan/Feb of 2011 were looking for? Perhaps they are working on an update of the paper? Or would like to collaborate on updating the paper.

Seems like a small “feature” that would allow researchers to contact others without disclosure of email addresses (other than for the sender of course).

Formal publication data:

Abiteboul, S. (1996) Querying Semi-Structured Data. Technical Report. Stanford InfoLab. (Publication Note: Database Theory – ICDT ’97, 6th International Conference, Delphi, Greece, January 8-10, 1997)

The Future of Hadoop in Bioinformatics

Monday, July 18th, 2011

The Future of Hadoop in Bioinformatics: Hadoop and its ecosystem including MapReduce are the dominant open source Big Data solution by Bob Gourley.

From the post:

Earlier, I wrote on the use of Hadoop in the exciting, evolving field of Bioinformatics. I have since had the pleasure of speaking with Dr. Ron Taylor of Pacific Northwest National Library, the author of “An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics“, on what’s changed in the half-year since its publication and what’s to come.

As Dr. Taylor expected, Hadoop and it’s “ecosystem” including MapReduce are the dominant open source Big Data solution for next generation DNA sequencing analysis. This is currently the sub-field generating the most data and requiring the most computationally expensive analysis. For example, de novo assembly pieces together tens of millions of short reads (which may be 50 bases long on ABI SOLiD sequencers). To do so, every read needs to be compared to the others, which scales in proportion to n(logn), meaning, even assuming reads that are 100 base pairs in length and a human genome of 3 billion pairs, analyzing an entire human genome will take 7.5 times longer than if it scaled linearly. By dividing the task up into a Hadoop cluster, the analysis will be faster and, unlike other high performance computing alternatives, it can run on regular commodity servers that are much cheaper than custom supercomputers. This, combined with the savings from using open source software, ease of use due to seamless scaling, and the strength of the Hadoop community make Hadoop and related software the parallelization solution of choice in next generation sequencing.In other areas, however, traditional HPC is still more common and Hadoop has not yet caught on. Dr. Taylor believes that in the next year to 18 months, this will change due to the following trends:

So, over the next year to eighteen months, what do you see as the evolution of topic map software and services?

Or what problems do you see becoming apparent in bioinformatics or other areas (like the Department of Energy’s knowledgebase) that will require topic maps?

(More on the DOE project later this week.)

Information Heterogeneity and Fusion

Thursday, May 12th, 2011

2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011)

Important Dates:

Paper submission deadline: 25th July 2011
Notification of acceptance: 19th August 2011
Camera-ready version due: 12th September 2011
Workshop: 23rd or 27th October 2011

Datasets are also being made available. Just in case you can’t find any heterogeneous data lying around. 😉

Looks like a perfect venue for topic map papers. (Not to mention that a re-usable mapping between recommender systems looks like a commercial opportunity.)

From the website:

In recent years, increasing attention has been given to finding ways for combining, integrating and mediating heterogeneous sources of information for the purpose of providing better personalized services in many information seeking and e-commerce applications. Information heterogeneity can indeed be identified in any of the pillars of a recommender system: the modeling of user preferences, the description of resource contents, the modeling and exploitation of the context in which recommendations are made, and the characteristics of the suggested resource lists.

Almost all current recommender systems are designed for specific domains and applications, and thus usually try to make best use of a local user model, using a single kind of personal data, and without explicitly addressing the heterogeneity of the existing personal information that may be freely available (on social networks, homepages, etc.). Recognizing this limitation, among other issues: a) user models could be based on different types of explicit and implicit personal preferences, such as ratings, tags, textual reviews, records of views, queries, and purchases; b) recommended resources may belong to several domains and media, and may be described with multilingual metadata; c) context could be modeled and exploited in multi-dimensional feature spaces; d) and ranked recommendation lists could be diverse according to particular user preferences and resource attributes, oriented to groups of users, and driven by multiple user evaluation criteria.

The aim of HetRec workshop is to bring together students, faculty, researchers and professionals from both academia and industry who are interested in addressing any of the above forms of information heterogeneity and fusion in recommender systems. We would like to raise awareness of the potential of using multiple sources of information, and look for sharing expertise and suitable models and techniques.

Another dire need is for strong datasets, and one of our aims is to establish benchmarks and standard datasets on which the problems could be investigated. In this edition, we make available on-line datasets with heterogeneous information from several social systems. These datasets can be used by participants to experiment and evaluate their recommendation approaches, and be enriched with additional data, which may be published at the workshop website for future use.

KNIME – 4th Annual User Group Meeting

Wednesday, March 16th, 2011

KNIME – 4th Annual User Group Meeting

From the website:

The 4th KNIME Workshop and Users Meeting at Technopark in Zurich, Switzerland took place between February 28th and March 4th, 2011 and was a huge success.

The meeting was very well attended by more than 130 participants. The presentations ranged from customer intelligence and applications of KNIME in soil and fuel research through to high performance data analytics and KNIME applications in the Life Science industry. The second meeting of the special interest group attracted more than 50 attendees and was filled with talks about how KNIME can be put to use in this fast growing research area.

Presentations are available.

A new version of KNIME is available for download with the features listed in ChangeLog 2.3.3.

Focused on data analytics and work flow, another software package that could benefit from an interchangeable subject-oriented approach.


Saturday, February 12th, 2011

Managing and Reasoning in the Presence of Inconsistency

The International Journal of Semantic Computing describes this Call for Papers as follows:

Inconsistency is ubiquitous in the real world, in human behaviors, and in the computing systems we build. Inconsistency manifests itself in a plethora of phenomena at different level in the depth of knowledge, ranging from data, information, knowledge, meta-knowledge, to expertise. Data inconsistency arises when patterns in data do not conform to an established range, distribution or interpretation. The exponentially growing volumes of data stemming from almost all types of data being created in digital form, a proliferation of sensors and sensor networks, and other sources such as social networks, complex computer simulations, space explorations, and high-resolution imagery and video, have made data inconsistency an inevitability. Information inconsistency occurs when meanings of the same data values become conflicting or when the same attribute for an entity has different data values. Knowledge inconsistency happens when propositions of either declarative or procedural beliefs, in either explicit or tacit form, yield antagonistic outcomes for the same circumstance. Inconsistency can also emerge from meta-knowledge or from expertise. How to manage and reason in the presence of inconsistency in computing systems is a very important issue in semantic computing, social computing, and other data-rich or knowledge-rich computing paradigms. It requires that we understand the causes and circumstances of inconsistency, establish proper metrics for inconsistency, adopt formalisms to represent inconsistency, develop ways to recognize and analyze different types of inconsistency, and devise mechanisms and methodologies to manage and handle inconsistency.

Refreshing in that inconsistency is recognized as an omnipresent and everlasting fact of our environments. Including computing environments.

The phrase, “…establish proper metrics for inconsistency,…” betrays a world view that we can stand outside of our inconsistencies and those of others.

For all the useful work that will appear in this volume (and others like it), there is no place to stand outside of our environments and their inconsistencies.

Important Dates
Submission deadline: May 20, 2011
Review result notification: July 20, 2011
Revision due: August 20, 2011
Final version due: August 31, 2011
Tentative date of publication: September, 2011 (Vol.5, No.3)

NCIBI – National Center for Integrative Biomedical Informatics

Wednesday, January 19th, 2011

NCIBI – National Center for Integrative Biomedical Informatics

From the website:

The National Center for Integrative Biomedical Informatics (NCIBI) is one of seven National Centers for Biomedical Computing (NCBC) within the NIH Roadmap. The NCBC program is focused on building a universal computing infrastructure designed to speed progress in biomedical research. NCIBI was founded in September 2005 and is based at the University of Michigan as part of the Center for Computational Medicine and Bioinformatics (CCMB).

Note the use of integrative in the name of the center.

They “get” that part.

They are in fact working on mappings to support integration of data even as I write these lines.

There is a lot to be learned about their strategies for integration and to better understand the integration issues they face in this domain. This site is a good starting place to do both.

KNIME Version 2.3.0 released – News

Saturday, December 18th, 2010

KNIME Version 2.3.0 released

From the announcement:

The new version is a greatly enhancing the usability of KNIME. It adds new features like workflow annotations, support for hotkeys, inclusion of R-views in reports, data flow switches, option to hide node labels, variable support in the database reader/connector and R-nodes, and the ability to export KNIME workflows as SVG Graphics.

With the 2.3 release we are also introducing a community node repository, which includes KNIME extensions for bio- and chemoinformatics and an advanced R-scripting environment.

CFP – Dealing with the Messiness of the Web of Data – Journal of Web Semantics

Friday, December 17th, 2010

CFP – Dealing with the Messiness of the Web of Data – Journal of Web Semantics

From the call:

Research on the Semantic Web, which is now in its second decade, has had a tremendous success in encouraging people to publish data on the Web in structured, linked, and standardized ways. The success of what has now become the Web of Data can be read from the sheer number of triples available within the Linked-Open Data, Linked Life Data and Open-Government initiatives. However, this growth in data makes many of the established assumptions inappropriate and offers a number of new research challenges.

In stark contrast to early Semantic Web applications that dealt with small, hand-crafted ontologies and data-sets, the new Web of Data comes with a plethora of contradicting world-views and contains incomplete, inconsistent, incorrect, fast-changing and opinionated information. This information not only comes from academic sources and trustworthy institutions, but is often community built, scraped or translated.

In short: the Web of Data is messy, and methods to deal with this messiness are paramount for its future.

Now, we have two choices as the topic map community:

  • congratulate ourselves for seeing this problem long ago, high five each other, etc., or
  • step up and offer topic map solutions that incorporate as much of the existing SW work as possible.

I strongly suggest the second one.

Important dates:

We will aim at an efficient publication cycle in order to guarantee prompt availability of the published results. We will review papers on a rolling basis as they are submitted and explicitly encourage submissions well before the submission deadline. Submit papers online at the journal’s Elsevier Web site.

Submission deadline: 1 February 2011
Author notification: 15 June 2011

Revisions submitted: 1 August 2011
Final decisions: 15 September 2011
Publication: 1 January 2012

Rule Synthesizing from Multiple Related Databases

Tuesday, November 9th, 2010

Rule Synthesizing from Multiple Related Databases Authors(s): Dan He, Xindong Wu, Xingquan Zhu Keywords: Association rule mining, rule synthesizing, multiple databases, clustering

In this paper, we study the problem of rule synthesizing from multiple related databases where items representing the databases may be different, and the databases may not be relevant, or similar to each other. We argue that, for such multi-related databases, simple rule synthesizing without a detailed understanding of the databases is not able to reveal meaningful patterns inside the data collections. Consequently, we propose a two-step clustering on the databases at both item and rule levels such that the databases in the final clusters contain both similar items and similar rules. A weighted rule synthesizing method is then applied on each such cluster to generate final rules. Experimental results demonstrate that the new rule synthesizing method is able to discover important rules which can not be synthesized by other methods.

The authors observe:

…existing rule synthesizing methods for distributed mining commonly assumes that related databases are relevant, share similar data distributions, and have identical items. This is equivalent to the assumption that all stores have the same type of business with identical meta-data structures, which is hardly the case in practice.

I should start collecting quotes that recognize semantic diversity as the rule rather than the exception.

More on that later. Enjoy the article.

A Prototype of Multimedia Metadata Management System for Supporting the Integration of Heterogeneous Sources

Tuesday, November 2nd, 2010

A Prototype of Multimedia Metadata Management System for Supporting the Integration of Heterogeneous Sources Authors: Tie Hua Zhou, Byeong Mun Heo, Ling Wang, Yang Koo Lee, Duck Jin Chai and Keun Ho Ryu Keywords: Multimedia Metadata Management Systems, Metadata, MPEG-7, TV-Anytime


With the advances in information technology, the amount of multimedia metadata captured, produced, and stored is increasing rapidly. As a consequence, multimedia content is widely used for many applications in today’s world, and hence, a need for organizing multimedia metadata and accessing it from repositories with vast amount of information has been a driving stimulus both commercially and academically. MPEG-7 is expected to provide standardized description schemes for concise and unambiguous content description of data/documents of complex multimedia types. Meanwhile, other metadata or description schemes, such as Dublin Core, XML, TV-Anytime etc., are becoming popular in different application domains. In this paper, we present a new prototype Multimedia Metadata Management System. Our system is good at sharing the integration of multimedia metadata from heterogeneous sources. This system enables the collection, analysis and integration of multimedia metadata semantic description from some different kinds of services. (UCC, IPTV, VOD and Digital TV et al.)

The details for the “Metadata Analyzer” and “Metadata Mapping” seep to be a bit sparse (as in non-existent) for a “prototype…supporting integration of heterogeneous sources.”

MPEG-7 has an important role to play in this area and topic mappers should be aware of it.

I will try to locate more useful resources on MPEG-7 and multimedia content.


Sunday, October 31st, 2010


From the website:

OpenII is a collaborative effort spearheaded by The MITRE Corporation and Google to create a suite of open-source tools for information integration. The project is leveraging the latest developments in research on information integration to create a platform on which integration applications can be built and further research can be conducted.

The motivation for OpenII is that although a significant amount of research has been conducted on information integration, and several commercial systems have been deployed, many information integration applications are still hard to build. In research, we often innovate on a specific aspect of information integration, but then spend much our time building (and rebuilding) other components that we need in order to validate our contributions. As a result, the research prototypes that have been built are generally not reusable and do not inter-operate with each other. On the applications side, information integration comes in many flavors, and therefore it is hard for commercial products to serve all the needs. Our goal is to create tools that can be applied in a variety of architectural contexts and can easily be tailored to the needs of particular domains.

OpenII tools include, among others, wrappers for common data sources, tools for creating matches and mappings between disparate schemas, a tool for searching collections of schemas and extending schemas, and run-time tools for processing queries over heterogeneous data sources.

The M3 metamodel:

The fundamental building block in M3 is the entity. An entity represents information about a set of related real-world objects. Associated with each entity is a set of attributes that indicate what information is captured about each entity. For simplicity, we assume that at most one value can be associated with each attribute of an entity.

The project could benefit from a strong injection of subject identity based thinking and design.

IEEE Computer Society Technical Committee on Semantic Computing (TCSEM)

Sunday, October 17th, 2010

The IEEE Computer Society Technical Committee on Semantic Computing (TCSEM)

addresses the derivation and matching of the semantics of computational content to that of naturally expressed user intentions in order to retrieve, manage, manipulate or even create content, where “content” may be anything including video, audio, text, software, hardware, network, process, etc.

Being organized by Phillip C-Y Sheu (UC Irvine),, Phone: +1 949 824 2660. Volunteers are needed for both organizational and technical committees.

This is a good way to meet people, make a positive contribution and, have a lot of fun.

Satrap: Data and Network Heterogeneity Aware P2P Data-Mining

Monday, October 11th, 2010

Satrap: Data and Network Heterogeneity Aware P2P Data-Mining Authors: Hock Hee Ang, Vivekanand Gopalkrishnan, Anwitaman Datta, Wee Keong Ng, Steven C. H. Hoi Keywords: Distributed classification, P2P network, cascade SVM


Distributed classification aims to build an accurate classifier by learning from distributed data while reducing computation and communication cost. A P2P network where numerous users come together to share resources like data content, bandwidth, storage space and CPU resources is an excellent platform for distributed classification. However, two important aspects of the learning environment have often been overlooked by other works, viz., 1) location of the peers which results in variable communication cost and 2) heterogeneity of the peers’ data which can help reduce redundant communication. In this paper, we examine the properties of network and data heterogeneity and propose a simple yet efficient P2P classification approach that minimizes expensive inter-region communication while achieving good generalization performance. Experimental results demonstrate the feasibility and effectiveness of the proposed solution.

Among the other claims for Satrap:

  • achieves the best accuracy-to-communication cost ratio given that data exchange is performed to improve global accuracy.
  • allows users to control the trade-off between accuracy and communication cost with the user-specified parameters.

I find these two the most interesting.

In part because semantic integration, whether explicit or not, is always a question of cost ratio and tradeoffs.

It would be refreshing to see papers that say what semantic integration would be too costly with method X or that aren’t possible with method Y.