Archive for the ‘Data Clustering’ Category

XML data clustering: An overview

Saturday, February 25th, 2012

XML data clustering: An overview by Alsayed Algergawy, Marco Mesiti, Richi Nayak, and Gunter Saake.

Abstract:

In the last few years we have observed a proliferation of approaches for clustering XML documents and schemas based on their structure and content. The presence of such a huge amount of approaches is due to the different applications requiring the clustering of XML data. These applications need data in the form of similar contents, tags, paths, structures, and semantics. In this article, we first outline the application contexts in which clustering is useful, then we survey approaches so far proposed relying on the abstract representation of data (instances or schema), on the identified similarity measure, and on the clustering algorithm. In this presentation, we aim to draw a taxonomy in which the current approaches can be classified and compared. We aim at introducing an integrated view that is useful when comparing XML data clustering approaches, when developing a new clustering algorithm, and when implementing an XML clustering component. Finally, the article moves into the description of future trends and research issues that still need to be faced.

I thought this survey article would be of particular interest since it covers the syntax and semantics of XML that contains data.

Not to mention that our old friend, heterogeneous data, isn’t far behind:

Since XML data are engineered by different people, they often have different structural and terminological heterogeneities. The integration of heterogeneous data sources requires many tools for organizing and making their structure and content homogeneous. XML data integration is a complex activity that involves reconciliation at different levels: (1) at schema level, reconciling different representations of the same entity or property, and (2) at instance level, determining if different objects coming from different sources represent the same real-world entity. Moreover, the integration of Web data increases the integration process challenges in terms of heterogeneity of data. Such data come from different resources and it is quite hard to identify the relationship with the business subjects. Therefore, a first step in integrating XML data is to find clusters of the XML data that are similar in semantics and structure [Lee et al. 2002; Viyanon et al. 2008]. This allows system integrators to concentrate on XML data within each cluster. We remark that reconciling similar XML data is an easier task than reconciling XML data that are different in structures and semantics, since the later involves more restructuring. (emphasis added)

Two comments to bear in mind while reading this paper.

First, print our or photocopy Table II on page 35, “Features of XML Clustering Approaches.” It will be a handy reminder/guide as you read the coverage of the various techniques.

Second, on the last page, page 41, note that the article was accepted in October of 2009 but not published until October of 2011. It’s great that the ACM has an abundance of excellent survey articles but a two year delay is publication is unreasonable.

Surveys in rapidly developing fields are of most interest when they are timely. Electronic publication upon final acceptance should be the rule at an organization such as the ACM.