Archive for the ‘Heterogeneous Data’ Category

Semantics for Big Data [W3C late to semantic heterogeneity party]

Sunday, March 31st, 2013

Semantics for Big Data

Dates:

Submission due: May 24, 2013

Acceptance Notification: June 21, 2013

Camera-ready Copies: June 28, 2013

Symposium: November 15-17, 2013

From the webpage:

AAAI 2013 Fall Symposium; Westin Arlington Gateway in Arlington, Virginia, November 15-17, 2013.

Workshop Description and Scope

One of the key challenges in making use of Big Data lies in finding ways of dealing with heterogeneity, diversity, and complexity of the data, while its volume and velocity forbid solutions available for smaller datasets as based, e.g., on manual curation or manual integration of data. Semantic Web Technologies are meant to deal with these issues, and indeed since the advent of Linked Data a few years ago, they have become central to mainstream Semantic Web research and development. We can easily understand Linked Data as being a part of the greater Big Data landscape, as many of the challenges are the same. The linking component of Linked Data, however, puts an additional focus on the integration and conflation of data across multiple sources.

Workshop Topics

In this symposium, we will explore the many opportunities and challenges arising from transferring and adapting Semantic Web Technologies to the Big Data quest. Topics of interest focus explicitly on the interplay of Semantics and Big Data, and include:

  • the use of semantic metadata and ontologies for Big Data,
  • the use of formal and informal semantics,
  • the integration and interplay of deductive (semantic) and statistical methods,
  • methods to establish semantic interoperability between data sources
  • ways of dealing with semantic heterogeneity,
  • scalability of Semantic Web methods and tools, and
  • semantic approaches to the explication of requirements from eScience applications.

The W3C is late to the party as evidenced by semantic heterogeneity becoming “…central to mainstream Semantic Web research and development” after the advent of Linked Data.

I suppose better late than never.

At least if they remember that:

Users experience semantic heterogeneity in data and in the means used to describe and store data.

Whatever solution is crafted, its starting premise must be to capture semantics as seen by some defined user.

Otherwise, it is capturing the semantics of designers, authors, etc., which may or may not be valuable to some particular user.

RDF is a good example of capturing someone else’s semantics.

As its uptake is evidence of the interest in someone else’s semantics. (Simple Web Semantics – The Semantic Web Is Failing — But Why?)

On ranking relevant entities in heterogeneous networks…

Tuesday, January 22nd, 2013

On ranking relevant entities in heterogeneous networks using a language-based model by Laure Soulier, Lamjed Ben Jabeur, Lynda Tamine, Wahiba Bahsoun. (Soulier, L., Jabeur, L. B., Tamine, L. and Bahsoun, W. (2013), On ranking relevant entities in heterogeneous networks using a language-based model. J. Am. Soc. Inf. Sci.. doi: 10.1002/asi.22762)

Abstract:

A new challenge, accessing multiple relevant entities, arises from the availability of linked heterogeneous data. In this article, we address more specifically the problem of accessing relevant entities, such as publications and authors within a bibliographic network, given an information need. We propose a novel algorithm, called BibRank, that estimates a joint relevance of documents and authors within a bibliographic network. This model ranks each type of entity using a score propagation algorithm with respect to the query topic and the structure of the underlying bi-type information entity network. Evidence sources, namely content-based and network-based scores, are both used to estimate the topical similarity between connected entities. For this purpose, authorship relationships are analyzed through a language model-based score on the one hand and on the other hand, non topically related entities of the same type are detected through marginal citations. The article reports the results of experiments using the Bibrank algorithm for an information retrieval task. The CiteSeerX bibliographic data set forms the basis for the topical query automatic generation and evaluation. We show that a statistically significant improvement over closely related ranking models is achieved.

Note the “estimat[ion] of topic similarity between connected entities.”

Very good work but rather than a declaration of similarity (topic maps) we have an estimate of similarity.

Before you protest about the volume of literature/data, recall that some author write the documents in question. And selected the terms and references found therein.

Rather than guessing what may be similar to what the author wrote, why not devise a method to allow the author to say?

And build upon similarity/sameness declarations across heterogeneous networks of data.

Sky Survey Data Lacks Standardization [Heterogeneous Big Data]

Tuesday, November 27th, 2012

Sky Survey Data Lacks Standardization by Ian Armas Foster.

From the post:

The Sloan Digital Sky Survey is at the forefront of astronomical research, compiling data from observatories around the world in an effort to truly pinpoint where we lie on the universal map. In order to do that, they must aggregate data from several observatories across the world, an intensive data operation.

According to a report written by researchers at UCLA, even though the SDSS is a data intensive astronomical mapping survey, it has yet to lay down a standardized foundation for retrieving and storing scientific data.

Per sdss.org, the first two projects were responsible for observing “a quarter of the sky” and picking out nearly a million galaxies and over 100,000 quasars. The project started at the Apache Point observatory in New Mexico and has since grown to include 25 observatories across the globe. The SDSS gained recognition in2009 with the Nobel Prize in physics awarded to the advancement of optical fibers and digital imaging detectors (or CCDs) that allowed the project to grow in scale.

The point is that the datasets that the scientists used seemed to be scattered. Some would come about through informal social contacts such as email while others would simply search for necessary datasets on Google. Further, once these datasets were found, there was even an inconsistency in how they were stored before they could be used. However, this may have had to do with the varying sizes of the sets and how quickly the researchers wished to use the data. The entire SDSS dataset consists of over 130 TB, according to the report, and that volume can be slightly unwieldy.

“Large sky surveys, including the SDSS, have significantly shaped research practices in the field of astronomy,” the report concluded. “However, these large data sources have not served to homogenize information retrieval in the field. There is no single, standardized method for discovering, locating, retrieving, and storing astronomy data.”

So, big data isn’t going to be homogeneous big data but heterogeneous big data.

That sounds like an opportunity for topic maps to me.

You?

Mapping solution to heterogeneous data sources

Monday, September 10th, 2012

dbSNO: a database of cysteine S-nitrosylation by Tzong-Yi Lee, Yi-Ju Chen, Cheng-Tsung Lu, Wei-Chieh Ching, Yu-Chuan Teng, Hsien-Da Huang and Yu-Ju Chen. (Bioinformatics (2012) 28 (17): 2293-2295. doi: 10.1093/bioinformatics/bts436)

OK, the title doesn’t jump out and say “mapping solution here!;-)

Reading a bit further, you discover that text mining is used to locate sequences and that data is then mapped to “UniProtKB protein entries.”

The data set provides access to:

  • UniProt ID
  • Organism
  • Position
  • PubMed Id
  • Sequence

My concern is what happens when X is mapped to a UniProtKB protein entry to:

  • The prior identifier for X (in the article or source), and
  • The mapping from X to the UniProtKB protein entry?

If both of those are captured, then prior literature can be annotated upon rendering to point to later aggregation of information on a subject.

If the prior identifier, place of usage, the mapping, etc., are not captured, then prior literature, when we encounter it, remains frozen in time.

Mapping solutions work, but repay the effort several times over if the prior identifier and its mapping to the “new” identifier are captured as part of the process.

Abstract

Summary: S-nitrosylation (SNO), a selective and reversible protein post-translational modification that involves the covalent attachment of nitric oxide (NO) to the sulfur atom of cysteine, critically regulates protein activity, localization and stability. Due to its importance in regulating protein functions and cell signaling, a mass spectrometry-based proteomics method rapidly evolved to increase the dataset of experimentally determined SNO sites. However, there is currently no database dedicated to the integration of all experimentally verified S-nitrosylation sites with their structural or functional information. Thus, the dbSNO database is created to integrate all available datasets and to provide their structural analysis. Up to April 15, 2012, the dbSNO has manually accumulated >3000 experimentally verified S-nitrosylated peptides from 219 research articles using a text mining approach. To solve the heterogeneity among the data collected from different sources, the sequence identity of these reported S-nitrosylated peptides are mapped to the UniProtKB protein entries. To delineate the structural correlation and consensus motif of these SNO sites, the dbSNO database also provides structural and functional analyses, including the motifs of substrate sites, solvent accessibility, protein secondary and tertiary structures, protein domains and gene ontology.

Availability: The dbSNO is now freely accessible via http://dbSNO.mbc.nctu.edu.tw. The database content is regularly updated upon collecting new data obtained from continuously surveying research articles.

Contacts: francis@saturn.yu.edu.tw or yujuchen@gate.sinica.edu.tw.

Center for Intelligent Information Retrieval (CIIR) [University of Massachusetts Amherst]

Tuesday, August 28th, 2012

Center for Intelligent Information Retrieval (CIIR)

From the webpage:

The Center for Intelligent Information Retrieval (CIIR) is one of the leading research groups working in the areas of information retrieval and information extraction. The CIIR studies and develops tools that provide effective and efficient access to large networks of heterogeneous, multimedia information.

CIIR accomplishments include significant research advances in the areas of retrieval models, distributed information retrieval, information filtering, information extraction, topic models, social network analysis, multimedia indexing and retrieval, document image processing, search engine architecture, text mining, structured data retrieval, summarization, evaluation, novelty detection, resource discovery, interfaces and visualization, digital libraries, computational social science, and cross-lingual information retrieval.

The CIIR has published more than 900 papers on these areas, and has worked with over 90 government and industry partners on research and technology transfer. Open source software supported by the Center is being used worldwide.

Please contact us to talk about potential new projects, collaborations, membership, or joining us as a graduate student or visiting researcher.

To get an idea of the range of their activities, visit the publications page and just browse.

Tutorial on biological networks [The Heterogeneity of Nature]

Friday, July 6th, 2012

Tutorial on biological networks by Francisco G. Vital-Lopez, Vesna Memišević, and Bhaskar Dutta. (Vital-Lopez, F. G., Memišević, V. and Dutta, B. (2012), Tutorial on biological networks. WIREs Data Mining Knowl Discov, 2: 298–325. doi: 10.1002/widm.1061)

Abstract:

Understanding how the functioning of a biological system emerges from the interactions among its components is a long-standing goal of network science. Fomented by developments in high-throughput technologies to characterize biomolecules and their interactions, network science has emerged as one of the fastest growing areas in computational and systems biology research. Although the number of research and review articles on different aspects of network science is increasing, updated resources that provide a broad, yet concise, review of this area in the context of systems biology are few. The objective of this article is to provide an overview of the research on biological networks to a general audience, who have some knowledge of biology and statistics, but are not necessarily familiar with this research field. Based on the different aspects of network science research, the article is broadly divided into four sections: (1) network construction, (2) topological analysis, (3) network and data integration, and (4) visualization tools. We specifically focused on the most widely studied types of biological networks, which are, metabolic, gene regulatory, protein–protein interaction, genetic interaction, and signaling networks. In future, with further developments on experimental and computational methods, we expect that the analysis of biological networks will assume a leading role in basic and translational research.

As a frozen artifact in time, I would suggest reading this article before it is too badly out of date. It will be sad to see it ravaged by time and pitted by later research that renders entire sections obsolete. Or of interest only to medical literature spelunkers of some future time.

Developers of homogeneous and “correct” models of biological networks should take warning from the closing lines of this survey article:

Currently different types of networks, such as PPI, GRN, or metabolic networks are analyzed separately. These heterogeneous networks have to be integrated systematically to generate comprehensive network, which creates a realistic representation of biological systems.[cite omitted] The integrated networks have to be combined with different types of molecular profiling data that measures different facades of the biological system. A recent multi institutional collaborative project, named The Cancer Genome Atlas,[cite omitted] has already started generating much multi-‘omics’ data for large cancer patient cohorts. Thus, we can expect to witness an exciting and fast paced growth on biological network research in the coming years.

Interesting.

Nature uses heterogeneous networks, with great success.

We can keep building homogenous networks or we can start building heterogeneous networks (at least to the extent we are capable).

What do you think?

Large Heterogeneous Data 2012

Thursday, May 31st, 2012

Workshop on Discovering Meaning On the Go in Large Heterogeneous Data 2012 (LHD-12)

Important Dates

  • Deadline for paper subsmission: July 31, 2012
  • Author notification: August 21, 2012
  • Deadline for camera-ready: September 10, 2012
  • Workshop date: November 11th or 12th, 2012

Take the time to read the workshop description.

A great summary of the need for semantic mappings, not more semantic fascism.

From the call for papers:

An interdisciplinary approach is necessary to discover and match meaning dynamically in a world of increasingly large data sources. This workshop aims to bring together practitioners from academia, industry and government for interaction and discussion. This will be a half-day workshop which primarily aims to initiate discussion and debate. It will involve

  • A panel discussion focussing on these issues from an industrial and governmental point of view. Membership to be confirmed, but we expect a representative from Scottish Government and from Google, as well as others.
  • Short presentations grouped into themed panels, to stimulate debate not just about individual contributions but also about the themes in general.

Workshop Description

The problem of semantic alignment – that of two systems failing to understand one another when their representations are not identical – occurs in a huge variety of areas: Linked Data, database integration, e-science, multi-agent systems, information retrieval over structured data; anywhere, in fact, where semantics or a shared structure are necessary but centralised control over the schema of the data sources is undesirable or impractical. Yet this is increasingly a critical problem in the world of large scale data, particularly as more and more of this kind of data is available over the Web.

In order to interact successfully in an open and heterogeneous environment, being able to dynamically and adaptively integrate large and heterogeneous data from the Web “on the go” is necessary. This may not be a precise process but a matter of finding a good enough integration to allow interaction to proceed successfully, even if a complete solution is impossible.

Considerable success has already been achieved in the field of ontology matching and merging, but the application of these techniques – often developed for static environments – to the dynamic integration of large-scale data has not been well studied.

Presenting the results of such dynamic integration to both end-users and database administrators – while providing quality assurance and provenance – is not yet a feature of many deployed systems. To make matters more difficult, on the Web there are massive amounts of information available online that could be integrated, but this information is often chaotically organised, stored in a wide variety of data-formats, and difficult to interpret.

This area has been of interest in academia for some time, and is becoming increasingly important in industry and – thanks to open data efforts and other initiatives – to government as well. The aim of this workshop is to bring together practitioners from academia, industry and government who are involved in all aspects of this field: from those developing, curating and using Linked Data, to those focusing on matching and merging techniques.

Topics of interest include, but are not limited to:

  • Integration of large and heterogeneous data
  • Machine-learning over structured data
  • Ontology evolution and dynamics
  • Ontology matching and alignment
  • Presentation of dynamically integrated data
  • Incentives and human computation over structured data and ontologies
  • Ranking and search over structured and semi-structured data
  • Quality assurance and data-cleansing
  • Vocabulary management in Linked Data
  • Schema and ontology versioning and provenance
  • Background knowledge in matching
  • Extensions to knowledge representation languages to better support change
  • Inconsistency and missing values in databases and ontologies
  • Dynamic knowledge construction and exploitation
  • Matching for dynamic applications (e.g., p2p, agents, streaming)
  • Case studies, software tools, use cases, applications
  • Open problems
  • Foundational issues

Applications and evaluations on data-sources that are from the Web and Linked Data are particularly encouraged.

Several years from now, how will you find this conference (and its proceedings)?

  • Large Heterogeneous Data 2012
  • Workshop on Discovering Meaning On the Go in Large Heterogeneous Data 2012
  • LHD-12

Just curious.

Cell Architectures (adding dashes of heterogeneity)

Saturday, May 12th, 2012

Cell Architectures

From the post:

A consequence of Service Oriented Architectures is the burning need to provide services at scale. The architecture that has evolved to satisfy these requirements is a little known technique called the Cell Architecture.

A Cell Architecture is based on the idea that massive scale requires parallelization and parallelization requires components be isolated from each other. These islands of isolation are called cells. A cell is a self-contained installation that can satisfy all the operations for a shard. A shard is a subset of a much larger dataset, typically a range of users, for example.

Cell Architectures have several advantages:

  • Cells provide a unit of parallelization that can be adjusted to any size as the user base grows.
  • Cell are added in an incremental fashion as more capacity is required.
  • Cells isolate failures. One cell failure does not impact other cells.
  • Cells provide isolation as the storage and application horsepower to process requests is independent of other cells.
  • Cells enable nice capabilities like the ability to test upgrades, implement rolling upgrades, and test different versions of software.
  • Cells can fail, be upgraded, and distributed across datacenters independent of other cells.

The intersection of semantic heterogeneity and scaling remains largely unexplored.

I suggest scaling in a homogeneous environment and then adding dashes of heterogeneity to see what breaks.

Adjust and try again.

“AI on the Web” 2012 – Saarbrücken, Germany

Monday, April 23rd, 2012

“AI on the Web” 2012 – Saarbrücken, Germany

Important Dates:

Deadline for Submission: July 5, 2012

Notification of Authors: August 14, 2012

Final Versions of Papers: August 28, 2012

Workshop: September 24/25, 2012

From the website:

The World Wide Web has become a unique source of knowledge on virtually any imaginable topic. It is continuously fed by companies, academia, and common people with a variety of information in numerous formats. By today, the Web has become an invaluable asset for research, learning, commerce, socializing, communication, and entertainment. Still, making full use of the knowledge contained on the Web is an ongoing challenge due to the special properties of the Web as an information source:

  • Heterogeneity: web data occurs in any kind of formats, languages, data structures and terminology one can imagine.
  • Decentrality: the Web is inherently decentralized which means that there is no central point of control that can ensure consistency or synchronicity.
  • Scale: the Web is huge and processing data at web scale is a major challenge in particular for knowledge‐intensive methods.

These characteristics make the Web a challenging but also a promising chance for AI methods that can help to make the knowledge on the Web more accessible for humans and machines by capturing, representing and using information semantics. The relevance and importance of AI methods for the Web is underlined by the fact that the AAAI – as one of the major AI conferences – has been featuring a special track “AI on the Web” for more than five years now. In line with this track and in order to stress this relevance within the German AI community, we are looking for work on relevant methods and their application to web data.

Look beyond the Web, to the larger world of information of the “deep” web or the even larger world of information, web or not, and what do you see?

Heterogeneity, Decentrality, Scale.

What we learn about AI for the Web may help us with larger information problems.

Using an RDF Data Pipeline to Implement Cross-Collection Search

Saturday, March 31st, 2012

Using an RDF Data Pipeline to Implement Cross-Collection Search by David Henry and Eric Brown.

Abstract:

This paper presents an approach to transforming data from many diverse sources in support of a semantic cross-collection search application. It describes the vision and goals for a semantic cross-collection search and examines the challenges of supporting search of that kind using very diverse data sources. The paper makes the case for supporting semantic cross-collection search using semantic web technologies and standards including Resource Descriptive Framework (RDF), SPARQL Protocol and RDF Query Language (SPARQL ), and an XML mapping language. The Missouri History Museum has developed a prototype method for transforming diverse data sources into a data repository and search index that can support a semantic cross-collection search. The method presented in this paper is a data pipeline that transforms diverse data into localized RDF; then transforms the localized RDF into more generalized RDF graphs using common vocabularies; and ultimately transforms generalized RDF graphs into a Solr search index to support a semantic cross-collection search. Limitations and challenges of this approach are detailed in the paper.

A great report on the issues you will face with diverse data resources. (And who doesn’t have those?)

The “practical considerations” section is particularly interesting and I am sure the project participants would appreciate any suggestions you may have.

LDIF – Linked Data Integration Framework (0.4)

Tuesday, January 24th, 2012

LDIF – Linked Data Integration Framework (0.4)

Version 0.4 News:

Up till now, LDIF stored data purely in-memory which restricted the amount of data that could be processed. Version 0.4 provides two alternative implementations of the LDIF runtime environment which allow LDIF to scale to large data sets: 1. The new triple store backed implementation scales to larger data sets on a single machine with lower memory consumption at the expense of processing time. 2. The new Hadoop-based implementation provides for processing very large data sets on a Hadoop cluster, for instance within Amazon EC2. A comparison of the performance of all three implementations of the runtime environment is found on the LDIF benchmark page.

From the “About LDIF:”

The Web of Linked Data grows rapidly and contains data from a wide range of different domains, including life science data, geographic data, government data, library and media data, as well as cross-domain data sets such as DBpedia or Freebase. Linked Data applications that want to consume data from this global data space face the challenges that:

  1. data sources use a wide range of different RDF vocabularies to represent data about the same type of entity.
  2. the same real-world entity, for instance a person or a place, is identified with different URIs within different data sources.

This usage of different vocabularies as well as the usage of URI aliases makes it very cumbersome for an application developer to write SPARQL queries against Web data which originates from multiple sources. In order to ease using Web data in the application context, it is thus advisable to translate data to a single target vocabulary (vocabulary mapping) and to replace URI aliases with a single target URI on the client side (identity resolution), before starting to ask SPARQL queries against the data.

Up-till-now, there have not been any integrated tools that help application developers with these tasks. With LDIF, we try to fill this gap and provide an an open-source Linked Data Integration Framework that can be used by Linked Data applications to translate Web data and normalize URI while keeping track of data provenance.

With the addition of Hadoop based processing, definitely worth your time to download and see what you think of it.

Ironic that the problem it solves:

  1. data sources use a wide range of different RDF vocabularies to represent data about the same type of entity.
  2. the same real-world entity, for instance a person or a place, is identified with different URIs within different data sources.

already existed, prior to Linked Data as:

  1. data sources use a wide range of different vocabularies to represent data about the same type of entity.
  2. the same real-world entity, for instance a person or a place, is identified differently within different data sources.

So the Linked Data drill is to convert data, which already has these problems, into Linked Data, which will still have these problems, and then solve the problem of differing identifications.

Yes?

Did I miss a step?

Call for Papers on Big Data: Theory and Practice

Thursday, January 12th, 2012

SWJ 2012 : Semantic Web Journal Call for Papers on Big Data: Theory and Practice

Dates:

Manuscript submission due: 13. February 2012
First notification: 26. March 2012
Issue publication: Summer 2012

From the post:

The Semantic Web journal calls for innovative and high-quality papers describing theory and practice of storing, accessing, searching, mining, processing, and visualizing big data. We especially invite papers that describe or demonstrate how ontologies, Linked Data, and Semantic Web technologies can handle the problems arising when integrating massive amounts of multi-thematic and multi-perspective information from heterogeneous sources to answer complex questions that cut through domain boundaries.

We welcome all paper categories, i.e., full research papers, application reports, systems and tools, ontology papers, as well as surveys, as long as they clearly relate to challenges and opportunities arising from processing big data – see our listing of paper types in the author guidelines. In other words, we expect all submitted manuscripts to address how the presented work can exploit massive and/or heterogeneous data.

Semantic Web technologies represent subjects as well as being subjects themselves should enable demonstrations of integrating diverse Semantic Web approaches to the same data. Where the underlying data is heterogeneous as well. Now that would be an interesting paper.

Proceedings…Information Heterogeneity and Fusion in Recommender Systems

Tuesday, January 10th, 2012

Proceedings of the 2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems

I am still working on the proceeding for the main conference but thought these might be of interest:

  • Information market based recommender systems fusion
    Efthimios Bothos, Konstantinos Christidis, Dimitris Apostolou, Gregoris Mentzas
    Pages: 1-8
    doi>10.1145/2039320.2039321
  • A kernel-based approach to exploiting interaction-networks in heterogeneous information sources for improved recommender systems
    Oluwasanmi Koyejo, Joydeep Ghosh
    Pages: 9-16
    doi>10.1145/2039320.2039322
  • Learning multiple models for exploiting predictive heterogeneity in recommender systems
    Clinton Jones, Joydeep Ghosh, Aayush Sharma
    Pages: 17-24
    doi>10.1145/2039320.2039323
  • A generic semantic-based framework for cross-domain recommendation
    Ignacio Fernández-Tobías, Iván Cantador, Marius Kaminskas, Francesco Ricci
    Pages: 25-32
    doi>10.1145/2039320.2039324
  • Hybrid algorithms for recommending new items
    Paolo Cremonesi, Roberto Turrin, Fabio Airoldi
    Pages: 33-40
    doi>10.1145/2039320.2039325
  • Expert recommendation based on social drivers, social network analysis, and semantic data representation
    Maryam Fazel-Zarandi, Hugh J. Devlin, Yun Huang, Noshir Contractor
    Pages: 41-48
    doi>10.1145/2039320.2039326
  • Experience Discovery: hybrid recommendation of student activities using social network data
    Robin Burke, Yong Zheng, Scott Riley
    Pages: 49-52
    doi>10.1145/2039320.2039327
  • Personalizing tags: a folksonomy-like approach for recommending movies
    Alan Said, Benjamin Kille, Ernesto W. De Luca, Sahin Albayrak
    Pages: 53-56
    doi>10.1145/2039320.2039328
  • Personalized pricing recommender system: multi-stage epsilon-greedy approach
    Toshihiro Kamishima, Shotaro Akaho
    Pages: 57-64
    doi>10.1145/2039320.2039329
  • Matrix co-factorization for recommendation with rich side information and implicit feedback
    Yi Fang, Luo Si
    Pages: 65-69
    doi>10.1145/2039320.2039330

Querying Semi-Structured Data

Friday, January 6th, 2012

Querying Semi-Structured Data

The Semi-structured data and P2P graph databases post I point to has a broken reference to Serge Abiteboul’s “Querying Semi-Structured Data.” Since I could not correct it there and the topic is of interest for topic maps, I created this entry for it here.

From the Introduction:

The amount of data of all kinds available electronically has increased dramatically in recent years. The data resides in diff erent forms, ranging from unstructured data in le systems to highly structured in relational database systems. Data is accessible through a variety of interfaces including Web browsers, database query languages, application-specifi c interfaces, or data exchange formats. Some of this data is raw data, e.g., images or sound. Some of it has structure even if the structure is often implicit, and not as rigid or regular as that found in standard database systems. Sometimes the structure exists but has to be extracted from the data. Sometimes also it exists but we prefer to ignore it for certain purposes such as browsing. We call here semi-structured data this data that is (from a particular viewpoint) neither raw data nor strictly typed, i.e., not table-oriented as in a relational model or sorted-graph as in object databases.

As will seen later when the notion of semi-structured data is more precisely defi ned, the need for semi-structured data arises naturally in the context of data integration, even when the data sources are themselves well-structured. Although data integration is an old topic, the need to integrate a wider variety of data-formats (e.g., SGML or ASN.1 data) and data found on the Web has brought the topic of semi-structured data to the forefront of research.

The main purpose of the paper is to isolate the essential aspects of semi-structured data. We also survey some proposals of models and query languages for semi-structured data. In particular, we consider recent works at Stanford U. and U. Penn on semi-structured data. In both cases, the motivation is found in the integration of heterogeneous data. The “lightweight” data models they use (based on labelled graphs) are very similar.

As we shall see, the topic of semi-structured data has no precise boundary. Furthermore, a theory of semi-structured data is still missing. We will try to highlight some important issues in this context.

The paper is organized as follows. In Section 2, we discuss the particularities of semi-structured data. In Section 3, we consider the issue of the data structure
and in Section 4, the issue of the query language.

A bit dated, 1996, but still worth reading. Updating the paper would make a nice semester size project

BTW, note the download graphics. Makes me think that archives should have an “anonymous notice” feature that allows anyone downloading a paper to send an email to anyone who has downloaded the paper in the past, without disclosing the emails of the prior downloaders.

I would really like to know what the people in Jan/Feb of 2011 were looking for? Perhaps they are working on an update of the paper? Or would like to collaborate on updating the paper.

Seems like a small “feature” that would allow researchers to contact others without disclosure of email addresses (other than for the sender of course).

Formal publication data:

Abiteboul, S. (1996) Querying Semi-Structured Data. Technical Report. Stanford InfoLab. (Publication Note: Database Theory – ICDT ’97, 6th International Conference, Delphi, Greece, January 8-10, 1997)

The Future of Hadoop in Bioinformatics

Monday, July 18th, 2011

The Future of Hadoop in Bioinformatics: Hadoop and its ecosystem including MapReduce are the dominant open source Big Data solution by Bob Gourley.

From the post:

Earlier, I wrote on the use of Hadoop in the exciting, evolving field of Bioinformatics. I have since had the pleasure of speaking with Dr. Ron Taylor of Pacific Northwest National Library, the author of “An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics“, on what’s changed in the half-year since its publication and what’s to come.

As Dr. Taylor expected, Hadoop and it’s “ecosystem” including MapReduce are the dominant open source Big Data solution for next generation DNA sequencing analysis. This is currently the sub-field generating the most data and requiring the most computationally expensive analysis. For example, de novo assembly pieces together tens of millions of short reads (which may be 50 bases long on ABI SOLiD sequencers). To do so, every read needs to be compared to the others, which scales in proportion to n(logn), meaning, even assuming reads that are 100 base pairs in length and a human genome of 3 billion pairs, analyzing an entire human genome will take 7.5 times longer than if it scaled linearly. By dividing the task up into a Hadoop cluster, the analysis will be faster and, unlike other high performance computing alternatives, it can run on regular commodity servers that are much cheaper than custom supercomputers. This, combined with the savings from using open source software, ease of use due to seamless scaling, and the strength of the Hadoop community make Hadoop and related software the parallelization solution of choice in next generation sequencing.In other areas, however, traditional HPC is still more common and Hadoop has not yet caught on. Dr. Taylor believes that in the next year to 18 months, this will change due to the following trends:

So, over the next year to eighteen months, what do you see as the evolution of topic map software and services?

Or what problems do you see becoming apparent in bioinformatics or other areas (like the Department of Energy’s knowledgebase) that will require topic maps?

(More on the DOE project later this week.)

Information Heterogeneity and Fusion

Thursday, May 12th, 2011

2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011)

Important Dates:

Paper submission deadline: 25th July 2011
Notification of acceptance: 19th August 2011
Camera-ready version due: 12th September 2011
Workshop: 23rd or 27th October 2011

Datasets are also being made available. Just in case you can’t find any heterogeneous data lying around. ;-)

Looks like a perfect venue for topic map papers. (Not to mention that a re-usable mapping between recommender systems looks like a commercial opportunity.)

From the website:

In recent years, increasing attention has been given to finding ways for combining, integrating and mediating heterogeneous sources of information for the purpose of providing better personalized services in many information seeking and e-commerce applications. Information heterogeneity can indeed be identified in any of the pillars of a recommender system: the modeling of user preferences, the description of resource contents, the modeling and exploitation of the context in which recommendations are made, and the characteristics of the suggested resource lists.

Almost all current recommender systems are designed for specific domains and applications, and thus usually try to make best use of a local user model, using a single kind of personal data, and without explicitly addressing the heterogeneity of the existing personal information that may be freely available (on social networks, homepages, etc.). Recognizing this limitation, among other issues: a) user models could be based on different types of explicit and implicit personal preferences, such as ratings, tags, textual reviews, records of views, queries, and purchases; b) recommended resources may belong to several domains and media, and may be described with multilingual metadata; c) context could be modeled and exploited in multi-dimensional feature spaces; d) and ranked recommendation lists could be diverse according to particular user preferences and resource attributes, oriented to groups of users, and driven by multiple user evaluation criteria.

The aim of HetRec workshop is to bring together students, faculty, researchers and professionals from both academia and industry who are interested in addressing any of the above forms of information heterogeneity and fusion in recommender systems. We would like to raise awareness of the potential of using multiple sources of information, and look for sharing expertise and suitable models and techniques.

Another dire need is for strong datasets, and one of our aims is to establish benchmarks and standard datasets on which the problems could be investigated. In this edition, we make available on-line datasets with heterogeneous information from several social systems. These datasets can be used by participants to experiment and evaluate their recommendation approaches, and be enriched with additional data, which may be published at the workshop website for future use.

KNIME – 4th Annual User Group Meeting

Wednesday, March 16th, 2011

KNIME – 4th Annual User Group Meeting

From the website:

The 4th KNIME Workshop and Users Meeting at Technopark in Zurich, Switzerland took place between February 28th and March 4th, 2011 and was a huge success.

The meeting was very well attended by more than 130 participants. The presentations ranged from customer intelligence and applications of KNIME in soil and fuel research through to high performance data analytics and KNIME applications in the Life Science industry. The second meeting of the special interest group attracted more than 50 attendees and was filled with talks about how KNIME can be put to use in this fast growing research area.

Presentations are available.

A new version of KNIME is available for download with the features listed in ChangeLog 2.3.3.

Focused on data analytics and work flow, another software package that could benefit from an interchangeable subject-oriented approach.

Inconsistency?

Saturday, February 12th, 2011

Managing and Reasoning in the Presence of Inconsistency

The International Journal of Semantic Computing describes this Call for Papers as follows:

Inconsistency is ubiquitous in the real world, in human behaviors, and in the computing systems we build. Inconsistency manifests itself in a plethora of phenomena at different level in the depth of knowledge, ranging from data, information, knowledge, meta-knowledge, to expertise. Data inconsistency arises when patterns in data do not conform to an established range, distribution or interpretation. The exponentially growing volumes of data stemming from almost all types of data being created in digital form, a proliferation of sensors and sensor networks, and other sources such as social networks, complex computer simulations, space explorations, and high-resolution imagery and video, have made data inconsistency an inevitability. Information inconsistency occurs when meanings of the same data values become conflicting or when the same attribute for an entity has different data values. Knowledge inconsistency happens when propositions of either declarative or procedural beliefs, in either explicit or tacit form, yield antagonistic outcomes for the same circumstance. Inconsistency can also emerge from meta-knowledge or from expertise. How to manage and reason in the presence of inconsistency in computing systems is a very important issue in semantic computing, social computing, and other data-rich or knowledge-rich computing paradigms. It requires that we understand the causes and circumstances of inconsistency, establish proper metrics for inconsistency, adopt formalisms to represent inconsistency, develop ways to recognize and analyze different types of inconsistency, and devise mechanisms and methodologies to manage and handle inconsistency.

Refreshing in that inconsistency is recognized as an omnipresent and everlasting fact of our environments. Including computing environments.

The phrase, “…establish proper metrics for inconsistency,…” betrays a world view that we can stand outside of our inconsistencies and those of others.

For all the useful work that will appear in this volume (and others like it), there is no place to stand outside of our environments and their inconsistencies.

Important Dates
Submission deadline: May 20, 2011
Review result notification: July 20, 2011
Revision due: August 20, 2011
Final version due: August 31, 2011
Tentative date of publication: September, 2011 (Vol.5, No.3)

NCIBI – National Center for Integrative Biomedical Informatics

Wednesday, January 19th, 2011

NCIBI – National Center for Integrative Biomedical Informatics

From the website:

The National Center for Integrative Biomedical Informatics (NCIBI) is one of seven National Centers for Biomedical Computing (NCBC) within the NIH Roadmap. The NCBC program is focused on building a universal computing infrastructure designed to speed progress in biomedical research. NCIBI was founded in September 2005 and is based at the University of Michigan as part of the Center for Computational Medicine and Bioinformatics (CCMB).

Note the use of integrative in the name of the center.

They “get” that part.

They are in fact working on mappings to support integration of data even as I write these lines.

There is a lot to be learned about their strategies for integration and to better understand the integration issues they face in this domain. This site is a good starting place to do both.

KNIME Version 2.3.0 released – News

Saturday, December 18th, 2010

KNIME Version 2.3.0 released

From the announcement:

The new version is a greatly enhancing the usability of KNIME. It adds new features like workflow annotations, support for hotkeys, inclusion of R-views in reports, data flow switches, option to hide node labels, variable support in the database reader/connector and R-nodes, and the ability to export KNIME workflows as SVG Graphics.

With the 2.3 release we are also introducing a community node repository, which includes KNIME extensions for bio- and chemoinformatics and an advanced R-scripting environment.

CFP – Dealing with the Messiness of the Web of Data – Journal of Web Semantics

Friday, December 17th, 2010

CFP – Dealing with the Messiness of the Web of Data – Journal of Web Semantics

From the call:

Research on the Semantic Web, which is now in its second decade, has had a tremendous success in encouraging people to publish data on the Web in structured, linked, and standardized ways. The success of what has now become the Web of Data can be read from the sheer number of triples available within the Linked-Open Data, Linked Life Data and Open-Government initiatives. However, this growth in data makes many of the established assumptions inappropriate and offers a number of new research challenges.

In stark contrast to early Semantic Web applications that dealt with small, hand-crafted ontologies and data-sets, the new Web of Data comes with a plethora of contradicting world-views and contains incomplete, inconsistent, incorrect, fast-changing and opinionated information. This information not only comes from academic sources and trustworthy institutions, but is often community built, scraped or translated.

In short: the Web of Data is messy, and methods to deal with this messiness are paramount for its future.

Now, we have two choices as the topic map community:

  • congratulate ourselves for seeing this problem long ago, high five each other, etc., or
  • step up and offer topic map solutions that incorporate as much of the existing SW work as possible.

I strongly suggest the second one.

Important dates:

We will aim at an efficient publication cycle in order to guarantee prompt availability of the published results. We will review papers on a rolling basis as they are submitted and explicitly encourage submissions well before the submission deadline. Submit papers online at the journal’s Elsevier Web site.

Submission deadline: 1 February 2011
Author notification: 15 June 2011

Revisions submitted: 1 August 2011
Final decisions: 15 September 2011
Publication: 1 January 2012

Rule Synthesizing from Multiple Related Databases

Tuesday, November 9th, 2010

Rule Synthesizing from Multiple Related Databases Authors(s): Dan He, Xindong Wu, Xingquan Zhu Keywords: Association rule mining, rule synthesizing, multiple databases, clustering

In this paper, we study the problem of rule synthesizing from multiple related databases where items representing the databases may be different, and the databases may not be relevant, or similar to each other. We argue that, for such multi-related databases, simple rule synthesizing without a detailed understanding of the databases is not able to reveal meaningful patterns inside the data collections. Consequently, we propose a two-step clustering on the databases at both item and rule levels such that the databases in the final clusters contain both similar items and similar rules. A weighted rule synthesizing method is then applied on each such cluster to generate final rules. Experimental results demonstrate that the new rule synthesizing method is able to discover important rules which can not be synthesized by other methods.

The authors observe:

…existing rule synthesizing methods for distributed mining commonly assumes that related databases are relevant, share similar data distributions, and have identical items. This is equivalent to the assumption that all stores have the same type of business with identical meta-data structures, which is hardly the case in practice.

I should start collecting quotes that recognize semantic diversity as the rule rather than the exception.

More on that later. Enjoy the article.

A Prototype of Multimedia Metadata Management System for Supporting the Integration of Heterogeneous Sources

Tuesday, November 2nd, 2010

A Prototype of Multimedia Metadata Management System for Supporting the Integration of Heterogeneous Sources Authors: Tie Hua Zhou, Byeong Mun Heo, Ling Wang, Yang Koo Lee, Duck Jin Chai and Keun Ho Ryu Keywords: Multimedia Metadata Management Systems, Metadata, MPEG-7, TV-Anytime

Abstract:

With the advances in information technology, the amount of multimedia metadata captured, produced, and stored is increasing rapidly. As a consequence, multimedia content is widely used for many applications in today’s world, and hence, a need for organizing multimedia metadata and accessing it from repositories with vast amount of information has been a driving stimulus both commercially and academically. MPEG-7 is expected to provide standardized description schemes for concise and unambiguous content description of data/documents of complex multimedia types. Meanwhile, other metadata or description schemes, such as Dublin Core, XML, TV-Anytime etc., are becoming popular in different application domains. In this paper, we present a new prototype Multimedia Metadata Management System. Our system is good at sharing the integration of multimedia metadata from heterogeneous sources. This system enables the collection, analysis and integration of multimedia metadata semantic description from some different kinds of services. (UCC, IPTV, VOD and Digital TV et al.)

The details for the “Metadata Analyzer” and “Metadata Mapping” seep to be a bit sparse (as in non-existent) for a “prototype…supporting integration of heterogeneous sources.”

MPEG-7 has an important role to play in this area and topic mappers should be aware of it.

I will try to locate more useful resources on MPEG-7 and multimedia content.

OpenII

Sunday, October 31st, 2010

OpenII

From the website:

OpenII is a collaborative effort spearheaded by The MITRE Corporation and Google to create a suite of open-source tools for information integration. The project is leveraging the latest developments in research on information integration to create a platform on which integration applications can be built and further research can be conducted.

The motivation for OpenII is that although a significant amount of research has been conducted on information integration, and several commercial systems have been deployed, many information integration applications are still hard to build. In research, we often innovate on a specific aspect of information integration, but then spend much our time building (and rebuilding) other components that we need in order to validate our contributions. As a result, the research prototypes that have been built are generally not reusable and do not inter-operate with each other. On the applications side, information integration comes in many flavors, and therefore it is hard for commercial products to serve all the needs. Our goal is to create tools that can be applied in a variety of architectural contexts and can easily be tailored to the needs of particular domains.

OpenII tools include, among others, wrappers for common data sources, tools for creating matches and mappings between disparate schemas, a tool for searching collections of schemas and extending schemas, and run-time tools for processing queries over heterogeneous data sources.

The M3 metamodel:

The fundamental building block in M3 is the entity. An entity represents information about a set of related real-world objects. Associated with each entity is a set of attributes that indicate what information is captured about each entity. For simplicity, we assume that at most one value can be associated with each attribute of an entity.

The project could benefit from a strong injection of subject identity based thinking and design.

IEEE Computer Society Technical Committee on Semantic Computing (TCSEM)

Sunday, October 17th, 2010

The IEEE Computer Society Technical Committee on Semantic Computing (TCSEM)

addresses the derivation and matching of the semantics of computational content to that of naturally expressed user intentions in order to retrieve, manage, manipulate or even create content, where “content” may be anything including video, audio, text, software, hardware, network, process, etc.

Being organized by Phillip C-Y Sheu (UC Irvine), psheu@uci.edu, Phone: +1 949 824 2660. Volunteers are needed for both organizational and technical committees.

This is a good way to meet people, make a positive contribution and, have a lot of fun.

Satrap: Data and Network Heterogeneity Aware P2P Data-Mining

Monday, October 11th, 2010

Satrap: Data and Network Heterogeneity Aware P2P Data-Mining Authors: Hock Hee Ang, Vivekanand Gopalkrishnan, Anwitaman Datta, Wee Keong Ng, Steven C. H. Hoi Keywords: Distributed classification, P2P network, cascade SVM

Abstract:

Distributed classification aims to build an accurate classifier by learning from distributed data while reducing computation and communication cost. A P2P network where numerous users come together to share resources like data content, bandwidth, storage space and CPU resources is an excellent platform for distributed classification. However, two important aspects of the learning environment have often been overlooked by other works, viz., 1) location of the peers which results in variable communication cost and 2) heterogeneity of the peers’ data which can help reduce redundant communication. In this paper, we examine the properties of network and data heterogeneity and propose a simple yet efficient P2P classification approach that minimizes expensive inter-region communication while achieving good generalization performance. Experimental results demonstrate the feasibility and effectiveness of the proposed solution.

Among the other claims for Satrap:

  • achieves the best accuracy-to-communication cost ratio given that data exchange is performed to improve global accuracy.
  • allows users to control the trade-off between accuracy and communication cost with the user-specified parameters.

I find these two the most interesting.

In part because semantic integration, whether explicit or not, is always a question of cost ratio and tradeoffs.

It would be refreshing to see papers that say what semantic integration would be too costly with method X or that aren’t possible with method Y.

Distributed Knowledge Discovery with Non Linear Dimensionality Reduction

Sunday, October 10th, 2010

Distributed Knowledge Discovery with Non Linear Dimensionality Reduction Authors: Panagis Magdalinos, Michalis Vazirgiannis, Dialecti Valsamou Keywords: distributed non linear dimensionality reduction, NLDR, distributed dimensionality reduction, DDR, distributed data mining, DDM, dimensionality reduction, DR, Distributed Isomap, D-Isomap, C-Isomap, L-Isomap

Abstract:

Data mining tasks results are usually improved by reducing the dimensionality of data. This improvement however is achieved harder in the case that data lay on a non linear manifold and are distributed across network nodes. Although numerous algorithms for distributed dimensionality reduction have been proposed, all assume that data reside in a linear space. In order to address the non-linear case, we introduce D-Isomap, a novel distributed non linear dimensionality reduction algorithm, particularly applicable in large scale, structured peer-to-peer networks. Apart from unfolding a non linear manifold, our algorithm is capable of approximate reconstruction of the global dataset at peer level a very attractive feature for distributed data mining problems. We extensively evaluate its performance through experiments on both artificial and real world datasets. The obtained results show the suitability and viability of our approach for knowledge discovery in distributed environments.

Data mining in peer-to-peer networks will face topic map authors sooner or later.

Not only a useful discussion of the issues, but, the authors have posted source code and data sets used in the article as well:

http://www.db-net.aueb.gr/panagis/PAKDD2010/

Evolutionary Clustering and Analysis of Heterogeneous Information Networks

Saturday, October 9th, 2010

Evolutionary Clustering and Analysis of Heterogeneous Information Networks Authors: Manish Gupta; Charu Aggarwal; Jiawei Han; Yizhou Sun Keywords: ENetClus, evolutionary clustering, typed-clustering, DBLP, bibliographic networks

Abstract:

In this paper, we study the problem of evolutionary clustering of multi-typed objects in a heterogeneous bibliographic network. The traditional methods of homogeneous clustering methods do not result in a good typed-clustering. The design of heterogeneous methods for clustering can help us better understand the evolution of each of the types apart from the evolution of the network as a whole. In fact, the problem of clustering and evolution diagnosis are closely related because of the ability of the clustering process to summarize the network and provide insights into the changes in the objects over time. We present such a tightly integrated method for clustering and evolution diagnosis of heterogeneous bibliographic information networks. We present an algorithm, ENetClus, which performs such an agglomerative evolutionary clustering which is able to show variations in the clusters over time with a temporal smoothness approach. Previous work on clustering networks is either based on homogeneous graphs with evolution, or it does not account for evolution in the process of clustering heterogeneous networks. This paper provides the first framework for evolution-sensitive clustering and diagnosis of heterogeneous information networks. The ENetClus algorithm generates consistent typed-clusterings across time, which can be used for further evolution diagnosis and insights. The framework of the algorithm is specifically designed in order to facilitate insights about the evolution process. We use this technique in order to provide novel insights about bibliographic information networks.

Exploring heterogeneous information networks is a first step towards discovery/recognition of new subjects. What other novel insights will emerge from work on heterogeneous information networks only future research can answer.

Designing a thesaurus-based comparison search interface for linked cultural heritage sources

Sunday, October 3rd, 2010

Designing a thesaurus-based comparison search interface for linked cultural heritage sources Authors: Alia Amin, Michiel Hildebrand, Jacco van Ossenbruggen, Lynda Hardman Keywords: comparison search, thesauri, cultural heritage

Prototype: LISA, e-culture.multimedian.nl

Abstract:

Comparison search is an information seeking task where a user examines individual items or sets of items for similarities and differences. While this is a known information need among experts and knowledge workers, appropriate tools are not available. In this paper, we discuss comparison search in the cultural heritage domain, a domain characterized by large, rich and heterogeneous data sets, where different organizations deploy different schemata and terminologies to describe their artifacts. This diversity makes meaningful comparison difficult. We developed a thesaurus-based comparison search application called LISA, a tool that allows a user to search, select and compare sets of artifacts. Different visualizations allow users to use different comparison strategies to cope with the underlying heterogeneous data and the complexity of the search tasks. We conducted two user studies. A preliminary study identifies the problems experts face while performing comparison search tasks. A second user study examines the effectiveness of LISA in helping to solve comparison search tasks. The main contribution of this paper is to establish design guidelines for the data and interface of a comparison search application. Moreover, we offer insights into when thesauri and metadata are appropriate for use in such applications.

User-centric project that develops an interface into heterogeneous data sets.

What I would characterize as pre-mapping, that is no “canonical” mapping has yet been established.

Perhaps a good idea to preserve a pre-mapping stage as any mapping represents but one choice among many.

Entity Resolution – Journal of Data and Information Quality

Thursday, September 30th, 2010

Special Issue on Entity Resolution.

The Journal of Data and Information Quality is a new journal from the ACM.

Calls for papers should not require ACM accounts for viewing.

I have re-ordered (to put the important stuff first) and reproduced the call below:

Important Dates

  • Submissions due: December 15, 2010
  • Acceptance Notification: April 30, 2011
  • Final Paper Due: June 30, 2011
  • Target Date for Special Issue: September 2011

Resources for authors include:

Topics of interest include, but are not limited to:

  • ER impacts on Information Quality and impacts of Information Quality
    on ER
  • ER frameworks and architectures
  • ER outcome/performance assessment and metrics
  • ER in special application domains and contexts
  • ER and high-performance computing (HPC)
  • ER education
  • ER case studies
  • Theoretical frameworks for ER and entity-based integration
  • Method and techniques for
    • Entity reference extraction
    • Entity reference resolution
    • Entity identity management and identity resolution
    • Entity relationship analysis

Entity resolution (ER) is a key process for improving data quality in data integration in modern information systems. ER covers a wide range of approaches to entity-based integration, known variously as merge/purge, record de-duplication, heterogeneous join, identity resolution, and customer recognition. More broadly, ER also includes a number of important pre- and post-integration activities, such as entity reference extraction and entity relationship analysis. Based on direct record matching strategies, such as those described by the Fellegi-Sunter Model, new theoretical frameworks are evolving to describe ER processes and outcomes that include other types of inferred and asserted reference linking techniques. Businesses have long recognized that the quality of their ER processes directly impacts the overall value of their information assets and the quality of the information products they produce. Government agencies and departments, including law enforcement and the intelligence community, are increasing their use of ER as a tool for accomplishing their missions as well. Recognizing the growing interest in ER theory and practice, and its impact on information quality in organizations, the ACM Journal of Data and Information Quality (JDIQ) will devote a special issue to innovative and high-quality research papers in this area. Papers that address any aspect of entity resolution are welcome.