Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 10, 2010

Efficient Spectral Neighborhood Blocking for Entity Resolution

Filed under: Entity Resolution — Patrick Durusau @ 6:54 am

Efficient Spectral Neighborhood Blocking for Entity Resolution Authors: Liangcai Shu, Aiyou Chen, Ming Xiong, Weiyi Meng

Abstract:

In many telecom and web applications, there is a need to identify whether data objects in the same source or different sources represent the same entity in the real-world. This problem arises for subscribers in multiple services, customers in supply chain management, and users in social networks when there lacks a unique identifier across multiple data sources to represent a real-world entity. Entity resolution is to identify and discover objects in the data sets that refer to the same entity in the real world. We investigate the entity resolution problem for large data sets where efficient and scalable solutions are needed. We propose a novel unsupervised blocking algorithm, namely SPectrAl Neighborhood (SPAN), which constructs a fast bipartition tree for the records based on spectral clustering such that real entities can be identified accurately by neighborhood records in the tree. There are two major novel aspects in our approach: 1) We develop a fast algorithm that performs spectral clustering without computing pairwise similarities explicitly, which dramatically improves the scalability of the standard spectral clustering algorithm; 2) We utilize a stopping criterion specified by Newman-Girvan modularity in the bipartition process. Our experimental results with both synthetic and real-world data demonstrate that SPAN is robust and outperforms other blocking algorithms in terms of accuracy while it is efficient and scalable to deal with large data sets.

Entity resolution is to identify and discover objects in the data sets that refer to the same entity in the real world.

Modulo my usual qualms about the real world language, this sounds useful for the construction of topic maps.

Questions:

  1. How would you suggest integrating this methodology into a topic map construction process? (3-5 pages, no citations)
  2. How would you suggest integrating rule based entity identification into this methodology? (3-5 pages, no citations)
  3. Is precision of identification an operational requirement? (3-5 pages, no citations)

November 25, 2010

A Node Indexing Scheme for Web Entity Retrieval

Filed under: Entity Resolution,Full-Text Search,Indexing,Lucene,RDF,Topic Maps — Patrick Durusau @ 6:15 am

A Node Indexing Scheme for Web Entity Retrieval Authors(s): Renaud Delbru, Nickolai Toupikov, Michele Catasta, Giovanni Tummarello Keywords: entity, entity search, full-text search, semi-structured queries, top-k query, node indexing, incremental index updates, entity retrieval system, RDF, RDFa, Microformats

Abstract:

Now motivated also by the partial support of major search engines, hundreds of millions of documents are being published on the web embedding semi-structured data in RDF, RDFa and Microformats. This scenario calls for novel information search systems which provide effective means of retrieving relevant semi-structured information. In this paper, we present an “entity retrieval system” designed to provide entity search capabilities over datasets as large as the entire Web of Data. Our system supports full-text search, semi-structural queries and top-k query results while exhibiting a concise index and efficient incremental updates. We advocate the use of a node indexing scheme and show that it offers a good compromise between query expressiveness, query processing time and update complexity in comparison to three other indexing techniques. We then demonstrate how such system can effectively answer queries over 10 billion triples on a single commodity machine.

Consider the requirements for this project:

  1. Support for the multiple formats which are used on the Web of Data;
  2. Support for searching an entity description given its characteristics (entity centric search);
  3. Support for context (provenance) of information: entity descriptions are given in the context of a website or a dataset;
  4. Support for semi-structural queries with full-text search, top-k query results, scalability over shard clusters of commodity machines, efficient caching strategy and incremental index maintenance.
    1. (emphasis added)

SIREn { Semantic Information Retrieval Engine }

Definitely a package to download, install and start to evaluate. More comments forthcoming.

Questions (more for topic map researchers)

  1. To what extent can “entity description” = properties of topics, associations, occurrences?
  2. Can XTM, etc., be regarded as “microformats” for the purposes of SIREn?
  3. To what extent does SIREn meet or exceed query requirements for XTM/TMDM based topic maps?
  4. Reports on use of SIREn by topic mappers?

October 17, 2010

IEEE Computer Society Technical Committee on Semantic Computing (TCSEM)

The IEEE Computer Society Technical Committee on Semantic Computing (TCSEM)

addresses the derivation and matching of the semantics of computational content to that of naturally expressed user intentions in order to retrieve, manage, manipulate or even create content, where “content” may be anything including video, audio, text, software, hardware, network, process, etc.

Being organized by Phillip C-Y Sheu (UC Irvine), psheu@uci.edu, Phone: +1 949 824 2660. Volunteers are needed for both organizational and technical committees.

This is a good way to meet people, make a positive contribution and, have a lot of fun.

October 1, 2010

Tell me more, not just “more of the same”

Tell me more, not just “more of the same” Authors: Francisco Iacobelli, Larry Birnbaum, Kristian J. Hammond Keywords: dimensions of similarity, information retrieval, new information detection

Abstract:

The Web makes it possible for news readers to learn more about virtually any story that interests them. Media outlets and search engines typically augment their information with links to similar stories. It is up to the user to determine what new information is added by them, if any. In this paper we present Tell Me More, a system that performs this task automatically: given a seed news story, it mines the web for similar stories reported by different sources and selects snippets of text from those stories which offer new information beyond the seed story. New content may be classified as supplying: additional quotes, additional actors, additional figures and additional information depending on the criteria used to select it. In this paper we describe how the system identifies new and informative content with respect to a news story. We also how that providing an explicit categorization of new information is more useful than a binary classification (new/not-new). Lastly, we show encouraging results from a preliminary evaluation of the system that validates our approach and encourages further study.

If you are interested in the automatic extraction, classification and delivery of information, this article is for you.

I think there are (at least) two interesting ways for “Tell Me More” to develop:

First, persisting entity recognition with other data (such as story, author, date, etc.) in the form of associations (with appropriate roles, etc.).

Second, and perhaps more importantly, to enable users to add/correct information presented as part of a mapping of information about particular entities.

September 30, 2010

Entity Resolution – Journal of Data and Information Quality

Filed under: Entity Resolution,Heterogeneous Data,Subject Identity — Patrick Durusau @ 5:37 am

Special Issue on Entity Resolution.

The Journal of Data and Information Quality is a new journal from the ACM.

Calls for papers should not require ACM accounts for viewing.

I have re-ordered (to put the important stuff first) and reproduced the call below:

Important Dates

  • Submissions due: December 15, 2010
  • Acceptance Notification: April 30, 2011
  • Final Paper Due: June 30, 2011
  • Target Date for Special Issue: September 2011

Resources for authors include:

Topics of interest include, but are not limited to:

  • ER impacts on Information Quality and impacts of Information Quality
    on ER
  • ER frameworks and architectures
  • ER outcome/performance assessment and metrics
  • ER in special application domains and contexts
  • ER and high-performance computing (HPC)
  • ER education
  • ER case studies
  • Theoretical frameworks for ER and entity-based integration
  • Method and techniques for
    • Entity reference extraction
    • Entity reference resolution
    • Entity identity management and identity resolution
    • Entity relationship analysis

Entity resolution (ER) is a key process for improving data quality in data integration in modern information systems. ER covers a wide range of approaches to entity-based integration, known variously as merge/purge, record de-duplication, heterogeneous join, identity resolution, and customer recognition. More broadly, ER also includes a number of important pre- and post-integration activities, such as entity reference extraction and entity relationship analysis. Based on direct record matching strategies, such as those described by the Fellegi-Sunter Model, new theoretical frameworks are evolving to describe ER processes and outcomes that include other types of inferred and asserted reference linking techniques. Businesses have long recognized that the quality of their ER processes directly impacts the overall value of their information assets and the quality of the information products they produce. Government agencies and departments, including law enforcement and the intelligence community, are increasing their use of ER as a tool for accomplishing their missions as well. Recognizing the growing interest in ER theory and practice, and its impact on information quality in organizations, the ACM Journal of Data and Information Quality (JDIQ) will devote a special issue to innovative and high-quality research papers in this area. Papers that address any aspect of entity resolution are welcome.

« Newer Posts

Powered by WordPress