A Node Indexing Scheme for Web Entity Retrieval Authors(s): Renaud Delbru, Nickolai Toupikov, Michele Catasta, Giovanni Tummarello Keywords: entity, entity search, full-text search, semi-structured queries, top-k query, node indexing, incremental index updates, entity retrieval system, RDF, RDFa, Microformats
Abstract:
Now motivated also by the partial support of major search engines, hundreds of millions of documents are being published on the web embedding semi-structured data in RDF, RDFa and Microformats. This scenario calls for novel information search systems which provide effective means of retrieving relevant semi-structured information. In this paper, we present an “entity retrieval system” designed to provide entity search capabilities over datasets as large as the entire Web of Data. Our system supports full-text search, semi-structural queries and top-k query results while exhibiting a concise index and efficient incremental updates. We advocate the use of a node indexing scheme and show that it offers a good compromise between query expressiveness, query processing time and update complexity in comparison to three other indexing techniques. We then demonstrate how such system can effectively answer queries over 10 billion triples on a single commodity machine.
Consider the requirements for this project:
- Support for the multiple formats which are used on the Web of Data;
- Support for searching an entity description given its characteristics (entity centric search);
- Support for context (provenance) of information: entity descriptions are given in the context of a website or a dataset;
- Support for semi-structural queries with full-text search, top-k query results, scalability over shard clusters of commodity machines, efficient caching strategy and incremental index maintenance.
(emphasis added)
SIREn { Semantic Information Retrieval Engine }
Definitely a package to download, install and start to evaluate. More comments forthcoming.
Questions (more for topic map researchers)
- To what extent can “entity description” = properties of topics, associations, occurrences?
- Can XTM, etc., be regarded as “microformats” for the purposes of SIREn?
- To what extent does SIREn meet or exceed query requirements for XTM/TMDM based topic maps?
- Reports on use of SIREn by topic mappers?
[…] Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity « A Node Indexing Scheme for Web Entity Retrieval […]
Pingback by Virtuoso Open-Source Edition « Another Word For It — November 25, 2010 @ 7:07 am