Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 24, 2013

Hybrid Indexes for Repetitive Datasets

Filed under: Data Structures,Indexing — Patrick Durusau @ 2:09 pm

Hybrid Indexes for Repetitive Datasets by H. Ferrada, T. Gagie, T. Hirvola, and S. J. Puglisi.

Abstract:

Advances in DNA sequencing mean databases of thousands of human genomes will soon be commonplace. In this paper we introduce a simple technique for reducing the size of conventional indexes on such highly repetitive texts. Given upper bounds on pattern lengths and edit distances, we preprocess the text with LZ77 to obtain a filtered text, for which we store a conventional index. Later, given a query, we find all matches in the filtered text, then use their positions and the structure of the LZ77 parse to find all matches in the original text. Our experiments show this also significantly reduces query times.

Need another repetitive data set?

Have you thought about topic maps?

If there is to be any merging in a topic map there are multiple topics that represent the same subjects.

This technique may be overkill for some hardly merging topic maps but if you had the endless repetition that you find in linked data versions of Wikipedia data, there it would be quite useful.

That might knock down the “Some-Smallish-Number” of triples count and so would be disfavored.

On the other hand, there are other data sets with massive replication (think phone records) where fast querying could be an advantage.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress