Efficient Similarity Query Processing (Previously Efficient Exact Similarity Join)
From the webpage:
Given a similarity function and two sets of objects, a similarity join returns all pairs of objects (from each set respectively) such that their similarity value satisifies a given criterion. A typical example is to find pairs of documents such that their cosine similarity is above a constant threshold (say, 0.95), as they are likely to be near duplicate documents.
In this project, we focus on efficient algorithms to perform the similarity join on (multi-) sets or strings both exactly and approximately. Commonly used similarity functions for (multi-) sets or strings are more complex than Euclidean distance functions. As a result, many previous approaches have to compute the approximate results instead.
We have developed several fast algorithms that address the above performance issue. They work for Jaccard, Dice, Cosine similarities and Edit distance constraints. We have also devised algorithms to compute top-k similar pairs progressively so that a user does not need to specify a similarity threshold for some unknown dataset. Recently, we have obtained several interesting new results on edit similarity search and joins by exploiting asymmetric signature schemes. The resulting IndexChunkTurbo algorithm can process most of the similarity queries efficiently while occupying almost the least amount of index space.
We also investigated the rpbolem of approximate entity matching. Our SIGMOD09 work can extract approximate mentions of entities in a document given an entity dictionary.
The first five (5) papers:
- Xiang Zhao, Chuan Xiao, Xuemin Lin, Wei Wang. Efficient Graph Similarity Joins with Edit Distance Constraints. ICDE 2012.
Summary: An efficient algorithm to compute graphs within certain (graph) edit distance away from a query graph.
- Sunanda Patro, Wei Wang. Learning Top-k Transformation Rules. DEXA 2011.
Summary: A new algorithm to learn transformation rules (e.g., VLDB = Very Large Data Base) from unannotated, potentially noisy data. Compared with previous approaches, our method can find much more valid rules in the top-k output.
- Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, Guoren Wang. Efficient Similarity Joins for Near Duplicate Detection. ACM Transaction of Database Systems (TODS).
Summary: This is the journal version of our WWW 2008 paper, with extension to implementing the PPJoin family of algorithms on top of relational database systems.
- Jianbin Qin, Wei Wang, Yifei Lu, Chuan Xiao, Xuemin Lin. Efficient Exact Edit Similarity Query Processing with Asymmetric Signature Schemes. SIGMOD 2011.
Summary: Two simple yet highly efficient algorithms are proposed in this paper that works very well for both edit similarity search and joins. Perhaps equally intereting is a comprehensive experiment involding Flamingo (ver 3), PartEnum, Ed-Join, Bed-tree, Trie-Join, NGPP, VGRAM, NS09.
- Chaokun Wang, Jianmin Wang, Xuemin Lin, Wei Wang, Haixun Wang, Hongsong Li, Wanpeng Tian, Jun Xu, Rui Li. MapDupReducer: Detecting Near Duplicates over Massive Datasets. SIGMOD 2010. PPT
Summary: This work essentially ports the ppjoin+ algorithm to the Map-Reduce framework in order to deal with huge volume of data.
The first paper is saying exactly what your suspect: similarity of arbitrary sub-graphs. Nothing like picking a hard problem is there? Not fully general solution but see what you think of theirs.
Did I mention there is working code at this site as well? Definitely a group to watch in the future.