Archive for the ‘Tries’ Category

TH*:Scalable Distributed Trie Hashing

Thursday, May 3rd, 2012

TH*:Scalable Distributed Trie Hashing by Aridj Mohamed and Zegour Djamel Eddine.

In today’s world of computers, dealing with huge amounts of data is not unusual. The need to distribute this data in order to increase its availability and increase the performance of accessing it is more urgent than ever. For these reasons it is necessary to develop scalable distributed data structures. In this paper we propose a TH* distributed variant of the Trie Hashing data structure. First we propose Thsw new version of TH without node Nil in digital tree (trie), then this version will be adapted to multicomputer environment. The simulation results reveal that TH* is scalable in the sense that it grows gracefully, one bucket at a time, to a large number of servers, also TH* offers a good storage space utilization and high query efficiency special for ordering operations.

I ran across this article today on tries, which dates from 2010 (original publication date).

Can anyone point me to a recent survey of literature on tries?


An Efficient Trie-based Method for Approximate Entity Extraction…

Monday, March 12th, 2012

An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints by Dong Deng, Guoliang Li, and Jianhua Feng. (PDF)


Dictionary-based entity extraction has attracted much attention from the database community recently, which locates substrings in a document into predefined entities (e.g., person names or locations). To improve extraction recall, a recent trend is to provide approximate matching between substrings of the document and entities by tolerating minor errors. In this paper we study dictionary-based approximate entity extraction with edit-distance constraints. Existing methods have several limitations. Firstly, they need to tune many parameters to achieve a high performance. Secondly, they are inefficient for large editdistance thresholds. We propose a trie-based method to address these problems. We partition each entity into a set of segments. We prove that if a substring of the document is similar to an entity, it must contain a segment of the entity. Based on this observation, we first search segments from the document, and then extend the matching segments in both entities and the document to find similar pairs. To facilitate searching segments, we use a trie structure to index segments and develop an efficient trie-based algorithm. We develop an extension-based method to efficiently find similar string pairs by extending the matching segments. We optimize our partition scheme and select the best partition strategy to improve the extraction performance. The experimental results show that our method achieves much higher performance compared with state-of-the-art studies.

Project page with author contact information. Code coming soon.

In case you are wondering about the path for the project including the word “taste:”

To address these problems, we propose a trie-based method for dictionary-based approximate entity extraction with edit distance constraints, called TASTE. TASTE does not need to tune parameters. Moreover TASTE achieves much higher performance, even for large edit-distance thresholds.

Is there a word for a person who creates acronyms? Acronymist perhaps?

Deeply interesting paper on the use of tries for entity extraction. Interesting due to its performance but also because of its approach.

You do remember that tries were what made the original e-version of the OED (Oxford English Dictionary) possible? Extremely responsive on less powerful machines than in your smart phone.

One wonders how starting from a set of entities this approach would fare against the TREC legal archives? But that would be “seeding” the application with entities. It may be that being given queries against a dark corpus isn’t all that realistic.