Archive for the ‘String Matching’ Category

Suffix Trees and their Applications in String Algorithms

Wednesday, November 19th, 2014

Suffix Trees and their Applications in String Algorithms by Roberto Grossi and Giuseppe F. Italiano.

Abstract:

The suffix tree is a compacted trie that stores all suffixes of a given text string. This data structure has been intensively employed in pattern matching on strings and trees, with a wide range of applications, such as molecular biology, data processing, text editing, term rewriting, interpreter design, information retrieval, abstract data types and many others.

In this paper, we survey some applications of suffix trees and some algorithmic techniques for their construction. Special emphasis is given to the most recent developments in this area, such as parallel algorithms for suffix tree construction and generalizations of suffix trees to higher dimensions, which are important in multidimensional pattern matching.

The authors point out that “suffix tree” is only one of the names for this subject:

The importance of the suffix tree is underlined by the fact that it has been rediscovered many times in the scientific literature, disguised under different names, and that it has been studied under numerous variations. Just to mention a few appearances of the suffix tree, we cite the compacted bi-tree [101], the prefix tree [24], the PAT tree [50], the position tree [3, 65, 75], the repetition finder [82], and the subword tree [8, 24]….

Which is an advantage if you are researching another survey paper and tracing every thread on suffix trees by whatever name, not so much of an advantage if you miss this paper or an application under name other than “suffix tree.”

Of course, a search with:

“suffix tree” OR “impacted by-tree” OR “prefix tree” OR “PAT tree” OR “position tree” OR “repetition finder” OR “subword tree”

that returns About 1,830,000 results (0.22 seconds), isn’t very helpful.

In part because no one is going to examine 1,830,000 results and that is a subset of all the terms for suffix trees.

I think we can do better than that and without an unreasonable expenditure of resources. (See: Less Than Universal & Uniform Indexing)

Taxonomies and Toolkits of Regular Language Algorithms

Wednesday, September 10th, 2014

Taxonomies and Toolkits of Regular Language Algorithms by Bruce William Watson.

From 1.1 Problem Statement:

A number of fundamental computing science problems have been extensively studied since the 1950s and the 1960s. As these problems were studied, numerous solutions (in the form of algorithms) were developed over the years. Although new algorithms still appear from time to time, each of these fields can be considered mature. In the solutions to many of the well-studied computing science problems, we can identify three deficiencies:

  1. Algorithms solving the same problem are difficult to compare to one another. This is usually due to the use of different programming languages, styles of presentation, or simply the addition of unnecessary details.
  2. Collections of implementations of algorithms solving a problem are difficult, if not impossible, to find. Some of the algorithms are presented in a relatively obsolete manner, either using old notations or programming languages for which no compilers exist, making it difficult to either implement the algorithm or find an existing implementation.
  3. Little is known about the comparative practical running time performance of the algorithms. The lack of existing implementations in one and the same framework, especially of the older algorithms, makes it difficult to determine the running time characteristics of the algorithms. A software engineer selecting one of the algorithms will usually do so on the basis of the algorithm’s theoretical running time, or simply by guessing.

In this dissertation, a solution to each of the three deficiencies is presented for each of the following three fundamental computing science problems:

  1. Keyword pattern matching in strings. Given a finite non-empty set of keywords (the patterns) and an input string, find the set of all occurrences of a keyword as a substring of the input string.
  2. Finite automata (FA) construction. Given a regular expression, construct a finite automaton which accepts the language denoted by the regular expression.
  3. Deterministic finite automata (DFA) minimization. Given a DFA, construct the unique minimal DFA accepting the same language.

We do not necessarily consider all the known algorithms solving the problems. For example, we restrict ourselves to batch-style algorithms1, as opposed to incremental algorithms2.

Requires updating given its age, 1995, but a work merits mention.

I first saw this in a tweet by silentbicycle.srec.

Jellyfish

Saturday, September 6th, 2014

Jellyfish by James Turk and Michael Stephens.

From the webpage:

Jellyfish is a python library for doing approximate and phonetic matching of strings.

String comparison:

  • Levenshtein Distance
  • Damerau-Levenshtein Distance
  • Jaro Distance
  • Jaro-Winkler Distance
  • Match Rating Approach Comparison
  • Hamming Distance

Phonetic encoding:

  • American Soundex
  • Metaphone
  • NYSIIS (New York State Identification and Intelligence System)
  • Match Rating Codex

You might want to consider the string matching offered by Duke (written on top of Lucene):

String comparators

  • Levenshtein
  • WeightedLevenshtein
  • JaroWinkler
  • QGramComparator

Simple comparators

  • ExactComparator
  • DifferentComparator

Specialized comparators

  • GeopositionComparator
  • NumericComparator
  • PersonNameComparator

Phonetic comparators

  • SoundexComparator
  • MetaphoneComparator
  • NorphoneComparator

Token set comparators

  • DiceCoefficientComparator
  • JaccardIndexComparator

Enjoy!

Tablib: Pythonic Tabular Datasets

Friday, May 30th, 2014

Tablib: Pythonic Tabular Datasets by Kenneth Reitz and Bessie Monke.

From the post:

Tablib is an MIT Licensed format-agnostic tabular dataset library, written in Python. It allows you to import, export, and manipulate tabular data sets. Advanced features include, segregation, dynamic columns, tags & filtering, and seamless format import & export.

Definitely an add to your Python keychair USB drive.

I first saw this in a tweet by Gregory Piatetsky.

(String/text processing)++:…

Thursday, May 15th, 2014

(String/text processing)++: stringi 0.2-3 released by Marek Gągolewski.

From the post:

A new release of the stringi package is available on CRAN (please wait a few days for Windows and OS X binary builds).

stringi is a package providing (but definitely not limiting to) replacements for nearly all the character string processing functions known from base R. While developing the package we had high performance and portability of its facilities in our minds.

Here is a very general list of the most important features available in the current version of stringi:

  • string searching:
    • with ICU (Java-like) regular expressions,
    • ICU USearch-based locale-aware string searching (quite slow, but working properly e.g. for non-Unicode normalized strings),
    • very fast, locale-independent byte-wise pattern matching;
  • joining and duplicating strings;
  • extracting and replacing substrings;
  • string trimming, padding, and text wrapping (e.g. with Knuth's dynamic word wrap algorithm);
  • text transliteration;
  • text collation (comparing, sorting);
  • text boundary analysis (e.g. for extracting individual words);
  • random string generation;
  • Unicode normalization;
  • character encoding conversion and detection;

and many more.

Interesting isn’t it? How CS keeps circling around back to strings?

Enjoy!

…[S]uffix array construction algorithms

Tuesday, March 25th, 2014

A bioinformatician’s guide to the forefront of suffix array construction algorithms by Anish Man Singh Shrestha, Martin C. Frith, and Paul Horton.

Abstract:

The suffix array and its variants are text-indexing data structures that have become indispensable in the field of bioinformatics. With the uninitiated in mind, we provide an accessible exposition of the SA-IS algorithm, which is the state of the art in suffix array construction. We also describe DisLex, a technique that allows standard suffix array construction algorithms to create modified suffix arrays designed to enable a simple form of inexact matching needed to support ‘spaced seeds’ and ‘subset seeds’ used in many biological applications.

If this doesn’t sound like a real page turner, consider the authors’ concluding paragraph:

The suffix array and its variants are text-indexing data structures that have become indispensable in the field of bioinformatics. With the uninitiated in mind, we provide an accessible exposition of the SA-IS algorithm, which is the state of the art in suffix array construction. We also describe DisLex, a technique that allows standard suffix array construction algorithms to create modified suffix arrays designed to enable a simple form of inexact matching needed to support ‘spaced seeds’ and ‘subset seeds’ used in many biological applications.

Reminds me that computer science departments need to start offering courses in “string theory,” to capitalize on the popularity of that phrase. 😉

Handling and Processing Strings in R

Friday, March 7th, 2014

Handling and Processing Strings in R by Gaston Sanchez. (free ebook)

From the post:

Many years ago I decided to apply for a job in a company that developed data mining applications for big retailers. I was invited for an on-site visit and I went through the typical series of interviews with the members of the analytics team. Everything was going smoothly and I was enjoying all the conversations. Then it came turn to meet the computer scientist. After briefly describing his role in the team he started asking me a bunch of technical questions and tests. Although I was able to answer those questions related with statistics and multivariate analysis, I had a really hard time trying to answer a series of questions related with string manipulations.

I will remember my interview with that guy as one of the most embarrassing moments of my life. That day, the first thing I did when I went back home was to open my laptop, launch R, and start reproducing the tests I failed to solve. It didn’t take me that much to get the right answers. Unfortunately, it was too late and the harm was already done. Needless to say I wasn’t offered the job. That shocking experience showed me that I was not prepared for manipulating character strings. I felt so bad that I promised myself to learn the basics of strings manipulation and text processing. “Handling and Processing Strings in R” is one of the derived results of that old promise.

The content of this ebook is the byproduct of my experience working with character string data in R. It is based on my notes, scripts, projects, and uncountable days and nights in which I’ve been struggling with text data. Briefly, I’ve tried to document and organize several topics related with manipulating character strings.

At one hundred and twelve (112) pages, “Handling and Processing Strings in R” may not answer every question you have about strings and R, but it answers a lot of them.

Enjoy and pass this along!

I first saw this in a tweet by Sharon Machlis.

Kolmogorov Complexity – A Primer

Thursday, December 6th, 2012

Kolmogorov Complexity – A Primer by Jeremy Kun.

From the post:

Previously on this blog (quite a while ago), we’ve investigated some simple ideas of using randomness in artistic design (psychedelic art, and earlier randomized css designs), and measuring the complexity of such constructions. Here we intend to give a more thorough and rigorous introduction to the study of the complexity of strings. This naturally falls into the realm of computability theory and complexity theory, and so we refer the novice reader to our other primers on the subject (Determinism and Finite Automata, Turing Machines, and Complexity Classes; but Turing machines will be the most critical to this discussion).

Jeremy sets the groundwork necessary for a later post in this series. (covering machine learning)

Digest this for a couple of days and I will point out the second post.

Bash One-Liners Explained (series)

Wednesday, November 28th, 2012

Bash One-Liners Explained by Peteris Krumins.

The series page for posts by Peteris Krumins on Bash one-liners.

So far:

One real advantage to Bash scripts is the lack of a graphical interface to get in the way.

A real advantage with “data” files but many times “text” files as well.

STAR: ultrafast universal RNA-seq aligner

Sunday, November 25th, 2012

STAR: ultrafast universal RNA-seq aligner
by Stephen Turner.

From the post:

There’s a new kid on the block for RNA-seq alignment.

Dobin, Alexander, et al. “STAR: ultrafast universal RNA-seq aligner.” Bioinformatics (2012).

Aligning RNA-seq data is challenging because reads can overlap splice junctions. Many other RNA-seq alignment algorithms (e.g. Tophat) are built on top of DNA sequence aligners. STAR (Spliced Transcripts Alignment to a Reference) is a standalone RNA-seq alignment algorithm that uses uncompressed suffix arrays and a mapping algorithm similar to those used in large-scale genome alignment tools to align RNA-seq reads to a genomic reference. STAR is over 50 times faster than any other previously published RNA-seq aligner, and outperforms other aligners in both sensitivity and specificity using both simulated and real (replicated) RNA-seq data.

I had a brief exchange of comments with Lars Marius Garshol on string matching recently. Another example of a string processing approach you may adapt to different circumstances.

SMART – string matching research tool

Friday, August 31st, 2012

SMART – string matching research tool by Simone Faro and Thierry Lecroq.

From the webpage:

1 tool

smart (string matching algorithms research tool) is a tool which provides a standard framework for researchers in string matching. It helps users to test, design, evaluate and understand existing solutions for the exact string matching problem. Moreover it provides the implementation of (almost) all string matching algorithms and a wide corpus of text buffers.

40 years

In the last 40 years of research in computer science string matching was one of the most extensively studied problem, mainly due to its direct applications to such diverse areas as text, image and signal processing, speech analysis and recognition, data compression, information retrieval, computational biology and chemistry. Moreover String matching algorithms are also basic components used in implementations of practical softwares existing under most operating systems.

85 algos

Since 1970 more than 80 string matching algorithms have been proposed, and more than 50% of them in the last ten years. The smart tool provides a comprehensive collection of all string matching algorithms, inplemented in C programming language, and helps researcher to perform experimental results and compare them from a practical point of view. Smart provides a practical and standard platform for testing string matching algorithms and sharing results with the community.

12 texts

The smart tool provides also a corpus of 12 texts on which the string matching algorithms can be tested. Texts in the corpus are of different types, including natural language texts, genome sequences, protein sequences, and random texts with a uniform distribution of characters.

Do you know of any similar research tools for named entity recognition?

Bio-hackers will be interested in the “Complete genome of the E. Coli bacterium.”

A Fast Suffix Automata Based Algorithm for Exact Online String Matching

Friday, August 31st, 2012

A Fast Suffix Automata Based Algorithm for Exact Online String Matching by Simone Faro and Thierry Lecroq (Implementation and Application of Automata Lecture Notes in Computer Science, 2012, Volume 7381/2012, 149-158, DOI: 10.1007/978-3-642-31606-7_13)

Abstract:

Searching for all occurrences of a pattern in a text is a fundamental problem in computer science with applications in many other fields, like natural language processing, information retrieval and computational biology. Automata play a very important role in the design of efficient solutions for the exact string matching problem. In this paper we propose a new very simple solution which turns out to be very efficient in practical cases. It is based on a suitable factorization of the pattern and on a straightforward and light encoding of the suffix automaton. It turns out that on average the new technique leads to longer shift than that proposed by other known solutions which make use of suffix automata.

I suppose it is too much to expect writers to notice matching “…pattern[s] in a text is a fundamental problem in [topic maps]….”

Pattern matching in topic maps can extend beyond strings so perhaps the oversight is understandable. 😉

INSTRUCT: Space-Efficient Structure for Indexing and Complete Query Management of String Databases

Thursday, July 5th, 2012

INSTRUCT: Space-Efficient Structure for Indexing and Complete Query Management of String Databases by Sourav Dutta and Arnab Bhattacharya.

Abstract:

The tremendous expanse of search engines, dictionary and thesaurus storage, and other text mining applications, combined with the popularity of readily available scanning devices and optical character recognition tools, has necessitated efficient storage, retrieval and management of massive text databases for various modern applications. For such applications, we propose a novel data structure, INSTRUCT, for efficient storage and management of sequence databases. Our structure uses bit vectors for reusing the storage space for common triplets, and hence, has a very low memory requirement. INSTRUCT efficiently handles prefix and suffix search queries in addition to the exact string search operation by iteratively checking the presence of triplets. We also propose an extension of the structure to handle substring search efficiently, albeit with an increase in the space requirements. This extension is important in the context of trie-based solutions which are unable to handle such queries efficiently. We perform several experiments portraying that INSTRUCT outperforms the existing structures by nearly a factor of two in terms of space requirements, while the query times are better. The ability to handle insertion and deletion of strings in addition to supporting all kinds of queries including exact search, prefix/suffix search and substring search makes INSTRUCT a complete data structure.

From the introduction:

As all strings are composed of a defined set of characters, reusing the storage space for common characters promises to provide the most compressed form of representation. This redundancy linked with the need for extreme space-efficient index structures motivated us to develop INSTRUCT (INdexing STrings by Re-Using Common Triplets).

By the time of a presentation (below) on the technique, the authors apparently rethought the name, settling on:

SPACE-EFFICIENT MANAGEMENT OF TEXT USING INDEXED KEYS” (SEManTIKs)

Neither one really “rolls off the tongue,” but I suspect searching for the second may be somewhat easier.

I say that, but “semantiks” turns out to be a women’s clothing line and at least one popular search engine offers to correct to “semantics.” I am sure a “gathered scoop neck” and a “flattering boot-cut silhouette” are all quite interesting but not really on point.

The slide presentation lists some fourteen (14) other approaches that can be compared to the one developed by the authors. (I am assuming the master’s thesis by Dutta has the details on the comparisons with the other techniques. I haven’t found it online but have written to request a copy.)

This work demonstrates that we are no where near the end of improvements for indexing and search.

See also the presentation: SPACE-EFFICIENT MANAGEMENT OF STRING DATABASES BY REUSING COMMON CHARACTERS by Sourav Dutta.

Binary Jumbled String Matching: Faster Indexing in Less Space

Saturday, June 16th, 2012

Binary Jumbled String Matching: Faster Indexing in Less Space by Golnaz Badkobeh, Gabriele Fici, Steve Kroon, and Zsuzsanna Lipták.

Abstract:

We introduce a new algorithm for the binary jumbled string matching problem, where the aim is to decide whether a given binary string of length n has a substring whose multiplicity equals a query vector (x, y). For example, for the string abaaababab, the query (3, 1) would return “yes”, and the query (5, 1) “no”. Previous solutions answered queries in constant time by creating an index of size O(n) in a pre-processing step. The fastest known approach to constructing this index is O(n^2/logn) [Burcsi et al., FUN 2010; Moosa and Rahman, IPL 2010] resp. O(n^2/log2 n) in the word-RAM model [Moosa and Rahman, JDA, 2012]. We propose an algorithm which creates an index for an input string s by using the string’s run-length encoding. This index can be queried in logarithmic time. Our index has worst-case size n, but extensive experimentation has consistently yielded a size which is between 0.8 and 3 times the length of the run-length encoding of s. The algorithm runs in time O(r^2 log r), where r is the number of a-runs of s, i.e., half the length of the run-length encoding of the string. This is no worse than previous solutions if r = O(n/logn) and better if r = o(n/logn)-which is the case for binary strings in many domains. Our experimentation further shows that in the vast majority of cases, the construction algorithm does not exceed the space needed for the index, and when it does, it does so only by a tiny constant.

Queries of binary strings in logarithmic time anyone?

I cite this in part because some topic mappers may be indexing binary strings but also as encouragement to consider the properties of your data carefully.

You may discover it has properties that lend themselves to very efficient processing.

Computer Algorithms: Morris-Pratt String Searching

Sunday, April 15th, 2012

Computer Algorithms: Morris-Pratt String Searching

From the post:

We saw that neither brute force string searching nor Rabin-Karp string searching are effective. However in order to improve some algorithm, first we need to understand its principles in detail. We know already that brute force string matching is slow and we tried to improve it somehow by using a hash function in the Rabin-Karp algorithm. The problem is that Rabin-Karp has the same complexity as brute force string matching, which is O(mn).

Obviously we need a different approach, but to come with a different approach let’s see what’s wrong with brute force string searching. Indeed by taking a closer look at its principles we can answer the question.

In brute force matching we checked each character of the text with the first character of the pattern. In case of a match we shifted the comparison between the second character of the pattern and the next character of the text. The problem is that in case of a mismatch we must go several positions back in the text. Well in fact this technique can’t be optimized.

Good posts on string matching!

On Approximating String Selection Problems with Outliers

Tuesday, February 14th, 2012

On Approximating String Selection Problems with Outliers by Christina Boucher, Gad M. Landau, Avivit Levy, David Pritchard and Oren Weimann.

Abstract:

Many problems in bioinformatics are about finding strings that approximately represent a collection of given strings. We look at more general problems where some input strings can be classified as outliers. The Close to Most Strings problem is, given a set S of same-length strings, and a parameter d, find a string x that maximizes the number of “non-outliers” within Hamming distance d of x. We prove this problem has no PTAS unless ZPP=NP, correcting a decade-old mistake. The Most Strings with Few Bad Columns problem is to find a maximum-size subset of input strings so that the number of non-identical positions is at most k; we show it has no PTAS unless P=NP. We also observe Closest to k Strings has no EPTAS unless W[1]=FPT. In sum, outliers help model problems associated with using biological data, but we show the problem of finding an approximate solution is computationally difficult.

Just in case you need a break from graph algorithms, intractable and otherwise. 😉

Bed-tree: an all-purpose index structure for string similarity search based on edit distance

Friday, December 23rd, 2011

Bed-tree: an all-purpose index structure for string similarity search based on edit distance by Zhenjie Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, and Divesh Srivastava.

Abstract:

Strings are ubiquitous in computer systems and hence string processing has attracted extensive research effort from computer scientists in diverse areas. One of the most important problems in string processing is to efficiently evaluate the similarity between two strings based on a specified similarity measure. String similarity search is a fundamental problem in information retrieval, database cleaning, biological sequence analysis, and more. While a large number of dissimilarity measures on strings have been proposed, edit distance is the most popular choice in a wide spectrum of applications. Existing indexing techniques for similarity search queries based on edit distance, e.g., approximate selection and join queries, rely mostly on n-gram signatures coupled with inverted list structures. These techniques are tailored for specific query types only, and their performance remains unsatisfactory especially in scenarios with strict memory constraints or frequent data updates. In this paper we propose the Bed-tree, a B+-tree based index structure for evaluating all types of similarity queries on edit distance and normalized edit distance. We identify the necessary properties of a mapping from the string space to the integer space for supporting searching and pruning for these queries. Three transformations are proposed that capture different aspects of information inherent in strings, enabling efficient pruning during the search process on the tree. Compared to state-of-the-art methods on string similarity search, the Bed-tree is a complete solution that meets the requirements of all applications, providing high scalability and fast response time.

The authors classify similarity queries as:

Type Example
range address in customer database
top-k results of search engine
all-pairs joins
pairs of proteins or genes

There are a couple of things that trouble me about the paper.

First:

6.3 Top-K construction

In many cases, top-k queries are more practical than range queries. However, existing indexing schemes with inverted lists do not naturally support such queries. To illustrate
the performance benefits of our proposals, we implemented a simple strategy with Flamingo, by increasing the range query threshold gradually until more than k string results
are found. Notice that we use the same Bed-tree structures to support all different types of queries. Thus, we skip the performance comparisons on index construction but focus on query processing efficiency. (emphasis added)

I am not sure what is meant by inverted lists “…do not naturally support …[top-k queries].” Inverted list structures are wildly popular among WWW search engines so I would like to know more about this notion of “…naturally support….”

Moreover, indexes aren’t simply used, they are created as well. Puzzling that we are left to wonder how long it will take to have a Bed-tree database that we can use.

Second, there are a couple of fairly serious mis-citation errors in the paper. The authors refer to “Flamingo” and “Mismatch” (from 2008) as comparisons but the articles cited: “[15] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257–266, 2008” and “C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933–944, 2008, respectively, are innocent of any such implementations.

C. Li is the faculty adviser for the Flamingo project, which has a new release since I mentioned it at: The FLAMINGO Project on Data Cleaning, but you don’t cite a project by picking a paper at random that doesn’t mention the project. (If you haven’t looked at the FLAMINGO project, its software and papers you really should.)

C. Xiao and company have a “mismatch” filter but it isn’t ever used as the name of an implementation.

Tracing the history of advances in computer science is easier or at least less time consuming if researchers don’t have to chase rabbits in the form of bad citations. Not to mention that if you aren’t cited properly, you may not get full credit for all the work you have actually done. Good citation practices are in everyone’s interest.

Modifying a Lucene Snowball Stemmer

Tuesday, August 9th, 2011

Modifying a Lucene Snowball Stemmer

From the post:

This post is written for advanced users. If you do not know what SVN (Subversion) is or if you’re not ready to get your hands dirty, there might be something more interesting to read on Wikipedia. As usual. This is an introduction to how to get a Lucene development environment running, a Solr environment and lastly, to create your own Snowball stemmer. Read on if that seems interesting. The receipe for regenerating the Snowball stemmer (I’ll get back to that…) assumes that you’re running Linux. Please leave a comment if you’ve generated the stemmer class under another operating system.

When indexing data in Lucene (a fulltext document search library) and Solr (which uses Lucene), you may provide a stemmer (a piece of code responsible for “normalizing” words to their common form (horses => horse, indexing => index, etc)) to give your users better and more relevant results when they search. The default stemmer in Lucene and Solr uses a library named Snowball which was created to do just this kind of thing. Snowball uses a small definition language of its own to generate parsers that other applications can embed to provide proper stemming.

Really advanced users will want to check out Snowball’s namesake, SNOBOL, a powerful pattern matching language that was invented in 1962. (That’s not a typo, 1962.) See SNOBOL4.org for more information and resources.

The post outlines how to change the default stemmer for Lucene and Solr to improve its stemming of Norwegian words. Useful in case you want to write/improve a stemmer for another language.

Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction

Monday, June 20th, 2011

Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction by Guoliang Li, Dong Deng, and Jianhua Feng in SIGMOD ’11.

Abstract:

Dictionary-based entity extraction identifies predefined entities (e.g., person names or locations) from a document. A recent trend for improving extraction recall is to support approximate entity extraction, which finds all substrings in the document that approximately match entities in a given dictionary. Existing methods to address this problem support either token-based similarity (e.g., Jaccard Similarity) or character-based dissimilarity (e.g., Edit Distance). It calls for a unified method to support various similarity/dissimilarity functions, since a unified method can reduce the programming efforts, hardware requirements, and the manpower. In addition, many substrings in the document have overlaps, and we have an opportunity to utilize the shared computation across the overlaps to avoid unnecessary redundant computation. In this paper, we propose a unified framework to support many similarity/dissimilarity functions, such as jaccard similarity, cosine similarity, dice similarity, edit similarity, and edit distance. We devise efficient filtering algorithms to utilize the shared computation and develop effective pruning techniques to improve the performance. The experimental results show that our method achieves high performance and outperforms state-of-the-art studies.

Entity extraction should be high on your list of topic map skills. A good set of user defined dictionaries is a good start. Creating those dictionaries as topic maps is an even better start.

On string metrics, you might want to visit Sam’s String Metrics, which lists some thirty algorithms.

String Syntax – Post

Tuesday, March 1st, 2011

String Syntax

Since TMQL discussions are starting up, it seem appropriate to point out at least one resource on string syntax.

BTW, I am assuming that the TMQL draft will not have any fewer capabilities than any of the extant implementations?

Is anyone assuming differently?

Substring search algorithm

Monday, February 14th, 2011

Substring search algorithm by Leonid Volnitsky.

From the website:

Described new online substring search algorithm which allows faster string traversal. Presented here implementation is substantially faster than any other online substring search algorithms.

Will be interesting to see which topic map engines incorporate this new substring search algorithm first.

Algorithms – Lecture Notes

Tuesday, January 4th, 2011

Algorithms, Jeff Erickson’s lecture notes.

Mentioned in a post on the Theoretical Computer Science blog, What Lecture Notes Should Everyone Read?.

From the introduction:

Despite several rounds of revision, these notes still contain lots of mistakes, errors, bugs, gaffes, omissions, snafus, kludges, typos, mathos, grammaros, thinkos, brain farts, nonsense, garbage, cruft, junk, and outright lies, all of which are entirely Steve Skiena’s fault. I revise and update these notes every time I teach the course, so please let me know if you find a bug. (Steve is unlikely to care.)

The notes are highly amusing and useful to anyone seeking to improve current subject identification (read searching) practices.

Processing Tweets with LingPipe #3: Near duplicate detection and evaluation – Post

Monday, January 3rd, 2011

Processing Tweets with LingPipe #3: Near duplicate detection and evaluation

Good coverage of tokenization of tweets and the use of the Jaccard Distance measure to determine similarity.

Of course, for a topic map, similarity may not lead to being discarded but trigger other operations instead.

The R Journal, Issue 2/2

Friday, December 31st, 2010

The R Journal, Issue 2/2 has arrived!

Download complete issue.

Or Individual articles.

A number of topic map relevant papers are in this issue, ranging from stringr: modern, consistent string processing, Hadley Wickham; to the edgy cudaBayesreg: Bayesian Computation in CUDA, Adelino Ferreira da Silva; to a technique that started in the late 1950’s, The RecordLinkage Package: Detecting Errors in Data, Murat Sariyar and Andreas Borg.

Mining of Massive Datasets – eBook

Thursday, December 9th, 2010

Mining of Massive Datasets

Jeff Dalton, Jeff’s Search Engine Caffè reports a new data mining book by Anand Rajaraman and Jeffrey D. Ullman (yes, that Jeffrey D. Ullman, think “dragon book.”).

A free eBook no less.

Read Jeff’s post on your way to get a copy.

Look for more comments as I read through it.

Has anyone written a comparison of the recent search engine titles? Just curious.


Update: New version out in hard copy and e-book remains available. See: Mining Massive Data Sets – Update

Detecting “Duplicates” (same subject?)

Friday, December 3rd, 2010

A couple of interesting posts from the LingPipe blog:

Processing Tweets with LingPipe #1: Search and CSV Data Structures

Processing Tweets with LingPipe #2: Finding Duplicates with Hashing and Normalization

The second one on duplicates being the one that caught my eye.

After all, what are merging conditions the in TMDM other than the detection of duplicates?

Of course, I am interested in TMDM merging but also in the detection of fuzzy subject identity.

Whether than is then represented by an IRI or kept as a native merging condition being an implementation type issue.

This could be very important for some future leak of diplomatic tweets. 😉

SecondString

Monday, November 15th, 2010

SecondString is a Java library of string matching techniques.

The Levenshtein distance test mentioned in the LIMES post is an example of a string matching technique.

The results are not normalized so compare results from the techniques cautiously.

Questions:

  1. Suggest 1 – 2 survey articles on string matching for the class. (The Navarro article cited in Wikipedia on the Levenshtein distance is almost ten years old and despite numerous exclusions, still runs 58 pages. Excellent article but needs updating with more recent material.)
  2. What one technique would you use in constructing your topic map? Why? (2-3 pages, citing examples of why it would be the best for your data)