Archive for the ‘Symbol’ Category

Finding Structure in Text, Genome and Other Symbolic Sequences

Saturday, July 14th, 2012

Finding Structure in Text, Genome and Other Symbolic Sequences by Ted Dunning. (thesis, 1998)

Abstract:

The statistical methods derived and described in this thesis provide new ways to elucidate the structural properties of text and other symbolic sequences. Generically, these methods allow detection of a difference in the frequency of a single feature, the detection of a difference between the frequencies of an ensemble of features and the attribution of the source of a text. These three abstract tasks suffice to solve problems in a wide variety of settings. Furthermore, the techniques described in this thesis can be extended to provide a wide range of additional tests beyond the ones described here.

A variety of applications for these methods are examined in detail. These applications are drawn from the area of text analysis and genetic sequence analysis. The textually oriented tasks include finding interesting collocations and cooccurent phrases, language identification, and information retrieval. The biologically oriented tasks include species identification and the discovery of previously unreported long range structure in genes. In the applications reported here where direct comparison is possible, the performance of these new methods substantially exceeds the state of the art.

Overall, the methods described here provide new and effective ways to analyse text and other symbolic sequences. Their particular strength is that they deal well with situations where relatively little data are available. Since these methods are abstract in nature, they can be applied in novel situations with relative ease.

Recently posted but dating from 1998.

Older materials are interesting because the careers of their authors can be tracked, say at DBPL Ted Dunning.

Or it can lead you to check an author in Citeseer:

Accurate Methods for the Statistics of Surprise and Coincidence (1993)

Abstract:

Much work has been done on the statistical analysis of text. In some cases reported in the literature, inappropriate statistical methods have been used, and statistical significance of results have not been addressed. In particular, asymptotic normality assumptions have often been used unjustifiably, leading to flawed results.This assumption of normal distribution limits the ability to analyze rare events. Unfortunately rare events do make up a large fraction of real text.However, more applicable methods based on likelihood ratio tests are available that yield good results with relatively small samples. These tests can be implemented efficiently, and have been used for the detection of composite terms and for the determination of domain-specific terms. In some cases, these measures perform much better than the methods previously used. In cases where traditional contingency table methods work well, the likelihood ratio tests described here are nearly identical.This paper describes the basis of a measure based on likelihood ratios that can be applied to the analysis of text.

Which has over 600 citations, only one of which is from the author. (I could comment about a well know self-citing ontologist but I won’t.)

The observations in the thesis about “large” data sets are dated but it merits your attention as fundamental work in the field of textual analysis.

As a bonus, it is quite well written and makes an enjoyable read.

Referential transparency

Saturday, February 11th, 2012

Referential transparency

Robert Harper writes:

After reading some of the comments on my Words Matter post, I realized that it might be worthwhile to clarify the treatment of references in PFPL. Most people are surprised to learn that I have separated the concept of a reference from the concept of mutable storage. I realize that my treatment is not standard, but I consider that one thing I learned in writing the book is how to do formulate references so that the concept of a reference is divorced from the thing to which it refers. Doing so allows for a nicely uniform account of references arising in many disparate situations, and is the basis for, among other things, a non-standard, but I think clearer, treatment of communication in process calculi.

It’s not feasible for me to reproduce the entire story here in a short post, so I will confine myself to a few remarks that will perhaps serve as encouragement to read the full story in PFPL.

There is, first of all, a general concept of a symbol, or name, on which much of the development builds. Perhaps the most important thing to realize about symbols in my account is that symbols are not forms of value. They are, rather, a indices for an infinite, open-ended (expansible) family of operators for forming expressions. For example, when one introduces a symbol a as the name of an assignable, what one is doing is introducing two new operators, get[a] and set[a], for getting and setting the contents of that assignable. Similarly, when one introduces a symbol a to be used as a channel name, then one is introducing operators send[a] and receive[a] that act on the channel named a. The point is that these operators interpret the symbol; for example, in the case of mutable state, the get[a] and set[a] operators are what make it be state, as opposed to some other use of the symbol a as an index for some other operator. There is no requirement that the state be formulated using “cells”; one could as well use the framework of algebraic effects pioneered by Plotkin and Power to give a dynamics to these operators. For present purposes I don’t care about how the dynamics is formulated; I only care about the laws that it obeys (this is the essence of the Plotkin/Power approach as I understand it).

….

Finally, let me mention that the critical property of symbols is that they admit disequality. Two different assignables, say a and b, can never, under any execution scenario, be aliases for one another. An assignment to one will never affect the content of the other. Aliasing is a property of references, because two references, say bound to variables x and y, may refer to the same underlying assignable, but two different assignables can never be aliases for one another. This is why “assignment conversion” as it is called in Scheme is not in any way an adequate response to my plea for a change of terminology! Assignables and variables are different things, and references are yet different from those.

What do you think?

How would you apply this analysis to symbols/identifiers as used in topic maps?