Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 31, 2011

The web of topics: discovering the topology of topic evolution in a corpus

Filed under: Corpus Linguistics,Graphs,Navigation — Patrick Durusau @ 3:39 pm

The web of topics: discovering the topology of topic evolution in a corpus by Yookyung Jo, John E. Hopcroft, and, Carl Lagoze, Cornell University, Ithaca, NY, USA.

Abstract:

In this paper we study how to discover the evolution of topics over time in a time-stamped document collection. Our approach is uniquely designed to capture the rich topology of topic evolution inherent in the corpus. Instead of characterizing the evolving topics at fixed time points, we conceptually define a topic as a quantized unit of evolutionary change in content and discover topics with the time of their appearance in the corpus. Discovered topics are then connected to form a topic evolution graph using a measure derived from the underlying document network. Our approach allows inhomogeneous distribution of topics over time and does not impose any topological restriction in topic evolution graphs. We evaluate our algorithm on the ACM corpus. The topic evolution graphs obtained from the ACM corpus provide an effective and concrete summary of the corpus with remarkably rich topology that are congruent to our background knowledge. In a finer resolution, the graphs reveal concrete information about the corpus that were previously unknown to us, suggesting the utility of our approach as a navigational tool for the corpus.

The term topic is being used in this paper to mean a subject in topic map parlance.

From the paper:

Our work is built on the premise that the words relevant to a topic are distributed over documents such that the distribution is correlated with the underlying document network such as a citation network. Specifically, in our topic discovery methodology, in order to test if a multinomial word distribution derived from a document constitutes a new topic, the following heuristic is used. We check that the distribution is exclusively correlated to the document network by requiring it to be significantly present in other documents that are network neighbors of the given document while suppressing the nondiscriminative words using the background model.

Navigation of a corpus on the basis of such a process would indeed be rich, but it would be even richer were multiple ways to represent the same subjects mapped together.

It would also be interesting to see how the resulting graphs, which included only the document titles and abstracts, compared to graphs constructed using the entire documents.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress