TM-Gen: A Topic Map Generator from Text Documents

TM-Gen: A Topic Map Generator from Text Documents by Angel L. Garrido, et al.

From the post:

The vast amount of text documents stored in digital format is growing at a frantic rhythm each day. Therefore, tools able to find accurate information by searching in natural language information repositories are gaining great interest in recent years. In this context, there are especially interesting tools capable of dealing with large amounts of text information and deriving human-readable summaries. However, one step further is to be able not only to summarize, but to extract the knowledge stored in those texts, and even represent it graphically.

In this paper we present an architecture to generate automatically a conceptual representation of knowledge stored in a set of text-based documents. For this purpose we have used the topic maps standard and we have developed a method that combines text mining, statistics, linguistic tools, and semantics to obtain a graphical representation of the information contained therein, which can be coded using a knowledge representation language such as RDF or OWL. The procedure is language-independent, fully automatic, self-adjusting, and it does not need manual configuration by the user. Although the validation of a graphic knowledge representation system is very subjective, we have been able to take advantage of an intermediate product of the process to make an experimental
validation of our proposal.

Of particular note on the automatic construction of topic maps:

Addition of associations:

TM-Gen adds to the topic map the associations between topics found in each sentence. These associations are given by the verbs present in the sentence. TM-Gen performs this task by searching the subject included as topic, and then it adds the verb as its association. Finally, it links its verb complement with the topic and with the association as a new topic.

Depending on the archive one would expect associations between the authors and articles but also topics within articles, to say nothing of date, the publication, etc. Once established, a user can request a view that consists of more or less detail. If not captured, however, more detail will not be available.

There is only a general description of TM-Gen but enough to put you on the way to assembling something quite similar.

Comments are closed.