Graph-based Approach to Automatic Taxonomy Generation (GraBTax) by Pucktada Treeratpituk, Madian Khabsa, C. Lee Giles.
Abstract:
We propose a novel graph-based approach for constructing concept hierarchy from a large text corpus. Our algorithm, GraBTax, incorporates both statistical co-occurrences and lexical similarity in optimizing the structure of the taxonomy. To automatically generate topic-dependent taxonomies from a large text corpus, GraBTax first extracts topical terms and their relationships from the corpus. The algorithm then constructs a weighted graph representing topics and their associations. A graph partitioning algorithm is then used to recursively partition the topic graph into a taxonomy. For evaluation, we apply GraBTax to articles, primarily computer science, in the CiteSeerX digital library and search engine. The quality of the resulting concept hierarchy is assessed by both human judges and comparison with Wikipedia categories.
Interesting work.
For example:
Unfortunately, existing taxonomies for concepts in computer science such as ODP categories and the ACM Classification System1 are unsuitable as a gold standard. ODP categories are too broad and do not contain the majority of concepts produced by our algorithm. For instance, there are no sub-concepts for “Semantic Web” in ODP. Also some portions of ODP categories under computer science are not computer science related concepts, especially at the lower level. For example, the concepts under “Neural Networks” are Books, People, Companies, Publications, FAQs, Help and Tutorials, etc. The ACM Classification System has similar drawbacks, where its categories are too broad for comparison.
Makes me curious if comparing the topics extracted from articles would consistently map to the broad categories assigned by the ACM.
Also instructive for the use of graphs, which admit to no pre-determined data structure.
I say that because of an on-going discussion about alternative data models for topic maps.
As you know, I don’t think topic maps have only one data model, not even my own.
The model you construct with your topic map should meet your needs, not mine.
Graphs are a good example of interchangeable information artifacts despite no one being able to constrain the graphs of others.
XML is another, although it gets overlooked from time to time.
PS: The authors don’t say but I am assuming that ODP = Open Directory Project.