Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 6, 2015

Topic Extraction and Bundling of Related Scientific Articles

Topic Extraction and Bundling of Related Scientific Articles by Shameem A Puthiya Parambath.

Abstract:

Automatic classification of scientific articles based on common characteristics is an interesting problem with many applications in digital library and information retrieval systems. Properly organized articles can be useful for automatic generation of taxonomies in scientific writings, textual summarization, efficient information retrieval etc. Generating article bundles from a large number of input articles, based on the associated features of the articles is tedious and computationally expensive task. In this report we propose an automatic two-step approach for topic extraction and bundling of related articles from a set of scientific articles in real-time. For topic extraction, we make use of Latent Dirichlet Allocation (LDA) topic modeling techniques and for bundling, we make use of hierarchical agglomerative clustering techniques.

We run experiments to validate our bundling semantics and compare it with existing models in use. We make use of an online crowdsourcing marketplace provided by Amazon called Amazon Mechanical Turk to carry out experiments. We explain our experimental setup and empirical results in detail and show that our method is advantageous over existing ones.

On “bundling” from the introduction:

Effective grouping of data requires a precise definition of closeness between a pair of data items and the notion of closeness always depend on the data and the problem context. Closeness is defined in terms of similarity of the data pairs which in turn is measured in terms of dissimilarity or distance between pair of items. In this report we use the term similarity,dissimilarity and distance to denote the measure of closeness between data items. Most of the bundling scheme start with identifying the common attributes(metadata) of the data set, here scientific articles, and create bundling semantics based on the combination of these attributes. Here we suggest a two step algorithm to bundle scientific articles. In the first step we group articles based on the latent topics in the documents and in the second step we carry out agglomerative hierarchical clustering based on the inter-textual distance and co-authorship similarity between articles. We run experiments to validate the bundling semantics and to compare it with content only based similarity. We used 19937 articles related to Computer Science from arviv [htt12a] for our experiments.

Is a “bundle” the same thing as a topic that represents “all articles on subject X?”

I have seen a number of topic map examples that use the equivalent proper noun, a proper subject, that is a singular and unique subject.

But there is no reason why I could not have a topic that represents all the articles on deep learning written in 2014, for example. Methods such as the bundling techniques described here could prove to be quite useful in such cases.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress