New Approach for Automated Categorizing and Finding Similarities in Online Persian News Authors: Naser Ezzati Jivan, Mahlagha Fazeli and Khadije Sadat Yousefi Keywords: Categorization of web pages – category – automatic categorization of Persian news – feature – similarity – clustering – structure of web pages.
Abstract:
The Web is a great source of information where data are stored in different formats, e.g., web-pages, archive files and images. Algorithms and tools which automatically categorize web-pages have wide applications in real-life situations. A web-site which collects news from different sources can be an example of such situations. In this paper, an algorithm for categorizing news is proposed. The proposed approach is specialized to work with documents (news) written in the Persian language but it can be easily generalized to work with documents in other languages, too. There is no standard test-bench or measure to evaluate the performance of this kind of algorithms as the amount of similarity between two documents (news) is not well-defined. To test the performance of the proposed algorithm, we implemented a web-site which uses the proposed approach to find similar news. Some of the similar news items found by the algorithm has been reported.
Similarity: The first step towards subject identification.