Unified analysis of streaming news by Amr Ahmed, Qirong Ho, Jacob Eisenstein, and, Eric Xing Carnegie Mellon University, Pittsburgh, USA, and Alexander J. Smola and Choon Hui Teo of Yahoo! Research, Santa Clara, CA, USA.
News clustering, categorization and analysis are key components of any news portal. They require algorithms capable of dealing with dynamic data to cluster, interpret and to temporally aggregate news articles. These three tasks are often solved separately. In this paper we present a unified framework to group incoming news articles into temporary but tightly-focused storylines, to identify prevalent topics and key entities within these stories, and to reveal the temporal structure of stories as they evolve. We achieve this by building a hybrid clustering and topic model. To deal with the available wealth of data we build an efficient parallel inference algorithm by sequential Monte Carlo estimation. Time and memory costs are nearly constant in the length of the history, and the approach scales to hundreds of thousands of documents. We demonstrate the efficiency and accuracy on the publicly available TDT dataset and data of a major internet news site.
From the article:
Such an approach combines the strengths of clustering and topic models. We use topics to describe the content of each cluster, and then we draw articles from the associated story. This is a more natural fit for the actual process of how news is created: after an event occurs (the story), several journalists write articles addressing various aspects of the story. While their vocabulary and their view of the story may differ, they will by necessity agree on the key issues related to a story (at least in terms of their vocabulary). Hence, to analyze a stream of incoming news we need to infer a) which (possibly new) cluster could have generated the article and b) which topic mix describes the cluster best.
I single out that part of the paper to remark that at first the authors say that the vocabulary for a story may vary and then in the next breath say that for key issues the vocabulary will agree on key issues.
Given the success of their results, it may be that news reporting is more homogeneous in its vocabulary than other forms of writing?
Perhaps news compression where duplicated content is suppressed but the “fact” of reportage is retained, that could make an interesting topic map.