Christopher Phipps mentioned Automatic Evaluation of Text Coherence: Models and Representations by Mirella Lapata and Regina Barzilay in a tweet today. Running that article down, I discovered it was published in the proceedings of International Joint Conferences on Artificial Intelligence in 2005.
Useful but a bit dated.
A more recent resource: A Bibliography of Coherence and Cohesion, Wolfram Bublitz (Universität Augsburg). Last updated: 2010.
The Bublitz bibliography is more recent but current bibliography would be even more useful.
Can you suggest a more recent bibliography on text coherence/cohesion?
I ask because while looking for such a bibliography, I encountered: Improving Topic Coherence with Regularized Topic Models by David Newman, Edwin V. Bonilla, and, Wray Buntine.
The abstract reads:
Topic models have the potential to improve search and browsing by extracting useful semantic themes from web pages and other text documents. When learned topics are coherent and interpretable, they can be valuable for faceted browsing, results set diversity analysis, and document retrieval. However, when dealing with small collections or noisy text (e.g. web search result snippets or blog posts), learned topics can be less coherent, less interpretable, and less useful. To over-come this, we propose two methods to regularize the learning of topic models. Our regularizers work by creating a structured prior over words that reflect broad patterns in the external data. Using thirteen datasets we show that both regularizers improve topic coherence and interpretability while learning a faithful representation of the collection of interest. Overall, this work makes topic models more useful across a broader range of text data.
I don’t think the “…small collections or noisy text (e.g. web search result snippets or blog posts), learned topics can be less coherent, less interpretable, and less useful” is a surprise to anyone. I take that as the traditional “garbage in, garbage out.”
However, “regularizers” may be useful for automatic/assisted authoring of topics in the topic map sense of the word topic. Assuming you want to mine “small or small and noisy texts.” The authors say the technique should apply to large texts and promise future research on applying “regularizers” to large texts.
I checked the authors’ recent publications but didn’t see anything I would call a “large” text application of “regularizers.” Open area of research if you want to take the lead.