On the origin of long-range correlations in texts by Eduardo G. Altmann, Giampaolo Cristadoro, and Mirko Degli Esposti.
Abstract:
The complexity of human interactions with social and natural phenomena is mirrored in the way we describe our experiences through natural language. In order to retain and convey such a high dimensional information, the statistical properties of our linguistic output has to be highly correlated in time. An example are the robust observations, still largely not understood, of correlations on arbitrary long scales in literary texts. In this paper we explain how long-range correlations flow from highly structured linguistic levels down to the building blocks of a text (words, letters, etc..). By combining calculations and data analysis we show that correlations take form of a bursty sequence of events once we approach the semantically relevant topics of the text. The mechanisms we identify are fairly general and can be equally applied to other hierarchical settings.
Another area of arXiv.org, Physics > Data Analysis, Statistics and Probability, to monitor. đ
The authors used ten (10) novels from Project Gutenberg:
- Alice’s Adventures in Wonderland
- The Adventures of Tom Sawyer
- Pride and Prejudice
- Life on the Mississippi
- The Jungle
- The Voyage of the Beagle
- Moby Dick; or The Whale
- Ulysses
- Don Quixote
- War and Peace
Interesting research that will take a while to digest but I have to wonder why these ten (10) novels?
Or perhaps better, in an age of “big data,” why only ten (10)?
Why not the entire corpus of Project Gutenberg?
Or perhaps the texts of Wikipedia in its multitude of languages?
Reasoning that if the results represent an insight about natural language, they should be applicable beyond English. Yes?
If this is your area, comments and suggestions would be most welcome.