From the post:
After the work I did for my last post, I wanted to practice doing multiple classification. I first thought of using the famous iris dataset, but felt that was a little boring. Ideally, I wanted to look for a practice dataset where I could successfully classify data using both categorical and numeric predictors. Unfortunately it was tough for me to find such a dataset that was easy enough for me to understand.
The dataset I use in this post comes from a textbook called Analyzing Categorical Data by Jeffrey S Simonoff, and lends itself to basically the same kind of analysis done by blogger “Wingfeet” in his post predicting authorship of Wheel of Time books. In this case, the dataset contains counts of stop words (function words in English, such as “as”, “also, “even”, etc.) in chapters, or scenes, from books or plays written by Jane Austen, Jack London (I’m not sure if “London” in the dataset might actually refer to another author), John Milton, and William Shakespeare. Being a textbook example, you just know there’s something worth analyzing in it!! The following table describes the numerical breakdown of books and chapters from each author:
Introduction to authorship studies as they were known (may still be) in the academic circles of my youth.
I wonder if the same techniques are as viable today as on the Federalist Papers?
The Wheel of Time example demonstrates the technique remains viable for novel authors.
But what about authorship more broadly?
Can we reliably distinguish between news commentary from multiple sources?
Or between statements by elected officials?
How would your topic map represent purported authorship versus attributed authorship?
Or even a common authorship for multiple purported authors? (speech writers)