Classification accuracy is not enough

Classification accuracy is not enough by Bob L. Sturm.

From the post:

Finally published is my article, Classification accuracy is not enough: On the evaluation of music genre recognition systems. I made it completely open access and free for anyone.

Some background: In my paper Two Systems for Automatic Music Genre Recognition: What Are They Really Recognizing?, I perform three different experiments to determine how well two state-of-the-art systems for music genre recognition are recognizing genre. In the first experiment, I find the two systems are consistently making extremely bad misclassifications. In the second experiment, I find the two systems can be fooled by such simple transformations that they cannot possibly be listening to the music. In the third experiment, I find their internal models of the genres do not match how humans think the genres sound. Hence, it appears that the systems are not recognizing genre in the least. However, this seems to contradict the fact that they achieve extremely good classification accuracies, and have been touted as superior solutions in the literature. Turns out, Classification accuracy is not enough!


I look closely at what kinds of mistakes the systems make, and find they all make very poor yet “confident” mistakes. I demonstrate the latter by looking at the decision statistics of the systems. There is little difference for a system between making a correct classification, and an incorrect one. To judge how poor the mistakes are, I test with humans whether the labels selected by the classifiers describe the music. Test subjects listen to a music excerpt and select between two labels which they think was given by a human. Not one of the systems fooled anyone. Hence, while all the systems had good classification accuracies, good precisions, recalls, and F-scores, and confusion matrices that appeared to make sense, a deeper evaluation shows that none of them are recognizing genre, and thus that none of them are even addressing the problem. (They are all horses, making decisions based on irrelevant but confounded factors.)


If you have ever wondered what a detailed review of classification efforts would look like, you need wonder no longer!

Bob’s Two Systems for Automatic Music Genre Recognition: What Are They Really Recognizing? is thirty-six (36) pages that examines efforts at music genre recognition (MGR) in detail.

I would highly recommend this paper as a demonstration of good research technique.

Comments are closed.