Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? by Christopher D. Manning.
Abstract:
I examine what would be necessary to move part-of-speech tagging performance from its current level of about 97.3% token accuracy (56% sentence accuracy) to close to 100% accuracy. I suggest that it must still be possible to greatly increase tagging performance and examine some useful improvements that have recently been made to the Stanford Part-of-Speech Tagger. However, an error analysis of some of the remaining errors suggests that there is limited further mileage to be had either from better machine learning or better features in a discriminative sequence classifier. The prospects for further gains from semi-supervised learning also seem quite limited. Rather, I suggest and begin to demonstrate that the largest opportunity for further progress comes from improving the taxonomic basis of the linguistic resources from which taggers are trained. That is, from improved descriptive linguistics. However, I conclude by suggesting that there are also limits to this process. The status of some words may not be able to be adequately captured by assigning them to one of a small number of categories. While conventions can be used in such cases to improve tagging consistency, they lack a strong linguistic basis.
I was struck by Christopher’s observation:
The status of some words may not be able to be adequately captured by assigning them to one of a small number of categories. While conventions can be used in such cases to improve tagging consistency, they lack a strong linguistic basis.
which comes up again in his final sentence:
But in such cases, we must accept that we are assigning parts of speech by convention for engineering convenience rather than achieving taxonomic truth, and there are still very interesting issues for linguistics to continue to investigate, along the lines of [27].
I suppose the observation stood out for me because on what other basis would we assign properties other than “convenience?”
When I construct a topic, I assign properties that I hope are useful to others when they view that particular topic. I don’t assign it properties unknown to me. I don’t necessarily assign it all the properties I may know for a given topic.
I may even assign it properties that I know will cause a topic to merge with other topics.
BTW, footnote [27] refers to:
Aarts, B.: Syntactic gradience: the nature of grammatical indeterminacy. Oxford University Press, Oxford (2007)
Sounds like an interesting work. I did search for “semantic indeterminacy” while at Amazon but it marked out “semantic” and returned results for indeterminacy. 😉
I first saw this in a tweet by the Stanford NLP Group.