I was reading a paper on natural language processing (NLP) when it occurred to me to ask:
When is parsing of any data not natural language processing?
I hear the phrase, “natural language processing,” applied to a corpus of emails, blog posts, web pages, electronic texts, transcripts of international phone calls and the like.
Other than following others out of habit, why do we say those are subject to “natural language processing?”
As opposed to say a database schema?
When we “process” the column headers in a database schema, aren’t we engaged in “natural language processing?” What about SGML/XML schemas or instances they govern?
Being mindful of semantics, synonymy and polysemy, it’s hard think of examples that are not “natural language processing.”
At least for data that would be meaningful if read by a person. Streams of numbers perhaps not, but the symbolism that defines their processing I would argue falls under natural language processing.
Thoughts?
When we read emails/blogs or column headers/XML/SGML we are doing natural language processing, yes. But computers accessing a SQL database or parsing an XML document are not doing NLP. They’re processing data according to a formal structure, unlike when they read “unstructured” text.
So I think the distinction is very much meaningful.
Comment by larsga@garshol.priv.no — May 24, 2012 @ 5:39 am
OK, if we want to distinguish: “natural language processing” as any human reading/interpretation of symbols vs. parsing, which is computer manipulation of the same symbols, I can live with that.
But I don’t think that is the conventional usage: http://en.wikipedia.org/wiki/Natural_language_processing
I agree that distinguishing our reading from a machine’s is useful.
My point, which I didn’t make clearly, was to extend “nlp,” as conventionally understood, to non-conventional areas.
For example, imagine applying statistical nlp to the markup in documents as opposed to using a DTD or schema.
Comment by Patrick Durusau — May 24, 2012 @ 9:25 am