Understanding Natural Language with Deep Neural Networks Using Torch

Understanding Natural Language with Deep Neural Networks Using Torch by Soumith Chintala and Wojciech Zaremba.

This is a deeply impressive article and a good introduction to Torch (scientific computing package with neural network, optimization, etc.)

In the preliminary materials, the authors illustrate one of the difficulties of natural language processing by machine:

For a machine to understand language, it first has to develop a mental map of words, their meanings and interactions with other words. It needs to build a dictionary of words, and understand where they stand semantically and contextually, compared to other words in their dictionary. To achieve this, each word is mapped to a set of numbers in a high-dimensional space, which are called “word embeddings”. Similar words are close to each other in this number space, and dissimilar words are far apart. Some word embeddings encode mathematical properties such as addition and subtraction (For some examples, see Table 1).

Word embeddings can either be learned in a general-purpose fashion before-hand by reading large amounts of text (like Wikipedia), or specially learned for a particular task (like sentiment analysis). We go into a little more detail on learning word embeddings in a later section.

You can already see the problem but just to call it out, the language usage in Wikipedia, for example, may or may not match the domain of interest. You could certainly use it as a general case but it will produce very odd results when the text to be “understood” in a regional version of a language where common words have meanings other than you will find in Wikipedia.

Slang is a good example. In the 17th century for example, “cab” was a term used for a brothel. To take a “hit” has a different meaning than being struck by a boxer, would be a more recent example.

“Understanding” natural language with machines is a great leap forward but one should never leap without looking.

Comments are closed.