Archive for the ‘Segmentation’ Category

Segmenting Words and Sentences

Wednesday, March 14th, 2012

Segmenting Words and Sentences by Richard Marsden.

From the post:

Even simple NLP tasks such as tokenizing words and segmenting sentences can have their complexities. Punctuation characters could be used to segment sentences, but this requires the punctuation marks to be treated as separate tokens. This would result in abbreviations being split into separate words and sentences.

This post uses a classification approach to create a parser that returns lists of sentences of tokenized words and punctuation.

Splitting text into words and sentences seems like it should be the simplest NLP task. It probably is, but there are a still number of potential problems. For example, a simple approach could use space characters to divide words. Punctuation (full stop, question mark, exclamation mark) could be used to divide sentences. This quickly comes into problems when an abbreviation is processed. “etc.” would be interpreted as a sentence terminator, and “U.N.E.S.C.O.” would be interpreted as six individual sentences, when both should be treated as single word tokens. How should hyphens be interpreted? What about speech marks and apostrophes?

A good introduction to segmentation but I would test the segmentation with a sample text before trusting it too far. Writing habits vary even within languages.

Challenges of Chinese Natural Language Processing

Sunday, March 11th, 2012

Thinkudo Labs is posting a series on Chinese natural language processing.

I will be gathering those posts here for ease of reference.

Challenges of Chinese Natural Language Processing – Segmentation

Challenges of Chinese Natural Language Processing – Homograph
(If you are betting this was the post that caught my attention, you are right in one.)

You will need native Chinese speaker assistance for serious Chinese language processing but understanding some of the issues ahead of time won’t hurt.