German Compound Words by Brian Johnson.
From the post:
Mark Twain is quoted as having said, “Some German words are so long that they have a perspective.”
Although eBay users are unlikely to search using fearsome beasts like “rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz”, which stands for the “beef labeling supervision duties delegation law”, we do frequently see compound words in our users’ queries. While some might look for “damenlederhose”, others might be searching for the same thing (women’s leather pants) using the decompounded forms “damen lederhose” or “damen leder hose”. And even though a German teacher would tell you only “damenlederhose” or “damen lederhose” are correct, the users’ expectation is to see the same results regardless of which form is used.
This scenario exists on the seller side as well. That is, people selling on eBay might describe their item using one or more of these forms. In such cases, what should a search engine do? While the problem might seem simple at first, German word-breaking – or decompounding, as it is also known – is not so simple.
And you thought all this worrying about subject identifiers was just another intellectual pose! 😉
There are serious people who spend serious money in the quest to make even more money who worry about subject identification. They don’t discuss it in topic map terms, but it is subject identity none the less.
This post should get your started on some issues with German.
What other languages/scripts have the same or similar issues? Are the solutions here extensible or are new solutions needed?
Pointers to similar resources most welcome!