Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 25, 2013

Duplicate News Story Detection Revisited

Filed under: Deduplication,Duplicates,News,Reporting — Patrick Durusau @ 5:34 pm

Duplicate News Story Detection Revisited by Omar Alonso, Dennis Fetterly, and Mark Manasse.

Abstract:

In this paper, we investigate near-duplicate detection, particularly looking at the detection of evolving news stories. These stories often consist primarily of syndicated information, with local replacement of headlines, captions, and the addition of locally-relevant content. By detecting near-duplicates, we can offer users only those stories with content materially different from previously-viewed versions of the story. We expand on previous work and improve the performance of near-duplicate document detection by weighting the phrases in a sliding window based on the term frequency within the document of terms in that window and inverse document frequency of those phrases. We experiment on a subset of a publicly available web collection that is comprised solely of documents from news web sites. News articles are particularly challenging due to the prevalence of syndicated articles, where very similar articles are run with different headlines and surrounded by different HTML markup and site templates. We evaluate these algorithmic weightings using human judgments to determine similarity. We find that our techniques outperform the state of the art with statistical significance and are more discriminating when faced with a diverse collection of documents.

Detecting duplicates or near-duplicates of subjects (such as news stories) is part and parcel of a topic maps toolkit.

What I found curious about this paper was the definition of “content” to mean the news story and not online comments as well.

That’s a rather limited view of near-duplicate content. And it has a pernicious impact.

If a story quotes a lead paragraph or two from a New York Times story, comments may be made at the “near-duplicate” site, not the New York Times.

How much of a problem is that? When was the last time you saw a comment that was not in English in the New York Times?

Answer: Very unlikely you have ever seen such a comment:

If you are writing a comment, please be thoughtful, civil and articulate. In the vast majority of cases, we only accept comments written in English; foreign language comments will be rejected. Comments & Readers’ Reviews

If a story appears in the New York Times and a “near-duplicate” in Arizona, Italy, and Sudan, with comments, according to the authors, you will not have the opportunity to see that content.

That’s replacing American Exceptionalism with American Myopia.

Doesn’t sound like a winning solution to me.

I first saw this at Full Text Reports as Duplicate News Story Detection Revisited.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress