Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 5, 2013

A Newspaper Clipping Service with Cascading

Filed under: Authoring Topic Maps,Cascading,Data Mining,News — Patrick Durusau @ 5:34 am

A Newspaper Clipping Service with Cascading by Sujit Pal.

From the post:

This post describes a possible implementation for an automated Newspaper Clipping Service. The end-user is a researcher (or team of researchers) in a particular discipline who registers an interest in a set of topics (or web-pages). An assistant (or team of assistants) then scour information sources to find more documents of interest to the researcher based on these topics identified. In this particular case, the information sources were limited to a set of “approved” newspapers, hence the name “Newspaper Clipping Service”. The goal is to replace the assistants with an automated system.

The solution I came up with was to analyze the original web pages and treat keywords extracted out of these pages as topics, then for each keyword, query a popular search engine and gather the top 10 results from each query. The search engine can be customized so the sites it looks at is restricted by the list of approved newspapers. Finally the URLs of the results are aggregated together, and only URLs which were returned by more than 1 keyword topic are given back to the user.

The entire flow can be thought of as a series of Hadoop Map-Reduce jobs, to first download, extract and count keywords from (web pages corresponding to) URLs, and then to extract and count search result URLs from the keywords. I’ve been wanting to play with Cascading for a while, and this seemed like a good candidate, so the solution is implemented with Cascading.

Hmmm, but an “automated system” leaves the user to sort, create associations, etc., for themselves.

Assistants with such a “clipping service” could curate the clippings by creating associations with other materials and adding non-obvious but useful connections.

Think of the front page of the New York Times as an interface to curated content behind the stories that appear on it.

Where “home” is the article on the front page.

Not only more prose but a web of connections to material you might not even know existed.

For example, in Beijing Flaunts Cross-Border Clout in Search for Drug Lord by Jane Perlez and Bree Feng (NYT) we learn that:

Under Lao norms, law enforcement activity is not done after dark, (Liu Yuejin, leader of the antinarcotics bureau of the Ministry of Public Security)

Could be important information, depending upon your reasons for being in Laos.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress