Trouble at the text mine « Another Word For It

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 23, 2012

Trouble at the text mine

Filed under: Data Mining,Search Engines,Searching — Patrick Durusau @ 7:24 pm

Trouble at the text mine by Richard Van Noorden.

From the post:

When he was a keen young biology graduate student in 2006, Max Haeussler wrote a computer program that would scan, or ‘crawl’, plain text and pull out any DNA sequences. To test his invention, the naive text-miner downloaded around 20,000 research papers that his institution had paid to access — and promptly found his IP address blocked by the papers’ publisher.

It was not until 2009 that Haeussler, then at the University of Manchester, UK, and now at the University of California, Santa Cruz, returned to the project in earnest. He had come to realize that standard site licences do not permit systematic downloads, because publishers fear wholesale theft of their content. So Haeussler began asking for licensing terms to crawl and text-mine articles. His goal was to serve science: his program is a key part of the text2genome project, which aims to use DNA sequences in research papers to link the publications to an online record of the human genome. This could produce an annotated genome map linked to millions of research articles, so that biologists browsing a genomic region could immediately click through to any relevant papers.

But Haeussler and his text2genome colleague Casey Bergman, a genomicist at the University of Manchester, have spent more than two years trying to agree terms with publishers — and often being ignored or rebuffed. “We’ve learned it’s a long, hard road with every journal,” says Bergman.

What Haeussler and Bergman don’t seem to “get” is that publishers have no interest in advancing science. Their sole and only goal is profiting from the content they have published. (I am not going to argue right or wrong but am simply trying to call out the positions in question.)

The question that Haeussler and Bergman should answer for publishers is this one: What is in this “indexing” for the publishers?

I suspect one acceptable answer would run along the lines of:

The full content of articles cannot be reconstructed from the indexes. The largest block of content delivered will be the article abstract, along with bibliographic reference data.
Pointers to the articles will point towards either the publisher’s content site and/or other commercial content providers that carry the publisher’s content.
The publisher’s designated journal logo (of some specified size) will appear with every reported citation.
The indexed content will be provided to the publisher’s at no charge.

Does this mean that publisher’s will be benefiting from allowing the indexing of their content? Yes. Next question.

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 23, 2012

Trouble at the text mine

No Comments