Data enhancing the Royal Society of…

Data enhancing the Royal Society of Chemistry publication archive by Antony Williams.

Abstract:

The Royal Society of Chemistry has an archive of hundreds of thousands of published articles containing various types of chemistry related data – compounds, reactions, property data, spectral data etc. RSC has a vision of extracting as much of these data as possible and providing access via ChemSpider and its related projects. To this end we have applied a combination of text-mining extraction, image conversion and chemical validation and standardization approaches. The outcome of this project will result in new chemistry related data being added to our chemical and reaction databases and in the ability to more tightly couple web-based versions of the articles with these extracted data. The ability to search across the archive will be enhanced as a result. This presentation will report on our progress in this data extraction project and discuss how we will ultimately use similar approaches in our publishing pipeline to enhance article markup for new publications.

The data mining Antony details on the Royal Society of Chemistry is impressive!

But as Anthony notes at slide #30, it isn’t a long term solution:

We should NOT be mining data out of future publications (emphasis added)

I would say the same thing for metadata/subject identities in data. For some data and some subjects, we can, after the fact, reconstruct properties to identify the subjects they represent.

Data/text mining would be more accurate and easier if subjects were identified at the time of authoring. Perhaps even automatically or at least subject to a user’s approval.

More accurate than researchers removed from an author by time, distance and even profession, trying to guess what subject an author may have meant.

Better semantic authoring support now, will reduce the cost and improve the accuracy of data mining in the future.

Comments are closed.