What is Scholarly HTML? by Robin Berjon and Sébastien Ballesteros.
Abstract:
Scholarly HTML is a domain-specific data format built entirely on open standards that enables the interoperable exchange of scholarly articles in a manner that is compatible with off-the-shelf browsers. This document describes how Scholarly HTML works and how it is encoded as a document. It is, itself, written in Scholarly HTML.
The abstract is accurate enough but the “Motivation” section provides a better sense of this project:
Scholarly articles are still primarily encoded as unstructured graphics formats in which most of the information initially created by research, or even just in the text, is lost. This was an acceptable, if deplorable, condition when viable alternatives did not seem possible, but document technology has today reached a level of maturity and universality that makes this situation no longer tenable. Information cannot be disseminated if it is destroyed before even having left its creator’s laptop.
According to the New York Times, adding structured information to their recipes (instead of exposing simply as plain text) improved their discoverability to the point of producing an immediate rise of 52 percent in traffic (NYT, 2014). At this point in time, cupcake recipes are reaping greater benefits from modern data format practices than the whole scientific endeavour.
This is not solely a loss for the high principles of knowledge sharing in science, it also has very immediate pragmatic consequences. Any tool, any service that tries to integrate with scholarly publishing has to spend the brunt of its complexity (or budget) extracting data the author would have willingly shared out of antiquated formats. This places stringent limits on the improvement of the scholarly toolbox, on the discoverability of scientific knowledge, and particularly on processes of meta-analysis.
To address these issues, we have followed an approach rooted in established best practices for the reuse of open, standard formats. The «HTML Vernacular» body of practice provides guidelines for the creation of domain-specific data formats that make use of HTML’s inherent extensibility (Science.AI, 2015b). Using the vernacular foundation overlaid with «schema.org» metadata we have produced a format for the interchange of scholarly articles built on open standards, ready for all to use.
Our high-level goals were:
- Uncompromisingly enabling structured metadata, accessibility, and internationalisation.
- Pragmatically working in Web browsers, even if it occasionally incurs some markup overhead.
- Powerfully customisable for inclusion in arbitrary Web sites, while remaining easy to process and interoperable.
- Entirely built on top of open, royalty-free standards.
- Long-term viability as a data format.
Additionally, in view of the specific problem we addressed, in the creation of this vernacular we have favoured the reliability of interchange over ease of authoring; but have nevertheless attempted to cater to the latter as much as possible. A decent boilerplate template file can certainly make authoring relatively simple, but not as radically simple as it can be. For such use cases, Scholarly HTML provides a great output target and overview of the data model required to support scholarly publishing at the document level.
An example of an authoring format that was designed to target Scholarly HTML as an output is the DOCX Standard Scientific Style which enables authors who are comfortable with Microsoft Word to author documents that have a direct upgrade path to semantic, standard content.
Where semantic modelling is concerned, our approach is to stick as much as possible to schema.org. Beyond the obvious advantages there are in reusing a vocabulary that is supported by all the major search engines and is actively being developed towards enabling a shared understanding of many useful concepts, it also provides a protection against «ontological drift» whereby a new vocabulary is defined by a small group with insufficient input from a broader community of practice. A language that solely a single participant understands is of limited value.
In a small, circumscribed number of cases we have had to depart from schema.org, using the https://ns.science.ai/
(prefixed with sa:
) vocabulary instead (Science.AI, 2015a). Our goal is to work with schema.org in order to extend their vocabulary, and we will align our usage with the outcome of these discussions.
I especially enjoyed the observation:
According to the New York Times, adding structured information to their recipes (instead of exposing simply as plain text) improved their discoverability to the point of producing an immediate rise of 52 percent in traffic (NYT, 2014). At this point in time, cupcake recipes are reaping greater benefits from modern data format practices than the whole scientific endeavour.
I don’t doubt the truth of that story but after all, a large number of people are interested in baking cupcakes. Not more than three in many cases, are interested in reading any particular academic paper.
The use of schema.org will provide advantages for common concepts but to be truly useful for scholarly writing, it will require serious extension.
Take for example my post yesterday Deep Feature Synthesis:… [Replacing Human Intuition?, Calling Bull Shit]. What microdata from schema.org would help readers find Propositionalisation and Aggregates, 2001, which describes substantially the same technique, without claims of surpassing human intuition? (Uncited by the authors the paper on deep feature synthesis.)
Or the 161 papers on propositionalisation that you can find at CiteSeer?
A crude classification that can be used by search engines is very useful but falls far short of the mark in terms of finding and retrieving scholarly writing.
Semantic uniformity for classifying scholarly content hasn’t been reached by scholars or librarians despite centuries of effort. Rather than taking up that Sisyphean task, let’s map across the ever increasing universe of semantic diversity.