ROpenSci

ROpenSci

From the webpage:

At rOpenSci we are creating packages that allow access to data repositories through the R statistical programming environment that is already a familiar part of the workflow of many scientists. We hope that our tools will not only facilitate drawing data into an environment where it can readily be manipulated, but also one in which those analyses and methods can be easily shared, replicated, and extended by other researchers. While all the pieces for connecting researchers with these data sources exist as disparate entities, our efforts will provide a unified framework that will be quickly connect researchers to open data.

More than twenty (20) R packages are available today!

Great for data mining your favorite science data repository, but that isn’t the only reason I mention them.

One of the issues for topic maps has always been how to produce the grist for a topic map mill. There is a lot of data and production isn’t a thrilling task. 😉

But what if we could automate that production, at least to a degree?

The search functions in Treebase offer several examples of auto-generation of semantics would benefit both the data set and potential users.

In Treebase: An R package for discovery, access and manipulation of online phylogenies Carl Boettiger and Duncan Temple Lang point out that Treebase has search functions for “author,” and “subject.”

Err, but Dublin Core 1.1 refers to authors as “creators.” And “subject,” for Treebase means: “Matches in the subject terms.”

The ACM would say “keywords,” as would many others, instead of “subject.”

Not a great semantic speed bump* but one that if left unnoticed, will result in poorer, not richer search results.

What if for an R package like Treebase, a user could request what is identified by a field?

That is in addition to the fields being returned, one or more key/value pairs are returned for each field, which define what is identified by that field.

For example, for “author” an --iden switch could return:

Author Semantics
Creator http://purl.org/dc/elements/1.1/creator
Author/Creator http://catalog.loc.gov/help/author-keyword.htm

and so on, perhaps even including identifiers in other languages.

While this step only addresses identifying what a field identifies, it would be a first step towards documenting identifiers that could be used over and over again to improve access to scientific data.

Future changes and we know there will be future changes, are accommodated by simply appending to the currently documented identifiers.

Document identifier mappings once, Reuse identifier mappings many times.

PS: The mapping I suggest above is a blind mapping, there is no information is given about “why” I thought the alternatives given were alternatives to the main entry “author.”

Blind mappings are sufficient for many cases but are terribly insufficient for others. Biological taxonomies, for example, do change and capturing what characteristics underlie a particular mapping may be important in terms of looking forwards or backwards from some point in time in the development of a taxonomy.

* I note for your amusement that Wikipedia offers “vertical deflection traffic calming devices,” as a class that includes “speed bump, speed hump, speed cushion, and speed table.”

Like many Library of Congress subject headings, “vertical deflection traffic calming devices” doesn’t really jump to mind when doing a search for “speed bump.” 😉

Comments are closed.