Archive for the ‘Biodiversity’ Category

The challenge of combining 176 x #otherpeoplesdata…

Wednesday, June 10th, 2015

The challenge of combining 176 x #otherpeoplesdata to create the Biomass And Allometry Database by Daniel Falster , Rich FitzJohn , Remko Duursma , Diego Barneche .

From the post:

Despite the hype around "big data", a more immediate problem facing many scientific analyses is that large-scale databases must be assembled from a collection of small independent and heterogeneous fragments — the outputs of many and isolated scientific studies conducted around the globe.

Collecting and compiling these fragments is challenging at both political and technical levels. The political challenge is to manage the carrots and sticks needed to promote sharing of data within the scientific community. The politics of data sharing have been the primary focus for debate over the last 5 years, but now that many journals and funding agencies are requiring data to be archived at the time of publication, the availability of these data fragments is increasing. But little progress has been made on the technical challenge: how can you combine a collection of independent fragments, each with its own peculiarities, into a single quality database?

Together with 92 other co-authors, we recently published the Biomass And Allometry Database (BAAD) as a data paper in the journal Ecology, combining data from 176 different scientific studies into a single unified database. We built BAAD for several reasons: i) we needed it for our own work ii) we perceived a strong need within the vegetation modelling community for such a database and iii) because it allowed us to road-test some new methods for building and maintaining a database ^1.

Until now, every other data compilation we are aware of has been assembled in the dark. By this we mean, end-users are provided with a finished product, but remain unaware of the diverse modifications that have been made to components in assembling the unified database. Thus users have limited insight into the quality of methods used, nor are they able to build on the compilation themselves.

The approach we took with BAAD is quite different: our database is built from raw inputs using scripts; plus the entire work-flow and history of modifications is available for users to inspect, run themselves and ultimately build upon. We believe this is a better way for managing lots of #otherpeoplesdata and so below share some of the key insights from our experience.

The highlights of the project:

1. Script everything and rebuild from source

2. Establish a data-processing pipeline

  • Don’t modify raw data files
  • Encode meta-data as data, not as code
  • Establish a formal process for processing and reviewing each data set

3. Use version control (git) to track changes and code sharing website (github) for effective collaboration

4. Embrace Openness

5. A living database

There was no mention of reconciliation of nomenclature for species. I checked some of the individual reports, such as Report for study: Satoo1968, which does mention:

Other variables: M.I. Ishihara, H. Utsugi, H. Tanouchi, and T. Hiura conducted formal search of reference databases and digitized raw data from Satoo (1968). Based on this reference, meta data was also created by M.I. Ishihara. Species name and family names were converted by M.I. Ishihara according to the following references: Satake Y, Hara H (1989a) Wild flower of Japan Woody plants I (in Japanese). Heibonsha, Tokyo; Satake Y, Hara H (1989b) Wild flower of Japan Woody plants II (in Japanese). Heibonsha, Tokyo. (Emphasis in original)

I haven’t surveyed all the reports but it appears that “conversion” of species and family names occurred prior to entering the data pipeline.

Not an unreasonable choice but it does mean that we cannot use the original names as recorded as search terms into literature that existed at the time of the original observations.

Normalization of data often leads to loss of information. Not necessarily but often does.

I first saw this in a tweet by Dr. Mike Whitfield.

Bringing biodiversity information to life

Thursday, December 11th, 2014

Bringing biodiversity information to life

From the post:

The inaugural GBIF Ebbe Nielsen Challenge aims to inspire scientists, informaticians, data modelers, cartographers and other experts to create innovative applications of open-access biodiversity data.


For the past 12 years, GBIF has awarded the Ebbe Nielsen Prize to recognize outstanding contributions to biodiversity informatics while honouring the legacy of Ebbe Nielsen, one of the principal founders of GBIF, who tragically died just before it came into being.

The Science Committee, working with the Secretariat, has revamped the award for 2015 as the GBIF Ebbe Nielsen Challenge. This open incentive competition seeks to encourage innovative uses of the more than half a billion species occurrence records mobilized through GBIF’s international network. These creative applications of GBIF-mediated data may come in a wide variety of forms and formats—new analytical research, richer policy-relevant visualizations, web and mobile applications, improvements to processes around data digitization, quality and access, or something else entirely. Judges will evaulate submissions on their innovation, functionality and applicability.

As a simple point of departure, participants may wish to review the visual analyses of trends in mobilizing species occurrence data at global and national scales recently unveiled on Challenge submissions may build on such creations and propose uses or extensions that make GBIF-mediated data even more useful to researchers, policymakers, educators, students and citizens alike.

A jury composed of experts from the biodiversity informatics community will judge the Round One entries collected through this ChallengePost website on their innovation, functionality and applicability, before selecting three to six finalists to compete for a €20,000 First Prize later in 2015.

You can’t argue with the judging criteria:


How novel is the submission? A significant portion of the submission should be developed for the challenge. A submission based largely (or entirely) on work published or developed prior to the challenge start date will not be eligible for submission.


Does the submission work and show or do something useful?


Can the GBIF and biodiversity informatics communities use and/or build on the submission?

Deadline: Tuesday, 3 March 2015 at 5pm CET.

An obvious opportunity to introduce the biodiversity community to topic maps!

Oh, there is a €20,000 first prize and €5,000 second prize. Just something to pique your interest. 😉

Publishing biodiversity data directly from GitHub to GBIF

Saturday, March 15th, 2014

Publishing biodiversity data directly from GitHub to GBIF by Roderic D. M. Page.

From the post:

Today I managed to publish some data from a GitHub repository directly to GBIF. Within a few minutes (and with Tim Robertson on hand via Skype to debug a few glitches) the data was automatically indexed by GBIF and its maps updated. You can see the data I uploaded here.

In case you don’t know about GBIF (I didn’t):

The Global Biodiversity Information Facility (GBIF) is an international open data infrastructure, funded by governments.

It allows anyone, anywhere to access data about all types of life on Earth, shared across national boundaries via the Internet.

By encouraging and helping institutions to publish data according to common standards, GBIF enables research not possible before, and informs better decisions to conserve and sustainably use the biological resources of the planet.

GBIF operates through a network of nodes, coordinating the biodiversity information facilities of Participant countries and organizations, collaborating with each other and the Secretariat to share skills, experiences and technical capacity.

GBIF’s vision: “A world in which biodiversity information is freely and universally available for science, society and a sustainable future.”

Roderic summarizes his post saying:

what I’m doing here is putting data on GitHub and having GBIF harvest that data directly from GitHub. This means I can edit the data, rebuild the Darwin Core Archive file, push it to GitHub, and GBIF will reindex it and update the data on the GBIF portal.

The process isn’t perfect but unlike disciplines where data sharing is the exception rather than the rule, the biodiversity community is trying to improve its sharing of data.

Every attempt at improvement will not succeed but lessons are learned from every attempt.

Kudos to the biodiversity community for a model that other communities should follow!

Biodiversity Information Standards

Tuesday, March 4th, 2014

Biodiversity Information Standards

From the webpage:

The most widely deployed formats for biodiversity occurrence data are Darwin Core (wiki) and ABCD (wiki).

The TDWG community’s priority is the deployment of Life Science Identifiers (LSID), the preferred Globally Unique Identifier technology and transitioning to RDF encoded metadata as defined by a set of simple vocabularies. All new projects should address the need for tagging their data with LSIDs and consider the use or development of appropriate vocabularies.

TDWG’s activities within the biodiversity informatics domain can be found in the Activities section of this website.

TDWG = Taxonomic Database Working Group.

I originally followed a link on “Darwin Core,” which sounded to much like another “D***** Core” to not check the reference.

Net result is two of the most popular formats used for biodiversity data.

ZooKeys 50 (2010) Special Issue

Wednesday, January 29th, 2014

Taxonomy shifts up a gear: New publishing tools to accelerate biodiversity research by Lyubomir Penev, et. al.

From the editorial:

The principles of Open Access greatly facilitate dissemination of information through the Web where it is freely accessed, shared and updated in a form that is accessible to indexing and data mining engines using Web 2.0 technologies. Web 2.0 turns the taxonomic information into a global resource well beyond the taxonomic community. A significant bottleneck in naming species is the requirement by the current Codes of biological nomenclature ruling that new names and their associated descriptions must be published on paper, which can be slow, costly and render the new information difficult to find. In order to make progress in documenting the diversity of life, we must remove the publishing impediment in order to move taxonomy “from a cottage industry into a production line” (Lane et al. 2008), and to make best use of new technologies warranting the fastest and widest distribution of these new results.

In this special edition of ZooKeys we present a practical demonstration of such a process. The issue opens with a forum paper from Penev et al. (doi: 10.3897/zookeys.50.538) that presents the landscape of semantic tagging and text enhancements in taxonomy. It describes how the content of the manuscript is enriched by semantic tagging and marking up of four exemplar papers submitted to the publisher in three different ways: (i) written in Microsoft Word and submitted as non-tagged manuscript (Stoev et al., doi: 10.3897/zookeys.50.504); (ii) generated from Scratchpads (Blagoderov et al., doi: 10.3897/zookeys.50.506 and Brake and Tschirnhaus, doi: 10.3897/zookeys.50.505); (iii) generated from an author’s database (Taekul et al., doi: 10.3897/zookeys.50.485). The latter two were submitted as XML-tagged manuscript. These examples demonstrate the suitability of the workflow to a range of possibilities that should encompass most current taxonomic efforts. To implement the aforementioned routes for XML mark up in prospective taxonomic publishing, a special software tool (Pensoft Mark Up Tool, PMT) was developed and its features were demonstrated in the current issue. The XML schema used was version #123 of TaxPub, an extension to the Document Type Definitions (DTD) of the US National Library of Medicine (NLM) (

A second forum paper from Blagoderov et al. (doi: 10.3897/zookeys.50.539) sets out a workflow that describes the assembly of elements from a Scratchpad taxon page ( to export a structured XML file. The publisher receives the submission, automatically renders the file into the journal‘s layout style as a PDF and transmits it to a selection of referees, based on the key words in the manuscript and the publisher’s database. Several steps, from the author’s decision to submit the manuscript to final publication and dissemination, are automatic. A journal editor first spends time on the submission when the referees’ reports are received, making the decision to publish, modify or reject the manuscript. If the decision is to publish, then PDF proofs are sent back to the author and, when verified, the paper is published both on paper and on-line, in PDF, HTML and XML formats. The original information is also preserved on the original Scratchpad where it may, in due course, be updated. A visitor arriving at the web site by tracing the original publication will be able to jump forward to the current version of the taxon page.

This sounds like the promise of SGML/XML made real doesn’t it?

See the rest of the editorial or ZooKeys 50 for a very good example of XML and semantics in action.

This is a long way from the “related” or “recent” article citations in most publisher interfaces. Thoughts on how to make that change?

A Semantic Web Example? Nearly a Topic Map?

Wednesday, January 29th, 2014

Morphological and Geographical Traits of the British Odonata by Gary D Powney, el. al.


Trait data are fundamental for many aspects of ecological research, particularly for modeling species response to environmental change. We synthesised information from the literature (mainly field guides) and direct measurements from museum specimens, providing a comprehensive dataset of 26 attributes, covering the 43 resident species of Odonata in Britain. Traits included in this database range from morphological traits (e.g. body length) to attributes based on the distribution of the species (e.g. climatic restriction). We measured 11 morphometric traits from five adult males and five adult females per species. Using digital callipers, these measurements were taken from dry museum specimens, all of which were wild caught individuals. Repeated measures were also taken to estimate measurement error. The trait data are stored in an online repository (, alongside R code designed to give an overview of the morphometric data, and to combine the morphometric data to the single value per trait per species data.

A great example of publishing data along with software to manipulate it.

I mention it here because the publisher, Pensoft, references the Semantic Web saying:

The Semantic Web could also be called a “linked Web” because most semantic enhancements are in fact provided through various kinds of links to external resources. The results of these linkages will be visualized in the HTML versions of the published papers through various cross-links within the text and more particularly through the Pensoft Taxon Profile (PTP) ( PTP is a web-based harvester that automatically links any taxon name mentioned within a text to external sources and creates a dynamic web-page for that taxon. PTP saves readers a great amount of time and effort by gathering for them the relevant information on a taxon from leading biodiversity sources in real time.

A substantial feature of the semantic Web is open data publishing, where not only analysed results, but original datasets can be published as citeable items so that the data authors may receive academic dredit for their efforts. For more information, please visit our detailed Data Publishing Policies and Guidelines for Biodiversity Data.

When you view the article, you will find related resources displayed next to the article. A lot of related resources.

Of course it remains to every reader to assemble data across varying semantics but this is definitely a step in the right direction.


I first saw this in a tweet by S.K. Morgan Ernest.

Biodiversity Information Serving Our Nation (BISON)

Friday, January 24th, 2014

Biodiversity Information Serving Our Nation (BISON)

From the about tab:

Researchers collect species occurrence data, records of an organism at a particular time in a particular place, as a primary or ancillary function of many biological field investigations. Presently, these data reside in numerous distributed systems and formats (including publications) and are consequently not being used to their full potential. As a step toward addressing this challenge, the Core Science Analytics and Synthesis (CSAS) program of the US Geological Survey (USGS) is developing Biodiversity Information Serving Our Nation (BISON), an integrated and permanent resource for biological occurrence data from the United States.

BISON will leverage the accumulated human and infrastructural resources of the long-term USGS investment in research and information management and delivery.

If that sounds impressive, consider the BISON statistics as of December 31, 2013:

Total Records: 126,357,352
Georeferenced: 120,394,780
Taxa: 315,663
Data Providers: 307

Searches are by scientific or common name and ITIS enabled searching is on by default. Just in case you are curious:

BISON has integrated taxonomic information provided by the Integrated Taxonomic Information System (ITIS) allowing advanced search capability in BISON. With the integration, BISON users have the ability to search more completely across species records. Searches can now include all synonyms and can be conducted hierarchically by genera and higher taxa levels using ITIS enabled queries. Binding taxonomic structure to search terms will make possible broad searches on species groups such as Salmonidae (salmon, trout, char) or Passeriformes (cardinals, tanagers, etc) as well as on all of the many synonyms and included taxa (there are 60 for Poa pratensis – Kentucky Bluegrass – alone).

Clue: With sixty (60) names, the breakfast of champions since 1875.

I wonder if Watson would have answered: “What is Kentucky Bluegrass?” on Jeopardy. The first Kentucky Derby was run on May 17, 1875.

BISON also offers developer tools and BISON Web Services.

Data sharing, OpenTree and GoLife

Monday, January 20th, 2014

Data sharing, OpenTree and GoLife

From the post:

NSF has released GoLife, the new solicitation that replaces both AToL and AVAToL. From the GoLife text:

The goals of the Genealogy of Life (GoLife) program are to resolve the phylogenetic history of life and to integrate this genealogical architecture with underlying organismal data.

Data completeness, open data and data integration are key components of these proposals – inferring well-sampled trees that are linked with other types of data (molecular, morphological, ecological, spatial, etc) and made easily available to scientific and non-scientific users. The solicitation requires that trees published by GoLife projects are published in a way that allows them to be understood and re-used by Open Tree of Life and other projects:

Integration and standardization of data consistent with three AVAToL projects: Open Tree of Life (, ARBOR (, and Next Generation Phenomics ( is required. Other data should be made available through broadly accessible community efforts (i.e., specimen data through iDigBio, occurrence data through BISON, etc). (I corrected the URLs for ARBOR and Next Generation Phenomics)

What does it mean to publish data consistent with Open Tree of Life? We have a short page on data sharing with OpenTree, a publication coming soon (we will update this post when it comes out) and we will be releasing our new curation / validation tool for phylogenetic data in the next few weeks.

A great resource on the NSF GoLife proposal that I just posted about.

Some other references:

AToL – Assembling the Tree of Life

AVATOL – Assembling, Visualizing and Analyzing the Tree of Life

Be sure to contact the Open Tree of Life group if you are interested in the GoLife project.

Global Biodiversity Information Facility

Wednesday, October 9th, 2013

Global Biodiversity Information Facility

Some stats:

417,165,184 occurrences

1,426,888 species

11,976 data sets

578 data publishers

What lies at the technical heart of this beast?

Would you believe a PostgreSQL database and an embedded Apache SOLR index?

Start with the Summary of the GBIF infrastructure. The details on PostgreSQL and Solr are under the Registry tab.

BTW, the system recognizes multiple identification systems and more are to be added.

Need to read more of the documents on that part of the system.

Microbial Life Database

Tuesday, July 23rd, 2013

Microbial Life Database

From the webpage:

The Microbial Life Database (MLD) is a project under continuos development to visualize the ecological, physiological and morphological diversity of microbial life. A database is being constructed including data for nearly 600 well-known prokaryote genera mostly described in Bergey’s Manual of Determinative Bacteriology and published by the Bergey’s Trust. Correction and additions come from many other sources. The database is divided by genera but we are working on a version by species. This is the current database v02 in Google Spreadsheets format. Below is a bubble chart of the number of species included in each of the microbial groups. You can click the graph to go to an interactive version with more details. If you want to contribute to this database please send an email to Prof. Abel Mendez.

I don’t have any immediate need for this data set but it is the sort of project where semantic reefs are found. 😉

Data Visualization: Exploring Biodiversity

Saturday, May 25th, 2013

Data Visualization: Exploring Biodiversity by Sean Gonzalez.

From the post:

When you have a few hundred years worth of data on biological records, as the Smithsonian does, from journals to preserved specimens to field notes to sensor data, even the most diligently kept records don’t perfectly align over the years, and in some cases there is outright conflicting information. This data is important, it is our civilization’s best minds giving their all to capture and record the biological diversity of our planet. Unfortunately, as it stands today, if you or I were to decide we wanted to learn more, or if we wanted to research a specific species or subject, accessing and making sense of that data effectively becomes a career. Earlier this year an executive order was given which generally stated that federally funded research had to comply with certain data management rules, and the Smithsonian took that order to heart, event though it didn’t necessarily directly apply to them, and has embarked to make their treasure of information more easily accessible. This is a laudable goal, but how do we actually go about accomplishing this? Starting with digitized information, which is a challenge in and of itself, we have a real Big Data challenge, setting the stage for data visualization.

The Smithsonian has already gone a long way in curating their biodiversity data on the Biodiversity Heritage Library (BHL) website, where you can find ever increasing sources. However, we know this curation challenge can not be met by simply wrapping the data with a single structure or taxonomy. When we search and explore the BHL data we may not know precisely what we’re looking for, and we don’t want a scavenger hunt to ensue where we’re forced to find clues and hidden secrets in hopes of reaching our national treasure; maybe the Gates family can help us out…

People see relationships in the data differently, so when we go exploring one person may do better with a tree structure, others prefer a classic title/subject style search, or we may be interested in reference types and frequencies. Why we don’t think about it as one monolithic system is akin to discussing the number of Angels that fit on the head of a pin, we’ll never be able to test our theories. Our best course is to accept that we all dive into data from different perspectives, and we must therefore make available different methods of exploration.

What would you do beyond visualization?

Biodiversity Heritage Library (BHL)

Thursday, March 28th, 2013

Biodiversity Heritage Library (BHL)

Best described by their own “about” page:

The Biodiversity Heritage Library (BHL) is a consortium of natural history and botanical libraries that cooperate to digitize and make accessible the legacy literature of biodiversity held in their collections and to make that literature available for open access and responsible use as a part of a global “biodiversity commons.” The BHL consortium works with the international taxonomic community, rights holders, and other interested parties to ensure that this biodiversity heritage is made available to a global audience through open access principles. In partnership with the Internet Archive and through local digitization efforts , the BHL has digitized millions of pages of taxonomic literature , representing tens of thousands of titles and over 100,000 volumes.

The published literature on biological diversity has limited global distribution; much of it is available in only a few select libraries in the developed world. These collections are of exceptional value because the domain of systematic biology depends, more than any other science, upon historic literature. Yet, this wealth of knowledge is available only to those few who can gain direct access to significant library collections. Literature about the biota existing in developing countries is often not available within their own borders. Biologists have long considered that access to the published literature is one of the chief impediments to the efficiency of research in the field. Free global access to digital literature repatriates information about the earth’s species to all parts of the world.

The BHL consortium members digitize the public domain books and journals held within their collections. To acquire additional content and promote free access to information, the BHL has obtained permission from publishers to digitize and make available significant biodiversity materials that are still under copyright.

Because of BHL’s success in digitizing a significant mass of biodiversity literature, the study of living organisms has become more efficient. The BHL Portal allows users to search the corpus by multiple access points, read the texts online, or download select pages or entire volumes as PDF files.

The BHL serves texts with information on over a million species names. Using UBio’s taxonomic name finding tools, researchers can bring together publications about species and find links to related content in the Encyclopedia of Life. Because of its commitment to open access, BHL provides a range of services and APIs which allow users to harvest source data files and reuse content for research purposes.

Since 2009, the BHL has expanded globally. The European Commission’s eContentPlus program has funded the BHL-Europe project, with 28 institutions, to assemble the European language literature. Additionally, the Chinese Academy of Sciences (BHL-China), the Atlas of Living Australia (BHL-Australia), Brazil (through BHL-SciELO) and the Bibliotheca Alexandrinahave created national or regional BHL nodes. Global nodes are organizational structures that may or may not develop their own BHL portals. It is the goal of BHL to share and serve content through the BHL Portal developed and maintained at the Missouri Botanical Garden. These projects will work together to share content, protocols, services, and digital preservation practices.

A truly remarkable effort!

Would you believe they have a copy of “Aristotle’s History of animals.” In ten books. Tr. by Richard Cresswell? For download as a PDF?

Tell me, how would you reconcile the terminology of Aristotle or of Cresswell for that matter in translation, with modern terminology both for species and their features?

In order to enable navigation from this work to other works in the collection?

Moreover, how would you preserve that navigation for others to use?

Document level granularity is better than not finding a document at all but it is a far cry from being efficient.

BHL-Europe web portal opens up…

Thursday, March 28th, 2013

BHL-Europe web portal opens up the world’s knowledge on biological diversity

From the post:

The goal of the Biodiversity Heritage Library for Europe (BHL-Europe) project is to make published biodiversity literature accessible to anyone who’s interested. The project will provide a multilingual access point (12 languages) for biodiversity content through the BHL-Europe web portal with specific biological functionalities for search and retrieval and through the EUROPEANA portal. Currently BHL-Europe involves 28 major natural history museums, botanical gardens and other cooperating institutions.

BHL-Europe is a 3 year project, funded by the European Commission under the eContentplus programme, as part of the i2010 policy.

Unlimited access to biological diversity information

The libraries of the European natural history museums and botanical gardens collectively hold the majority of the world’s published knowledge on the discovery and subsequent description of biological diversity. However, digital access to this knowledge is difficult.

The BHLproject, launched 2007 in the USA, is systematically attempting to address this problem. In May 2009 the ambitious and innovative EU project ‘Biodiversity Heritage Library for Europe’ (BHL-Europe) was launched. BHL-Europe is coordinated by the Museum für Naturkunde Berlin, Germany, and combines the efforts of 26 European and 2 American institutions. For the first time, the wider public, citizen scientists and decision makers will have unlimited access to this important source of information.

A project with enormous potential, although three (3) years seems a bit short.

Mentioned but without a link, the BHLproject has digitized over 100,000 volumes, with information on more than one million species names.