Archive for the ‘Biology’ Category

A Guide to Reproducible Code in Ecology and Evolution

Thursday, December 7th, 2017

A Guide to Reproducible Code in Ecology and Evolution by British Ecological Society.

Natilie Cooper, Natural History Museum, UK and Pen-Yuan Hsing, Durham University, UK, write in the introduction:

The way we do science is changing — data are getting bigger, analyses are getting more complex, and governments, funding agencies and the scientific method itself demand more transparency and accountability in research. One way to deal with these changes is to make our research more reproducible, especially our code.

Although most of us now write code to perform our analyses, it is often not very reproducible. We have all come back to a piece of work we have not looked at for a while and had no idea what our code was doing or which of the many “final_analysis” scripts truly was the final analysis! Unfortunately, the number of tools for reproducibility and all the jargon can leave new users feeling overwhelmed, with no idea how to start making their code more reproducible. So, we have put together this guide to help.

A Guide to Reproducible Code covers all the basic tools and information you will need to start making your code more reproducible. We focus on R and Python, but many of the tips apply to any programming language. Anna Krystalli introduces some ways to organise files on your computer and to document your workflows. Laura Graham writes about how to make your code more reproducible and readable. François Michonneau explains how to write reproducible reports. Tamora James breaks down the basics of version control. Finally, Mike Croucher describes how to archive your code. We have also included a selection of helpful tips from other scientists.

True reproducibility is really hard. But do not let this put you off. We would not expect anyone to follow all of the advice in this booklet at once. Instead, challenge yourself to add one more aspect to each of your projects. Remember, partially reproducible research is much better than completely non-reproducible research.

Good luck!
… (emphasis in original)

Not counting front and back matter, 39 pages total. A lot to grasp in one reading but if you don’t already have reproducible research habits, keep a copy of this publication on top of your desk. Yes, on top of the incoming mail, today’s newspaper, forms and chart requests from administrators, etc. On top means just that, on top.

At some future date, when the pages are too worn, creased, folded, dog eared and annotated to be read easily, reprint it and transfer your annotations to a clean copy.

I first saw this in David Smith’s The British Ecological Society’s Guide to Reproducible Science.

PS: The same rules apply to data science.

A Primer for Computational Biology

Thursday, November 9th, 2017

A Primer for Computational Biology by Shawn T. O’Neil.

From the webpage:

A Primer for Computational Biology aims to provide life scientists and students the skills necessary for research in a data-rich world. The text covers accessing and using remote servers via the command-line, writing programs and pipelines for data analysis, and provides useful vocabulary for interdisciplinary work. The book is broken into three parts:

  1. Introduction to Unix/Linux: The command-line is the “natural environment” of scientific computing, and this part covers a wide range of topics, including logging in, working with files and directories, installing programs and writing scripts, and the powerful “pipe” operator for file and data manipulation.
  2. Programming in Python: Python is both a premier language for learning and a common choice in scientific software development. This part covers the basic concepts in programming (data types, if-statements and loops, functions) via examples of DNA-sequence analysis. This part also covers more complex subjects in software development such as objects and classes, modules, and APIs.
  3. Programming in R: The R language specializes in statistical data analysis, and is also quite useful for visualizing large datasets. This third part covers the basics of R as a programming language (data types, if-statements, functions, loops and when to use them) as well as techniques for large-scale, multi-test analyses. Other topics include S3 classes and data visualization with ggplot2.

Pass along to life scientists and students.

This isn’t the primer that separates the CS material from domain specific examples and prose. Adaptation to another domain is a question of re-writing.

I assume an adaptable primer wasn’t the author’s intention and so that isn’t a criticism but an observation that basic material is written over and over again, needlessly.

I first saw this in a tweet by Christophe Lalanne.

The Gene Hackers [Chaos Remains King]

Tuesday, November 10th, 2015

The Gene Hackers by Michael Specter.

From the post:

It didn’t take Zhang or other scientists long to realize that, if nature could turn these molecules into the genetic equivalent of a global positioning system, so could we. Researchers soon learned how to create synthetic versions of the RNA guides and program them to deliver their cargo to virtually any cell. Once the enzyme locks onto the matching DNA sequence, it can cut and paste nucleotides with the precision we have come to expect from the search-and-replace function of a word processor. “This was a finding of mind-boggling importance,” Zhang told me. “And it set off a cascade of experiments that have transformed genetic research.”

With CRISPR, scientists can change, delete, and replace genes in any animal, including us. Working mostly with mice, researchers have already deployed the tool to correct the genetic errors responsible for sickle-cell anemia, muscular dystrophy, and the fundamental defect associated with cystic fibrosis. One group has replaced a mutation that causes cataracts; another has destroyed receptors that H.I.V. uses to infiltrate our immune system.

The potential impact of CRISPR on the biosphere is equally profound. Last year, by deleting all three copies of a single wheat gene, a team led by the Chinese geneticist Gao Caixia created a strain that is fully resistant to powdery mildew, one of the world’s most pervasive blights. In September, Japanese scientists used the technique to prolong the life of tomatoes by turning off genes that control how quickly they ripen. Agricultural researchers hope that such an approach to enhancing crops will prove far less controversial than using genetically modified organisms, a process that requires technicians to introduce foreign DNA into the genes of many of the foods we eat.

The technology has also made it possible to study complicated illnesses in an entirely new way. A few well-known disorders, such as Huntington’s disease and sickle-cell anemia, are caused by defects in a single gene. But most devastating illnesses, among them diabetes, autism, Alzheimer’s, and cancer, are almost always the result of a constantly shifting dynamic that can include hundreds of genes. The best way to understand those connections has been to test them in animal models, a process of trial and error that can take years. CRISPR promises to make that process easier, more accurate, and exponentially faster.

Deeply compelling read on the stellar career of Feng Zhang and his use of “clustered regularly interspaced short palindromic repeats” (CRISPR) for genetic engineering.

If you are up for the technical side, try PubMed on CRISPR at 2,306 “hits” as of today.

If not, continue with Michael’s article. You will get enough background to realize this is a very profound moment in the development of genetic engineering.

A profound moment that can be made all the more valuable by linking its results to the results (not articles or summaries of articles) of prior research.

Proposals for repackaging data in some yet-to-be-invented format are a non-starter from my perspective. That is more akin to the EU science/WPA projects than a realistic prospect for value-add.

Let’s start with the assumption that when held in electronic format, data has its native format as a given. Nothing we can change about that part of the problem of access.

Whether labbooks, databases, triple stores, etc.

That one assumption reduces worries about corrupting the original data and introduces a sense of “tinkering” with existing data interfaces. (Watch for a post tomorrow on the importance of “tinkering.”)

Hmmm, nodes anyone?

PS: I am not overly concerned about genetic “engineering.” My money is riding on chaos in genetics and environmental factors.

The challenge of combining 176 x #otherpeoplesdata…

Wednesday, June 10th, 2015

The challenge of combining 176 x #otherpeoplesdata to create the Biomass And Allometry Database by Daniel Falster , Rich FitzJohn , Remko Duursma , Diego Barneche .

From the post:

Despite the hype around "big data", a more immediate problem facing many scientific analyses is that large-scale databases must be assembled from a collection of small independent and heterogeneous fragments — the outputs of many and isolated scientific studies conducted around the globe.

Collecting and compiling these fragments is challenging at both political and technical levels. The political challenge is to manage the carrots and sticks needed to promote sharing of data within the scientific community. The politics of data sharing have been the primary focus for debate over the last 5 years, but now that many journals and funding agencies are requiring data to be archived at the time of publication, the availability of these data fragments is increasing. But little progress has been made on the technical challenge: how can you combine a collection of independent fragments, each with its own peculiarities, into a single quality database?

Together with 92 other co-authors, we recently published the Biomass And Allometry Database (BAAD) as a data paper in the journal Ecology, combining data from 176 different scientific studies into a single unified database. We built BAAD for several reasons: i) we needed it for our own work ii) we perceived a strong need within the vegetation modelling community for such a database and iii) because it allowed us to road-test some new methods for building and maintaining a database ^1.

Until now, every other data compilation we are aware of has been assembled in the dark. By this we mean, end-users are provided with a finished product, but remain unaware of the diverse modifications that have been made to components in assembling the unified database. Thus users have limited insight into the quality of methods used, nor are they able to build on the compilation themselves.

The approach we took with BAAD is quite different: our database is built from raw inputs using scripts; plus the entire work-flow and history of modifications is available for users to inspect, run themselves and ultimately build upon. We believe this is a better way for managing lots of #otherpeoplesdata and so below share some of the key insights from our experience.

The highlights of the project:

1. Script everything and rebuild from source

2. Establish a data-processing pipeline

  • Don’t modify raw data files
  • Encode meta-data as data, not as code
  • Establish a formal process for processing and reviewing each data set

3. Use version control (git) to track changes and code sharing website (github) for effective collaboration

4. Embrace Openness

5. A living database

There was no mention of reconciliation of nomenclature for species. I checked some of the individual reports, such as Report for study: Satoo1968, which does mention:

Other variables: M.I. Ishihara, H. Utsugi, H. Tanouchi, and T. Hiura conducted formal search of reference databases and digitized raw data from Satoo (1968). Based on this reference, meta data was also created by M.I. Ishihara. Species name and family names were converted by M.I. Ishihara according to the following references: Satake Y, Hara H (1989a) Wild flower of Japan Woody plants I (in Japanese). Heibonsha, Tokyo; Satake Y, Hara H (1989b) Wild flower of Japan Woody plants II (in Japanese). Heibonsha, Tokyo. (Emphasis in original)

I haven’t surveyed all the reports but it appears that “conversion” of species and family names occurred prior to entering the data pipeline.

Not an unreasonable choice but it does mean that we cannot use the original names as recorded as search terms into literature that existed at the time of the original observations.

Normalization of data often leads to loss of information. Not necessarily but often does.

I first saw this in a tweet by Dr. Mike Whitfield.

World Register of Marine Introduced Species (WRIMS)

Tuesday, March 17th, 2015

World Register of Marine Introduced Species (WRIMS)

From the post:

WRIMS – a database of introduced and invasive alien marine species – has officially been released to the public. It includes more than 1,400 marine species worldwide, compiled through the collaboration with international initiatives and study of almost 2,500 publications.

WRIMS lists the known alien marine species worldwide, with an indication of the region in which they are considered to be alien. In addition, the database lists whether a species is reported to have ecological or economic impacts and thus considered invasive in that area. Each piece of information is linked to a source publication or a specialist database, allowing users to retrace the information or get access to the full source for more details.

Users can search for species within specific groups, and generate species lists per geographic region, thereby taking into account their origin (alien or origin unknown or uncertain) and invasiveness (invasive, of concern, uncertain …). For each region the year of introduction or first report has been documented where available. In the past, species have sometimes erroneously been labelled as ‘alien in region X’. This information is also stored in WRIMS, clearly indicating that this was an error. Keeping track of these kinds of errors or misidentifications can greatly help researchers and policy makers in dealing with alien species.

WRIMS is a subset of the World Register of Marine Species (WoRMS): the taxonomy of the species is managed by the taxonomic editor community of WoRMS, whereas the alien-related information is managed by both the taxonomic editors and the thematic editors within WRIMS. Just like its umbrella-database WoRMS, WRIMS is dynamic: a team of editors is not only keeping track of new reports of alien species, they also scan existing literature and databases to complete the general distribution range of each alien species in WRIMS.

Are there aliens in your midst? 😉

Exactly the sort of resource that if I don’t capture it now, I will never be able to find it again.


Databases of Biological Databases (yes, plural)

Tuesday, January 20th, 2015

Mick Watson points out in a tweet today that there are at least two databases of biological databases.


MetaBase is a user-contributed list of all the biological databases available on the internet. Currently there are 1,802 entries, each describing a different database. The databases are described in a semi-structured way by using templates and entries can cary various user comments and annotations (see a random entry). Entries can be searched, listed or browsed by category.

The site uses the same MediaWiki technology that powers Wikipedia, probably the best known user-contributed resource on the internet. The Mediawiki system allows users to participate on many different levels, ranging from authors and editors to curators and designers.

Database description

MetaBase aims to be a flexible, user-driven (user-created) resource for the biological database community.

The main focus of MetaBase is summarised below:

  • As a basic requirement, MB contains a list of databases, URLs and descriptions of the most commonly used biological databases currently available on the internet.
  • The system should be flexible, allowing users to contribute, update and maintain the data in different ways.
  • In the future we aim to generate more communication between the database developer and user communities.

A larger, more ambitious list of aims is given here.

The first point was acheived using data taken from the Molecular Biology Database Collection. Secondly, MetaBase has been implemented using MediaWiki. The final point will take longer, and is dependent on the community uptake of MB…

DBD – Database of Biological Databases

DBD: Database of Biological Database team are R.R. Siva Kiran, MVN Setty, Department of Biotechnology, MS Ramaiah Institute of Technology, MSR Nagar, Bangalore, India and G. Hanumantha Rao, Center for Biotechnology, Department of Chemical Engineering, Andhra University, Visakhapatnam-530003, India. DBD consists of 1200 Database entries covering wide range of databases useful for biological researchers.

Be aware that the DBD database reports its last update as 30-July-2008. I have written to confirm if that is the correct date.

Assuming it is, has anyone validated the links in the DBD database and/or compared them to the links in Metabase? That seems like a worthwhile service to the community.

History & Philosophy of Computational and Genome Biology

Wednesday, December 17th, 2014

History & Philosophy of Computational and Genome Biology by Mark Boguski.

A nice collection of books and articles on computational and genome biology. It concludes with this anecdote:

Despite all of the recent books and biographies that have come out about the Human Genome Project, I think there are still many good stories to be told. One of them is the origin of the idea for whole-genome shotgun and assembly. I recall a GRRC (Genome Research Review Committee) review that took place in late 1996 or early 1997 where Jim Weber proposed a whole-genome shotgun approach. The review panel, at first, wanted to unceremoniously “NeRF” (Not Recommend for Funding) the grant but I convinced them that it deserved to be formally reviewed and scored, based on Jim’s pioneering reputation in the area of genetic polymorphism mapping and its impact on the positional cloning of human disease genes and the origins of whole-genome genotyping. After due deliberation, the GRRC gave the Weber application a non-fundable score (around 350 as I recall) largely on the basis of Weber’s inability to demonstrate that the “shotgun” data could be assembled effectively.

Some time later, I was giving a ride to Jim Weber who was in Bethesda for a meeting. He told me why his grant got a low score and asked me if I knew any computer scientists that could help him address the assembly problem. I suggested he talk with Gene Myers (I knew Gene and his interests well since, as one of the five authors of the BLAST algorithm, he was a not infrequent visitor to NCBI).

The following May, Weber and Myers submitted a “perspective” for publication in Genome Research entitled “Human whole-genome shotgun sequencing“. This article described computer simulations which showed that assembly was possible and was essentially a rebuttal to the negative review and low priority score that came out of the GRRC. The editors of Genome Research (including me at the time) sent the Weber/Myers article to Phil Green (a well-known critic of shotgun sequencing) for review. Phil’s review was extremely detailed and actually longer that the Weber/Myers paper itself! The editors convinced Phil to allow us to publish his critique entitled “Against a whole-genome shotgun” as a point-counterpoint feature alongside the Weber-Myers article in the journal.

The rest, as they say, is history, because only a short time later, Craig Venter (whose office at TIGR had requested FAX copies of both the point and counterpoint as soon as they were published) and Mike Hunkapiller announced their shotgun sequencing and assembly project and formed Celera. They hired Gene Myers to build the computational capabilities and assemble their shotgun data which was first applied to the Drosophila genome as practice for tackling a human genome which, as is now known, was Venter’s own. Three of my graduate students (Peter Kuehl, Jiong Zhang and Oxana Pickeral) and I participated in the Drosophila annotation “jamboree” (organized by Mark Adams of Celera and Gerry Rubin) working specifically on an analysis of the counterparts of human disease genes in the Drosophila genome. Other aspects of the Jamboree are described in a short book by one of the other participants, Michael Ashburner.

The same type of stories exist not only from the early days of computer science but since then as well. Stories that will capture the imaginations of potential CS majors as well as illuminate areas where computer science can or can’t be useful.

How many of those stories have you captured?

I first saw this in a tweet by Neil Saunders.

Overlap and the Tree of Life

Sunday, December 7th, 2014

I encountered a wonderful example of “overlap” in the markup sense today while reading about resolving conflicts in constructing a comprehensive tree of life.

overlap and the tree of life

The authors use a graph database which allows them to study various hypotheses on the resolutions of conflicts.

Their graph database, opentree-treemachine, is available on GitHub,, as is the source to all the project’s software,

There’s a thought for Balisage 2015. Is the processing of overlapping markup a question of storing documents with overlapping markup in graph databases and then streaming the non-overlapping results of a query to an XML processor?

And visualizing overlapping results or alternative resolutions to overlapping results via a graph database.

The question of which overlapping syntax to use becoming a matter of convenience and the amount of information captured, as opposed to attempts to fashion syntax that cheats XML processors and/or developing new means for processing XML.

Perhaps graph databases can make overlapping markup in documents the default case just as overlap is the default case in documents (single tree documents being rare outliers).

Remind me to send a note to Michael Sperberg-McQueen and friends about this idea.

BTW, the details of the article that lead me down this path:

Synthesis of phylogeny and taxonomy into a comprehensive tree of life by Steven A. Smith, et al.


Reconstructing the phylogenetic relationships that unite all biological lineages (the tree of life) is a grand challenge of biology. However, the paucity of readily available homologous character data across disparately related lineages renders direct phylogenetic inference currently untenable. Our best recourse towards realizing the tree of life is therefore the synthesis of existing collective phylogenetic knowledge available from the wealth of published primary phylogenetic hypotheses, together with taxonomic hierarchy information for unsampled taxa. We combined phylogenetic and taxonomic data to produce a draft tree of life—the Open Tree of Life—containing 2.3 million tips. Realization of this draft tree required the assembly of two resources that should prove valuable to the community: 1) a novel comprehensive global reference taxonomy, and 2) a database of published phylogenetic trees mapped to this common taxonomy. Our open source framework facilitates community comment and contribution, enabling a continuously updatable tree when new phylogenetic and taxonomic data become digitally available. While data coverage and phylogenetic conflict across the Open Tree of Life illuminates significant gaps in both the underlying data available for phylogenetic reconstruction and the publication of trees as digital objects, the tree provides a compelling starting point from which we can continue to improve through community contributions. Having a comprehensive tree of life will fuel fundamental research on the nature of biological diversity, ultimately providing up-to-date phylogenies for downstream applications in comparative biology, ecology, conservation biology, climate change studies, agriculture, and genomics.

A project with a great deal of significance beyond my interest in overlap in markup documents. Highly recommended reading. The resolution of conflicts in trees here involves an evaluation of data, much as you would for merging in a topic map.

Unlike the authors, I see no difficulty in super trees being rich enough with the underlying data to permit direct use of trees for resolution of conflicts. But you would have to design the trees from the start with those capabilities or have topic map like merging capabilities so you are not limited by early and necessarily preliminary data design decisions.


I first saw this in a tweet by Ross Mounce.

Programming for Biologists

Thursday, October 9th, 2014

Programming for Biologists by Ethan White.

From the post:

This is the website for Ethan White’s programming and database management courses designed for biologists. At the moment there are four courses being taught during Fall 2014.

The goal of these courses is to teach biologists how to use computers more effectively to make their research easier. We avoid a lot of the theory that is taught in introductory computer science classes in favor of covering more of the practical side of programming that is necessary for conducting research. In other words, the purpose of these courses is to teach you how to drive the car, not prepare you to be a mechanic.

Hmmm, less theory of engine design and more driving lessons? 😉

Despite my qualms about turn-key machine learning solutions, more people want to learn to drive a car than want to design an engine.

Should we teach topic maps the “right way” or should we teach them to drive?

I first saw this in a tweet by Christophe Lalanne.

Genomic Encyclopedia of Bacteria…

Friday, August 8th, 2014

Genomic Encyclopedia of Bacteria and Archaea: Sequencing a Myriad of Type Strains by Nikos C. Kyrpides, et al. (Kyrpides NC, Hugenholtz P, Eisen JA, Woyke T, Göker M, et al. (2014) Genomic Encyclopedia of Bacteria and Archaea: Sequencing a Myriad of Type Strains. PLoS Biol 12(8): e1001920. doi:10.1371/journal.pbio.1001920)


Microbes hold the key to life. They hold the secrets to our past (as the descendants of the earliest forms of life) and the prospects for our future (as we mine their genes for solutions to some of the planet’s most pressing problems, from global warming to antibiotic resistance). However, the piecemeal approach that has defined efforts to study microbial genetic diversity for over 20 years and in over 30,000 genome projects risks squandering that promise. These efforts have covered less than 20% of the diversity of the cultured archaeal and bacterial species, which represent just 15% of the overall known prokaryotic diversity. Here we call for the funding of a systematic effort to produce a comprehensive genomic catalog of all cultured Bacteria and Archaea by sequencing, where available, the type strain of each species with a validly published name (currently~11,000). This effort will provide an unprecedented level of coverage of our planet’s genetic diversity, allow for the large-scale discovery of novel genes and functions, and lead to an improved understanding of microbial evolution and function in the environment.

While I am a standards advocate, I have to disagree with some of the claims for standards:

Accurate estimates of diversity will require not only standards for data but also standard operating procedures for all phases of data generation and collection [33],[34]. Indeed, sequencing all archaeal and bacterial type strains as a unified international effort will provide an ideal opportunity to implement international standards in sequencing, assembly, finishing, annotation, and metadata collection, as well as achieve consistent annotation of the environmental sources of these type strains using a standard such as minimum information about any (X) sequence (MixS) [27],[29]. Methods need to be rigorously challenged and validated to ensure that the results generated are accurate and likely reproducible, without having to reproduce each point. With only a few exceptions [27],[29], such standards do not yet exist, but they are in development under the auspices of the Genomics Standards Consortium (e.g., the M5 initiative) ( [35]. Without the vehicle of a grand-challenge project such as this one, adoption of international standards will be much less likely.

Some standardization will no doubt be beneficial but for the data that is collected, a topic map informed approach where critical subjects are identified not be surface tokens but by key/value pairs would be much better.

In part because there is always legacy data and too little time and funding to back fit every change in present terminology to past names. Or should I say it hasn’t happen outside of one specialized chemical index that comes to mind.

Finding needles in haystacks:…

Sunday, July 6th, 2014

Finding needles in haystacks: linking scientific names, reference specimens and molecular data for Fungi by Conrad L. Schoch, et al. (Database (2014) 2014 : bau061 doi: 10.1093/database/bau061).


DNA phylogenetic comparisons have shown that morphology-based species recognition often underestimates fungal diversity. Therefore, the need for accurate DNA sequence data, tied to both correct taxonomic names and clearly annotated specimen data, has never been greater. Furthermore, the growing number of molecular ecology and microbiome projects using high-throughput sequencing require fast and effective methods for en masse species assignments. In this article, we focus on selecting and re-annotating a set of marker reference sequences that represent each currently accepted order of Fungi. The particular focus is on sequences from the internal transcribed spacer region in the nuclear ribosomal cistron, derived from type specimens and/or ex-type cultures. Re-annotated and verified sequences were deposited in a curated public database at the National Center for Biotechnology Information (NCBI), namely the RefSeq Targeted Loci (RTL) database, and will be visible during routine sequence similarity searches with NR_prefixed accession numbers. A set of standards and protocols is proposed to improve the data quality of new sequences, and we suggest how type and other reference sequences can be used to improve identification of Fungi.

Database URL:

If you are interested in projects to update and correct existing databases, this is the article for you.

Fungi may not be on your regular reading list but consider one aspect of the problem described:

It is projected that there are ~400 000 fungal names already in existence. Although only 100 000 are accepted taxonomically, it still makes updates to the existing taxonomic structure a continuous task. It is also clear that these named fungi represent only a fraction of the estimated total, 1–6 million fungal species (93–95).

I would say that computer science isn’t the only discipline where “naming things” is hard.


PS: The other lesson from this paper (and many others) is that semantic accuracy is not easy nor is it cheap. Anyone who says differently is lying.

The Encyclopedia of Life v2:…

Saturday, May 10th, 2014

The Encyclopedia of Life v2: Providing Global Access to Knowledge About Life on Earth by Cynthia S. Parr, et al. (Biodiversity Data Journal 2: e1079 (29 Apr 2014) doi: 10.3897/BDJ.2.e1079)


The Encyclopedia of Life (EOL, aims to provide unprecedented global access to a broad range of information about life on Earth. It currently contains 3.5 million distinct pages for taxa and provides content for 1.3 million of those pages. The content is primarily contributed by EOL content partners (providers) that have a more limited geographic, taxonomic or topical scope. EOL aggregates these data and automatically integrates them based on associated scientific names and other classification information. EOL also provides interfaces for curation and direct content addition. All materials in EOL are either in the public domain or licensed under a Creative Commons license. In addition to the web interface, EOL is also accessible through an Application Programming Interface.

In this paper, we review recent developments added for Version 2 of the web site and subsequent releases through Version 2.2, which have made EOL more engaging, personal, accessible and internationalizable. We outline the core features and technical architecture of the system. We summarize milestones achieved so far by EOL to present results of the current system implementation and establish benchmarks upon which to judge future improvements.

We have shown that it is possible to successfully integrate large amounts of descriptive biodiversity data from diverse sources into a robust, standards-based, dynamic, and scalable infrastructure. Increasing global participation and the emergence of EOL-powered applications demonstrate that EOL is becoming a significant resource for anyone interested in biological diversity.

This section on the organization of the taxonomy for the Encyclopedia of Life v2 seems particularly relevant:

Resource documents made available by content partners define the text and multimedia being provided as well as the taxa to which the content refers, the associations between content and taxa, and the associations among taxa (i.e. taxonomies). Expert taxonomists often disagree about the best classification for a given group of organisms, and there is no universal taxonomy for partners to adhere to (Patterson et al. 2008, Rotman et al. 2012a, Yoon and Rose 2001). As an aggregator, EOL accepts all taxonomic viewpoints from partners and attempts to assign them to existing Taxon Pages, or create new Taxon Pages when necessary. A reconciliation algorithm uses incoming taxon information, previously indexed data, and assertions from our curators to determine the best aggregation strategy. (links omitted)

Integration of information without agreement on a single view of the information. (Have we heard this before?)

If you think of the taxon pages as proxies, it is easier to see the topic map aspects of this project.

Names are not (always) useful

Monday, April 21st, 2014

PhyloCode names are not useful for phylogenetic synthesis

From the post:

Which brings me to the title of this post. In the PhyloCode, taxonomic names are not hypothetical concepts that can be refuted or refined by data-driven tests. Instead, they are definitions involving specifiers (designated specimens) that are simply applied to source trees that include those specifiers. This is problematic for synthesis because if two source trees differ in topology, and/or they fail to include the appropriate specifiers, it may be impossible to answer the basic question I began with: do the trees share any clades (taxa) in common? If taxa are functions of phylogenetic topology, then there can be no taxonomic basis for meaningfully comparing source trees that either differ in topology, or do not permit the application of taxon definitions. (emphasis added)

If you substitute “names” for “taxa” then it is easy to see my point in Plato, Shiva and A Social Graph about nodes that are “abstract concept devoid of interpretation.” There is nothing to compare.

This isn’t a new problem but a very old one that keeps being repeated.

For processing reasons it may be useful to act as though taxa (or names) are simply given. A digital or print index need not struggle to find a grounding for the terms it reports. For some purposes, that is completely unnecessary.

On the other hand, we should not forget the lack of grounding is purely a convenience for processing or other reasons. We can choose differently should an occasion merit it.

ZooKeys 50 (2010) Special Issue

Wednesday, January 29th, 2014

Taxonomy shifts up a gear: New publishing tools to accelerate biodiversity research by Lyubomir Penev, et. al.

From the editorial:

The principles of Open Access greatly facilitate dissemination of information through the Web where it is freely accessed, shared and updated in a form that is accessible to indexing and data mining engines using Web 2.0 technologies. Web 2.0 turns the taxonomic information into a global resource well beyond the taxonomic community. A significant bottleneck in naming species is the requirement by the current Codes of biological nomenclature ruling that new names and their associated descriptions must be published on paper, which can be slow, costly and render the new information difficult to find. In order to make progress in documenting the diversity of life, we must remove the publishing impediment in order to move taxonomy “from a cottage industry into a production line” (Lane et al. 2008), and to make best use of new technologies warranting the fastest and widest distribution of these new results.

In this special edition of ZooKeys we present a practical demonstration of such a process. The issue opens with a forum paper from Penev et al. (doi: 10.3897/zookeys.50.538) that presents the landscape of semantic tagging and text enhancements in taxonomy. It describes how the content of the manuscript is enriched by semantic tagging and marking up of four exemplar papers submitted to the publisher in three different ways: (i) written in Microsoft Word and submitted as non-tagged manuscript (Stoev et al., doi: 10.3897/zookeys.50.504); (ii) generated from Scratchpads (Blagoderov et al., doi: 10.3897/zookeys.50.506 and Brake and Tschirnhaus, doi: 10.3897/zookeys.50.505); (iii) generated from an author’s database (Taekul et al., doi: 10.3897/zookeys.50.485). The latter two were submitted as XML-tagged manuscript. These examples demonstrate the suitability of the workflow to a range of possibilities that should encompass most current taxonomic efforts. To implement the aforementioned routes for XML mark up in prospective taxonomic publishing, a special software tool (Pensoft Mark Up Tool, PMT) was developed and its features were demonstrated in the current issue. The XML schema used was version #123 of TaxPub, an extension to the Document Type Definitions (DTD) of the US National Library of Medicine (NLM) (

A second forum paper from Blagoderov et al. (doi: 10.3897/zookeys.50.539) sets out a workflow that describes the assembly of elements from a Scratchpad taxon page ( to export a structured XML file. The publisher receives the submission, automatically renders the file into the journal‘s layout style as a PDF and transmits it to a selection of referees, based on the key words in the manuscript and the publisher’s database. Several steps, from the author’s decision to submit the manuscript to final publication and dissemination, are automatic. A journal editor first spends time on the submission when the referees’ reports are received, making the decision to publish, modify or reject the manuscript. If the decision is to publish, then PDF proofs are sent back to the author and, when verified, the paper is published both on paper and on-line, in PDF, HTML and XML formats. The original information is also preserved on the original Scratchpad where it may, in due course, be updated. A visitor arriving at the web site by tracing the original publication will be able to jump forward to the current version of the taxon page.

This sounds like the promise of SGML/XML made real doesn’t it?

See the rest of the editorial or ZooKeys 50 for a very good example of XML and semantics in action.

This is a long way from the “related” or “recent” article citations in most publisher interfaces. Thoughts on how to make that change?

A Semantic Web Example? Nearly a Topic Map?

Wednesday, January 29th, 2014

Morphological and Geographical Traits of the British Odonata by Gary D Powney, el. al.


Trait data are fundamental for many aspects of ecological research, particularly for modeling species response to environmental change. We synthesised information from the literature (mainly field guides) and direct measurements from museum specimens, providing a comprehensive dataset of 26 attributes, covering the 43 resident species of Odonata in Britain. Traits included in this database range from morphological traits (e.g. body length) to attributes based on the distribution of the species (e.g. climatic restriction). We measured 11 morphometric traits from five adult males and five adult females per species. Using digital callipers, these measurements were taken from dry museum specimens, all of which were wild caught individuals. Repeated measures were also taken to estimate measurement error. The trait data are stored in an online repository (, alongside R code designed to give an overview of the morphometric data, and to combine the morphometric data to the single value per trait per species data.

A great example of publishing data along with software to manipulate it.

I mention it here because the publisher, Pensoft, references the Semantic Web saying:

The Semantic Web could also be called a “linked Web” because most semantic enhancements are in fact provided through various kinds of links to external resources. The results of these linkages will be visualized in the HTML versions of the published papers through various cross-links within the text and more particularly through the Pensoft Taxon Profile (PTP) ( PTP is a web-based harvester that automatically links any taxon name mentioned within a text to external sources and creates a dynamic web-page for that taxon. PTP saves readers a great amount of time and effort by gathering for them the relevant information on a taxon from leading biodiversity sources in real time.

A substantial feature of the semantic Web is open data publishing, where not only analysed results, but original datasets can be published as citeable items so that the data authors may receive academic dredit for their efforts. For more information, please visit our detailed Data Publishing Policies and Guidelines for Biodiversity Data.

When you view the article, you will find related resources displayed next to the article. A lot of related resources.

Of course it remains to every reader to assemble data across varying semantics but this is definitely a step in the right direction.


I first saw this in a tweet by S.K. Morgan Ernest.

Open Microscopy Environment

Tuesday, January 28th, 2014

Open Microscopy Environment

From the webpage:

OME develops open-source software and data format standards for the storage and manipulation of biological microscopy data. It is a joint project between universities, research establishments, industry and the software development community.

Where you will find:

OMERO: OMERO is client-server software for visualization, management and analysis of biological microscope images.

Bio-Formats: Bio-Formats is a Java library for reading and writing biological image files. It can be used as an ImageJ plugin, Matlab toolbox, or in your own software.

OME-TIFF Format: A TIFF-based image format that includes the OME-XML standard.

OME Data Model: A common specification for storing details of microscope set-up and image acquisition.

More data formats for sharing of information. And for integration with other data.

Not only does data continue to expand but so does the semantics associated with it.

We have “big data” tools for the data per se. Have you seen any tools capable of managing the diverse semantics of “big data?”

Me neither.

I first saw this in a tweet by Paul Groth.

Data sharing, OpenTree and GoLife

Monday, January 20th, 2014

Data sharing, OpenTree and GoLife

From the post:

NSF has released GoLife, the new solicitation that replaces both AToL and AVAToL. From the GoLife text:

The goals of the Genealogy of Life (GoLife) program are to resolve the phylogenetic history of life and to integrate this genealogical architecture with underlying organismal data.

Data completeness, open data and data integration are key components of these proposals – inferring well-sampled trees that are linked with other types of data (molecular, morphological, ecological, spatial, etc) and made easily available to scientific and non-scientific users. The solicitation requires that trees published by GoLife projects are published in a way that allows them to be understood and re-used by Open Tree of Life and other projects:

Integration and standardization of data consistent with three AVAToL projects: Open Tree of Life (, ARBOR (, and Next Generation Phenomics ( is required. Other data should be made available through broadly accessible community efforts (i.e., specimen data through iDigBio, occurrence data through BISON, etc). (I corrected the URLs for ARBOR and Next Generation Phenomics)

What does it mean to publish data consistent with Open Tree of Life? We have a short page on data sharing with OpenTree, a publication coming soon (we will update this post when it comes out) and we will be releasing our new curation / validation tool for phylogenetic data in the next few weeks.

A great resource on the NSF GoLife proposal that I just posted about.

Some other references:

AToL – Assembling the Tree of Life

AVATOL – Assembling, Visualizing and Analyzing the Tree of Life

Be sure to contact the Open Tree of Life group if you are interested in the GoLife project.

Genealogy of Life (GoLife)

Monday, January 20th, 2014

Genealogy of Life (GoLife) NSF.

Full Proposal Deadline Date: March 26, 2014
Fourth Wednesday in March, Annually Thereafter


All of comparative biology depends on knowledge of the evolutionary relationships (phylogeny) of living and extinct organisms. In addition, understanding biodiversity and how it changes over time is only possible when Earth’s diversity is organized into a phylogenetic framework. The goals of the Genealogy of Life (GoLife) program are to resolve the phylogenetic history of life and to integrate this genealogical architecture with underlying organismal data.

The ultimate vision of this program is an open access, universal Genealogy of Life that will provide the comparative framework necessary for testing questions in systematics, evolutionary biology, ecology, and other fields. A further strategic integration of this genealogy of life with data layers from genomic, phenotypic, spatial, ecological and temporal data will produce a grand synthesis of biodiversity and evolutionary sciences. The resulting knowledge infrastructure will enable synthetic research on biological dynamics throughout the history of life on Earth, within current ecosystems, and for predictive modeling of the future evolution of life.

Projects submitted to this program should emphasize increased efficiency in contributing to a complete Genealogy of Life and integration of various types of organismal data with phylogenies.

This program also seeks to broadly train next generation, integrative phylogenetic biologists, creating the human resource infrastructure and workforce needed to tackle emerging research questions in comparative biology. Projects should train students for diverse careers by exposing them to the multidisciplinary areas of research within the proposal.

You may have noticed the emphasis on data integration:

to integrate this genealogical architecture with underlying organismal data.

comparative framework necessary for testing questions in systematics, evolutionary biology, ecology, and other fields

strategic integration of this genealogy of life with data layers from genomic, phenotypic, spatial, ecological and temporal data

synthetic research on biological dynamics

integration of various types of organismal data with phylogenies

next generation, integrative phylogenetic biologists

That sounds like a tall order! Particularly if your solution does not enable researchers to ask on what basis data was integrated as it was and by who?

If you can’t ask and answer those two questions, the more data and integration you mix together, the more fragile the integration structure will become.

I’m not trying to presume that such a project will use dynamic merging because it may well not. “Merging” in topic map terms may well be an operation ordered by a member of a group of curators. It is the capturing of the basis for that operation that makes it maintainable over a series of curators through time.

I first saw this at: Data sharing, OpenTree and GoLife, which I am about to post on but thought the NSF call merited a separate post as well.

A names backbone:…

Friday, November 22nd, 2013

A names backbone: a graph of taxonomy by Nicky Nicolson.

At first glance a taxonomy paper but as you look deeper:

Slide 34: Concepts layer: taxonomy as a graph

  • Names are nodes
  • Typed, directed relationships represent synonymy and taxonomic placement
  • Evidence for taxonomic assertions provided as references
  • …and again, standards bases import / export using TCS

Slide 35 shows a synonym_of relationship between two name nodes.

Slide 36 shows evidence attached to placement at one node and for the synonym_of link.

Slide 37 shows reuse of nodes to support “different taxonomic opinions.”

Slide 39 Persistent identification of concepts

We can re-create a sub-graph representing a concept at a particular point in time using:

  1. Name ID
  2. Classification
  3. State

Users can link to a stable state of a concept

We can provide a feed of what has changed since

I mention this item in part because Peter Neubauer (Neo4j) suggested in an email that rather than “merging” nodes that subject sameness (my term, not his) could be represented as a relationship between nodes.

Much in the same way that synonym_of was represented in these slides.

And I like the documentation of the reason for synonymy.

The internal data format of Neo4j makes “merging” in the sense of creating one node to replace two or more other nodes impractical.

Perhaps replacing nodes with other nodes has practical limits?

Is “virtual merging” in your topic map future?

Announcing BioCoder

Wednesday, October 16th, 2013

Announcing BioCoder by Mike Loukides.

From the post:

We’re pleased to announce BioCoder, a newsletter on the rapidly expanding field of biology. We’re focusing on DIY bio and synthetic biology, but we’re open to anything that’s interesting.

Why biology? Why now? Biology is currently going through a revolution as radical as the personal computer revolution. Up until the mid-70s, computing was dominated by large, extremely expensive machines that were installed in special rooms and operated by people wearing white lab coats. Programming was the domain of professionals. That changed radically with the advent of microprocessors, the homebrew computer club, and the first generation of personal computers. I put the beginning of the shift in 1975, when a friend of mine built a computer in his dorm room. But whenever it started, the phase transition was thorough and radical. We’ve built a new economy around computing: we’ve seen several startups become gigantic enterprises, and we’ve seen several giants collapse because they couldn’t compete with the more nimble startups.

Bioinformatics and amateur genome exploration is growing hobby area. Yes, hobby area.

For background, see: Playing with genes by David Smith.

Your bioinformatics skills, which you learned for cross-over use in other fields, could come in handy.

A couple of resources to get you started:


DYI Genomics

Seems like a ripe field for mining and organization.

There is no publication date set on Weaponized Viruses in a Nutshell.

Global Biodiversity Information Facility

Wednesday, October 9th, 2013

Global Biodiversity Information Facility

Some stats:

417,165,184 occurrences

1,426,888 species

11,976 data sets

578 data publishers

What lies at the technical heart of this beast?

Would you believe a PostgreSQL database and an embedded Apache SOLR index?

Start with the Summary of the GBIF infrastructure. The details on PostgreSQL and Solr are under the Registry tab.

BTW, the system recognizes multiple identification systems and more are to be added.

Need to read more of the documents on that part of the system.

Data Visualization: Exploring Biodiversity

Saturday, May 25th, 2013

Data Visualization: Exploring Biodiversity by Sean Gonzalez.

From the post:

When you have a few hundred years worth of data on biological records, as the Smithsonian does, from journals to preserved specimens to field notes to sensor data, even the most diligently kept records don’t perfectly align over the years, and in some cases there is outright conflicting information. This data is important, it is our civilization’s best minds giving their all to capture and record the biological diversity of our planet. Unfortunately, as it stands today, if you or I were to decide we wanted to learn more, or if we wanted to research a specific species or subject, accessing and making sense of that data effectively becomes a career. Earlier this year an executive order was given which generally stated that federally funded research had to comply with certain data management rules, and the Smithsonian took that order to heart, event though it didn’t necessarily directly apply to them, and has embarked to make their treasure of information more easily accessible. This is a laudable goal, but how do we actually go about accomplishing this? Starting with digitized information, which is a challenge in and of itself, we have a real Big Data challenge, setting the stage for data visualization.

The Smithsonian has already gone a long way in curating their biodiversity data on the Biodiversity Heritage Library (BHL) website, where you can find ever increasing sources. However, we know this curation challenge can not be met by simply wrapping the data with a single structure or taxonomy. When we search and explore the BHL data we may not know precisely what we’re looking for, and we don’t want a scavenger hunt to ensue where we’re forced to find clues and hidden secrets in hopes of reaching our national treasure; maybe the Gates family can help us out…

People see relationships in the data differently, so when we go exploring one person may do better with a tree structure, others prefer a classic title/subject style search, or we may be interested in reference types and frequencies. Why we don’t think about it as one monolithic system is akin to discussing the number of Angels that fit on the head of a pin, we’ll never be able to test our theories. Our best course is to accept that we all dive into data from different perspectives, and we must therefore make available different methods of exploration.

What would you do beyond visualization?

A self-updating road map of The Cancer Genome Atlas

Friday, May 17th, 2013

A self-updating road map of The Cancer Genome Atlas by David E. Robbins, Alexander Grüneberg, Helena F. Deus, Murat M. Tanik and Jonas S. Almeida. (Bioinformatics (2013) 29 (10): 1333-1340. doi: 10.1093/bioinformatics/btt141)


Motivation: Since 2011, The Cancer Genome Atlas’ (TCGA) files have been accessible through HTTP from a public site, creating entirely new possibilities for cancer informatics by enhancing data discovery and retrieval. Significantly, these enhancements enable the reporting of analysis results that can be fully traced to and reproduced using their source data. However, to realize this possibility, a continually updated road map of files in the TCGA is required. Creation of such a road map represents a significant data modeling challenge, due to the size and fluidity of this resource: each of the 33 cancer types is instantiated in only partially overlapping sets of analytical platforms, while the number of data files available doubles approximately every 7 months.

Results: We developed an engine to index and annotate the TCGA files, relying exclusively on third-generation web technologies (Web 3.0). Specifically, this engine uses JavaScript in conjunction with the World Wide Web Consortium’s (W3C) Resource Description Framework (RDF), and SPARQL, the query language for RDF, to capture metadata of files in the TCGA open-access HTTP directory. The resulting index may be queried using SPARQL, and enables file-level provenance annotations as well as discovery of arbitrary subsets of files, based on their metadata, using web standard languages. In turn, these abilities enhance the reproducibility and distribution of novel results delivered as elements of a web-based computational ecosystem. The development of the TCGA Roadmap engine was found to provide specific clues about how biomedical big data initiatives should be exposed as public resources for exploratory analysis, data mining and reproducible research. These specific design elements align with the concept of knowledge reengineering and represent a sharp departure from top-down approaches in grid initiatives such as CaBIG. They also present a much more interoperable and reproducible alternative to the still pervasive use of data portals.

Availability: A prepared dashboard, including links to source code and a SPARQL endpoint, is available at A video tutorial is available at

Curious how the granularity of required semantics and the uniformity of the underlying data set impact the choice of semantic approaches?

And does access to data files present different challenges than say access to research publications in the same field?


Saturday, April 20th, 2013

PhenoMiner: quantitative phenotype curation at the rat genome database by Stanley J. F. Laulederkind, (Database (2013) 2013 : bat015 doi: 10.1093/database/bat015)


The Rat Genome Database (RGD) is the premier repository of rat genomic and genetic data and currently houses >40 000 rat gene records as well as human and mouse orthologs, >2000 rat and 1900 human quantitative trait loci (QTLs) records and >2900 rat strain records. Biological information curated for these data objects includes disease associations, phenotypes, pathways, molecular functions, biological processes and cellular components. Recently, a project was initiated at RGD to incorporate quantitative phenotype data for rat strains, in addition to the currently existing qualitative phenotype data for rat strains, QTLs and genes. A specialized curation tool was designed to generate manual annotations with up to six different ontologies/vocabularies used simultaneously to describe a single experimental value from the literature. Concurrently, three of those ontologies needed extensive addition of new terms to move the curation forward. The curation interface development, as well as ontology development, was an ongoing process during the early stages of the PhenoMiner curation project.

Database URL:

The line:

A specialized curation tool was designed to generate manual annotations with up to six different ontologies/vocabularies used simultaneously to describe a single experimental value from the literature.

sounded relevant to topic maps.

Turns out to be five ontologies and the article reports:

The ‘Create Record’ page (Figure 4) is where the rest of the data for a single record is entered. It consists of a series of autocomplete text boxes, drop-down text boxes and editable plain text boxes. All of the data entered are associated with terms from five ontologies/vocabularies: RS, CMO, MMO, XCO and the optional MA (Mouse Adult Gross Anatomy Dictionary) (13)

Important to note that authoring does not require the user to make explicit the properties underlying any of the terms from the different ontologies.

Some users probably know that level of detail but what is important is the capturing of their knowledge of subject sameness.

A topic map extension/add-on to such a system could flesh out those bare terms to provide a basis for treating terms from different ontologies as terms for the same subjects.

That merging/mapping detail need not bother an author or casual user.

But it increases the odds that future data sets can be reliably integrated with this one.

And issues with the correctness of a mapping can be meaningfully investigated.

If it helps, think of correctness of mappping as accountability, for someone else.


Sunday, April 7th, 2013

Open PHACTS – Open Pharmacological Space

From the homepage:

Open PHACTS is building an Open Pharmacological Space in a 3-year knowledge management project of the Innovative Medicines Initiative (IMI), a unique partnership between the European Community and the European Federation of Pharmaceutical Industries and Associations (EFPIA).

The project is due to end in March 2014, and aims to deliver a sustainable service to continue after the project funding ends. The project consortium consists of leading academics in semantics, pharmacology and informatics, driven by solid industry business requirements: 28 partners, including 9 pharmaceutical companies and 3 biotechs.

Sourcecode has just appeared on GibHub: OpenPHACTS.

Important to different communities for different reasons. My interest isn’t the same as BigPharma. 😉

A project to watch as they navigate the thickets of vocabularies, ontologies and other semantically diverse information sources.

Visualizing Biological Data Using the SVGmap Browser

Thursday, April 4th, 2013

Visualizing Biological Data Using the SVGmap Browser by Casey Bergman.

From the post:

Early in 2012, Nuria Lopez-Bigas‘ Biomedical Genomics Group published a paper in Bioinformatics describing a very interesting tool for visualizing biological data in a spatial context called SVGmap. The basic idea behind SVGMap is (like most good ideas) quite straightforward – to plot numerical data on a pre-defined image to give biological context to the data in an easy-to-interpret visual form.

To do this, SVGmap takes as input an image in Scalable Vector Graphics (SVG) format where elements of the image are tagged with an identifier, plus a table of numerical data with values assigned to the same identifier as in the elements of the image. SVGMap then integrates these files using either a graphical user interface that runs in standard web browser or a command line interface application that runs in your terminal, allowing the user to display color-coded numerical data on the original image. The overall framework of SVGMap is shown below in an image taken from a post on the Biomedical Genomics Group blog.

svgmap image

We’ve been using SVGMap over the last year to visualize tissue-specific gene expression data in Drosophila melanogaster from the FlyAtlas project, which comes as one of the pre-configured “experiments” in the SVGMap web application.

More recently, we’ve been also using the source distribution of SVGMap to display information about the insertion preferences of transposable elements in a tissue-specific context, which as required installing and configuring a local instance of SVGMap and run it via the browser. The documentation for SVGMap is good enough to do this on your own, but it took a while for us to get a working instance the first time around. We ran into the same issues again the second time, so I thought I write up my notes for future reference and to help others get SVGMap up and running as fast as possible.

Topic map interfaces aren’t required to take a particular form.

A drawing of a fly could be topic map interface.

Useful for people studying flies, less useful (maybe) if you are mapping Lady Gaga discography.

What interface do you want to create for a topic map?

Biological Database of Images and Genomes

Wednesday, April 3rd, 2013

Biological Database of Images and Genomes: tools for community annotations linking image and genomic information by Andrew T Oberlin, Dominika A Jurkovic, Mitchell F Balish and Iddo Friedberg. (Database (2013) 2013 : bat016 doi: 10.1093/database/bat016)


Genomic data and biomedical imaging data are undergoing exponential growth. However, our understanding of the phenotype–genotype connection linking the two types of data is lagging behind. While there are many types of software that enable the manipulation and analysis of image data and genomic data as separate entities, there is no framework established for linking the two. We present a generic set of software tools, BioDIG, that allows linking of image data to genomic data. BioDIG tools can be applied to a wide range of research problems that require linking images to genomes. BioDIG features the following: rapid construction of web-based workbenches, community-based annotation, user management and web services. By using BioDIG to create websites, researchers and curators can rapidly annotate a large number of images with genomic information. Here we present the BioDIG software tools that include an image module, a genome module and a user management module. We also introduce a BioDIG-based website, MyDIG, which is being used to annotate images of mycoplasmas.

Database URL: BioDIG website:

BioDIG source code repository:

The MyDIG database:

Linking image data to genomic data. Sounds like associations to me.


Not to mention the heterogeneity of genomic data.

Imagine extending an image/genomic data association by additional genomic data under a different identification.

Biodiversity Heritage Library (BHL)

Thursday, March 28th, 2013

Biodiversity Heritage Library (BHL)

Best described by their own “about” page:

The Biodiversity Heritage Library (BHL) is a consortium of natural history and botanical libraries that cooperate to digitize and make accessible the legacy literature of biodiversity held in their collections and to make that literature available for open access and responsible use as a part of a global “biodiversity commons.” The BHL consortium works with the international taxonomic community, rights holders, and other interested parties to ensure that this biodiversity heritage is made available to a global audience through open access principles. In partnership with the Internet Archive and through local digitization efforts , the BHL has digitized millions of pages of taxonomic literature , representing tens of thousands of titles and over 100,000 volumes.

The published literature on biological diversity has limited global distribution; much of it is available in only a few select libraries in the developed world. These collections are of exceptional value because the domain of systematic biology depends, more than any other science, upon historic literature. Yet, this wealth of knowledge is available only to those few who can gain direct access to significant library collections. Literature about the biota existing in developing countries is often not available within their own borders. Biologists have long considered that access to the published literature is one of the chief impediments to the efficiency of research in the field. Free global access to digital literature repatriates information about the earth’s species to all parts of the world.

The BHL consortium members digitize the public domain books and journals held within their collections. To acquire additional content and promote free access to information, the BHL has obtained permission from publishers to digitize and make available significant biodiversity materials that are still under copyright.

Because of BHL’s success in digitizing a significant mass of biodiversity literature, the study of living organisms has become more efficient. The BHL Portal allows users to search the corpus by multiple access points, read the texts online, or download select pages or entire volumes as PDF files.

The BHL serves texts with information on over a million species names. Using UBio’s taxonomic name finding tools, researchers can bring together publications about species and find links to related content in the Encyclopedia of Life. Because of its commitment to open access, BHL provides a range of services and APIs which allow users to harvest source data files and reuse content for research purposes.

Since 2009, the BHL has expanded globally. The European Commission’s eContentPlus program has funded the BHL-Europe project, with 28 institutions, to assemble the European language literature. Additionally, the Chinese Academy of Sciences (BHL-China), the Atlas of Living Australia (BHL-Australia), Brazil (through BHL-SciELO) and the Bibliotheca Alexandrinahave created national or regional BHL nodes. Global nodes are organizational structures that may or may not develop their own BHL portals. It is the goal of BHL to share and serve content through the BHL Portal developed and maintained at the Missouri Botanical Garden. These projects will work together to share content, protocols, services, and digital preservation practices.

A truly remarkable effort!

Would you believe they have a copy of “Aristotle’s History of animals.” In ten books. Tr. by Richard Cresswell? For download as a PDF?

Tell me, how would you reconcile the terminology of Aristotle or of Cresswell for that matter in translation, with modern terminology both for species and their features?

In order to enable navigation from this work to other works in the collection?

Moreover, how would you preserve that navigation for others to use?

Document level granularity is better than not finding a document at all but it is a far cry from being efficient.

BHL-Europe web portal opens up…

Thursday, March 28th, 2013

BHL-Europe web portal opens up the world’s knowledge on biological diversity

From the post:

The goal of the Biodiversity Heritage Library for Europe (BHL-Europe) project is to make published biodiversity literature accessible to anyone who’s interested. The project will provide a multilingual access point (12 languages) for biodiversity content through the BHL-Europe web portal with specific biological functionalities for search and retrieval and through the EUROPEANA portal. Currently BHL-Europe involves 28 major natural history museums, botanical gardens and other cooperating institutions.

BHL-Europe is a 3 year project, funded by the European Commission under the eContentplus programme, as part of the i2010 policy.

Unlimited access to biological diversity information

The libraries of the European natural history museums and botanical gardens collectively hold the majority of the world’s published knowledge on the discovery and subsequent description of biological diversity. However, digital access to this knowledge is difficult.

The BHLproject, launched 2007 in the USA, is systematically attempting to address this problem. In May 2009 the ambitious and innovative EU project ‘Biodiversity Heritage Library for Europe’ (BHL-Europe) was launched. BHL-Europe is coordinated by the Museum für Naturkunde Berlin, Germany, and combines the efforts of 26 European and 2 American institutions. For the first time, the wider public, citizen scientists and decision makers will have unlimited access to this important source of information.

A project with enormous potential, although three (3) years seems a bit short.

Mentioned but without a link, the BHLproject has digitized over 100,000 volumes, with information on more than one million species names.

Using molecular networks to assess molecular similarity

Friday, February 15th, 2013

Systems chemistry: Using molecular networks to assess molecular similarity by Bailey Fallon.

From the post:

In new research published in Journal of Systems Chemistry, Sijbren Otto and colleagues have provided the first experimental approach towards molecular networks that can predict bioactivity based on an assessment of molecular similarity.

Molecular similarity is an important concept in drug discovery. Molecules that share certain features such as shape, structure or hydrogen bond donor/acceptor groups may have similar properties that make them common to a particular target. Assessment of molecular similarity has so far relied almost exclusively on computational approaches, but Dr Otto reasoned that a measure of similarity might be obtained by interrogating the molecules in solution experimentally.

Important work for drug discovery but there are semantic lessons here as well:

Tests for similarity/sameness are domain specific.

Which means there are no universal tests for similarity/sameness.

Lacking universal tests for similarity/sameness, we should focus on developing documented and domain specific tests for similarity/sameness.

Domain specific tests provide quicker ROI than less useful and doomed universal solutions.

Documented domain specific tests may, no guarantees, enable us to find commonalities between domain measures of similarity/sameness.

But our conclusions will be based on domain experience and not projection from our domain onto others, less well known domains.