Are Graph Databases Ready for Bioinformatics? by Christian Theil Have and Lars Juhl Jensen.
From the editorial:
Graphs are ubiquitous in bioinformatics and frequently consist of too many nodes and edges to represent in RAM. These graphs are thus stored in databases to allow for efficient queries using declarative query languages such as SQL. Traditional relational databases (e.g. MySQL and PostgreSQL) have long been used for this purpose and are based on decades of research into query optimization. Recently, NoSQL databases have caught a lot of attention due to their advantages in scalability. The term NoSQL is used to refer to schemaless databases such as key/value stores (e.g. Apache Cassandra), document stores (e.g. MongoDB) and graph databases (e.g. AllegroGraph, Neo4J, OpenLink Virtuoso), which do not fit within the traditional relational paradigm. Most NoSQL databases do not have a declarative query language. The widely used Neo4J graph database is an exception. Its query language Cypher is designed for expressing graph queries, but is still evolving.
Graph databases have so far seen only limited use within bioinformatics [Schriml et al., 2013]. To illustrate the pros and cons of using a graph database (exemplified by Neo4J v1.8.1) instead of a relational database (PostgreSQL v9.1) we imported into both the human interaction network from STRING v9.05 [Franceschini et al., 2013], which is an approximately scale-free network with 20,140 proteins and 2.2 million interactions. As all graph databases, Neo4J stores edges as direct pointers between nodes, which can thus be traversed in constant time. Because Neo4j uses the property graph model, nodes and edges can have properties associated with them; we use this for storing the protein names and the confidence scores associated with the interactions. In PostgreSQL, we stored the graph as an indexed table of node pairs, which can be traversed with either logarithmic or constant lookup complexity depending on the type of index used. On these databases we benchmarked the speed of Cypher and SQL queries for solving three bioinformatics graph processing problems: finding immediate neighbors and their interactions, finding the best scoring path between two proteins, and finding the shortest path between them. We have selected these three tasks because they illustrate well the strengths and weaknesses of graph databases compared to traditional relational databases.
Encouraging but also makes the case for improvements in graph database software.