Archive for the ‘Bio4j’ Category

Big Data – Genomics – Bio4j

Tuesday, November 12th, 2013

Berkeley Phylogenomics Group receives an NSF grant to develop a graph DB for Big Data challenges in genomics building on Bio4j

From the post:

The Sjölander Lab at the University of California, Berkeley, has recently been awarded a 250K US dollars EAGER grant from the National Science Foundation to build a graph database for Big Data challenges in genomics. Naturally, they’re building on Bio4j.

The project “EAGER: Towards a self-organizing map and hyper-dimensional information network for the human genome” aims to create a graph database of genome and proteome data for the human genome and related species to allow biologists and computational biologists to mine the information in gene family trees, biological networks and other graph data that cannot be represented effectively in relational databases. For these goals, they will develop on top of the pioneering graph-based bioinformatics platform Bio4j.

We are excited to see how Bio4j is used by top research groups to build cutting-edge bioinformatics solutions” said Eduardo Pareja, Era7 Bioinformatics CEO. “To reach an even broader user base, we are pleased to announce that we now provide versions for both Neo4j and Titan graph databases, for which we have developed another layer of abstraction for the domain model using Blueprints.”

EAGER stands for Early-concept Grants for Exploratory Research”, explained Professor Kimmen Sjölander, head of the Berkeley Phylogenomics Group: “NSF awards these grants to support exploratory work in its early stages on untested, but potentially transformative, research ideas or approaches”. “My lab’s focus is on machine learning methods for Big Data challenges in biology, particularly for graphical data such as gene trees, networks, pathways and protein structures. The limitations of relational database technologies for graph data, particularly BIG graph data, restrict scientists’ ability to get any real information from that data. When we decided to switch to a graph database, we did a lot of research into the options. When we found out about Bio4j, we knew we’d found our solution. The Bio4j team has made our development tasks so much easier, and we look forward to a long and fruitful collaboration in this open-source project”.

Always nice to see great projects get ahead!

Kudos to the Berkeley Phylogenomics Group!

Bio4j: Big Biological Data Pioneer

Thursday, November 15th, 2012

Bio4j: A pioneer graph based database for the integration of biological Big Data by Pablo Pareja Tobes.

If your favorite biologist or geneticist isn’t aware of Bio4j (unlikely but could happen), some slides you can pass along to them.

Not enough of the details to be frightening but enough on the potential for your friends to want to know more.

That would be a good principle for creating topic map presentations.

Not enough detail to frighten but enough of the promise to keep people interested.

Bio4j 0.8, some numbers

Saturday, October 20th, 2012

Bio4j 0.8, some numbers by Pablo Pareja Tobes.

Bio4j 0.8 was recently released and now it’s time to have a deeper look at its numbers (as you can see we are quickly approaching the 1 billion relationships and 100M nodes):

  • Number of Relationships: 717.484.649
  • Number of Nodes: 92.667.745
  • Relationship types: 144
  • Node types: 42

If Pablo gets tired of his brilliant career in bioinformatics he can always run for office in the United States with claims like: “…we are quickly approaching the 1 billion relationships….” 😉

Still, a stunning achievement!

See Pablo’s post for more analysis.

Pass the project along to anyone with doubts about graph databases.

Bio4j 0.8 is here!

Saturday, October 13th, 2012

Bio4j 0.8 is here! by Pablo Pareja Tobes.

You will find “5.488.000 new proteins and 3.233.000 genes” and other improvements!

Whether you are interested in graph databases (Neo4j), bioinformatics or both, this is welcome news!

Neo4j and Bioinformatics

Saturday, August 11th, 2012

Neo4j and Bioinformatics

From the description:

Pablo Pareja will give an overview of Bio4j project, and then move to some of its recent applications. BG7: a new system for bacterial genome annotation designed for NGS data MG7: metagenomics + taxonomy integration Evolutionary studies, transcriptional networks, network analysis..

It may just be me but the sound seems “faint.” Even when set to full volume, it is difficult to hear Pablo clearly.

I have tried this on two different computers with different OSes so I don’t think it is a problem on my end.

Your experience?

BTW, slides are here.

Neo4j and Bioinformatics [Webinar]

Monday, July 30th, 2012

Neo4j and Bioinformatics [Webinar]

Thursday August 9 10:00 PDT / 19:00 CEST

From the webpage:

The world of data is changing. Big Data and NOSQL are bringing new ways of understanding your data.

This opens a whole new world of possibilities for a wide range of fields, and bioinformatics is no exception. This paradigm provides bioinformaticians with a powerful and intuitive framework, to deal with biological data that is naturally interconnected.

Pablo Pareja will give an overview of Bio4j project, and then move to some of its recent applications.

  • BG7: a new system for bacterial genome annotation designed for NGS data
  • MG7: metagenomics + taxonomy integration
  • Evolutionary studies, transcriptional networks, network analysis..
  • Future directions

Speaker: Pablo Pareja, Project Leader of Bio4j

If you are thinking about “scale,” consider the current stats on Bio4j:

The current version of Bio4j includes:

Relationships: 530.642.683

Nodes: 76.071.411

Relationship types: 139

Node types: 38

With room to spare!

Bio4jExplorer, new features and design!

Monday, March 12th, 2012

Bio4jExplorer, new features and design!

Pablo Pareja Tobes writes:

I’m happy to announce a new set of features for our tool Bio4jExplorer plus some changes in its design. I hope this may help both potential and current users to get a better understanding of Bio4j DB structure and contents.

Among the new features:

  • Node & Relationship Properties
  • Node & Relationship Data Source
  • Relationships Name Property

It may take time but even with “big data,” the source of data (as an aspect of validity or trust) is going to become a requirement.

Bio4j 0.7, some numbers

Monday, March 5th, 2012

Bio4j 0.7, some numbers by Pablo Pareja Tobes.

From the post:

There have already been a good few posts showing different uses and applications of Bio4j, but what about Bio4j data itself?

Today I’m going to show you some basic statistics about the different types of nodes and relationships Bio4j is made up of.

Just as a heads up, here are the general numbers of Bio4j 0.7 :

  • Number of Relationships: 530.642.683
  • Number of Nodes: 76.071.411
  • Relationship types: 139
  • Node types: 38

The numbers speak for themselves. More information at Pablo’s post.

Finding the lowest common ancestor of a set of NCBI taxonomy nodes with Bio4j

Thursday, February 23rd, 2012

Finding the lowest common ancestor of a set of NCBI taxonomy nodes with Bio4j

Pablo Pareja writes:

I don’t know if you have ever heard of the lowest common ancestor problem in graph theory and computer science but it’s actually pretty simple. As its name says, it consists of finding the common ancestor for two different nodes which has the lowest level possible in the tree/graph.

Even though it is normally defined for only two nodes given it can easily be extended for a set of nodes with an arbitrary size. This is a quite common scenario that can be found across multiple fields and taxonomy is one of them.

The reason I’m talking about all this is because today I ran into the need to make use of such algorithm as part of some improvements in our metagenomics MG7 method. After doing some research looking for existing solutions, I came to the conclusion that I should implement my own, – I couldn’t find any applicable implementation that was thought for more than just two nodes.

Important for its use with NCBI taxonomy nodes but another use case comes readily to mind.

What about overlapping markup?

Traditionally we represent markup elements as single nodes, despite their composition of and events for each “well-formed” element in the text stream.

But what if we represent and events as nodes in a graph with relationships both to each other and other nodes in the markup stream?

Can we then ask the question, Which pair of / nodes are the ancestor of either a or element?

If they have the same ancestor then we have the uninteresting case of well-formed markup.

But what if they don’t have the same ancestor? What can the common ancestor method tell us about the structure of the markup?

Definitely a research topic.

Bio4j: A pioneer graph based database…

Wednesday, February 1st, 2012

Bio4j: A pioneer graph based database for the integration of biological Big Data by Pablo Pareja Tobes.

Great slide deck by the principal developer for Bio4j.

Take a close look at slide 19 and tell me what it reminds you of?


Using Bio4j + Neo4j Graph-algo component…

Monday, January 2nd, 2012

Using Bio4j + Neo4j Graph-algo component for finding protein-protein interaction paths

From the post:

Today I managed to find some time to check out the Graph-algo component from Neo4j and after playing with it plus Bio4j a bit, I have to say it seems pretty cool.

For those who don’t know what I’m talking about, here you have the description you can find in Neo4j wiki:

This is a component that offers implementations of common graph algorithms on top of Neo4j. It is mostly focused around finding paths, like finding the shortest path between two nodes, but it also contains a few different centrality measures, like betweenness centrality for nodes.

The algorithm for finding the shortest path between two nodes caught my attention and I started to wonder how could I give it a try applying it to the data included in Bio4j.

Suggestions of other data sets where shortest path would yield interesting results?

BTW, isn’t the shortest path an artifact of the basis for nearness between nodes? Thinking that shortest path when expressed between gene fragments as relatedness would be different than physical distance. (see: Nearness key in microbe DNA swaps: Proximity trumps relatedness in influencing how often bacteria pick up each other’s genes.)


Monday, October 10th, 2011

Bio4jExplorer: familiarize yourself with Bio4j nodes and relationships

From the post:

I just uploaded a new tool aimed to be used both as a reference manual and initial contact for Bio4j domain model: Bio4jExplorer

Bio4jExplorer allows you to:

  • Navigate through all nodes and relationships
  • Access the javadocs of any node or relationship
  • Graphically explore the neighbourhood of a node/relationship
  • Look up for the different indexes that may serve as an entry point for a node
  • Check incoming/outgoing relationships of a specific node
  • Check start/end nodes of a specific relationship

And take note:

For those interested on how this was done, on the server side I created an AWS SimpleDB database holding all the information about the model of Bio4j, i.e. everything regarding nodes, relationships, indexes… (here you can check the program used for creating this database using java aws sdk)

Meanwhile, in the client side I used Flare prefuse AS3 library for the graph visualization.

When people are this productive as well as a benefit to the community, I am deeply envious but glad for them (and the rest of us) at the same time. Simply must work harder. 😉

Bio4j – as an AWS snapshot

Thursday, June 23rd, 2011

Bio4j current release now available as an AWS snapshot

From the post:

For those using AWS (or willing to…) I just created a public snapshot containing the last version of Bio4j DB.

The snapshot details are the following:

  • Snapshot id: snap-25192d4c
  • Snapshot region: EU West (Ireland)
  • Snapshot size: 90 GB

The whole DB is under the folder ‘bio4jdb’.
In order to use it, just create a Bio4jManager instance and start navigating the graph!

Very cool!