Finding the lowest common ancestor of a set of NCBI taxonomy nodes with Bio4j
Pablo Pareja writes:
I don’t know if you have ever heard of the lowest common ancestor problem in graph theory and computer science but it’s actually pretty simple. As its name says, it consists of finding the common ancestor for two different nodes which has the lowest level possible in the tree/graph.
Even though it is normally defined for only two nodes given it can easily be extended for a set of nodes with an arbitrary size. This is a quite common scenario that can be found across multiple fields and taxonomy is one of them.
The reason I’m talking about all this is because today I ran into the need to make use of such algorithm as part of some improvements in our metagenomics MG7 method. After doing some research looking for existing solutions, I came to the conclusion that I should implement my own, – I couldn’t find any applicable implementation that was thought for more than just two nodes.
Important for its use with NCBI taxonomy nodes but another use case comes readily to mind.
What about overlapping markup?
Traditionally we represent markup elements as single nodes, despite their composition of and events for each “well-formed” element in the text stream.
But what if we represent and events as nodes in a graph with relationships both to each other and other nodes in the markup stream?
Can we then ask the question, Which pair of / nodes are the ancestor of either a or element?
If they have the same ancestor then we have the uninteresting case of well-formed markup.
But what if they don’t have the same ancestor? What can the common ancestor method tell us about the structure of the markup?
Definitely a research topic.