Archive for the ‘Trees’ Category

Rollin’ Trees, yo

Sunday, January 11th, 2015

Rollin’ Trees, yo by Clark Feusier.

From the post:

I like trees. All kinds of trees — concrete and abstract. Redwoods, Oaks, search trees, decision trees, fruit trees, DOM trees, Christmas trees, and more.

They are powerful beyond common recognition. Oxygen, life, shelter, food, beauty, computational efficiency, and more are provided by trees when we interact with them in the right ways.

Don’t get offended when I say this:

you don’t like trees enough

Before I can make you feel bad about taking trees for granted, I need you to be very familiar with trees and their uses. Once you understand the tree, you will feel bad for not appreciating it enough. Then, you will start appreciating trees, as well as using them in the situations for which they are perfectly suited. Good.

Oh, I don’t mind using trees for “…situations for which they are perfectly suited.” What I object to is using trees to model texts, where outside of the artificial strictures of markup, are definitely not trees!

I suppose the simplest (and most common) case of non-tree like behavior for texts is where a sentence crosses page boundaries. If you think of the page as a container, then the markup for the start and end of a sentence “overlaps” the markers for the page boundary.

Another easy example is where a quotation starts the middle of one sentence and ends at the end of the following sentence. Any markup for the first sentence is going to “overlap” the markup for the start of the quotation.

For all that, this is a good review of trees and worth your time to read. Just don’t allow yourself to be limited to trees when thinking about texts.

I first saw this in a tweet by Anna Pawlicka.

Finger trees:…

Wednesday, July 23rd, 2014

Finger trees: a simple general-purpose data structure by Ralf Hinze and Ross Paterson.


We introduce 2-3 finger trees, a functional representation of a persistent sequences supporting access to the ends in amortized constant time, and concatenation and splitting in time logarithmic in the size of the smaller piece. Representations achieving these bounds have appeared previously, but 2-3 finger trees are much simpler, as are the operations on them. Further, by defi ning the split operation in a general form, we obtain a general purpose data structure that can serve as a sequence, priority queue, search tree, priority search queue and more.

Before the Hinze and Paterson article you may want to read: 2-3 finger trees in ASCII by Jens Nicolay.

Note 2-3 finger trees go unmentioned in Purely Functional Data Structures by Chris Okasaki.

Other omissions of note?

Classification and regression trees

Tuesday, July 15th, 2014

Classification and regression trees by Wei-Yin Loh.


Classification and regression trees are machine-learningmethods for constructing prediction models from data. The models are obtained by recursively partitioning the data space and fitting a simple prediction model within each partition. As a result, the partitioning can be represented graphically as a decision tree. Classification trees are designed for dependent variables that take a finite number of unordered values, with prediction error measured in terms of misclassification cost. Regression trees are for dependent variables that take continuous or ordered discrete values, with prediction error typically measured by the squared difference between the observed and predicted values. This article gives an introduction to the subject by reviewing some widely available algorithms and comparing their capabilities, strengths, and weakness in two examples. 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 14–23 DOI: 10.1002/widm.8.

A bit more challenging that CSV formats but also very useful.

I heard a joke many years ago but a then U.S. Assistant Attorney General who said:

To create a suspect list for a truck hijacking in New York, you choose files with certain name characteristics, delete the ones that are currently in prison and those that remain are your suspect list. (paraphrase)

If topic maps can represent any “subject” then they should be able to represent “group subjects” as well. We may know that our particular suspect is the member of a group, but we just don’t know which member of the group is our suspect.

Think of it as a topic map that evolves as more data/analysis is brought to the map and members of a group subject can be broken out into smaller groups or even individuals.

In fact, displaying summaries of characteristics of members of a group in response to classification/regression could well help with the subject analysis process. An interactive construction/mining of the topic map as it were.

Great paper whether you use it for topic map subject analysis or more traditional purposes.

Binary Search Trees (Clojure)

Thursday, April 10th, 2014

Data Structures in Clojure: Binary Search Trees by Max Countryman.

From the post:

Trees Everywhere

So far we have talked about two fundamental and pervasive data structures: linked lists and hash tables. Here again we discuss another important data structure and one that you will find is quite common: trees. Trees offer a powerful way of organizing data and approaching certain problems. In particular, searching and traversal. Whether you know it or not, you no doubt use trees in your programs today. For instance, Clojure’s vectors are backed by a special kind of tree!

Here we will construct our own tree, just like with our linked list and hash table implementations. Specifically, our tree will be a kind of tree known as a Binary Search Tree (BST). Often when someone says tree, they mean a BST.

We will look the basic structure of our tree, how we insert things into it, and how we find them again. Then we will explore traversing, and finally, removing nodes. At the end of this tutorial you will have a basic, functioning Binary Search Tree, which will be the basis for further explorations later on in this series.

Another installment by Max on data structures in Clojure.


Annual Christmas Tree Lecture (Knuth)

Tuesday, December 3rd, 2013

Computer Musing by Professor Donald E. Knuth.

From the webpage:

Professor Knuth will present his 19th Annual Christmas Tree Lecture on Monday, December 9, 2013 at 7:00 pm in NVIDIA Auditorium in the new Huang Engineering Center, 475 Via Ortega, Stanford University (map). The topic will be Planar Graphs and Ternary Trees. There is no admission charge or registration required. For those unable to come to Stanford, register for the live webinar broadcast.

No doubt heavy sledding but what better way to prepare for the holiday season?

Date: Monday, December 9, 2013

7 p.m. – 8 p.m. Pacific
10 p.m. – 11 p.m. Eastern

Constructing a true LCSH tree of a science and engineering collection

Monday, November 19th, 2012

Constructing a true LCSH tree of a science and engineering collection by Charles-Antoine Julien, Pierre Tirilly, John E. Leide and Catherine Guastavino.


The Library of Congress Subject Headings (LCSH) is a subject structure used to index large library collections throughout the world. Browsing a collection through LCSH is difficult using current online tools in part because users cannot explore the structure using their existing experience navigating file hierarchies on their hard drives. This is due to inconsistencies in the LCSH structure, which does not adhere to the specific rules defining tree structures. This article proposes a method to adapt the LCSH structure to reflect a real-world collection from the domain of science and engineering. This structure is transformed into a valid tree structure using an automatic process. The analysis of the resulting LCSH tree shows a large and complex structure. The analysis of the distribution of information within the LCSH tree reveals a power law distribution where the vast majority of subjects contain few information items and a few subjects contain the vast majority of the collection.

After a detailed analysis of records from the McGill University Libraries (204,430 topical authority records) and 130,940 bibliographic records (Schulich Science and Engineering Library), the authors conclude in part:

This revealed that the structure was large, highly redundant due to multiple inheritances, very deep, and unbalanced. The complexity of the LCSH tree is a likely usability barrier for subject browsing and navigation of the information collection.

For me the most compelling part of this research was the focus on LCSH as used and not as it imagines itself. Very interesting reading. A slow walk through the bibliography will interest those researching LCSH or classification more generally.

Demonstration of the power law with the use of LCSH makes one wonder about other classification systems as used.

Trees – A Primer

Monday, September 17th, 2012

Trees – A Primer by Jeremy Kun.

From the post:

This post comes in preparation for a post on decision trees (a specific type of tree used for classification in machine learning). While most mathematicians and programmers are familiar with trees, we have yet to discuss them on this blog. For completeness, we’ll give a brief overview of the terminology and constructions associated with trees, and describe a few common algorithms on trees. We will assume the reader has read our first primer on graph theory, which is a light assumption. Furthermore, we will use the terms node and vertex interchangeably, as mathematicians use the latter and computer scientists the former.

A nice introduction/refresher on tree data structures.

Parsing the Newick format in C using flex and bison

Friday, July 13th, 2012

Parsing the Newick format in C using flex and bison by Pierre Lindenbaum.

From the post:

The following post is my answer for this question on biostar “Newick 2 Json converter“.

The Newick tree format is a simple format used to write out trees (using parentheses and commas) in a text file .

The original question asked for a parser based on perl but here, I’ve implemented a C parser using flex/bison.

If that doesn’t grab your interest, consider the following from the Wikipedia article cited by Pierre on the Newick tree format:

In mathematics, Newick tree format (or Newick notation or New Hampshire tree format) is a way of representing graph-theoretical trees with edge lengths using parentheses and commas. It was adopted by James Archie, William H. E. Day, Joseph Felsenstein, Wayne Maddison, Christopher Meacham, F. James Rohlf, and David Swofford, at two meetings in 1986, the second of which was at Newick’s restaurant in Dover, New Hampshire, US. The adopted format is a generalization of the format developed by Meacham in 1984 for the first tree-drawing programs in Felsenstein’s PHYLIP package.[1]

Of interest both for conversion but also for the representation of graph-theoretical trees. About the same time as GML and other efforts on trees.

In case you are in Dover, Newick’s survives to this day. I don’t know if they are aware of the reason for their fame but you could mention it.

Extracting Conflict-free Information from Multi-labeled Trees

Friday, June 1st, 2012

Extracting Conflict-free Information from Multi-labeled Trees by Akshay Deepak, David Fernández-Baca, and Michelle M. McMahon.


A multi-labeled tree, or MUL-tree, is a phylogenetic tree where two or more leaves share a label, e.g., a species name. A MUL-tree can imply multiple conflicting phylogenetic relationships for the same set of taxa, but can also contain conflict-free information that is of interest and yet is not obvious. We define the information content of a MUL-tree T as the set of all conflict-free quartet topologies implied by T, and define the maximal reduced form of T as the smallest tree that can be obtained from T by pruning leaves and contracting edges while retaining the same information content. We show that any two MUL-trees with the same information content exhibit the same reduced form. This introduces an equivalence relation in MUL-trees with potential applications to comparing MUL-trees. We present an efficient algorithm to reduce a MUL-tree to its maximally reduced form and evaluate its performance on empirical datasets in terms of both quality of the reduced tree and the degree of data reduction achieved.

You may not agree with:

That is, for every MUL-tree $T$ there exists a singly-labeled tree that displays all the conflict-free quartets of $T$ — and possibly some other quartets as well. Motivated by this, we only view conflict-free quartet topologies as informative, and define the information content of a MUL-tree as the set of all conflict-free quartet topologies it implies.

Preferring to view conflicts as information content (I would) but each to his own.

I suspect “multi-labeled” trees are more common than one might expect.

Other examples?

Trees in the Database: Advanced Data Structures

Monday, March 5th, 2012

Trees in the Database: Advanced Data Structures

Lorenzo Alberton writes:

Despite the NoSQL movement trying to flag traditional databases as a dying breed, the RDBMS keeps evolving and adding new powerful weapons to its arsenal. In this talk we’ll explore Common Table Expressions (SQL-99) and how SQL handles recursion, breaking the bi-dimensional barriers and paving the way to more complex data structures like trees and graphs, and how we can replicate features from social networks and recommendation systems. We’ll also have a look at window functions (SQL:2003) and the advanced reporting features they make finally possible. The first part of this talk will cover several different techniques to model a tree data structure into a relational database: parent-child (adjacency list) model, materialized path, nested sets, nested intervals, hybrid models, Common Table Expressions. Then we’ll move one step forward and see how we can model a more complex data structure, i.e. a graph, with concrete examples from today’s websites. Starting from real-world examples of social networks’ and recommendation systems’ features, and with the help of some graph theory, this talk will explain how to represent and traverse a graph in the database. Finally, we will take a look at Window Functions and how they can be useful for data analytics and simple inline aggregations, among other things. All the examples have been tested on PostgreSQL >= 8.4.

Very impressive presentation!

Definitely makes me want to dust off my SQL installations and manuals for a closer look!

Compact Binary Relation Representations with Rich Functionality

Wednesday, January 18th, 2012

Compact Binary Relation Representations with Rich Functionality by Jérémy Barbay, Francisco Claude, Gonzalo Navarro.


Binary relations are an important abstraction arising in many data representation problems. The data structures proposed so far to represent them support just a few basic operations required to fit one particular application. We identify many of those operations arising in applications and generalize them into a wide set of desirable queries for a binary relation representation. We also identify reductions among those operations. We then introduce several novel binary relation representations, some simple and some quite sophisticated, that not only are space-efficient but also efficiently support a large subset of the desired queries.

Read the introduction (runs about two of the thirty-two pages) and tell me you aren’t interested. Go ahead.

Just the start of what looks like a very interesting line of research.

Tree Traversal in O(1) Space

Saturday, October 8th, 2011

Tree Traversal in O(1) Space by Sanjoy.

From the post:

I’ve been reading some texts recently, and came across a very interesting way to traverse a graph, called pointer reversal. The idea is this — instead of maintaining an explicit stack (of the places you’ve visited), you try to store the relevant information in the nodes themselves. One approach that works on directed graphs with two (outgoing) arcs per node is called the Deutsch-Schorr-Waite algorithm. This was later extended by Thorelli to work for directed graphs with an unknown number of (outgoing) arcs per node.

Implemented here for a tree, care to go for a more general graph?

Working with Trees in the Phyloinformatic Age

Monday, February 28th, 2011

Working with Trees in the Phyloinformatic Age by William H. Piel discusses the processing and display of phylogenetic trees.

The PhyloWidget that is mentioned in the slides can be found at: Phylowidget

If you are working on topic maps in bioinformatics, this should be on your list of works to review.

I was particularly amused by the advice given at the Phylowidget site in its step-by-step instructions:

* If you are a biologist: we recommend starting with the First Step and moving forwards.
* If you are a developer: we recommend starting with the Last Step and moving backwards.