Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 14, 2014

Flax Clade PoC

Filed under: Indexing,Solr,Taxonomy — Patrick Durusau @ 6:20 pm

Flax Clade PoC by Tom Mortimer.

From the webpage:

Flax Clade PoC is a proof-of-concept open source taxonomy management and document classification system, based on Apache Solr. In its current state it should be considered pre-alpha. As open-source software you are welcome to try, use, copy and modify Clade as you like. We would love to hear any constructive suggestions you might have.

Tom Mortimer tom@flax.co.uk


Taxonomies and document classification

Clade taxonomies have a tree structure, with a single top-level category (e.g. in the example data, “Social Psychology”). There is no distinction between parent and child nodes (except that the former has children) and the hierachical structure of the taxonomy is completely orthogonal from the node data. The structure may be freely edited.

Each node represents a category, which is represented by a set of “keywords” (words or phrases) which should be present in a document belonging to that category. Not all the keywords have to be present – they are joined with Boolean OR rather than AND. A document may belong to multiple categories, which are ranked according to standard Solr (TF-IDF) scoring. It is also possible to exclude certain keywords from categories.

Clade will also suggest keywords to add to a category, based on the content of the documents already in the category. This feature is currently slow as it uses the standard Solr MoreLikeThis component to analyse a large number of documents. We plan to improve this for a future release by writing a custom Solr plugin.

Documents are stored in a standard Solr index and are categorised dynamically as taxonomy nodes are selected. There is currently no way of writing the categorisation results to the documents in SOLR, but see below for how to export the document categorisation to an XML or CSV file.

A very interesting project!

I am particularly interested in the dynamic categorisation when nodes are selected.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress