Archive for December, 2010

The Brainy Learning Algorithms of Numenta

Friday, December 31st, 2010

The Brainy Learning Algorithms of Numenta

From the lead:

Jeff Hawkins has a track record at predicting the future. The founder of Palm and inventor of the PalmPilot, he spent the 1990s talking of a coming world in which we would all carry powerful computers in our pockets. “No one believed in it back then—people thought I was crazy,” he says. “Of course, I’m thrilled about how successful mobile computing is today.”

At his current firm, Numenta, Hawkins is working on another idea that seems to come out of left field: copying the workings of our own brains to build software that makes accurate snap decisions for today’s data-deluged businesses. He and his team have been working on their algorithms since 2005 and are finally preparing to release a version that is ready to be used in products. Numenta’s technology is aimed at a variety of applications, such as judging whether a credit card transaction is fraudulent, anticipating what a Web user will click next, or predicting the likelihood that a particular hospital patient will suffer a relapse.

In topic maps lingo, I would say the algorithms are developing parameters for subject recognition.

It would be interesting to see the development of parameters for subject recognition that could be sold or leased. As artifacts separate from software.

As far as I know, all searching software develops its own view from scratch, which seems pretty wasteful. Not to mention obtaining results of varying utility.


  1. Are there any search engines or appliances that don’t start from scratch when indexing? (1-2 pages, citations)
  2. What issues do current search engines present to the addition of subject recognition rules, data or results (from other software)? (3-5 pages, citations)
  3. What would you add to current search engines? Rules, results from other engines? Why? (3-5 pages, citations)

The R Journal, Issue 2/2

Friday, December 31st, 2010

The R Journal, Issue 2/2 has arrived!

Download complete issue.

Or Individual articles.

A number of topic map relevant papers are in this issue, ranging from stringr: modern, consistent string processing, Hadley Wickham; to the edgy cudaBayesreg: Bayesian Computation in CUDA, Adelino Ferreira da Silva; to a technique that started in the late 1950’s, The RecordLinkage Package: Detecting Errors in Data, Murat Sariyar and Andreas Borg.

Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase comparison – Post

Friday, December 31st, 2010

Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase comparison

Not enough detail for decision making but a useful overview nonetheless.

Libraries and the Semantic Web (video)

Friday, December 31st, 2010

Libraries and the Semantic Web (video)

This is a very amusing video.

ACL Anthology: A Digital Archive of Research Papers in Computational Linguistics

Friday, December 31st, 2010

ACL Anthology: A Digital Archive of Research Papers in Computational Linguistics

Association for Computational Linguistics collection of over 19,100 (as of 2010-12-31) papers on computational linguistics.


Assume that you have this resource, CiteSeer, DBLP and several others.

  1. If copying of the underlying data isn’t possible/feasible, how would you search across these resources?
  2. What steps would you take to gather at least similar if not the same materials together? (Hint: The correct answer is not use a topic map. Or at least it isn’t a complete answer. What techniques would you use with a topic map?)
  3. What steps would you take to make improvements to mappings between resources or concepts available to other users?

RANLP 2011: Recent Advances In Natural Language Processing

Friday, December 31st, 2010

RANLP 2011: Recent Advances In Natural Language Processing Augusta SPA hotel, September 10-16, 2011, Hissar, Bulgaria

Call for Papers:

We invite papers reporting on recent advances in all aspects of Natural Language Processing (NLP). We encourage the representation of a broad range of areas including but not limited to the following: pragmatics, discourse, semantics, syntax, and the lexicon; phonetics, phonology, and morphology; mathematical models and complexity; text understanding and generation; multilingual NLP; machine translation, machine-aided translation, translation memory systems, translation aids and tools; corpus-based language processing; POS tagging; parsing; electronic dictionaries; knowledge acquisition; terminology; word-sense disambiguation; anaphora resolution; information retrieval; information extraction; text summarisation; term recognition; text categorisation; question answering; textual entailment; visualisation; dialogue systems; speech processing; computer-aided language learning; language resources; evaluation; and theoretical and application-oriented papers related to NLP.

Important Dates:

  • Conference paper submission notification: 11 April 2011
  • Conference paper submission deadline: 18 April 2011
  • Conference paper acceptance notification: 15 June 2011
  • Camera-ready versions of the conference papers: 20 July 2011

The proceedings from RANLP 2009 are typical for the conference.

Neo4J 1.2 – Released!

Thursday, December 30th, 2010

Neo4J 1.2 Released!

New features:

  • The Neo4j Server

    The Neo4j standalone server builds upon the RESTful API that was pre-released for Neo4j 1.1. The server provides a complete stand alone Neo4j graph database experience, making it easy to access Neo4j from any programming language or platform. Some of you have already provided great client libraries for languages such as Python, Ruby, PHP, the .Net stack and more. Links and further information about client libraries can be found at:

  • Neo4j High Availability

    The High Availability feature of Neo4j provides an easy way to set up a cluster of graph databases. This allows for read scalability and tolerates faults in any of the participating machines. Writes are allowed to any machine, but are synchronized with a slight delay across all of them.

    High Availability in Neo4j is still in quite an early stage of its evolution and thus still have a few limitations. While it provides scalability for read load, write operations are slightly slower. Adding new machines to a cluster still requires some manual work, and very large transactions cannot be transmitted across machines. These limitations will be addressed in the next version of Neo4j.

  • Some other noteworthy changes include:
    • Additional services for the Neo4j kernel can now be loaded during startup, or injected into a running instance. Examples of such additional services are the Neo4j shell server and the Neo4j management interface.
    • Memory footprint and read performance has been improved.
    • A new cache implementation has been added for high load, low latency workloads.
    • A new index API has been added that is more tightly integrated with the database. This new index API supports indexing relationships as well as nodes, and also supports indexing and querying multiple properties for each node or relationship. The old index API has been deprecated but remains available and will continue to receive bug fixes for a while.
    • The Neo4j shell supports performing path algorithm queries.
    • Built in automatic feedback to improve future versions of Neo4j. See:

Let me repeat part of that:

This new index API supports indexing relationships as well as nodes, and also supports indexing and querying multiple properties for each node or relationship.

Will be looking at the details on the indexing, more comments to follow.

Data Diligence: More Thoughts on Google Books’ Ngrams – Post

Thursday, December 30th, 2010

Data Diligence: More Thoughts on Google Books’ Ngrams

Matthew Hurst asks a number of interesting questions about the underlying data for Google Book’s Ngrams.

He illustrates that large amounts of data have the potential to be useful, but divorced from any context or at least limited in terms of the context that is known, it can be of limited utility.


  1. Spend at least 4-6 hours exploring (ok, playing) with Google Books’ Ngrams.
  2. Develop 3 or 4 questions you would like to answer with this data source.
  3. What additional information or context would you need to answer your questions in #2?

Inductive Logic Programming (and Martian Identifications)

Thursday, December 30th, 2010

Inductive Logic Programming: Theory and Methods Authors: Stephen Muggleton, Luc De Raedt


Inductive Logic Programming (ILP) is a new discipline which investigates the inductive construction of first-order clausal theories from examples and background knowledge. We survey the most important theories and methods of this new eld. Firstly, various problem specifications of ILP are formalised in semantic settings for ILP, yielding a “model-theory” for ILP. Secondly, a generic ILP algorithm is presented. Thirdly, the inference rules and corresponding operators used in ILP are presented, resulting in a “proof-theory” for ILP. Fourthly, since inductive inference does not produce statements which are assured to follow from what is given, inductive inferences require an alternative form of justification. This can take the form of either probabilistic support or logical constraints on the hypothesis language. Information compression techniques used within ILP are presented within a unifying Bayesian approach to confirmation and corroboration of hypotheses. Also, different ways to constrain the hypothesis language, or specify the declarative bias are presented. Fifthly, some advanced topics in ILP are addressed. These include aspects of computational learning theory as applied to ILP, and the issue of predicate invention. Finally, we survey some applications and implementations of ILP. ILP applications fall under two different categories: firstly scientific discovery and knowledge acquisition, and secondly programming assistants.

A good survey of Inductive Logic Programming (ILP) if a bit dated. Feel free to suggest more recent surveys of the area.

As I mentioned under Mining Travel Resources on the Web Using L-Wrappers, the notion of interpretative domains is quite interesting.

I suspect, but cannot prove (at least at this point), that most useful mappings exist between closely related interpretative domains.

Closely related interpretative domains being composed of identifications of a subject that I will quickly recognize as alternative identifications.

Showing me a mapping that includes a Martian identification of my subject, which is not a closely related interpretative domain is unlikely to be useful, at least to me. (I can’t speak for any potential Martians.)

How to Design Programs

Thursday, December 30th, 2010

How to Design Programs: An Introduction to Computing and Programming Authors: Matthias Felleisen, Robert Bruce Findler, Matthew Flatt, Shriram Krishnamurthi (2003 version)

Update: see How to Design Programs, Second Edition.

Website includes the complete text.

The Amazon product description reads:

This introduction to programming places computer science in the core of a liberal arts education. Unlike other introductory books, it focuses on the program design process. This approach fosters a variety of skills–critical reading, analytical thinking, creative synthesis, and attention to detail–that are important for everyone, not just future computer programmers. The book exposes readers to two fundamentally new ideas. First, it presents program design guidelines that show the reader how to analyze a problem statement; how to formulate concise goals; how to make up examples; how to develop an outline of the solution, based on the analysis; how to finish the program; and how to test. Each step produces a well-defined intermediate product. Second, the book comes with a novel programming environment, the first one explicitly designed for beginners. The environment grows with the readers as they master the material in the book until it supports a full-fledged language for the whole spectrum of programming tasks. All the book’s support materials are available for free on the Web. The Web site includes the environment, teacher guides, exercises for all levels, solutions, and additional projects.

If we are going to get around to solving the hard subject identity problems in addition to those that are computationally convenient, there will need to be more collaboration across the liberal arts.

The Amazon page, How to Design Programs is in error. I checked the ISBN numbers at: http://www.book The ISBN-13 works but the French, German and UK details point back to the 2001 printing. Bottom line: There is no 2008 edition of this work.

If you are interested, Matthias Felleisen, along with Robert Bruce Findler and Matthew Flatt, has authored Semantics Engineering with PLT Redex in 2009. Sounds interesting but the only review I saw was on Amazon.

Graph 500

Thursday, December 30th, 2010

Graph 500

From the website:

Data intensive supercomputer applications are increasingly important HPC workloads, but are ill-suited for platforms designed for 3D physics simulations. Current benchmarks and performance metrics do not provide useful information on the suitability of supercomputing systems for data intensive applications. A new set of benchmarks is needed in order to guide the design of hardware architectures and software systems intended to support such applications and to help procurements. Graph algorithms are a core part of many analytics workloads.

Backed by a steering committee of over 30 international HPC experts from academia, industry, and national laboratories, Graph 500 will establish a set of large-scale benchmarks for these applications. The Graph 500 steering committee is in the process of developing comprehensive benchmarks to address three application kernels: concurrent search, optimization (single source shortest path), and edge-oriented (maximal independent set). Further, we are in the process of addressing five graph-related business areas: Cybersecurity, Medical Informatics, Data Enrichment, Social Networks, and Symbolic Networks.

This is the first serious approach to complement the Top 500 with data intensive applications. Additionally, we are working with the SPEC committee to include our benchmark in their CPU benchmark suite. We anticipate the list will rotate between ISC and SC in future years.

What drew my attention to this site was the following quote in the IEEE article, Better Benchmarking for Supercomputers by Mark Anderson:

An “edge” here is a connection between two data points. For instance, when you buy Michael Belfiore’s Department of Mad Scientists from, one edge is the link in Amazon’s computer system between your user record and the Department of Mad Scientists database entry. One necessary but CPU-intensive job Amazon continually does is to draw connections between edges that enable it to say that 4 percent of customers who bought Belfiore’s book also bought Alex Abella’s Soldiers of Reason and 3 percent bought John Edwards’s The Geeks of War.

Within Amazon’s system, intensive but, what if someone, say the U.S. government, wanted to map Amazon data to information it holds in various systems?

Can you say subject identity?

Mining Travel Resources on the Web Using L-Wrappers

Thursday, December 30th, 2010

Mining Travel Resources on the Web Using L-Wrappers Authors Elvira Popescu, Amelia Bădică , and Costin Bădică


The work described here is part of an ongoing research on the application of general-purpose inductive logic programming, logic representation of wrappers (L-wrappers) and XML technologies (including the XSLT transformation language) to information extraction from the Web. The L-wrappers methodology is based on a sound theoretical approach and has already proved its efficacy on a smaller scale, in the area of collecting product information. This paper proposes the use of L-wrappers for tuple extraction from HTML in the domain of e-tourism. It also describes a method for translating L-wrappers into XSLT and illustrates it with the example of a real-world travel agency Web site.

Deeply interesting work in part due to the use of XSLT to extract tuples from HTML pages but also because a labeled ordered tree is used as an interpretive domain for patterns matched against the tree.

If that latter sounds familiar, it should, most data mining techniques specifying a domain in which results (intermediate or otherwise), are going to be interpreted.

I will look around for other material on L-wrappers and inductive logic programming.

The Joy of Stats

Thursday, December 30th, 2010

The Joy Of Stats Available In Its Entirety

I am not sure that “…statistics are the sexiest subject around…” but if anyone could make it appear to be so, it would be Rosling.

Highly recommended for an entertaining account of statistics and data visualization.

You won’t learn the latest details but you will be left with an enthusiasm for incorporating such techniques in your next topic map.

BTW, does anyone know of a video editor/producer who would be interested in volunteering to film/produce The Joy of Topic Maps?

(I suppose the script would have to be written first. 😉 )

Setting Government Data Free With ScraperWiki

Wednesday, December 29th, 2010

Setting Government Data Free With ScraperWiki

reports one a video by Max Ogden illustrating on the use of ScraperWiki to harvest government data.

If you are planning on adding government data to your topic map, this is a video you need to see.

SimpleGeo Makes Location Easy With Context and Places

Wednesday, December 29th, 2010

SimpleGeo Makes Location Easy With Context and Places reports on:

SimpleGeo Context takes a latitude and longitude and provides relevant contextual information such as weather, demographics, or neighborhood data for that co-ordinate.

SimpleGeo Places, which is a free database of business listings and Points of Interest (POI) that enables real-time community collaboration.

Interesting and free APIs that could add value to any topic map concerned with tourism or location information.

Trending Terms in Google’s Book Corpus – Post

Tuesday, December 28th, 2010

Trending Terms in Google’s Book Corpus

Matthew Hurst covers an interesting new tool at Google Book Corpus that allows tracking of terms over time.


  1. Pick at least 3 pairs of terms to track through this interface. (3-5 pages, no citations)
  2. Track only one term over 300 years of publication with one example of usage for every 30 years. (3-5 pages, citations)
  3. What similarity measures would you use to detect variation in the semantic of that term in a corpus covering 300 years? (3-5 pages, citations)

RuleML 2011 – International Symposium on Rules – Conference

Tuesday, December 28th, 2010

RuleML 2011 – International Symposium on Rules

From the website:

The International Symposium on Rules, RuleML, has evolved from an annual series of international workshops since 2002, international conferences in 2005 and 2006, and international symposia since 2007. In 2011 two instalments of the RuleML Symposium will take place. The first one will be held in conjunction with IJCAI 2011 (International Joint Conference on Artificial Intelligence) in Barcelona in July, and the second will be co-located with the Business Rule Forum to be held in late October-early November in North America including Challenge Award that this year will be dedicate to Rules and Ontologies.

Important Dates

  • Abstract submission: February 25, 2011 (11:59PM, UTC-12)
  • Paper submission: March 4, 2011 (11:59PM, UTC-12)
  • Notification of acceptance/rejection: March 31, 2011
  • Camera-ready copy due: April 15, 2011
  • RuleML-2011 dates: July 19-21, 2011


  • Rules and Automated Reasoning
  • Logic Programming and Non-monotonic Reasoning
  • Rules, Workflows and Business Processes
  • Rules, Agents and Norms
  • Rule-Based Distributed/Multi-Agent Systems
  • Rule-Based Policies, Reputation and Trust
  • Rule-based Event Processing and Reaction Rules
  • Fuzzy Rules and Uncertainty
  • Rule Transformation and Extraction
  • Vocabularies, Ontologies, and Business rules

A weighted similarity measure for non-conjoint rankings – Post

Monday, December 27th, 2010

A weighted similarity measure for non-conjoint rankings updates the recently published A Similarity Measure for Indefinite Rankings
and provides an implementation.

Abstract from the article:

Ranked lists are encountered in research and daily life, and it is often of interest to compare these lists, even when they are incomplete or have only some members in common. An example is document rankings returned for the same query by different search engines. A measure of the similarity between incomplete rankings should handle non-conjointness, weight high ranks more heavily than low, and be monotonic with increasing depth of evaluation; but no measure satisfying all these criteria currently exists. In this article, we propose a new measure having these qualities, namely rank-biased overlap (RBO). The RBO measure is based on a simple probabilistic user model. It provides monotonicity by calculating, at a given depth of evaluation, a base score that is non-decreasing with additional evaluation, and a maximum score that is non-increasing. An extrapolated score can be calculated between these bounds if a point estimate is required. RBO has a parameter which determines the strength of the weighting to top ranks. We extend RBO to handle tied ranks and rankings of different lengths. Finally, we give examples of the use of the measure in comparing the results produced by public search engines, and in assessing retrieval systems in the laboratory.

There are also some comments about journal publishing delays that should serve as fair warning to other authors.

Python 2.6 Graphics Cookbook – Review Forthcoming!

Monday, December 27th, 2010

Just a quick placeholder to say that I am reviewing Python 2.6 Graphics Cookbook

Python 2.6 Graphics Cookbook

I should have the review done in the next couple of weeks.

My primary interest is in the use of graphics with topic maps for visualization and analysis.

But, I don’t suppose familiarity with graphics in general ever hurt anyone. 😉

While you wait for the review, you might enjoy reading: Chapter 7 Combining Raster and Vector Pictures (free download).

Python Text Processing with NLTK2.0 Cookbook – Review Forthcoming!

Monday, December 27th, 2010

Just a quick placeholder to say that I am reviewing Python Text Processing with NLTK2.0 Cookbook

Python Text Processing

I should have the review done in the next couple of weeks.

In the longer term I will be developing a set of notes on the construction of topic maps using this toolkit.

While you wait for the review, you might enjoy reading: Chapter No.3 – Creating Custom Corpora (free download).


Monday, December 27th, 2010


From the website:

Open source data visualization and analysis for novice and experts. Data mining through visual programming or Python scripting. Components for machine learning. Extensions for bioinformatics and text mining. Packed with features for data analytics.

I had to look at the merge data widget.

Which is said to: Merges two data sets based on the values of selected attributes.

According to the documentation:

Merge Data widget is used to horizontally merge two data sets based on the values of selected attributes. On input, two data sets are required, A and B. The widget allows for selection of an attribute from each domain which will be used to perform the merging. When selected, the widget produces two outputs, A+B and B+A. The first output (A+B) corresponds to instances from input data A which are appended attributes from B, and the second output (B+A) to instances from B which are appended attributes from A.

The merging is done by the values of the selected (merging) attributes. For example, instances from from A+B are constructed in the following way. First, the value of the merging attribute from A is taken and instances from B are searched with matching values of the merging attributes. If more than a single instance from B is found, the first one is taken and horizontally merged with the instance from A. If no instance from B match the criterium, the unknown values are assigned to the appended attributes. Similarly, B+A is constructed.

Which illustrates the problem that topic maps solves rather neatly:

  1. How does a subsequent researcher reliably duplicate such a merger?
  2. How does a subsequent researcher reliably merge that data with other data?
  3. How do other researchers reliably merge that data with their own data?

Answer is: They can’t. Not enough information.

Question: How would you change the outcome for those three questions? In detail. (5-7 pages, citations)

TMVA Toolkit for Multivariate Data Analysis with ROOT

Monday, December 27th, 2010

TMVA Toolkit for Multivariate Data Analysis with ROOT

From the website:

The Toolkit for Multivariate Analysis (TMVA) provides a ROOT-integrated machine learning environment for the processing and parallel evaluation of multivariate classification and regression techniques. TMVA is specifically designed to the needs of high-energy physics (HEP) applications, but should not be restricted to these. The package includes:

TMVA consists of object-oriented implementations in C++ for each of these multivariate methods and provides training, testing and performance evaluation algorithms and visualization scripts. The MVA training and testing is performed with the use of user-supplied data sets in form of ROOT trees or text files, where each event can have an individual weight. The true event classification or target value (for regression problems) in these data sets must be known. Preselection requirements and transformations can be applied on this data. TMVA supports the use of variable combinations and formulas.


  1. Review TMVA documentation on one method in detail.
  2. Using a topic map, demonstrate supplementing that documentation with additional literature or examples.
  3. TMVA is not restricted to high energy physics but do you find citations of its use outside of high energy physics?


Monday, December 27th, 2010


From the website:

ROOT is a framework for data processing, born at CERN, at the heart of the research on high-energy physics.  Every day, thousands of physicists use ROOT applications to analyze their data or to perform simulations.


  • Save data. You can save your data (and any C++ object) in a compressed binary form in a ROOT file.  The object format is also saved in the same file.  ROOT provides a data structure that is extremely powerful for fast access of huge amounts of data – orders of magnitude faster than any database.
  • Access data. Data saved into one or several ROOT files can be accessed from your PC, from the web and from large-scale file delivery systems used e.g. in the GRID.  ROOT trees spread over several files can be chained and accessed as a unique object, allowing for loops over huge amounts of data.
  • Process data. Powerful mathematical and statistical tools are provided to operate on your data.  The full power of a C++ application and of parallel processing is available for any kind of data manipulation.  Data can also be generated following any statistical distribution, making it possible to simulate complex systems.
  • Show results. Results are best shown with histograms, scatter plots, fitting functions, etc.  ROOT graphics may be adjusted real-time by few mouse clicks.  High-quality plots can be saved in PDF or other format.
  • Interactive or built application. You can use the CINT C++ interpreter or Python for your interactive sessions and to write macros, or compile your program to run at full speed. In both cases, you can also create a GUI.

Effective deployment of topic maps requires an understanding of how others identify their subjects.

Noting that subjects in this context includes not only subject in experimental data but the detectors and programs used to analyze that data. (Think data preservation.)


  1. Review the documentation browser for ROOT.
  2. How would you integrate one or more of the years of RootTalk Digest into that documentation?
  3. What scopes would you create and how would you use them?
  4. How would you use a topic map to integrate subject specific content for data or analysis in ROOT?

Network Science – NetSci

Monday, December 27th, 2010

Warning: NetSci has serious issues with broken links.

Network Science – NetSci: An Extensive Set of Resources for Science in Drug Discovery

From the website:

Welcome to the Network Science website. This site is dedicated to the topics of pharmaceutical research and the use of advanced techniques in the discovery of new therapeutic agents. We endeavor to provide a comprehensive look at the industry and the tools that are in use to speed drug discovery and development.

I stumbled across this website while looking for computational chemistry resources.

Pharmaceutical research is rich in topic map type issues, from mapping across the latest reported findings in journal literature to matching those identifications to results in computational software.


  1. Develop a drug discovery account that illustrates how topic maps might or might not help in that process. (5-7 pages, citations)
  2. What benefits would a topic map bring to drug discovery and how would you illustrate those benefits for a grant application either to a pharmaceutical company or granting agency? (3-5 pages, citations)
  3. Where would you submit a grant application based on #2? (3-5 pages, citations) (Requires researching what activities in drug development are funded by particular entities.)
  4. Prepare a grant application based on the answer to #3. (length depends on grantor requirements)
  5. For extra credit, update and/or correct twenty (20) links from this site. (Check with me first, I will maintain a list of those already corrected.)

Data Management Slam Dunk – SPAM Warning

Monday, December 27th, 2010

The Data Management Slam Dunk: A Unified Integration Platform is a spam message that landed in my inbox today.

I have heard good things about Talend software but gibberish like:

There will never be a silver bullet for marshalling the increasing volumes of data, but at least there is one irrefutable truth: a unified data management platform can solve most of the problems that information managers encounter. In fact, by creating a centralized repository for data definitions, lineage, transformations and movements, companies can avoid many troubles before they occur.

makes me wonder if any of it is true?

Did you notice that the irrefutable fact is a sort of magic incantation?

If everything is dumped in one place, troubles just melt away.

It isn’t that simple.

The “presentation” never gives a clue as to how anyone would achieve these benefits in practice. It just keeps repeating the benefits and oh, that Talend is the way to get them.

Not quite as annoying as one of those belly-buster infomercials but almost.

I have been planning on reviewing the Talend software from a topic map perspective.

Suggestions of issues, concerns or particularly elegant parts that I should be aware of are most welcome.


Sunday, December 26th, 2010

Waffles Authors: Mike Gashler

From the website:

Waffles is a collection of command-line tools for performing machine learning tasks. These tools are divided into 4 script-friendly apps:

waffles_learn contains tools for supervised learning.
waffles_transform contains tools for manipulating data.
waffles_plot contains tools for visualizing data.
waffles_generate contains tools to generate certain types of data.

For people who prefer not to have to remember commands, waffles also includes a graphical tool called


which guides the user to generate a command that will perform the desired task.

While exploring the site I looked at the demo applications and:

At some point, it seems, almost every scholar has an idea for starting a new journal that operates in some a-typical manner. This demo is a framework for the back-end of an on-line journal, to help get you started.

with the “…operates in some a-typical manner” was close enough to the truth that I just has to laugh out loud.

Care to nominate your favorite software project that “…operates in some a-typical manner?”

Update: Almost a year later I revisited the site to find:

Michael S. Gashler. Waffles: A machine learning toolkit. Journal of Machine Learning Research, MLOSS 12:2383-2387, July 2011. ISSN 1532-4435.


Random Forests

Sunday, December 26th, 2010

Random Forests Authors: Leo Breiman, Adele Cutler

The home site for Random Forest classification algorithm, with resources from its inventors, including the following philosophical note:

RF is an example of a tool that is useful in doing analyses of scientific data.

But the cleverest algorithms are no substitute for human intelligence and knowledge of the data in the problem.

Take the output of random forests not as absolute truth, but as smart computer generated guesses that may be helpful in leading to a deeper understanding of the problem.

I rather like that.

It is applicable to all the inferencing, machine learning, classification, and other tools you will see mentioned in this blog.

XML Schema Element Similarity Measures: A Schema Matching Context

Sunday, December 26th, 2010

XML Schema Element Similarity Measures: A Schema Matching Context Authors: Alsayed Algergawy, Richi Nayak, Gunter Saake


In this paper, we classify, review, and experimentally compare major methods that are exploited in the definition, adoption, and utilization of element similarity measures in the context of XML schema matching. We aim at presenting a unified view which is useful when developing a new element similarity measure, when implementing an XML schema matching component, when using an XML schema matching system, and when comparing XML schema matching systems.

I commend the entire paper for your reading but would draw your attention to one of the conclusions in particular:

Using a single element similarity measure is not sufficient to assess the similarity between XML schema elements. This necessitates the need to utilize several element measures exploiting both internal element features and external element relationships.

Does it seem plausible that single subject similarity measures can work but it is better to use several subject similarity measures?


  1. Compare this paper to any recent (last two years) paper on database schema similarity. What issues are the same, different, similar? (sorry, could not think of another word for it) (2-3 pages, citations)
  2. Create an annotated bibliography of ten (10) recent papers on XML or database schema similarity (excluding the papers in #1). (4-6 pages, citations)
  3. How would you use any of the similarity measures you have read about in a topic map? Or is similarity enough? (3-5 pages, no citations)

parf: Parallel Random Forest Algorithm

Saturday, December 25th, 2010

parf: Parallel Random Forest Algorithm

From the website:

The Random Forests algorithm is one of the best among the known classification algorithms, able to classify big quantities of data with great accuracy. Also, this algorithm is inherently parallelisable.

Originally, the algorithm was written in the programming language Fortran 77, which is obsolete and does not provide many of the capabilities of modern programming languages; also, the original code is not an example of “clear” programming, so it is very hard to employ in education. Within this project the program is adapted to Fortran 90. In contrast to Fortran 77, Fortran 90 is a structured programming language, legible — to researchers as well as to students.

The creator of the algorithm, Berkeley professor emeritus Leo Breiman, expressed a big interest in this idea in our correspondence. He has confirmed that no one has yet worked on a parallel implementation of his algorithm, and promised his support and help. Leo Breiman is one of the pioneers in the fields of machine learning and data mining, and a co-author of the first significant programs (CART – Classification and Regression Trees) in that field.

Well, while I was at I decided to look around for any resources that might interest topic mappers in the new year. This one caught my eye.

Not much apparent activity so this might be one where a volunteer or two could make a real difference.


Saturday, December 25th, 2010

Haloop Reported by Jack Park.

From the website:

The growing demand for large-scale data mining and data analysis applications has led both industry and academia to design new types of highly scalable data-intensive computing platforms. MapReduce and Dryad are two popular platforms in which the dataflow takes the form of a directed acyclic graph of operators. However, these new platforms do not have built-in support for iterative programs, which arise naturally in many applications including data mining, web ranking, graph processing, model fitting, and so on.


Simply speaking, HaLoop = Ha, Loop:-) HaLoop is a modified version of the Hadoop MapReduce framework, designed to serve these applications. HaLoop not only extends MapReduce with programming support for iterative applications, but also dramatically improves their efficiency by making the task scheduler loop-aware and by adding various caching mechanisms. We evaluate HaLoop on real queries and real datasets and find that, on average, HaLoop reduces query runtimes by 1.85 compared with Hadoop, and shuffles only 4% of the data between mappers and reducers compared with Hadoop.

Interesting project but svn reports the most recent commit was 2010-08-23 and the project wiki reflects the UsersManual was modified on 2010-09-04.

I will follow up with the owner and report back.

Update: 2010-12-26 – Email from the project owner advises of activity not reflected at the project site. Updates to appear in 2011-01. I will probably create another post and link back to this one and forward from this one.