Active Learning « Another Word For It

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 22, 2013

Active learning, almost black magic

Filed under: Active Learning,Duke,Genetic Algorithms,Machine Learning — Patrick Durusau @ 6:53 pm

Active learning, almost black magic by Lars Marius Garshol.

From the post:

I’ve written Duke, an engine for figuring out which records represent the same thing. It works fine, but people find it difficult to configure correctly, which is not so strange. Getting the configurations right requires estimating probabilities and choosing between comparators like Levenshtein, Jaro-Winkler, and Dice coefficient. Can we get the computer to do something people cannot? It sounds like black magic, but it’s actually pretty simple.

I implemented a genetic algorithm that can set up a good configuration automatically. The genetic algorithm works by making lots of configurations, then removing the worst and making more of the best. The configurations that are kept are tweaked randomly, and the process is repeated over and over again. It’s dead simple, but it works fine. The problem is: how is the algorithm to know which configurations are the best? The obvious solution is to have test data that tells you which records should be linked, and which ones should not be linked.

But that leaves us with a bootstrapping problem. If you can create a set of test data big enough for this to work, and find all the correct links in that set, then you’re fine. But how do you find all the links? You can use Duke, but if you can set up Duke well enough to do that you don’t need the genetic algorithm. Can you do it in other ways? Maybe, but that’s hard work, quite possibly harder than just battling through the difficulties and creating a configuration.

So, what to do? For a year or so I was stuck here. I had something that worked, but it wasn’t really useful to anyone.

Then I came across a paper where Axel Ngonga described how to solve this problem with active learning. Basically, the idea is to pick some record pairs that perhaps should be linked, and ask the user whether they should be linked or not. There’s an enormous number of pairs we could ask the user about, but most of these pairs provide very little information. The trick is to select those pairs which teach the algorithm the most.
…

This great stuff.

Particularly since I have a training problem that lacks a training set.

Looking forward to trying this on “real-world problems” as Lars says.

Comments Off

April 18, 2012

DUALIST: Utility for Active Learning with Instances and Semantic Terms

Filed under: Active Learning,Bayesian Models,HCIR,Machine Learning — Patrick Durusau @ 6:08 pm

DUALIST: Utility for Active Learning with Instances and Semantic Terms

From the webpage:

DUALIST is an interactive machine learning system for quickly building classifiers for text processing tasks. It does so by asking “questions” of a human “teacher” in the form of both data instances (e.g., text documents) and features (e.g., words or phrases). It uses active learning and semi-supervised learning to build text-based classifiers at interactive speed.

(video demo omitted)

The goals of this project are threefold:

A practical tool to facilitate annotation/learning in text analysis projects.

A framework to facilitate research in interactive and multi-modal active learning. This includes enabling actual user experiments with the GUI (as opposed to simulated experiments, which are pervasive in the literature but sometimes inconclusive for use in practice) and exploring HCI issues, as well as supporting new dual supervision algorithms which are fast enough to be interactive, accurate enough to be useful, and might make more appropriate modeling assumptions than multinomial naive Bayes (the current underlying model).

A starting point for more sophisticated interactive learning scenarios that combine multiple “beyond supervised learning” strategies. See the proceedings of the recent ICML 2011 workshop on this topic.

This could be quite useful for authoring a topic map across a corpus of materials. With interactive recognition of occurrences of subjects, etc.

Sponsored in part by the folks at DARPA. Unlike Al Gore, they did build the Internet.

Comments Off

February 29, 2012

Will the Circle Be Unbroken? Interactive Annotation!

Filed under: Active Learning,Annotation,Bayesian Data Analysis,Classification,Classifier,Machine Learning — Patrick Durusau @ 7:21 pm

I have to agree with Bob Carpenter, the title is a bit much:

Closing the Loop: Fast, Interactive Semi-Supervised Annotation with Queries on Features and Instances

From the post:

Whew, that was a long title. Luckily, the paper’s worth it:

Settles, Burr. 2011. Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances. EMNLP.

It’s a paper that shows you how to use active learning to build reasonably high-performance classifier with only minutes of user effort. Very cool and right up our alley here at LingPipe.

Both the paper and Bob’s review merit close reading.

Comments Off

October 12, 2011

Active learning: far from solved

Filed under: Active Learning,Linguistics — Patrick Durusau @ 4:38 pm

Active learning: far from solved

From the post:

As Daniel Hsu and John Langford pointed out recently, there has been a lot of recent progress in active learning. This is to the point where I might actually be tempted to suggest some of these algorithms to people to use in practice, for instance the one John has that learns faster than supervised learning because it’s very careful about what work it performs. That is, in particular, I might suggest that people try it out instead of the usual query-by-uncertainty (QBU) or query-by-committee (QBC). This post is a brief overview of what I understand of the state of the art in active learning (paragraphs 2 and 3) and then a discussion of why I think (a) researchers don’t tend to make much use of active learning and (b) why the problem is far from solved. (a will lead to b.)

This is a deeply interesting article that could give rise to mini and major projects. I particularly like his point about not throwing away training data. No, you have to read the post for yourself. It’s not that long.

Comments Off