Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 19, 2011

Intelligent Ruby + Machine Learning

Filed under: Artificial Intelligence,Learning Classifier,Machine Learning — Patrick Durusau @ 4:11 pm

Intelligent Ruby + Machine Learning

Entertaining slide deck that argues more data is better than less data and more models.

That its essential point but it does conclude with useful references.

It also has examples that may increase your interest in “machine learning.”

February 17, 2011

Encog Java and DotNet Neural Network Framework

Filed under: .Net,Encog,Java,Machine Learning,Neural Networks,Silverlight — Patrick Durusau @ 6:56 am

Encog Java and DotNet Neural Network Framework

From the website:

Encog is an advanced neural network and machine learning framework. Encog contains classes to create a wide variety of networks, as well as support classes to normalize and process data for these neural networks. Encog trains using multithreaded resilient propagation. Encog can also make use of a GPU to further speed processing time. A GUI based workbench is also provided to help model and train neural networks. Encog has been in active development since 2008.

Encog is available for Java, .Net and Silverlight.

An important project for at least two reasons.

First, the obvious applicability to the creation of topic maps using machine learning techniques.

Second, it demonstrates that supporting Java, .Net and Silverlight, isn’t, you know, all that weird.

The world is changing and becoming, somewhat more interoperable.

Topic maps has a role to play in that process, both in terms of semantic interoperability of the infrastructure as well as the data it contains.

February 14, 2011

PyML

Filed under: Machine Learning — Patrick Durusau @ 1:44 pm

PyML is an interactive object oriented framework for machine learning in Python

From the website:

PyML has been tested on Mac OS X and Linux. Some components are in C++ so it’s not automatically portable.

Here are some key features of “PyML”:

  • Classifiers: Support Vector Machines (SVM), nearest neighbor classifiers, ridge regression
  • Multi-class methods (one-against-one and one-against-rest)
  • Feature selection (filter methods, RFE, multiplicative update)
  • Model selection
  • Syntax for combining classifiers
  • Classifier testing (cross-validation, error rates, ROC curves, statistical test for comparing classifiers)

If you are running Mac OS X, please let me know what you think about this package.

Machine Learning Lectures (Video)

Filed under: Machine Learning — Patrick Durusau @ 10:34 am

Machine Learning by Andrew Ng, Stanford University, on iTunes.

These videos will not make the Saturday night movie schedule in most households but will definitely repay close study.

Augmented topic map authoring is a necessary feature of creation of topic maps for large data sets.

February 12, 2011

Pattern recognition and machine learning

Filed under: Machine Learning,Pattern Recognition — Patrick Durusau @ 5:26 pm

Pattern recognition and machine learning by Christoper M. Bishop was mentioned in Which Automatic Differentiation Tool for C/C++?, a post by Bob Carpenter.

I ran across another reference to it today that took me to a page with exercise solutions, corrections and other materials that will be of interest if you are using the book for a class or self-study.

See: PRML: Pattern Recognition and Machine Learning

I was impressed enough by the materials to go ahead and order a copy of it.

It is fairly long and I have to start up a blog on ODF (Open Document Format), so don’t expect a detailed summary any time soon.

February 8, 2011

Which Automatic Differentiation Tool for C/C++?

Which Automatic Differentiation Tool for C/C++?

OK, not immediately obvious why this is relevant to topic maps.

Nor is Bob Carpenter’s references:

I’ve been playing with all sorts of fun new toys at the new job at Columbia and learning lots of new algorithms. In particular, I’m coming to grips with Hamiltonian (or hybrid) Monte Carlo, which isn’t as complicated as the physics-based motivations may suggest (see the discussion in David MacKay’s book and then move to the more detailed explanation in Christopher Bishop’s book).

particularly useful.

I suspect the two book references are:

but I haven’t asked. In part to illustrate the problem of resolving any entity reference. Both authors have authored other books touching on the same subjects so my guesses may or may not be correct.

Oh, relevance to topic maps. The technique automatic differentiation is used in Hamiltonian Monte Carlo methods for the generation of gradients. Still not helpful? Isn’t to me either.

Ah, what about Bayesian models in IR? That made the light go on!

I will be discussing ways to show more immediate relevance to topic maps, at least for some posts, in post #1000.

It isn’t as far away as you might think.

February 4, 2011

The Best Machine Learning Course on the Web

Filed under: Machine Learning — Patrick Durusau @ 9:06 am

The Best Machine Learning Course on the Web by Shubhendu Trivedi.

A bit dated (2008) and reported by Trivedi to be a very board survey type course.

Can anyone suggest anything more recent, of equal quality?

February 3, 2011

PyBrain: The Python Machine Learning Library

PyBrain: The Python Machine Learning Library

From the website:

PyBrain is a modular Machine Learning Library for Python. Its goal is to offer flexible, easy-to-use yet still powerful algorithms for Machine Learning Tasks and a variety of predefined environments to test and compare your algorithms.

PyBrain is short for Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network Library. In fact, we came up with the name first and later reverse-engineered this quite descriptive “Backronym”.

How is PyBrain different?

While there are a few machine learning libraries out there, PyBrain aims to be a very easy-to-use modular library that can be used by entry-level students but still offers the flexibility and algorithms for state-of-the-art research. We are constantly working on more and faster algorithms, developing new environments and improving usability.

What PyBrain can do

PyBrain, as its written-out name already suggests, contains algorithms for neural networks, for reinforcement learning (and the combination of the two), for unsupervised learning, and evolution. Since most of the current problems deal with continuous state and action spaces, function approximators (like neural networks) must be used to cope with the large dimensionality. Our library is built around neural networks in the kernel and all of the training methods accept a neural network as the to-be-trained instance. This makes PyBrain a powerful tool for real-life tasks.

Another tool kit to assist in the construction of topic maps.

And another likely contender for the Topic Map Competition!

MALLET: MAchine Learning for LanguagE Toolkit
Topic Map Competition (TMC) Contender?

MALLET: MAchine Learning for LanguagE Toolkit

From the website:

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

MALLET includes sophisticated tools for document classification: efficient routines for converting text to “features”, a wide variety of algorithms (including Naïve Bayes, Maximum Entropy, and Decision Trees), and code for evaluating classifier performance using several commonly used metrics.

In addition to classification, MALLET includes tools for sequence tagging for applications such as named-entity extraction from text. Algorithms include Hidden Markov Models, Maximum Entropy Markov Models, and Conditional Random Fields. These methods are implemented in an extensible system for finite state transducers.

Topic models are useful for analyzing large collections of unlabeled text. The MALLET topic modeling toolkit contains efficient, sampling-based implementations of Latent Dirichlet Allocation, Pachinko Allocation, and Hierarchical LDA.

Many of the algorithms in MALLET depend on numerical optimization. MALLET includes an efficient implementation of Limited Memory BFGS, among many other optimization methods.

In addition to sophisticated Machine Learning applications, MALLET includes routines for transforming text documents into numerical representations that can then be processed efficiently. This process is implemented through a flexible system of “pipes”, which handle distinct tasks such as tokenizing strings, removing stopwords, and converting sequences into count vectors.

An add-on package to MALLET, called GRMM, contains support for inference in general graphical models, and training of CRFs with arbitrary graphical structure.

Another tool to assist in the authoring of a topic map from a large data set.

It would be interesting but beyond the scope of the topic maps class, to organize a competition around several of the natural language processing packages.

To have a common data set, to be released on X date, with topic maps due say within 24 hours (there is a TV show with that in the title or so I am told).

Will have to give that some thought.

Could be both interesting and entertaining.

January 31, 2011

Pseudo-Code: A New Definition

Filed under: Machine Learning,Sets,Subject Identity,Topic Maps — Patrick Durusau @ 7:24 am

How to Speed up Machine Learning using a Set-Oriented Approach

The detail article for Need faster machine learning? Take a set-oriented approach, which I mentioned in a separate post.

Well, somewhat more detail.

Gives new meaning to pseudo-code:

The application side becomes:

Computing the model:

Fetch “compute-model over data items”

Classifying new items:

Fetch “classify over data items”

I am reminded of the cartoon with two people at a blackboard and one of them says: I think you should be more explicit in step two., where the text reads: Then a miracle occurs.

How about you?

January 29, 2011

Need faster machine learning? Take a
set-oriented approach

Filed under: Machine Learning,Sets,Subject Identity — Patrick Durusau @ 5:00 pm

Need faster machine learning? Take a set-oriented approach.

Roger Magoulas, using not small iron reports:

The result: The training set was processed and the sample data set classified in six seconds. We were able to classify the entire 400,000-record data set in under six minutes — more than a four-orders-of-magnitude records processed per minute (26,000-fold) improvement. A process that would have run for days, in its initial implementation, now ran in minutes! The performance boost let us try out different feature options and thresholds to optimize the classifier. On the latest run, a random sample showed the classifier working with 92% accuracy.

or

set-oriented machine learning makes for:

  • Handling larger and more diverse data sets
  • Applying machine learning to a larger set of problems
  • Faster turnarounds
  • Less risk
  • Better focus on a problem
  • Improved accuracy, greater understanding and more usable results
  • Seems to me sameness of subject representation is a classification task. Yes?

    Going from days to minutes sounds attractive to me.

    How about you?

    January 28, 2011

    Sofia-ML and Maui: Two Cool Machine Learning and Extraction libraries – Post

    Filed under: Extraction,Machine Learning — Patrick Durusau @ 7:21 am

    Sofia-ML and Maui: Two Cool Machine Learning and Extraction libraries

    Jeff Dalton reports on two software packages for text analysis.

    These are examples of just some of the tools that could be run on a corpus like the Afghan War Diaries.

    January 12, 2011

    Information Theory, Inference, and Learning Algorithms

    Filed under: Inference,Information Theory,Machine Learning — Patrick Durusau @ 11:46 am

    Information Theory, Inference, and Learning Algorithms Author: David J.C. MacKay, full text of the 2005 printing available for downloading. Software is also available.

    From a review that I read (http://dx.doi.org/10.1145/1189056.1189063), MacKay treats machine learning as the other side of the coin from information theory.

    Take the time to visit MacKay’s homepage.

    There you will find his book Sustainable Energy – Without the Hot Air. Highly entertaining.

    January 10, 2011

    Walking Towards A Topic Map

    Filed under: Graphs,Machine Learning — Patrick Durusau @ 6:55 pm

    Improving graph-walk-based similarity with reranking: Case studies for personal information management Authors: Einat Minkov, William W. Cohen Keywords: graph walk, learning, semistructured data, PIM

    Abstract:

    Relational or semistructured data is naturally represented by a graph, where nodes denote entities and directed typed edges represent the relations between them. Such graphs are heterogeneous, describing different types of objects and links. We represent personal information as a graph that includes messages, terms, persons, dates, and other object types, and relations like sent-to and has-term. Given the graph, we apply finite random graph walks to induce a measure of entity similarity, which can be viewed as a tool for performing search in the graph. Experiments conducted using personal email collections derived from the Enron corpus and other corpora show how the different tasks of alias finding, threading, and person name disambiguation can be all addressed as search queries in this framework, where the graph-walk-based similarity metric is preferable to alternative approaches, and further improvements are achieved with learning. While researchers have suggested to tune edge weight parameters to optimize the graph walk performance per task, we apply reranking to improve the graph walk results, using features that describe high-level information such as the paths traversed in the walk. High performance, together with practical runtimes, suggest that the described framework is a useful search system in the PIM domain, as well as in other semistructured domains. (emphasis in original)

    OK, so I lied. The title of the post isn’t the title of the article. Sue me. 😉

    Although, on the other hand you will find that for the authors, relatedness and similarity are used interchangeably (footnote 4), which I found to be rather odd.

    My point being that creation of a topic map can be viewed as a process of refinement.

    Based on some measure of similarity, you can decide that enough information has been identified or gathered together about a subject and simply stop.

    There may well be additional information that could be refined out of a graph about a subject but there is no rule that compels you do to so.

    Large Scale Data Mining Using Genetics-Based Machine Learning

    Filed under: Data Mining,Machine Learning — Patrick Durusau @ 3:16 pm

    Large Scale Data Mining Using Genetics-Based Machine Learning Authors: Jaume Bacardit, Xaiver Llorà

    Tutorial on data mining with genetics-based machine learning algorithms.

    Usual examples of exploding information from genetics to high energy physics.

    While those are good examples, it really isn’t necessary to go there in order to get large scale data sets.

    Imagine constructing a network for all the entities and their relationships in a single issue of the New York Times.

    That data isn’t as easily available or to process as genetic databases or results from the Large Hadron Collider.

    But that is a question of ease of access and processing, not being large scale data.

    The finance pages alone have listings for all the major financial institutions in the country. What about mapping their relationships to each other?

    Or for that matter, mapping the phone calls, emails and other communications between the stock trading houses? Broken down by subjects discussed.

    Important problems often as not have data that is difficult to acquire. Doesn’t make them any less important problems.

    January 9, 2011

    Center for Computational Analysis of Social and Organizational Systems (CASOS)

    Center for Computational Analysis of Social and Organizational Systems (CASOS)

    Home of both ORA and AutoMap but I thought it merited an entry of its own.

    Directed by Dr. Kathleen Carley:

    CASOS brings together computer science, dynamic network analysis and the empirical study of complex socio-technical systems. Computational and social network techniques are combined to develop a better understanding of the fundamental principles of organizing, coordinating, managing and destabilizing systems of intelligent adaptive agents (human and artificial) engaged in real tasks at the team, organizational or social level. Whether the research involves the development of metrics, theories, computer simulations, toolkits, or new data analysis techniques advances in computer science are combined with a deep understanding of the underlying cognitive, social, political, business and policy issues.

    CASOS is a university wide center drawing on a group of world class faculty, students and research and administrative staff in multiple departments at Carnegie Mellon. CASOS fosters multi-disciplinary research in which students and faculty work with students and faculty in other universities as well as scientists and practitioners in industry and government. CASOS research leads the way in examining network dynamics and in linking social networks to other types of networks such as knowledge networks. This work has led to the development of new statistical toolkits for the collection and analysis of network data (Ora and AutoMap). Additionally, a number of validated multi-agent network models in areas as diverse as network evolution , bio-terrorism, covert networks, and organizational adaptation have been developed and used to increase our understanding of real socio-technical systems.

    CASOS research spans multiple disciplines and technologies. Social networks, dynamic networks, agent based models, complex systems, link analysis, entity extraction, link extraction, anomaly detection, and machine learning are among the methodologies used by members of CASOS to tackle real world problems.

    Definitely a group that bears watching by anyone interested in topic maps!

    January 3, 2011

    Vowpal Wabbit (Fast Online Learning)

    Filed under: Machine Learning,Subject Identity — Patrick Durusau @ 4:01 pm

    Vowpal Wabbit (Fast Online Learning) by John Langford.

    From the website:

    There are two ways to have a fast learning algorithm: (a) start with a slow algorithm and speed it up, or (b) build an intrinsically fast learning algorithm. This project is about approach (b), and it’s reached a state where it may be useful to others as a platform for research and experimentation.

    I rather like that!

    Suspect the same is true for subject identity recognition algorithms.

    People have fast ones that require little or no programming. 😉

    What it will take to replicate such intrinsically fast subject recognition algorithms in digital form remains a research question.

    December 27, 2010

    TMVA Toolkit for Multivariate Data Analysis with ROOT

    Filed under: HEP - High Energy Physics,Inference,Machine Learning — Patrick Durusau @ 2:22 pm

    TMVA Toolkit for Multivariate Data Analysis with ROOT

    From the website:

    The Toolkit for Multivariate Analysis (TMVA) provides a ROOT-integrated machine learning environment for the processing and parallel evaluation of multivariate classification and regression techniques. TMVA is specifically designed to the needs of high-energy physics (HEP) applications, but should not be restricted to these. The package includes:

    TMVA consists of object-oriented implementations in C++ for each of these multivariate methods and provides training, testing and performance evaluation algorithms and visualization scripts. The MVA training and testing is performed with the use of user-supplied data sets in form of ROOT trees or text files, where each event can have an individual weight. The true event classification or target value (for regression problems) in these data sets must be known. Preselection requirements and transformations can be applied on this data. TMVA supports the use of variable combinations and formulas.

    Questions:

    1. Review TMVA documentation on one method in detail.
    2. Using a topic map, demonstrate supplementing that documentation with additional literature or examples.
    3. TMVA is not restricted to high energy physics but do you find citations of its use outside of high energy physics?

    December 26, 2010

    Waffles

    Filed under: Inference,Machine Learning,Random Forests — Patrick Durusau @ 5:38 pm

    Waffles Authors: Mike Gashler

    From the website:

    Waffles is a collection of command-line tools for performing machine learning tasks. These tools are divided into 4 script-friendly apps:

    waffles_learn contains tools for supervised learning.
    waffles_transform contains tools for manipulating data.
    waffles_plot contains tools for visualizing data.
    waffles_generate contains tools to generate certain types of data.

    For people who prefer not to have to remember commands, waffles also includes a graphical tool called

    waffles_wizard

    which guides the user to generate a command that will perform the desired task.

    While exploring the site I looked at the demo applications and:

    At some point, it seems, almost every scholar has an idea for starting a new journal that operates in some a-typical manner. This demo is a framework for the back-end of an on-line journal, to help get you started.

    with the “…operates in some a-typical manner” was close enough to the truth that I just has to laugh out loud.

    Care to nominate your favorite software project that “…operates in some a-typical manner?”


    Update: Almost a year later I revisited the site to find:

    Michael S. Gashler. Waffles: A machine learning toolkit. Journal of Machine Learning Research, MLOSS 12:2383-2387, July 2011. ISSN 1532-4435.

    Enjoy!

    December 9, 2010

    Mining of Massive Datasets – eBook

    Mining of Massive Datasets

    Jeff Dalton, Jeff’s Search Engine Caffè reports a new data mining book by Anand Rajaraman and Jeffrey D. Ullman (yes, that Jeffrey D. Ullman, think “dragon book.”).

    A free eBook no less.

    Read Jeff’s post on your way to get a copy.

    Look for more comments as I read through it.

    Has anyone written a comparison of the recent search engine titles? Just curious.


    Update: New version out in hard copy and e-book remains available. See: Mining Massive Data Sets – Update

    November 26, 2010

    Infer.NET

    Filed under: Bioinformatics,Inference,Machine Learning — Patrick Durusau @ 11:02 am

    Infer.NET

    From the website:

    Infer.NET is a framework for running Bayesian inference in graphical models. It can also be used for probabilistic programming as shown in this video.

    You can use Infer.NET to solve many different kinds of machine learning problems, from standard problems like classification or clustering through to customised solutions to domain-specific problems. Infer.NET has been used in a wide variety of domains including information retrieval, bioinformatics, epidemiology, vision, and many others.

    I should not have been surprised but use of a “.net” language is required to use Infer.Net.

    I would appreciate comments from anyone who uses Infer.Net for inferencing to assist in the authoring of topic maps.

    November 1, 2010

    Introduction to Graphical Models for Data Mining

    Filed under: Data Mining,Graphs,Machine Learning — Patrick Durusau @ 4:32 pm

    Introduction to Graphical Models for Data Mining by Arindam Banerjee, Department of Computer Science and Engineering, University of Minnesota.

    Abstract:

    Graphical models for large scale data mining constitute an exciting development in statistical data analysis which has gained significant momentum in the past decade. Unlike traditional statistical models which often make `i.i.d.’ assumptions, graphical models acknowledge dependencies among variables of interest and investigate inference/prediction while taking into account such dependencies. In recent years, latent variable Bayesian networks, such as latent Dirichlet allocation, stochastic block models, Bayesian co-clustering, and probabilistic matrix factorization techniques have achieved unprecedented success in a variety of application domains including topic modeling and text mining, recommendation systems, multi-relational data analysis, etc. The tutorial will give a broad overview of graphical models, and discuss recent developments in the context of mixed-membership models, matrix analysis models, and their generalizations. The tutorial will present a balanced mix of models, inference/learning methods, and applications.

    Slides (pdf)
    Slides (ppt)

    If you plan on using data mining as a source for authoring topic maps, graphical models are on your reading list.

    Questions:

    1. Would you use the results of a Bayesian network to author an entry in a topic map? Why/why not? (2-3 pages, no citations)
    2. Would you use the results of a Bayesian network to author an entry in a library catalog? Why/why not? (2-3 pages, no citations)
    3. Do we attribute certainty to library catalog entries that are actually possible entries for a particular item? (discussion question)
    4. Examples of the use of Bayesian networks in classification for library catalogs?

    October 25, 2010

    Consensus of Ambiguity: Theory and Application of Active Learning for Biomedical Image Analysis

    Consensus of Ambiguity: Theory and Application of Active Learning for Biomedical Image Analysis Authors: Scott Doyle, Anant Madabhushi Keywords:

    Abstract:

    Supervised classifiers require manually labeled training samples to classify unlabeled objects. Active Learning (AL) can be used to selectively label only “ambiguous” samples, ensuring that each labeled sample is maximally informative. This is invaluable in applications where manual labeling is expensive, as in medical images where annotation of specific pathologies or anatomical structures is usually only possible by an expert physician. Existing AL methods use a single definition of ambiguity, but there can be significant variation among individual methods. In this paper we present a consensus of ambiguity (CoA) approach to AL, where only samples which are consistently labeled as ambiguous across multiple AL schemes are selected for annotation. CoA-based AL uses fewer samples than Random Learning (RL) while exploiting the variance between individual AL schemes to efficiently label training sets for classifier training. We use a consensus ratio to determine the variance between AL methods, and the CoA approach is used to train classifiers for three different medical image datasets: 100 prostate histopathology images, 18 prostate DCE-MRI patient studies, and 9,000 breast histopathology regions of interest from 2 patients. We use a Probabilistic Boosting Tree (PBT) to classify each dataset as either cancer or non-cancer (prostate), or high or low grade cancer (breast). Trained is done using CoA-based AL, and is evaluated in terms of accuracy and area under the receiver operating characteristic curve (AUC). CoA training yielded between 0.01-0.05% greater performance than RL for the same training set size; approximately 5-10 more samples were required for RL to match the performance of CoA, suggesting that CoA is a more efficient training strategy.

    The consensus of ambiguity (CoA) is trivially extensible to other image analysis. Intelligence photos anyone?

    What intrigues me is extension of that approach to other types of data analysis.

    Such as having multiple AL schemes process textual data and follow the CoA approach on what to bounce to experts for annotation.

    Questions:

    1. What types of ambiguity would this approach miss?
    2. How would you apply this method to other data?
    3. How would you measure success/failure of application to other data?
    4. Design and apply this concept to specified data set. (project)

    October 21, 2010

    A Survey of Genetics-based Machine Learning

    Filed under: Evoluntionary,Learning Classifier,Machine Learning,Neural Networks — Patrick Durusau @ 5:15 am

    A Survey of Genetics-based Machine Learning Author: Tim Kovacs

    Abstract:

    This is a survey of the field of Genetics-based Machine Learning (GBML): the application of evolutionary algorithms to machine learning. We assume readers are familiar with evolutionary algorithms and their application to optimisation problems, but not necessarily with machine learning. We briefly outline the scope of machine learning, introduce the more specific area of supervised learning, contrast it with optimisation and present arguments for and against GBML. Next we introduce a framework for GBML which includes ways of classifying GBML algorithms and a discussion of the interaction between learning and evolution. We then review the following areas with emphasis on their evolutionary aspects: GBML for sub-problems of learning, genetic programming, evolving ensembles, evolving neural networks, learning classifier systems, and genetic fuzzy systems.

    The author’s preprint has 322 references. Plus there are slides, bibliographies in BibTeX.

    If you are interesting in augmented topic map authoring using GBML, this would be a good starting place.

    Questions:

    1. Pick 3 subject areas. What arguments would you make in favor of GBML for augmenting authoring of a topic map for those subject areas?
    2. Same subject areas, but what arguments would you make against the use of GBML for augmenting authoring of a topic map for those subject areas?
    3. Design an experiment to test one of your arguments for and against GBML. (project, use of the literature encouraged)
    4. Convert the BibTeX formatted bibliographies into a topic map. (project)
    « Newer Posts

    Powered by WordPress