Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 28, 2011

Enhancing search results using machine learning

Filed under: Machine Learning,Search Data,Searching — Patrick Durusau @ 7:59 pm

Enhancing search results using machine learning by Emmanuel Espina

From the introduction:

To introduce you in the topic let’s think about how the users are used to work with “information retrieval platforms” (I mean, search engines). The user enters your site, sees a little rectangular box with a button that reads “search” besides it, and figures out that he must think about some keywords to describe what he wants, write them in the search box and hit search. Despite we are all very used to this, a deeper analysis of the workings of this procedure leads to the conclusion that it is a quite unintuitive procedure. Before search engines, the action of “mentally extracting keywords” from concepts was not a so common activity.

It is something natural to categorize things, to classify the ideas or concepts, but extracting keywords is a different intellectual activity. While searching, the user must think like the search engine! The user must think “well, this machine will give me documents with the words I am going to enter, so which are the words that have the best chance to give me what I want” (emphasis added)

Hmmmm, but prior to full-text search, users learned how to think like the indexers who created the index they were using. Indexers were a first line of defense against unbounded information as indexes covered particular resources and had mechanisms to account for changing terminology. Not to mention domain specific vocabularies that users could master.

A second line of defense were librarians who not only mastered domain specific indexes but who could also move from one specialized finding aid to another, collating information as they went. The ability to transition from one finding aid is one that has yet to be duplicated by automatic means. In part because it depends on the resources available in a particular library.

Do read the article to see how the author proposes to use machine learning to improve search results.

BTW, do you know of any sets of query responses that are publicly available?

August 24, 2011

RTextTools:…v.1.3 New Release

Filed under: Machine Learning,R — Patrick Durusau @ 6:55 pm

RTextTools: a machine learning library for text classification

From the post:

RTextTools v1.3 was released on August 21, and the package binaries are now available on CRAN. This update fixes a major bug with the stemmers, and it is highly recommended you upgrade to the latest version. Other changes include optimization of existing functions and improvements to the documentation.

Additionally, Duncan Temple Lang has graciously released Rstem on CRAN, meaning that the RTextTools package is now fully installable using the install.packages(“RTextTools”) command within R 2.13+. The repository at install.rtexttools.com will continue to work through the end of September.

From about the project:

RTextTools is a free, open source machine learning package for automatic text classification that makes it simple for both novice and advanced users to get started with supervised learning. The package includes nine algorithms for ensemble classification (SVM, SLDA, BOOSTING, BAGGING, RF, GLMNET, TREE, NNET, MAXENT), comprehensive analytics, and thorough documentation.

August 23, 2011

How Far You Can Get Using Machine Learning Black-Boxes [?]

Filed under: Machine Learning — Patrick Durusau @ 6:40 pm

How Far You Can Get Using Machine Learning Black-Boxes [?]

Abstract:

Supervised Learning (SL) is a machine learning research area which aims at developing techniques able to take advantage from labeled training samples to make decisions over unseen examples. Recently, a lot of tools have been presented in order to perform machine learning in a more straightforward and transparent manner. However, one problem that is increasingly present in most of the SL problems being solved is that, sometimes, researchers do not completely understand what supervised learning is and, more often than not, publish results using machine learning black-boxes. In this paper, we shed light over the use of machine learning black-boxes and show researchers how far they can get using these out-of-the-box solutions instead of going deeper into the machinery of the classifiers. Here, we focus on one aspect of classifiers namely the way they compare examples in the feature space and show how a simple knowledge about the classifier’s machinery can lift the results way beyond out-of-the-box machine learning solutions.

Not surprising that understanding how to use a tool leads to better results. A reminder, particularly one that illustrates how to better use a tool, is always welcome.

Modeling Social and Information Networks: Opportunities for Machine Learning

Filed under: Machine Learning — Patrick Durusau @ 6:39 pm

Modeling Social and Information Networks: Opportunities for Machine Learning

The description:

Emergence of the web, social media and online social networking websites gave rise to detailed traces of human social activity. This offers many opportunities to analyze and model behaviors of millions of people. For example, we can now study ”planetary scale” dynamics of a full Microsoft Instant Messenger network of 240 million people, with more than 255 billion exchanged messages per month. Many types of data, especially web and “social” data, come in a form of a network or a graph. This tutorial will cover several aspects of such network data: macroscopic properties of network data sets; statistical models for modeling large scale network structure of static and dynamic networks; properties and models of network structure and evolution at the level of groups of nodes and algorithms for extracting such structures. I will also present several applications and case studies of blogs, instant messaging, Wikipedia and web search. Machine learning as a topic will be present throughout the tutorial. The idea of the tutorial is to introduce the machine learning community to recent developments in the area of social and information networks that underpin the Web and other on-line media.

Very good tutorial on social and information networks. Almost 2.5 hours in length.

Slides.

Mulan: A Java Library for Multi-Label Learning

Filed under: Java,Machine Learning — Patrick Durusau @ 6:38 pm

Mulan: A Java Library for Multi-Label Learning

From the website:

Mulan is an open-source Java library for learning from multi-label datasets. Multi-label datasets consist of training examples of a target function that has multiple binary target variables. This means that each item of a multi-label dataset can be a member of multiple categories or annotated by many labels (classes). This is actually the nature of many real world problems such as semantic annotation of images and video, web page categorization, direct marketing, functional genomics and music categorization into genres and emotions. An introduction on mining multi-label data is provided in (Tsoumakas et al., 2010).

Currently, the library includes a variety of state-of-the-art algorithms for performing the following major multi-label learning tasks:

  • Classification. This task is concerned with outputting a bipartition of the labels into relevant and irrelevant ones for a given input instance.
  • Ranking. This task is concerned with outputting an ordering of the labels, according to their relevance for a given data item
  • Classification and ranking. A combination of the two tasks mentioned-above.

In addition, the library offers the following features:

  • Feature selection. Simple baseline methods are currently supported.
  • Evaluation. Classes that calculate a large variety of evaluation measures through hold-out evaluation and cross-validation.

August 22, 2011

Scaling Up Machine Learning, the Tutorial

Filed under: BigData,Machine Learning — Patrick Durusau @ 7:41 pm

Scaling Up Machine Learning, the Tutorial, KDD 2011 by Ron Bekkerman, Misha Bilenko and John Langford.

From the webpage:

This tutorial gives a broad view of modern approaches for scaling up machine learning and data mining methods on parallel/distributed platforms. Demand for scaling up machine learning is task-specific: for some tasks it is driven by the enormous dataset sizes, for others by model complexity or by the requirement for real-time prediction. Selecting a task-appropriate parallelization platform and algorithm requires understanding their benefits, trade-offs and constraints. This tutorial focuses on providing an integrated overview of state-of-the-art platforms and algorithm choices. These span a range of hardware options (from FPGAs and GPUs to multi-core systems and commodity clusters), programming frameworks (including CUDA, MPI, MapReduce, and DryadLINQ), and learning settings (e.g., semi-supervised and online learning). The tutorial is example-driven, covering a number of popular algorithms (e.g., boosted trees, spectral clustering, belief propagation) and diverse applications (e.g., speech recognition and object recognition in vision).

The tutorial is based on (but not limited to) the material from our upcoming Cambridge U. Press edited book which is currently in production and will be available in December 2011.

The slides are informative and entertaining. Interested in seeing if the book is the same.

August 17, 2011

Machine Learning – Stanford Class

Filed under: CS Lectures,Machine Learning — Patrick Durusau @ 6:48 pm

Machine Learning – Stanford Class

From the course description:

This course provides a broad introduction to machine learning, datamining, and statistical pattern recognition. Topics include: (i) Supervised learning (parametric/non-parametric algorithms, support vector machines, kernels, neural networks). (ii) Unsupervised learning (clustering, dimensionality reduction, recommender systems, deep learning). (iii) Best practices in machine learning (bias/variance theory; innovation process in machine learning and AI). (iv) Reinforcement learning. The course will also draw from numerous case studies and applications, so that you’ll also learn how to apply learning algorithms to building smart robots (perception, control), text understanding (web search, anti-spam), computer vision, medical informatics, audio, database mining, and other areas.

A free Stanford class on machine learning being taught by Professor Andrew Ng!

Over 200,000 people have viewed Professor Ng’s machine learning lectures on YouTube. Now you can participate and even get a certificate of accomplishment.

I am already planning to take the free Introduction to Artificial Intelligence class at Stanford so I can only hope they repeat Machine Learning next year.

Embracing Uncertainty: Applied Machine Learning Comes of Age

Filed under: Machine Learning,Recognition — Patrick Durusau @ 6:48 pm

Embracing Uncertainty: Applied Machine Learning Comes of Age

Christopher Bishop, Microsoft Research Cambridge, ICML 2011 Keynote.

Christopher reports the discovery that solving the problem of guesture controls isn’t one of tracking location, say of your arm from position to position.

Rather it is a question of recognition, at every frame. Which makes the computation tractable on older hardware.

Which makes me wonder how many other problems we have “viewed” the most difficult way possible? Or where viewing as problems of recognition will make previously intractable problems tractable? Won’t know unless we make the effort to ask.

August 8, 2011

Suicide Note Classification…ML Correct 78% of the time.

Filed under: Data Analysis,Data Mining,Machine Learning — Patrick Durusau @ 6:41 pm

Suicide Note Classification Using Natural Language Processing: A Content Analysis

Punch line (for the impatient):

…trainees accurately classified notes 49% of the time, mental health professionals accurately classified notes 63% of the time, and the best machine learning algorithm accurately classified the notes 78% of the time.

Abstract:

Suicide is the second leading cause of death among 25–34 year olds and the third leading cause of death among 15–25 year olds in the United States. In the Emergency Department, where suicidal patients often present, estimating the risk of repeated attempts is generally left to clinical judgment. This paper presents our second attempt to determine the role of computational algorithms in understanding a suicidal patient’s thoughts, as represented by suicide notes. We focus on developing methods of natural language processing that distinguish between genuine and elicited suicide notes. We hypothesize that machine learning algorithms can categorize suicide notes as well as mental health professionals and psychiatric physician trainees do. The data used are comprised of suicide notes from 33 suicide completers and matched to 33 elicited notes from healthy control group members. Eleven mental health professionals and 31 psychiatric trainees were asked to decide if a note was genuine or elicited. Their decisions were compared to nine different machine-learning algorithms. The results indicate that trainees accurately classified notes 49% of the time, mental health professionals accurately classified notes 63% of the time, and the best machine learning algorithm accurately classified the notes 78% of the time.

The researchers concede that the data set is small but apparently it is the only one of it kind.

I mention the study here as a reason to consider using ML techniques in your next topic map project.

Merging the results from different ML algorithms re-creates the original topic maps use case, how do you merge indexes made by different indexers?, but that can’t be helped. More patterns to discover to use as the basis for merging rules!*

PS: I spotted this at Improbable Results: Machines vs. Professionals: Recognizing Suicide Notes.


* I wonder if we could apply the lessons from ensembles of classifiers to a situation where multiple classifiers are used by different projects? One part of me says that an ensemble is developed by a person or group that shares an implicit view of the data and so that makes the ensemble workable.

Another part wants to say that no, the results of classifiers, whether programmed by the same group or different groups, should not make a difference. Well, other than having to “merge” the results of the classifiers, which happens with an ensemble anyway. In that case you might have to think about it more.

Hard to say. Will have to investigate further.

Data Mining: Professor Pier Luca Lanzi, Politecnico di Milano

Filed under: Data Mining,Genetic Algorithms,Machine Learning,Visualization — Patrick Durusau @ 6:27 pm

This post started with my finding the data mining slides at Slideshare (about 4 years old) and after organizing those, deciding to check Professor Pier Luca Lanzi’s homepage for more recent material. I think you will find it useful material.

Pier Luca Lanzi – homepage

The professor is obviously interested in video games, a rapidly growing area of development and research.

Combining video games with data mining, that would be a real coup.

Data Mining Course page

Data Mining

Includes prior exams, video (2009 course), transparencies from all lectures.

Lecture slides on Data Mining and Machine Learning at Slideshare.

Not being a lemming, I don’t find most viewed a helpful sorting criteria.

I organized the data mining slides in course order (as nearly as I could determine, there are two #6 presentations and no #7 or #17 presentations):

00 Course Introduction

01 Data Mining

02 Machine Learning

03 The representation of data

04 Association rule mining

05 Association rules: advanced topics

06 Clustering: Introduction

06 Clustering: Partitioning Methods

08 Clustering: Hierarchical

09 Density-based, Grid-based, and Model-based Clustering

10 Introduction to Classification

11 Decision Trees

12 Classification Rules

13 Nearest Neighbor and Bayesian Classifiers

14 Evaluation

15 Data Exploration and Preparation

16 Classifiers Ensembles

18 Mining Data Streams

19 Text and Web Mining

Genetic Algorithms

Genetic Algorithms Course Notes

Bayesian Reasoning and Machine Learning

Filed under: Bayesian Models,Machine Learning — Patrick Durusau @ 6:26 pm

Bayesian Reasoning and Machine Learning by David Barber.

Whom this book is for

The book is designed to appeal to students with only a modest mathematical background in undergraduate calculus and linear algebra. No formal computer science or statistical background is required to follow the book, although a basic familiarity with probability, calculus and linear algebra would be useful. The book should appeal to students from a variety of backgrounds, including Computer Science, Engineering, applied Statistics, Physics, and Bioinformatics that wish to gain an entry to probabilistic approaches in Machine Learning. In order to engage with students, the book introduces fundamental concepts in inference using only minimal reference to algebra and calculus. More mathematical techniques are postponed until as and when required, always with the concept as primary and the mathematics secondary.

The concepts and algorithms are described with the aid of many worked examples. The exercises and demonstrations, together with an accompanying MATLAB toolbox, enable the reader to experiment and more deeply understand the material. The ultimate aim of the book is to enable the reader to construct novel algorithms. The book therefore places an emphasis on skill learning, rather than being a collection of recipes. This is a key aspect since modern applications are often so specialised as to require novel methods. The approach taken throughout is to fi rst describe the problem as a graphical model, which is then translated in to a mathematical framework, ultimately leading to an algorithmic implementation in the BRMLtoolbox.

The book is primarily aimed at fi nal year undergraduates and graduates without signifi cant experience in mathematics. On completion, the reader should have a good understanding of the techniques, practicalities and philosophies of probabilistic aspects of Machine Learning and be well equipped to understand more advanced research level material.

The main page for the book and link to software.

David Barber’s homepage.

The book is due to be published by Cambridge University Press in the summer of 2011.

Machine Learning – Mathematicalmonk

Filed under: Machine Learning — Patrick Durusau @ 6:24 pm

Machine Learning – Mathematicalmonk

Engaging series of videos on machine learning. (159 as of 8 August 2011)

I have only watched the first five videos, but the videos are helpfully broken down into small segments (8 to 15 minutes), so you don’t have to commit to watching 30, 40 or 60 minutes of lecture at one time.

The lecturer has a very engaging style.

A style I would like to imitate for similar material on topic maps.

August 7, 2011

The quiet rise of Gaussian Belief Propagation (GaBP)

Filed under: GaBP,Machine Learning — Patrick Durusau @ 7:06 pm

The quiet rise of Gaussian Belief Propagation (GaBP) by Danny Bickson.

From the post:

Gaussian Belief Propagation is an inference method on a Gaussian graphical model which is related to solving a linear system of equations, one of the fundamental problems in computer science and engineering.  I have published my PhD thesis on applications of GaBP in 2008.

When I started working on GaBP, it was absolutely useless algorithm with no documented applications.

Recently, I am getting a lot of inquiries from people who applying GaBP on real world problems. Some examples:

  • Carnegie Mellon graduate student Kyung-Ah Sohn, working with Eric Xing, is working on regression problem for finding causal genetic variants of gene expressions, considered using GaBP for computing matrix inverses.
  • UCSC researcher Daniel Zerbino using suing GaBP for smoothing genomic sequencing measurements with constraints.
  • UCSB graduate student Yun Teng is working on implementing GaBP as part of the KDT (knowledge discovery toolbox package).

Furthermore, I was very excited to find out today from Noam Koenigstein, a Tel Aviv university graduate about Microsoft Research Cambridge project called MatchBox, which is using Gaussian BP for collaborative filtering and being actually deployed in MS. Some examples to other conversations I had are:

  • Wall Street undisclosed company (that asked to remain private) who is using GaBP for parallel computation of linear regression of online stock market data.
  • A gas and oil company was considering to use GaBP for computing the main diagonal of the inverse of a sparse matrix.

The MatchBox project is a recommender system that takes user choices into account, even ones in a current “session.”

Curious, to what extent are user preferences the same or different from way they identify subjects and the subjects they would identify?

August 6, 2011

Machine learning problem settings

Filed under: Hadoop,Machine Learning,MapReduce — Patrick Durusau @ 6:51 pm

Machine learning problem settings

From the post:

After a few successful Apache Mahout projects the goal of this lecture was to introduce students to some of the basic concepts and problems encountered today in a world where huge datasets are generally available and are easy to process with Apache Hadoop. As such the course is targeted at an entry level audience – thorough treatment of the mathematical background of latest machine learning technology is left to the machine learning research groups in Potsdam, at TU Berlin and the neural information processing group at TU.

Slides and exercises that will be useful along side of or getting warmed up for Introduction to Artificial Intelligence – Stanford Class.

August 5, 2011

Sentiment Analysis: Machines Are Like Us

Filed under: Analytics,Artificial Intelligence,Classifier,Machine Learning — Patrick Durusau @ 7:07 pm

Sentiment Analysis: Machines Are Like Us

Interesting post but in particular for:

We are very aware of the importance of industry-specific language here at Brandwatch and we do our best to offer language analysis that specialises in industries as much as possible.

We constantly refine our language systems by adding newly trained classifiers (a classifier is the particular system used to detect and analyse the language of a query’s matches – which classifier should be used is determined upon query creation).

We have over 500 classifiers for different industries across the 17 languages we cover.

Did you catch that? Over 500 classifiers for different industries.

In other words, we don’t need a single classifier that does all the heavy lifting on entity recognition for building topic maps. We could, for example, train a classifier for use with all the journals in a field or sub-field. For astronomy, for example, we don’t have to disambiguate all the various uses of “Venus” but can concentrate on the one most likely to be found in a sub-set of astronomy literature.

By using specialized classifiers, perhaps we can reduce the target for more generalized classifiers to a manageable size.

Mahout: Hands on!

Filed under: Artificial Intelligence,Hadoop,Machine Learning,Mahout — Patrick Durusau @ 7:06 pm

Mahout: Hands on!

From the tutorial description at OSCON 2011:

Mahout is an open source machine learning library from Apache. At the present stage of development, it is evolving with a focus on collaborative filtering/recommendation engines, clustering, and classification.

There is no user interface, or a pre-packaged distributable server or installer. It is, at best, a framework of tools intend to be used and adapted by developers. The algorithms in this “suite” can be used in applications ranging from recommendation engines for movie websites to designing early warning systems in credit risk engines supporting the cards industry out there.

This tutorial aims at helping you set up Mahout to run on a Hadoop setup. The instructor will walk you through the basic idea behind each of the algorithms. Having done that, we’ll take a look at how it can be run on some of the large-sized datasets and how it can be used to solve real world problems.

If your site or smartphone app or viral facebook app collects data which you really want to use a lot more productively, this session is for you!

Not the only resource on Mahout you will want but an excellent place to start.

July 29, 2011

AZOrange – Machine Learning for QSAR Modeling

Filed under: AZOrange,Key-Value Stores,Machine Learning,QSAR — Patrick Durusau @ 7:41 pm

AZOrange – High performance open source machine learning for QSAR modeling in a graphical programming environment Jonna C Stalring, Lars A Carlsson, Pedro Almeida and Scott Boyer. Journal of Cheminformatics 2011, 3:28doi:10.1186/1758-2946-3-28

Abstract:

Machine learning has a vast range of applications. In particular, advanced machine learning methods are routinely and increasingly used in quantitative structure activity relationship (QSAR) modeling. QSAR data sets often encompass tens of thousands of compounds and the size of proprietary, as well as public data sets, is rapidly growing. Hence, there is a demand for computationally efficient machine learning algorithms, easily available to researchers without extensive machine learning knowledge. In granting the scientific principles of transparency and reproducibility, Open Source solutions are increasingly acknowledged by regulatory authorities. Thus, an Open Source state-of-the-art high performance machine learning platform, interfacing multiple, customized machine learning algorithms for both graphical programming and scripting, to be used for large scale development of QSAR models of regulatory quality, is of great value to the QSAR community.

Project homepage: AZOrange (Ubuntu, I assume it compiles and runs on other *nix platforms. I run Ubuntu so I need to setup another *nix distribution just for test purposes.)

July 26, 2011

A machine learning toolbox for musician
computer interaction

Filed under: Authoring Semantics,Authoring Topic Maps,Machine Learning — Patrick Durusau @ 6:27 pm

A machine learning toolbox for musician computer interaction

Abstract:

This paper presents the SARC EyesWeb Catalog, (SEC), a machine learning toolbox that has been specifically developed for musician-computer interaction. The SEC features a large number of machine learning algorithms that can be used in real-time to recognise static postures, perform regression and classify multivariate temporal gestures. The algorithms within the toolbox have been designed to work with any N-dimensional signal and can be quickly trained with a small number of training examples. We also provide the motivation for the algorithms used for the recognition of musical gestures to achieve a low intra-personal generalisation error, as opposed to the inter-personal generalisation error that is more common in other areas of human-computer interaction.

Recorded at: 11th International Conference on New Interfaces for Musical Expression. 30 May – 1 June 2011, Oslo, Norway. nime2011.org

The paper: A machine learning toolbox for musician computer interaction

The software: SARC EyesWeb Catalog [SEC]

Although written in the context of musician-computer interaction, the techniques described here could just as easily be applied to exploration or authoring of a topic map. Or for that matter exploring a data stream that is being presented to a user.

Imagine that one hand gives “focus” to some particular piece of data and the other hand “overlays” a query onto that data that then displays a portion of a topic map with that data as the organizing subject. Based on that result the data can be simply dumped back into the data stream or “saved” for further review and analysis.

July 25, 2011

Interesting Neural Network Papers at ICML 2011

Filed under: Machine Learning,Neural Networks — Patrick Durusau @ 6:39 pm

Interesting Neural Network Papers at ICML 2011 by Richard Socher.

Brief comments on eight (8) papers and the ICML 2011 conference.

Highly recommended, particularly if you are interested in neural networks and/or machine learning in connection with your topic maps.

The conference website: The 28th International Conference on Machine Learning, has pointers to the complete proceedings as well as videos of all Session A talks.

Kudos to the conference and its organizers for making materials from the conference available!

July 16, 2011

Python for brain mining:…

Filed under: Machine Learning,Parallel Programming,Parallelism,Python,Visualization — Patrick Durusau @ 5:42 pm

Python for brain mining: (neuro)science with state of the art machine learning and data visualization by Gaël Varoquaux.

Brief slide deck on three tools:

Mayavi: For 3-D visualizations.

scikit-learn, which we reported on at: scikits.learn machine learning in Python.

Joblib: running Python function as pipeline jobs.

All three look useful, although I suspect Joblib may be the one of more immediate interest.

Depends on your interests. Comments?

IPSN 2012

Filed under: Conferences,Machine Learning — Patrick Durusau @ 5:42 pm

IPSN 2012 : The 11th ACM/IEEE Conference on Information Processing in Sensor Networks Call for Papers.

April 2012 Beijing, China (conference dates not confirmed)

Important Dates:

Abstract deadline: Friday, October 07, 2011
Full papers due: Friday, October 14, 2011
Author notification: Friday, January 20, 2012
Camera ready due: Friday, March 01, 2012

Scope:

The International Conference on Information Processing in Sensor Networks (IPSN) is a leading, single-track, annual forum on research in wireless embedded sensing systems. IPSN brings together researchers from academia, industry, and government to present and discuss recent advances in both theoretical and experimental research. Its scope includes signal and image processing, information and coding theory, databases and information management, distributed algorithms, networks and protocols, wireless communications, collaborative objects and the Internet of Things, machine learning, and embedded systems design.

If you are designing a topic map for sensor network input this looks like a good conference to attend.

I haven’t had the good fortune to visit Beijing but April is reported to be a good month (Spring) to visit.

June 26, 2011

21st-Century Data Miners Meet 19th-Century Electrical Cables

Filed under: Data Mining,Machine Learning — Patrick Durusau @ 4:09 pm

21st-Century Data Miners Meet 19th-Century Electrical Cables by Cynthia Rudin, Rebecca J. Passonneau, Axinia Radeva, Steve Ierome, and Delfina F. Isaac, Computer, June 2011 (vol. 44 no. 6).

As they say, the past is never far behind. In this case, about 5% of the low-voltage cables in Manhattan were installed before 1930. The records of Consolidated Edison (ConEd) on its cabling and manholes to access it, vary in form, content and originate in different departments, starting in the 1880’s. Yes, the 1880’s for those of you who thing the 1980’s are ancient history.

From the article:

The text in trouble tickets is very irregular and thus challenging to process in its raw form. There are many spellings of each word–for instance, the term “service box” has at least 38 variations, including SB, S, S/B, S.B, S?B, S/BX, SB/X, S/XB, /SBX, S.BX, S&BX, S?BX, S BX, S/B/X, S BOX, SVBX, SERV BX, SERV-BOX, SERV/BOX, and SERVICE BOX.

Similar difficulties plagued determining the type of event from trouble tickets, etc.

Read the article for the details on how the researchers were successful at showing legacy data can assist in the maintenance of a current electrical grid.

I suspect that “service box” is used by numerous utilities and with similar divergences in its recording. A more general application written as a topic map would preserve all those variations and use them in searching other data records. It is the reuse of user analysis and data that make them so valuable.

June 24, 2011

Good Machine Learning Blogs

Filed under: Machine Learning — Patrick Durusau @ 10:48 am

Good Machine Learning Blogs

Something to keep you off the streets for a while. 😉

June 23, 2011

Advanced Topics in Machine Learning

Advanced Topics in Machine Learning

Andreas Krause and Daniel Golovin course at CalTech. Lecture notes, readings, this will keep you entertained for some time.

Overview:

How can we gain insights from massive data sets?

Many scientific and commercial applications require us to obtain insights from massive, high-dimensional data sets. In particular, in this course we will study:

  • Online learning: How can we learn when we cannot fit the training data into memory? We will cover no regret online algorithms; bandit algorithms; sketching and dimension reduction.
  • Active learning: How should we choose few expensive labels to best utilize massive unlabeled data? We will cover active learning algorithms, learning theory and label complexity.
  • Nonparametric learning on large data: How can we let complexity of classifiers grow in a principled manner with data set size? We will cover large-­scale kernel methods; Gaussian process regression, classification, optimization and active set methods.

Why would a non-strong AI person list so much machine learning stuff?

Two reasons:

1) Machine learning techniques are incredibly useful in appropriate cases.

2) You have to understand machine learning to pick out the appropriate cases.

June 21, 2011

Andrew Ng – Machine Learning Materials

Filed under: Machine Learning — Patrick Durusau @ 7:10 pm

Andrew Ng – Machine Learning Materials

Course materials for the lectures Machine Learning – Andrew Ng – YouTube.

Thanks to @philipmlong for the pointer!

June 16, 2011

How do I start with Machine Learning?

Filed under: Machine Learning — Patrick Durusau @ 3:43 pm

How do I start with Machine Learning?

This question on Hacker News drew a number of useful responses.

Now to just get the equivalent number of high quality responses to: “How do I start with Topic Maps?”

June 14, 2011

OpenCV haartraining

Filed under: Machine Learning,Topic Maps — Patrick Durusau @ 10:24 am

OpenCV haartraining (Rapid Object Detection With A Cascade of Boosted Classifiers Based on Haar-like Features)

From the post:

The OpenCV library provides us a greatly interesting demonstration for a face detection. Furthermore, it provides us programs (or functions) that they used to train classifiers for their face detection system, called HaarTraining, so that we can create our own object classifiers using these functions. It is interesting.

I am not sure about the “rapid” part in the title because the author points out he typically waits a week to check for results. 😉

I suppose it is all relative.

Assuming larger hardware resources, it occurred to me that face detection could be interest to topic map authors or more importantly, to people who buy topic maps or topic map services.

At some point, video surveillance will have to improve beyond the convenience store video showing a robbery in progress, to something more sophisticated.

It is all well and good to take video of everyone in the central parts of London, but other than spotting people about to commit a crime or recognizing someone who is a known person of interest, how useful is that?

Imagine a system that assist human reviewers with suggested matches not only to identity records but suggests links to other individuals either seen in their presence or who intersect at other patterns, such as incoming passenger lists.

Hopefully this tutorial will spark you thinking on how to use topic maps with video recognition systems.

June 11, 2011

MLComp

Filed under: Machine Learning — Patrick Durusau @ 12:41 pm

MLComp

Run your machine learning program on existing datasets to compare with other programs.

Or, run existing algorithms against your dataset.

Certainly an interesting idea for developing/testing machine learning algorithms or what algorithms to use with particular datasets.

June 7, 2011

Machine Learning and Knowledge Discovery for Semantic Web

Filed under: Machine Learning,Semantic Web — Patrick Durusau @ 6:20 pm

Machine Learning and Knowledge Discovery for Semantic Web

Description:

Machine Learning and Semantic web are covering conceptually different sides of the same story – Semantic Web’s typical approach is top-down modeling of knowledge and proceeding down towards the data while Machine Learning is almost entirely data-driven bottom-up approach trying to discover the structure in the data and express it in the more abstract ways and rich knowledge formalisms. The talk will discuss possible interaction and usage of Machine Learning and Knowledge discovery for Semantic Web with emphases on ontology construction. In the second half of the talk we will take a look at some research using machine learning for Semantic Web and demos of the corresponding prototype systems.

Slides.

The presentation runs 80+ minutes but three quick points:

First, the “Semi-Automatic Data-Driven Ontology Construction, http://ontogen.ijs.si, from a slightly different point of view, could be converted into a topic map authoring tool for working with data.

Second, the “jaguar” search example at 39:29 was particularly compelling. Definitely improves the usefulness of the search results but still working at the document level. The document level is the wrong level for search, unless you just like wasting time repeating what other people have already done.

Third, there are lots of other tools and resources at: http://ailab.ijs.si/. I am going to be slowly mining this site but if you encounter something really interesting, please make a comment or drop me a note.

Definitely a group to watch.

June 5, 2011

MLDemos

Filed under: Machine Learning,Visualization — Patrick Durusau @ 3:22 pm

MLDemos

From the website:

During my PhD I’ve come across a number of machine learning algorithms for classification, regression and clustering. While there is a great number of libraries, source code and binaries for different algorithms, it is always difficult to get a good grasp of what they do. Moreover, one ends up spending a great amount of time just getting the algorithm to display the results in an understandable way. Change the algorithm and you will have to do the work all over again. Some people have tried, and succeeded, to combine several algorithms into a single multi-purpose library, making their libraries extremely useful (you will find many of their names in the acknowledgements below), but still they didn’t solve the problem of visualization and ease of use. Matlab is an easy answer to that, but while extremely easy to use for generating and displaying single instances of data processing (data, results, models), Matlab is annoyingly slow and cumbersome when it comes to creating an interactive GUI. While preparing the exercice sessions for the Applied Machine Learning class at EPFL, I decided to combine the code snippets, example programs and libraries I had at hand into a seamless application where the different algorithms could be compared and studied easily.

This is an awesome piece of work! You can change the parameters and get immediate feedback on the impact of those changes.

Some minor issues (Windows XP, version 0.3.2):

The files in the /help directory have “open with” information set to Adobe Acrobat.

As far as I can tell, the “help” files don’t appear under the help menu or elsewhere.

There is no base directory so files unpack into whatever directory is selected. Suggest creating mldemo directory as target.

Exiting MLDemos is treated as an error and generates and error report for Microsoft.

Documentation, however brief, about the various algorithms and their parameters would be a welcome addition. Perhaps keyed to one or more leading texts on machine learning. That sounds like something that should be contributed by an interested user doesn’t it? 😉

« Newer PostsOlder Posts »

Powered by WordPress