Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 14, 2013

Data Mining with Weka

Filed under: Machine Learning,Weka — Patrick Durusau @ 3:59 am

Data Mining with Weka by Prof Ian H. Witten.

From the documentation:

The purpose of this study is to gain information to help design and implement the main WekaMOOC course.

If you are interested in Weka or helping with the development of a MOOC or both, this is an opportunity for you.

I am curious if MOOCs or at least mini-MOOCs are going to replace the extended infomercials touted as webinars.


Update already: For Ubuntu, manually install (not with Aptitude). So you can start with JVM memory settings. I got JDBC error messages but otherwise ran properly.

April 12, 2013

A First Encounter with Machine Learning

Filed under: Machine Learning — Patrick Durusau @ 6:41 pm

A First Encounter with Machine Learning (PDF) by Max Welling, Professor at University of California, Irvine.

From the preface:

In winter quarter 2007 I taught an undergraduate course in machine learning at UC Irvine. While I had been teaching machine learning at a graduate level it became soon clear that teaching the same material to an undergraduate class was a whole new challenge. Much of machine learning is build upon concepts from mathematics such as partial derivatives, eigenvalue decompositions, multivariate probability densities and so on. I quickly found that these concepts could not be taken for granted at an undergraduate level. The situation was aggravated by the lack of a suitable textbook. Excellent textbooks do exist for this field, but I found all of them to be too technical for a first encounter with machine learning. This experience led me to believe there was a genuine need for a simple, intuitive introduction into the concepts of machine learning. A first read to wet the appetite so to speak, a prelude to the more technical and advanced textbooks. Hence, the book you see before you is meant for those starting out in the field who need a simple, intuitive explanation of some of the most useful algorithms that our field has to offer

This looks like a fun read!

Although I think an intuitive approach may be more important than as a prelude to more technical explanations.

In part because the machinery of technical explanations and their use, may obscure fundamental “meta-questions” that are important.

For example, in Jeremy Kun’s Homology series, which I strongly recommend, the technical side of homology isn’t going to prepare a student to ask questions like:

How did data collection impact the features of the data now subject to homology calculations?

How did the modeling of features impact the outcome of homology calculations?

What features are missing that could impact the findings from homology calculations?

Persistent homology is important but however well you learn the rules for its use, those rules won’t answer the meta-questions for its use.

An intuitive understanding of the technique and its limitations are as important as learning the latest computational details.

I first saw this at: Introductory Machine Learning Textbook by Ryan Swanstrom.

April 11, 2013

MS Machine Learning Summit [23 April 2013]

Filed under: Machine Learning,Microsoft — Patrick Durusau @ 2:29 pm

MS Machine Learning Summit

From the post:

The live broadcast of the Microsoft Research Machine Learning Summit will include keynotes from machine learning experts and enlightening discussions with leading scientific and academic researchers about approaches to challenges that are raised by the new era in machine learning. Watch it streamed live from Paris on April 23, 2013, 13:30–17:00 Greenwich Mean Time (09:30–13:00 Eastern Time, 06:30–10:00 Pacific Time) at http://MicrosoftMLS.com.

I would rather be in Paris but watching the live stream will be a lot cheaper!

April 7, 2013

Advances in Neural Information Processing Systems (NIPS)

Filed under: Decision Making,Inference,Machine Learning,Neural Networks,Neuroinformatics — Patrick Durusau @ 5:47 am

Advances in Neural Information Processing Systems (NIPS)

From the homepage:

The Neural Information Processing Systems (NIPS) Foundation is a non-profit corporation whose purpose is to foster the exchange of research on neural information processing systems in their biological, technological, mathematical, and theoretical aspects. Neural information processing is a field which benefits from a combined view of biological, physical, mathematical, and computational sciences.

Links to videos from NIPS 2012 meetings are featured on the homepage. The topics are as wide ranging as the foundation’s description.

A tweet from Chris Diehl, wondering what to do with “old hardbound NIPS proceedings (NIPS 11)” led me to: Advances in Neural Information Processing Systems (NIPS) [Online Papers], which has the papers from 1987 to 2012 by volume and a search interface to the same.

Quite a remarkable collection just from a casual skim of some of the volumes.

Unless you need to fill book shelf space, suggest you bookmark the NIPS Online Papers.

March 30, 2013

2012 IPAM Graduate Summer School: Deep Learning, Feature Learning

Filed under: Deep Learning,Feature Learning,Machine Learning — Patrick Durusau @ 2:43 pm

2012 IPAM Graduate Summer School: Deep Learning, Feature Learning

OK, so they skipped the weekends!

Still have fifteen (15) days of video.

So if you don’t have a date for movie night…., 😉

March 23, 2013

Tensors and Their Applications…

Filed under: Linked Data,Machine Learning,Mathematics,RDF,Tensors — Patrick Durusau @ 6:36 pm

Tensors and Their Applications in Graph-Structured Domains by Maximilian Nickel and Volker Tresp. (Slides.)

Along with the slides, you will like abstract and bibliography found at: Machine Learning on Linked Data: Tensors and their Applications in Graph-Structured Domains.

Abstract:

Machine learning has become increasingly important in the context of Linked Data as it is an enabling technology for many important tasks such as link prediction, information retrieval or group detection. The fundamental data structure of Linked Data is a graph. Graphs are also ubiquitous in many other fields of application, such as social networks, bioinformatics or the World Wide Web. Recently, tensor factorizations have emerged as a highly promising approach to machine learning on graph-structured data, showing both scalability and excellent results on benchmark data sets, while matching perfectly to the triple structure of RDF. This tutorial will provide an introduction to tensor factorizations and their applications for machine learning on graphs. By the means of concrete tasks such as link prediction we will discuss several factorization methods in-depth and also provide necessary theoretical background on tensors in general. Emphasis is put on tensor models that are of interest to Linked Data, which will include models that are able to factorize large-scale graphs with millions of entities and known facts or models that can handle the open-world assumption of Linked Data. Furthermore, we will discuss tensor models for temporal and sequential graph data, e.g. to analyze social networks over time.

Devising a system to deal with the heterogeneous nature of linked data.

Just skimming the slides I could see, this looks very promising.

I first saw this in a tweet by Stefano Bertolo.


Update: I just got an email from Maximilian Nickel and he has altered the transition between slides. Working now!

From slide 53 forward is pure gold for topic map purposes.

Heavy sledding but let me give you one statement from the slides that should capture your interest:

Instance matching: Ranking of entities by their similarity in the entity-latent-component space.

Although written about linked data, not limited to linked data.

What is more, Maximilian offers proof that the technique scales!

Complex, configurable, scalable determination of subject identity!

[Update: deleted note about issues with slides, which read: (Slides for ISWC 2012 tutorial, Chrome is your best bet. Even better bet, Chrome on Windows. Chrome on Ubuntu crashed every time I tried to go to slide #15. Windows gets to slide #46 before failing to respond. I have written to inquire about the slides.)]

March 22, 2013

Cloudera ML:…

Filed under: Cloudera,Clustering,Machine Learning — Patrick Durusau @ 10:57 am

Cloudera ML: New Open Source Libraries and Tools for Data Scientists by Josh Wills.

From the post:

Today, I’m pleased to introduce Cloudera ML, an Apache licensed collection of Java libraries and command line tools to aid data scientists in performing common data preparation and model evaluation tasks. Cloudera ML is intended to be an educational resource and reference implementation for new data scientists that want to understand the most effective techniques for building robust and scalable machine learning models on top of Hadoop.

…[details about clustering omitted]

If you were paying at least somewhat close attention, you may have noticed that the algorithms I’m describing above are essentially clever sampling techniques. With all of the hype surrounding big data, sampling has gotten a bit of a bad rap, which is unfortunate, since most of the work of a data scientist involves finding just the right way to turn a large data set into a small one. Of course, it usually takes a few hundred tries to find that right way, and Hadoop is a powerful tool for exploring the space of possible features and how they should be weighted in order to achieve our objectives.

Wherever possible, we want to minimize the amount of parameter tuning required for any model we create. At the very least, we should try to provide feedback on the quality of the model that is created by different parameter settings. For k-means, we want to help data scientists choose a good value of K, the number of clusters to create. In Cloudera ML, we integrate the process of selecting a value of K into the data sampling and cluster fitting process by allowing data scientists to evaluate multiple values of K during a single run of the tool and reporting statistics about the stability of the clusters, such as the prediction strength.

Finally, we want to investigate the anomalous events in our clustering- those points that don’t fit well into any of the larger clusters. Cloudera ML includes a tool for using the clusters that were identified by the scalable k-means algorithm to compute an assignment of every point in our large data set to a particular cluster center, including the distance from that point to its assigned center. This information is created via a MapReduce job that outputs a CSV file that can be analyzed interactively using Cloudera Impala or your preferred analytical application for processing data stored in Hadoop.

Cloudera ML is under active development, and we are planning to add support for pivot tables, Hive integration via HCatalog, and tools for building ensemble classifers over the next few weeks. We’re eager to get feedback on bug fixes and things that you would like to see in the tool, either by opening an issue or a pull request on our github repository. We’re also having a conversation about training a new generation of data scientists next Tuesday, March 26th, at 2pm ET/11am PT, and I hope that you will be able to join us.

Another great project by Cloudera!

March 20, 2013

Large-Scale Learning with Less… [Less Precision Viable?]

Filed under: Algorithms,Artificial Intelligence,Machine Learning — Patrick Durusau @ 4:32 pm

Large-Scale Learning with Less RAM via Randomization by Daniel Golovin, D. Sculley, H. Brendan McMahan, Michael Young.

Abstract:

We reduce the memory footprint of popular large-scale online learning methods by projecting our weight vector onto a coarse discrete set using randomized rounding. Compared to standard 32-bit float encodings, this reduces RAM usage by more than 50% during training and by up to 95% when making predictions from a fixed model, with almost no loss in accuracy. We also show that randomized counting can be used to implement per-coordinate learning rates, improving model quality with little additional RAM. We prove these memory-saving methods achieve regret guarantees similar to their exact variants. Empirical evaluation confirms excellent performance, dominating standard approaches across memory versus accuracy tradeoffs.

I mention this in part because topic map authoring can be assisted by the results of machine learning.

It is also a data point for the proposition that unlike their human masters, machines are too precise.

Perhaps it is the case that the vagueness of human reasoning has significant advantages over the disk grinding precision of our machines.

The question then becomes: How do we capture vagueness in a system where every point is either 0 or 1?

Not probabilistic because that can be expressed but vagueness, which I experience as something different.

Suggestions?

PS: Perhaps that is what makes artificial intelligence artificial. It is too precise. 😉

I first saw this in a tweet by Stefano Bertolo.

March 10, 2013

Bayesian Reasoning and Machine Learning (update)

Filed under: Bayesian Models,Machine Learning — Patrick Durusau @ 3:15 pm

Bayesian Reasoning and Machine Learning by David Barber.

I first posted about this work at: Bayesian Reasoning and Machine Learning in 2011.

The current draft (that corresponds to the Cambridge University Press hard copy is dated January 9, 2013.

If you use the online version and have the funds, please order a hard copy to encourage the publisher to continue to make published texts available online.

March 1, 2013

Incremental association rule mining: a survey

Filed under: Association Rule Mining,Machine Learning — Patrick Durusau @ 5:33 pm

Incremental association rule mining: a survey by B. Nath, D. K. Bhattacharyya, A. Ghosh. (WIREs Data Mining Knowl Discov 2013. doi: 10.1002/widm.1086)

Abstract:

Association rule mining is a computationally expensive task. Despite the huge processing cost, it has gained tremendous popularity due to the usefulness of association rules. Several efficient algorithms can be found in the literature. This paper provides a comprehensive survey on the state-of-the-art algorithms for association rule mining, specially when the data sets used for rule mining are not static. Addition of new data to a data set may lead to additional rules or to the modification of existing rules. Finding the association rules from the whole data set may lead to significant waste of time if the process has started from the scratch. Several algorithms have been evolved to attend this important issue of the association rule mining problem. This paper analyzes some of them to tackle the incremental association rule mining problem.

Not suggesting that it is always a good idea to model association rules as “associations” in the topic map sense but it is an important area of data mining.

The paper provides:

  • a taxonomy on the existing frequent itemset generation techniques and an analysis of their pros and cons,
  • a comprehensive review on the existing static and incremental rule generation techniques and their pros and cons, and
  • identification of several important issues and research challenges.

Some thirteen (13) pages and sixty-six (66) citations to the literature so a good starting point for research in this area.

If you need a more basic starting point, consider: Association rule learning (Wikipedia).

February 26, 2013

AstroML: data mining and machine learning for Astronomy

Filed under: Astroinformatics,Data Mining,Machine Learning — Patrick Durusau @ 1:53 pm

AstroML: data mining and machine learning for Astronomy by Jake Vanderplas, Alex Gray, Andrew Connolly and Zeljko Ivezic.

Description:

Python is currently being adopted as the language of choice by many astronomical researchers. A prominent example is in the Large Synoptic Survey Telescope (LSST), a project which will repeatedly observe the southern sky 1000 times over the course of 10 years. The 30,000 GB of raw data created each night will pass through a processing pipeline consisting of C++ and legacy code, stitched together with a python interface. This example underscores the need for astronomers to be well-versed in large-scale statistical analysis techniques in python. We seek to address this need with the AstroML package, which is designed to be a repository for well-tested data mining and machine learning routines, with a focus on applications in astronomy and astrophysics. It will be released in late 2012 with an associated graduate-level textbook, ‘Statistics, Data Mining and Machine Learning in Astronomy’ (Princeton University Press). AstroML leverages many computational tools already available available in the python universe, including numpy, scipy, scikit- learn, pymc, healpy, and others, and adds efficient implementations of several routines more specific to astronomy. A main feature of the package is the extensive set of practical examples of astronomical data analysis, all written in python. In this talk, we will explore the statistical analysis of several interesting astrophysical datasets using python and astroML.

AstroML at Github:

AstroML is a Python module for machine learning and data mining built on numpy, scipy, scikit-learn, and matplotlib, and distributed under the 3-clause BSD license. It contains a growing library of statistical and machine learning routines for analyzing astronomical data in python, loaders for several open astronomical datasets, and a large suite of examples of analyzing and visualizing astronomical datasets.

The goal of astroML is to provide a community repository for fast Python implementations of common tools and routines used for statistical data analysis in astronomy and astrophysics, to provide a uniform and easy-to-use interface to freely available astronomical datasets. We hope this package will be useful to researchers and students of astronomy. The astroML project was started in 2012 to accompany the book Statistics, Data Mining, and Machine Learning in Astronomy by Zeljko Ivezic, Andrew Connolly, Jacob VanderPlas, and Alex Gray, to be published in early 2013.

The book, Statistics, Data Mining, and Machine Learning in Astronomy by Zeljko Ivezic, Andrew Connolly, Jacob VanderPlas, and Alex Gray, is not yet listed by Princeton University Press. 🙁

I have subscribed to their notice service and will post a note when it appears.

February 23, 2013

MLBase

Filed under: Machine Learning,MLBase — Patrick Durusau @ 4:09 pm

MLBase by Danny Bickson.

From the post:

Here is an interesting post I got from Ben Lorica, O’Reilly about MLbase: http://strata.oreilly.com/2013/02/mlbase-scalable-machine-learning-made-accessible.html.

It is a proof of concept machine learning library on top of Spark, with a custom declarative language called MQL.

Slated for release in August, 2013.

Suggest you digest Lorica’s post and the links therein.

February 22, 2013

Everything You Wanted to Know About Machine Learning…

Filed under: Machine Learning — Patrick Durusau @ 3:51 pm

Everything You Wanted to Know About Machine Learning, But Were Too Afraid To Ask (Part One)

Everything You Wanted to Know About Machine Learning, But Were Too Afraid To Ask (Part Two)

by Charles Parker.

From Part One:

Recently, Professor Pedro Domingos, one of the top machine learning researchers in the world, wrote a great article in the Communications of the ACM entitled “A Few Useful Things to Know about Machine Learning“. In it, he not only summarizes the general ideas in machine learning in fairly accessible terms, but he also manages to impart most of the things we’ve come to regard as common sense or folk wisdom in the field.

It’s a great article because it’s a brilliant man with deep experience who is an excellent teacher writing for “the rest of us”, and writing about things we need to know. And he manages to cover a huge amount of ground in nine pages.

Now, while it’s very light reading for the academic literature, it’s fairly dense by other comparisons. Since so much of it is relevant to anyone trying to use BigML, I’m going to try to give our readers the Cliff’s Notes version right here in our blog, with maybe a few more examples and a little less academic terminology. Often I’ll be rephrasing Domingos, and I’ll indicate it where I’m quoting directly.

Perhaps not “everything” but certainly enough to spark an interest in knowing more!

My take away is: understanding machine learning, like understanding data, is critical to success with machine learning.

Not surprising but does get overlooked.

February 11, 2013

Label propagation in GraphChi

Filed under: Artificial Intelligence,Classifier,GraphChi,Graphs,Machine Learning — Patrick Durusau @ 4:12 pm

Label propagation in GraphChi by Danny Bickson.

From the post:

A few days ago I got a request from Jidong, from the Chinese Renren company to implement label propagation in GraphChi. The algorithm is very simple described here: Zhu, Xiaojin, and Zoubin Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University, 2002.

The basic idea is that we start with a group of users that we have some information about the categories they are interested in. Following the weights in the social network, we propagate the label probabilities from the user seed node (the ones we have label information about) into the general social network population. After several iterations, the algorithm converges and the output is labels for the unknown nodes.

I assume there is more unlabeled data for topic maps than labeled data.

Depending upon your requirements, this could prove to be a useful technique for completing those unlabeled nodes.

February 10, 2013

Setting up Java GraphChi development environment…

Filed under: GraphChi,Graphs,Java,Machine Learning — Patrick Durusau @ 2:17 pm

Setting up Java GraphChi development environment – and running sample ALS by Danny Bickson.

From the post:

As you may know, our GraphChi collaborative filtering toolkit in C is becoming more and more popular. Recently, Aapo Kyrola did a great effort for porting GraphChi C into Java and implementing more methods on top of it.

In this blog post I explain how to setup GraphChi Java development environment in Eclipse and run alternating least squares algorithm (ALS) on a small subset of Netflix data.

Based on the level of user feedback I am going to receive for this blog post, we will consider porting more methods to Java. So email me if you are interested in trying it out.

If you are interested in more machine learning methods in Java, here’s your chance!

Not to mention your interest in graph based solutions.

February 3, 2013

Case study: million songs dataset

Filed under: Data,Dataset,GraphChi,Graphs,Machine Learning — Patrick Durusau @ 6:58 pm

Case study: million songs dataset by Danny Bickson.

From the post:

A couple of days ago I wrote about the million songs dataset. Our man in London, Clive Cox from Rummble Labs, suggested we should implement rankings based on item similarity.

Thanks to Clive suggestion, we have now an implementation of Fabio Aiolli’s cost function as explained in the paper: A Preliminary Study for a Recommender System for the Million Songs Dataset, which is the winning method in this contest.

Following are detailed instructions on how to utilize GraphChi CF toolkit on the million songs dataset data, for computing user ratings out of item similarities. 

Just in case you need some data for practice with your GraphChi installation. 😉

Seriously, nice way to gain familiarity with the data set.

What value you extract from it is up to you.

February 1, 2013

Still Not a MOOC, but…

Filed under: Machine Learning — Patrick Durusau @ 8:04 pm

John Langford and Yann LeCun are teaching a large scale machine learning class at NYU that was announced to not be a MOOC. See: NYU Large Scale Machine Learning Class [Not a MOOC]

However, see: Remote large scale learning class participation.

John and Yann have arranged for lectures to be posted with slides with one day delay and have a discussion forum if you are interested.

Still not a MOOC, but a wonderful opportunity for those of us who cannot attend in person.

January 26, 2013

Human Computation and Crowdsourcing

Filed under: Artificial Intelligence,Crowd Sourcing,Human Computation,Machine Learning — Patrick Durusau @ 1:42 pm

Announcing HCOMP 2013 – Conference on Human Computation and Crowdsourcing by Eric Horvitz.

From the conference website:

Where

Palm Springs, California
Venue information coming soon

When

November 7-9, 2013

Important Dates

All deadlines are 5pm Pacific time unless otherwise noted.

Papers

Submission deadline: May 1, 2013
Author rebuttal period: June 21-28
Notification: July 16, 2013
Camera Ready: September 4, 2013

Workshops & Tutorials

Proposal deadline: May 10, 2013
Notification: July 16, 2013
Camera Ready: September 4, 2013

Posters & Demonstrations

Submission deadline: July 25, 2013
Notification: August 26, 2013
Camera Ready: September 4, 2013

From the post:

Announcing HCOMP 2013, the Conference on Human Computation and Crowdsourcing, Palm Springs, November 7-9, 2013. Paper submission deadline is May 1, 2013. Thanks to the HCOMP community for bringing HCOMP to life as a full conference, following on the successful workshop series.

The First AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2013) will be held November 7-9, 2013 in Palm Springs, California, USA. The conference was created by researchers from diverse fields to serve as a key focal point and scholarly venue for the review and presentation of the highest quality work on principles, studies, and applications of human computation. The conference is aimed at promoting the scientific exchange of advances in human computation and crowdsourcing among researchers, engineers, and practitioners across a spectrum of disciplines. Papers submissions are due May 1, 2013 with author notification on July 16, 2013. Workshop and tutorial proposals are due May 10, 2013. Posters & demonstrations submissions are due July 25, 2013.

I suppose it had to happen.

Instead of asking adding machines for their opinions, someone would decide to ask the creators of adding machines for theirs.

I first saw this at: New AAAI Conference on Human Computation and Crowdsourcing by Shar Steed.

Machine Learning Cheat Sheet (for scikit-learn)

Filed under: Machine Learning,Scikit-Learn — Patrick Durusau @ 1:40 pm

Machine Learning Cheat Sheet (for scikit-learn) by Andreas Mueller.

From the post:

(Click for a larger version)

BTW, scikit-learn is doing a user survey.

Take a few minutes to contribute your feedback.

January 24, 2013

Scikit-Learn 0.13 released!

Filed under: Machine Learning,Scikit-Learn — Patrick Durusau @ 8:07 pm

Scikit-Learn 0.13 released! We want your feedback. by Andreas Mueller

From the post:

After a little delay, the team finished work on the 0.13 release of scikit-learn.

There is also a user survey that we launched in parallel with the release, to get some feedback from our users.

There is a list of changes and new features on the website.

Feedback (useful feedback) is a small price to pay for such a large amount of effort!

Download the new release and submit feedback.

On the next release you will be glad you did!

January 23, 2013

Assembling a Python Machine Learning Toolkit

Filed under: Machine Learning,Python — Patrick Durusau @ 7:40 pm

Assembling a Python Machine Learning Toolkit by Sujit Pal.

From the post:

I had been meaning to read Peter Harrington’s book Machine Learning In Action (MLIA) for a while now, and I finally finished reading it earlier this week (my review on Amazon is here). The book provides Python implementations of 8 of the 10 Top Algorithms in Data Mining listed in this paper (PDF). The math package used in the examples is Numpy, and the charts are built using Matplotlib.

In the past, the little ML work I have done has been in Java, because that was the language and ecosystem I knew best. However, given the experimental, iterative nature of ML work, its probably not the most ideal language to use. However, there are lots of options when it comes to languages for ML – over the last year, I have learned Octave (open-source version of MATLAB) for the Coursera Machine Learning class and R for the Coursera Statistics One and Computing for Data Analysis classes (still doing the second one). But because I know Python already, Python/Numpy looks easier to use than Octave, and Python/Matplotlib looks as simple as using R graphics. There is also the pandas package which provides R-like features, although I haven’t used it yet.

Looking around on the net, I find that many other people have reached similar conclusions – ie, that Python seems to be the way to go for initial prototyping work in ML. I wanted to set up a small toolbox of Python libraries that will allow me to do this also. I settled on an initial list of packages based on the Scipy Superpack, but since I am still on Mac OS (Snow Leopard) I could not use the script from there. There were some issues I had to work through to make this to work, so I document this here, so if you are in the same situation this may help you.

Unlike the Scipy Superpack, which seems to prefer versions that are often the bleeding edge development versions, I decided to stick to the latest stable release versions for each of the libraries. Here they are:

Sujit’s post will save you a few steps in assembling your Python machine learning toolkit.

Pass it on.

January 22, 2013

Mahout on Windows Azure…

Filed under: Azure Marketplace,Hadoop,Hortonworks,Machine Learning,Mahout — Patrick Durusau @ 2:42 pm

Mahout on Windows Azure – Machine Learning Using Microsoft HDInsight by Istvan Szegedi.

From the post:

Our last post was about Microsoft and Hortonworks joint effort to deliver Hadoop on Microsoft Windows Azure dubbed HDInsight. One of the key Microsoft HDInsight components is Mahout, a scalable machine learning library that provides a number of algorithms relying on the Hadoop platform. Machine learning supports a wide range of use cases from email spam filtering to fraud detection to recommending books or movies, similar to Amazon.com features.These algorithms can be divided into three main categories: recommenders/collaborative filtering, categorization and clustering. More details about these algorithms can be read on Apache Mahout wiki.

Are you hearing Hadoop, Mahout, HBase, Hive, etc., as often as I am?

Does it make you wonder about Apache becoming the locus of transferable IT skills?

Something to think about as you are developing topic map ecosystems.

You can hand roll your own solutions.

Or build upon solutions that have widespread vendor support.

PS: Another great post from Istvan.

Prediction API – Machine Learning from Google

Filed under: Google Prediction,Machine Learning,Prediction,Topic Maps — Patrick Durusau @ 2:42 pm

Prediction API – Machine Learning from Google by Istvan Szegedi.

From the post:

One of the exciting APIs among the 50+ APIs offered by Google is the Prediction API. It provides pattern matching and machine learning capabilities like recommendations or categorization. The notion is similar to the machine learning capabilities that we can see in other solutions (e.g. in Apache Mahout): we can train the system with a set of training data and then the applications based on Prediction API can recommend (“predict”) what products the user might like or they can categories spams, etc.

In this post we go through an example how to categorize SMS messages – whether they are spams or valuable texts (“hams”).

Nice introduction to Google’s Prediction API.

A use case for topic map authoring would be to route content to appropriate experts for further evaluation.

January 21, 2013

The #NIPS2012 Videos are out

Filed under: Decision Making,Inference,Machine Learning — Patrick Durusau @ 7:29 pm

The #NIPS2012 Videos are out by Igor Carron.

From the post:

Videolectures came through earlier than last year. woohoo! Presentations relevant to Nuit Blanche were featured earlier here. Videos for the presentations for the Posner Lectures, Invited Talks and Oral Sessions of the conference are here. Videos for the presentations for the different Workshops are here. Some videos are not available because the presenters have not given their permission to the good folks at Videolectures. If you know any of them, let them know the world is waiting.

Just in case Netflix is down. 😉

January 17, 2013

CS 229 Machine Learning – Final Projects, Autumn 2012

Filed under: Machine Learning — Patrick Durusau @ 7:25 pm

CS 229 Machine Learning – Final Projects, Autumn 2012

Two hundred and forty-five (245) final project reports in machine learning.

I started to provide a sampling but I would miss the one that would capture your interest.

BTW, yes, this is Andrew Ng’s Machine Learning course.

Machine Learning and Data Mining – Association Analysis with Python

Machine Learning and Data Mining – Association Analysis with Python by Marcel Caraciolo.

From the post:

Recently I’ve been working with recommender systems and association analysis. This last one, specially, is one of the most used machine learning algorithms to extract from large datasets hidden relationships.

The famous example related to the study of association analysis is the history of the baby diapers and beers. This history reports that a certain grocery store in the Midwest of the United States increased their beers sells by putting them near where the stippers were placed. In fact, what happened is that the association rules pointed out that men bought diapers and beers on Thursdays. So the store could have profited by placing those products together, which would increase the sales.

Association analysis is the task of finding interesting relationships in large data sets. There hidden relationships are then expressed as a collection of association rules and frequent item sets. Frequent item sets are simply a collection of items that frequently occur together. And association rules suggest a strong relationship that exists between two items.

When I think of associations in a topic map, I assume I am at least starting with the roles and the players of those roles.

As this post demonstrates, that may be overly optimistic on my part.

What if I discover an association but not its type or the roles in it? And yet I still want to preserve the discovery for later use?

An incomplete association as it were.

Suggestions?

January 16, 2013

RuleML 2013

Filed under: Conferences,Machine Learning,RuleML — Patrick Durusau @ 7:56 pm

RuleML 2013

Important Dates:

Abstract submission: Feb. 19, 2013
Paper submission: Feb. 20, 2013
Notification of acceptance/rejection: April 12, 2013
Camera-ready copy due: May 3, 2013
RuleML-2013 dates: July 11-13, 2013

From the call for papers:

The annual International Web Rule Symposium (RuleML) is an international conference on research, applications, languages and standards for rule technologies. RuleML is the leading conference for building bridges between academia and industry in the field of rules and its applications, especially as part of the semantic technology stack. It is devoted to rule-based programming and rule-based systems including production rules systems, logic programming rule engines, and business rules engines/business rules management systems; Semantic Web rule languages and rule standards (e.g., RuleML, SWRL, RIF, PRR, SBVR); Legal RuleML; rule-based event processing languages (EPLs) and technologies; hybrid rule-based methods; and research on inference rules, transformation rules, decision rules, production rules, and ECA rules.

The 7th International Symposium on Rules and the Web (RuleML 2013) will be held on July 11-13, 2013 just prior to the AAAI conference in the Seattle Metropolitan Area, Washington. Selected papers will be published in book form in the Springer Lecture Notes in Computer Science (LNCS) series.

Topics:

  • Rules and automated reasoning
  • Rule-based policies, reputation, and trust
  • Rule-based event processing and reaction rules
  • Rules and the web
  • Fuzzy rules and uncertainty
  • Logic programming and nonmonotonic reasoning
  • Non-classical logics and the web (e.g modal and epistemic logics)
  • Hybrid methods for combining rules and statistical machine learning techniques (e.g., conditional random fields, PSL)
  • Rule transformation, extraction, and learning
  • Vocabularies, ontologies, and business rules
  • Rule markup languages and rule interchange formats
  • Rule-based distributed/multi-agent systems
  • Rules, agents, and norms
  • Rule-based communication, dialogue, and argumentation models
  • Vocabularies and ontologies for pragmatic primitives (e.g. speech acts and deontic primitives)
  • Pragmatic web reasoning and distributed rule inference / rule execution
  • Rules in online market research and online marketing
  • Applications of rule technologies in health care and life sciences
  • Legal rules and legal reasoning
  • Industrial applications of rules
  • Controlled natural language for rule encoding (e.g. SBVR, ACE, CLCE)
  • Standards activities related to rules
  • General rule topics

A number of those seem quite at home in a topic maps setting.

January 13, 2013

Foundations of Rule Learning [A Topic Map Parable]

Filed under: Data Mining,Machine Learning,Topic Maps — Patrick Durusau @ 8:14 pm

Foundations of Rule Learning by Authors: Johannes Fürnkranz, Dragan Gamberger, Nada Lavrač, ISBN: 978-3-540-75196-0 (Print) 978-3-540-75197-7 (Online).

From the Introduction:

Rule learning is not only one of the oldest but also one of the most intensively investigated, most frequently used, and best developed fields of machine learning. In more than 30 years of intensive research, many rule learning systems have been developed for propositional and relational learning, and have been successfully used in numerous applications. Rule learning is particularly useful in intelligent data analysis and knowledge discovery tasks, where the compactness of the representation of the discovered knowledge, its interpretability, and the actionability of the learned rules are of utmost importance for successful data analysis.

The aim of this book is to give a comprehensive overview of modern rule learning techniques in a unifying framework which can serve as a basis for future research and development. The book provides an introduction to rule learning in the context of other machine learning and data mining approaches, describes all the essential steps of the rule induction process, and provides an overview of practical systems and their applications. It also introduces a feature-based framework for rule learning algorithms which enables the integration of propositional and relational rule learning concepts.

The topic map parable comes near the end of the introduction where the authors note:

The book is written by authors who have been working in the field of rule learning for many years and who themselves developed several of the algorithms and approaches presented in the book. Although rule learning is assumed to be a well-established field with clearly defined concepts, it turned out that finding a unifying approach to present and integrate these concepts was a surprisingly difficult task. This is one of the reasons why the preparation of this book took more than 5 years of joint work.

A good deal of discussion went into the notation to use. The main challenge was to define a consistent notational convention to be used throughout the book because there is no generally accepted notation in the literature. The used notation is gently introduced throughout the book, and is summarized in Table I in a section on notational conventions immediately following this preface (pp. xi–xiii). We strongly believe that the proposed notation is intuitive. Its use enabled us to present different rule learning approaches in a unifying notation and terminology, hence advancing the theory and understanding of the area of rule learning.

Semantic diversity in rule learning was discovered and took five years to resolve.

Where n = all prior notations/terminologies, the solution was to create the n + 1 notation/terminology.

Understandable and certainly a major service to the rule learning community. The problem remains, how does one use the n + 1 notation/terminology to access prior (and forthcoming) literature in rule learning?

In its present form, the resolution of the prior notations and terminologies into the n + 1 terminology isn’t accessible to search, data, bibliographic engines.

Not to mention that on the next survey of rule learning, its authors will have to duplicate the work already accomplished by these authors.

Something about the inability to re-use the valuable work done by these authors, either for improvement of current information systems or to avoid duplication of effort in the future seems wrong.

Particularly since it is avoidable through the use of topic maps.


The link at the top of this post is the “new and improved site,” which has less sample content than Foundations for Rule Learning, apparently an old and not improved site.

I first saw this in a post by Gregory Piatetsky.

January 10, 2013

Machine Learning Throwdown: The Reckoning

Filed under: Machine Learning — Patrick Durusau @ 1:49 pm

Machine Learning Throwdown: The Reckoning by Charles Parker.

From the post:

As you, our faithful readers know, we compared some machine learning services several months ago in our machine learning throwdown. In another recent blog post, we talked about the power of ensembles, and how your BigML models can be made into an even more powerful classifier when many of them are learned over samples of the data. With this in mind, we decided to re-run the performance tests from the fourth throwdown post using BigML ensembles as well as single BigML models.

You can see the results in an updated version of the throwdown details file. As you’ll be able to see, the ensemble of classifiers (Bagged BigML Classification/Regression Trees) almost always outperform their solo counterparts. In addition, if we update our “medal count” table tracking the competition among our three machine learning services, we see that the BigML ensembles now lead in the number of “wins” over all datasets:

Charles continues his comparison of machine learning services.

Charles definitely has a position. 😉

On the other hand, the evidence suggests a close look at your requirements, data and capabilities before defaulting to one solution or another.

January 8, 2013

NYU Large Scale Machine Learning Class [Not a MOOC]

Filed under: CS Lectures,Machine Learning — Patrick Durusau @ 11:46 am

NYU Large Scale Machine Learning Class by John Langford.

From the post:

Yann LeCun and I are coteaching a class on Large Scale Machine Learning starting late January at NYU. This class will cover many tricks to get machine learning working well on datasets with many features, examples, and classes, along with several elements of deep learning and support systems enabling the previous.

This is not a beginning class—you really need to have taken a basic machine learning class previously to follow along. Students will be able to run and experiment with large scale learning algorithms since Yahoo! has donated servers which are being configured into a small scale Hadoop cluster. We are planning to cover the frontier of research in scalable learning algorithms, so good class projects could easily lead to papers.

For me, this is a chance to teach on many topics of past research. In general, it seems like researchers should engage in at least occasional teaching of research, both as a proof of teachability and to see their own research through that lens. More generally, I expect there is quite a bit of interest: figuring out how to use data to make predictions well is a topic of growing interest to many fields. In 2007, this was true, and demand is much stronger now. Yann and I also come from quite different viewpoints, so I’m looking forward to learning from him as well.

We plan to videotape lectures and put them (as well as slides) online, but this is not a MOOC in the sense of online grading and class certificates. I’d prefer that it was, but there are two obstacles: NYU is still figuring out what to do as a University here, and this is not a class that has ever been taught before. Turning previous tutorials and class fragments into coherent subject matter for the 50 students we can support at NYU will be pretty challenging as is. My preference, however, is to enable external participation where it’s easily possible.

Not a MOOC but videos of the lectures will be available. Details under development.

Note the request for suggestions on the class.

« Newer PostsOlder Posts »

Powered by WordPress