Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 5, 2013

Machine Learning Surveys

Filed under: Machine Learning — Patrick Durusau @ 7:41 am

Machine Learning Surveys

According to the tweet that led me here:

http://mlsurveys.com a crowdsourced list of #machinelearning survey and tutorial papers organized by topics and publication years

Not a large set of papers (110 as of when I looked) but certainly a serviceable idea. The vetting/editorial mechanism isn’t clear.

I first saw this in a post by Olivier Grisel.

January 4, 2013

A List of Data Science and Machine Learning Resources

Filed under: Data Science,Machine Learning — Patrick Durusau @ 7:37 pm

A List of Data Science and Machine Learning Resources

From the post:

Every now and then I get asked for some help or for some pointers on a machine learning/data science topic. I tend respond with links to resources by folks that I consider to be experts in the topic area. Over time my list has gotten a little larger so I decided to put it all together in a blog post. Since it is based mostly on the questions I have received, it is by no means complete, or even close to a complete list, but hopefully it will be of some use. Perhaps I will keep it updated, or even better yet, feel free to comment with anything you think might be of help.

Also, when I think of data science, I tend to focus on Machine Learning rather than the hardware or coding aspects. If you are looking for stuff on Hadoop, or R, or Python, sorry, there really isn’t anything here.

A bit more specific advice than “just do it,” which may helpful to many readers.

The first resource is Professor Gilbert Strang’s video lectures on Linear Algebra.

Factoid: Strang’s Introduction to Linear Algebra, Forth Edition (2009), lists new at Amazon for $60.49. The cheapest used copy goes for $53.90. Not bad for a textbook that is four years old this year.

I first saw this at: Free Online Resources: Bone Up on Your Data Science and Machine Learning by Angela Guess.

December 28, 2012

LIBOL 0.1.0

Filed under: Algorithms,Classification,Machine Learning — Patrick Durusau @ 7:44 pm

LIBOL 0.1.0

From the webpage:

LIBOL is an open-source library for large-scale online classification, which consists of a large family of efficient and scalable state-of-the-art online learning algorithms for large-scale online classification tasks. We have offered easy-to-use command-line tools and examples for users and developers. We also have made documents available for both beginners and advanced users. LIBOL is not only a machine learning tool, but also a comprehensive experimental platform for conducting online learning research.

In general, the existing online learning algorithms for linear classication tasks can be grouped into two major categories: (i) first order learning (Rosenblatt, 1958; Crammer et al., 2006), and (ii) second order learning (Dredze et al., 2008; Wang et al., 2012; Yang et al., 2009).

Example online learning algorithms in the first order learning category implemented in this library include:

• Perceptron: the classical online learning algorithm (Rosenblatt, 1958);

• ALMA: A New ApproximateMaximal Margin Classification Algorithm (Gentile, 2001);

• ROMMA: the relaxed online maxiumu margin algorithms (Li and Long, 2002);

• OGD: the Online Gradient Descent (OGD) algorithms (Zinkevich, 2003);

• PA: Passive Aggressive (PA) algorithms (Crammer et al., 2006), one of state-of-the-art first order online learning algorithms;

Example algorithms in the second order online learning category implemented in this library include the following:

• SOP: the Second Order Perceptron (SOP) algorithm (Cesa-Bianchi et al., 2005);

• CW: the Confidence-Weighted (CW) learning algorithm (Dredze et al., 2008);

• IELLIP: online learning algorithms by improved ellipsoid method (Yang et al., 2009);

• AROW: the Adaptive Regularization of Weight Vectors (Crammer et al., 2009);

• NAROW: New variant of Adaptive Regularization (Orabona and Crammer, 2010);

• NHERD: the Normal Herding method via Gaussian Herding (Crammer and Lee, 2010)

• SCW: the recently proposed Soft ConfidenceWeighted algorithms (Wang et al., 2012).

LIBOL is still being improved by improvements from practical users and new research results.

More information can be found in our project website: http://libol.stevenhoi.org/

Consider this an early New Year’s present!

December 17, 2012

The Rewards of Ignoring Data

Filed under: Boosting,Machine Learning,Random Forests — Patrick Durusau @ 2:55 pm

The Rewards of Ignoring Data by Charles Parker.

From the post:

Can you make smarter decisions by ignoring data? It certainly runs counter to our mission, and sounds a little like an Orwellean dystopia. But as we’re going to see, ignoring some of your data some of the time can be a very useful thing to do.

Charlie does an excellent job of introducing the use of multiple models of data and includes deeper material:

There are fairly deep mathematical reasons for this, and ML scientist par excellence Robert Shapire lays out one of the most important arguments in the landmark paper “The Strength of Weak Learnability” in which he proves that a machine learning algorithm that performs only slightly better than randomly can be “boosted” into a classifier that is able to learn to an arbitrary degree of accuracy. For this incredible contribution (and for the later paper that gave us the Adaboost algorithm), he and his colleague Yoav Freund earned the Gödel Prize for computer science theory, the only time the award has been given for a machine learning paper.

Not being satisfied, Charles demonstrates how you can create a random decision forest from your data.

Which is possible without reading the deeper material.

December 9, 2012

BigMLer in da Cloud: Machine Learning made even easier [Amateur vs. Professional Models]

Filed under: Cloud Computing,Machine Learning,WWW — Patrick Durusau @ 5:19 pm

BigMLer in da Cloud: Machine Learning made even easier by Martin Prats.

From the post:

We have open-sourced BigMLer, a command line tool that will let you create predictive models much easier than ever before.

BigMLer wraps BigML’s API Python bindings to offer a high-level command-line script to easily create and publish Datasets and Models, create Ensembles, make local Predictions from multiple models, and simplify many other machine learning tasks. BigMLer is open sourced under the Apache License, Version 2.0.

“…will let you create predictive models much easier than ever before.”

Well…., true, but the amount of effort you invest in a predictive model has a relationship to the usefulness of the model for some given purpose.

It is a great idea to create an easy “on ramp” to introduce machine learning. But it may lead some users to confuse “…easier than ever before” models with professionally crafted models.

An old friend confided their organization was about to write a classification system for a well know subject. Exciting to think they will put all past errors to rest while adding new capabilities.

But in reality librarians have labored in such areas for centuries. It isn’t an good target for a start-up project. Particularly for those innocent of existing classification systems and the theory/praxis that drove their creation.

Librarians didn’t invent the Internet. If they had, we wouldn’t be searching for ways to curate information on the Internet, in a backwards compatible way.

December 5, 2012

The Elements of Statistical Learning (2nd ed.)

Filed under: Machine Learning,Mathematics,Statistical Learning,Statistics — Patrick Durusau @ 6:50 am

The Elements of Statistical Learning (2nd ed.) by Trevor Hastie, Robert Tibshirani and Jerome Friedman. (PDF)

The authors note in the preface to the first edition:

The field of Statistics is constantly challenged by the problems that science and industry brings to its door. In the early days, these problems often came from agricultural and industrial experiments and were relatively small in scope. With the advent of computers and the information age, statistical problems have exploded both in size and complexity. Challenges in the areas of data storage, organization and searching have led to the new field of “data mining”; statistical and computational problems in biology and medicine have created “bioinformatics.” Vast amounts of data are being generated in many fields, and the statistician’s job is to make sense of it all: to extract important patterns and trends, and understand “what the data says.” We call this learning from data.

I’m sympathetic to that sentiment but with the caveat that it is our semantic expectations of the data that give it any meaning to be “learned.”

Data isn’t lurking outside our door with “meaning” captured separate and apart from us. Our fancy otherwise obscures our role in the origin of “meaning” that we attach to data. In part to bolster the claim that the “facts/data say….”

It is us who take up the gauge for our mute friends, facts/data, and make claims on their behalf.

If we recognized those as our claims, perhaps we would be more willing to listen to the claims of others. Perhaps.

I first saw this in a tweet by Michael Conover.

December 1, 2012

MOA Massively Online Analysis

Filed under: BigData,Data,Hadoop,Machine Learning,S4,Storm,Stream Analytics — Patrick Durusau @ 8:02 pm

MOA Massively Online Analysis : Real Time Analytics for Data Streams

From the homepage:

What is MOA?

MOA is an open source framework for data stream mining. It includes a collection of machine learning algorithms (classification, regression, and clustering) and tools for evaluation. Related to the WEKA project, MOA is also written in Java, while scaling to more demanding problems.

What can MOA do for you?

MOA performs BIG DATA stream mining in real time, and large scale machine learning. MOA can be easily used with Hadoop, S4 or Storm, and extended with new mining algorithms, and new stream generators or evaluation measures. The goal is to provide a benchmark suite for the stream mining community. Details.

Short tutorials and a manual are available. Enough to get started but you will need additional resources on machine learning if it isn’t already familiar.

A small niggle about documentation. Many projects have files named “tutorial” or in this case “Tutorial1,” or “Manual.” Those files are easier to discover/save, if the project name, version(?), is prepended to tutorial or manual. Thus “Moa-2012-08-tutorial1” or “Moa-2012-08-manual.”

If data streams are in your present or future, definitely worth a look.

November 23, 2012

Data Mining and Machine Learning in Astronomy

Filed under: Astroinformatics,Data Mining,Machine Learning — Patrick Durusau @ 11:30 am

Data Mining and Machine Learning in Astronomy by Nicholas M. Ball and Robert J. Brunner. (International Journal of Modern Physics D, Volume 19, Issue 07, pp. 1049-1106 (2010).)

Abstract:

We review the current state of data mining and machine learning in astronomy. Data Mining can have a somewhat mixed connotation from the point of view of a researcher in this field. If used correctly, it can be a powerful approach, holding the potential to fully exploit the exponentially increasing amount of available data, promising great scientific advance. However, if misused, it can be little more than the black box application of complex computing algorithms that may give little physical insight, and provide questionable results. Here, we give an overview of the entire data mining process, from data collection through to the interpretation of results. We cover common machine learning algorithms, such as artificial neural networks and support vector machines, applications from a broad range of astronomy, emphasizing those in which data mining techniques directly contributed to improving science, and important current and future directions, including probability density functions, parallel algorithms, Peta-Scale computing, and the time domain. We conclude that, so long as one carefully selects an appropriate algorithm and is guided by the astronomical problem at hand, data mining can be very much the powerful tool, and not the questionable black box.

At fifty-eight (58) pages and three hundred and seventy-five references, this is a great starting place to learn about data mining and machine learning from an astronomy perspective!

And should yield new techniques or new ways to apply old ones to your data, with a little imagination.

Dates from 2010 so word of more recent surveys welcome!

November 21, 2012

Digesting Big Data [Egestion vs. Excretion]

Filed under: BigData,Machine Learning — Patrick Durusau @ 6:17 am

Digesting Big Data by Jos Verwoerd.

From the post:

Phase One: Ingestion.
The art of collecting data and storing it.

Phase Two: Digestion.
Digestion is processing your raw data into something that you can extract value from.

Phase Three: Absorption.
This stage is all about extracting insights from your data.

Phase Four: Assimilation.
In the fourth stage you want to put the insights to action.

Phase Five: Egestion.
This fifth phase runs parallel to all others and is about getting rid of the unwanted, unclean, unnecessary parts of your data, invalid insights and predictions at every step of the process.

An analogy to processing big data that I have not encountered before.

Jos points out that most PR is about phases one and two, but the business value payoff doesn’t come until phases three and four.

Interesting reading, and I learned a new word, “Egestion.”

From Biology Online: Egestion –

noun

The act or process of voiding or discharging undigested food as faeces.

Supplement

Egestion is the discharge or expulsion of undigested material (food) from a cell in case of unicellular organisms, and from the digestive tract via the anus in case of multicellular organisms.

It should not be confused with excretion, which is getting rid of waste formed from the chemical reaction of the body, such as in urine, sweat, etc.

Word origin: from Latin ēgerere, ēgest-, to carry out.
Related forms: egest (verb).

According to Webster’s, excrement doesn’t make such fine distinctions, covering waste of any origin.

November 13, 2012

Introduction to Machine Learning

Filed under: Machine Learning — Patrick Durusau @ 2:52 pm

Introduction to Machine Learning by Alex Smola and S.V.N. Vishwanathan. (PDF file)

From the preface:

Since this is a textbook we biased our selection of references towards easily accessible work rather than the original references. While this may not be in the interest of the inventors of these concepts, it greatly simplifi es access to those topics. Hence we encourage the reader to follow the references in the cited works should they be interested in finding out who may claim intellectual ownership of certain key ideas.

If you need a machine learning textbook with accessible references this is one to consider.

I first saw this in a tweet by Peter Skomoroch.

November 5, 2012

Topic Modeling Tool

Filed under: Latent Dirichlet Allocation (LDA),Machine Learning — Patrick Durusau @ 2:31 pm

Topic Modeling Tool

From the webpage:

A graphical user interface tool for Latent Dirichlet Allocation topic modeling.

A very easy tool for exploring the use of Latent Dirichlet Allocation topic modeling.

Of course, on non-Mac machines, there is no “Double-click” on the jar file to run it, so use:

java -jar TopicModelingTool.jar

Oh, and the documentation is missing the link to the test files, see:

http://code.google.com/p/topic-modeling-tool/downloads/list

  • testdatanews_music_2084docs.txt 13.3 MB
  • testdata_news_economy_2073docs.txt 13.0 MB
  • testdata_news_fuel_845docs.txt 5.3 MB
  • testdata_braininjury_10000docs.txt 9.6 MB

I used testdata_news_music_2048docs.txt file, set to 100 topics with the default options and the learning process took 52 seconds and the complete process 66.056 seconds. Your mileage will vary but fast enough for smallish data sets.

At least in a session, you can’t change the output directory.

I could see using this in a class to explore a body of material for creation of topic maps.

October 26, 2012

BigML creates a marketplace for Predictive Models

Filed under: Data,Machine Learning,Prediction,Predictive Analytics — Patrick Durusau @ 4:42 pm

BigML creates a marketplace for Predictive Models by Ajay Ohri.

From the post:

BigML has created a marketplace for selling Datasets and Models. This is a first (?) as the closest market for Predictive Analytics till now was Rapid Miner’s marketplace for extensions (at http://rapidupdate.de:8180/UpdateServer/faces/index.xhtml)

From http://blog.bigml.com/2012/10/25/worlds-first-predictive-marketplace/

SELL YOUR DATA

You can make your Dataset public. Mind you: the Datasets we are talking about are BigML’s fancy histograms. This means that other BigML users can look at your Dataset details and create new models based on this Dataset. But they can not see individual records or columns or use it beyond the statistical summaries of the Dataset. Your Source will remain private, so there is no possibility of anyone accessing the raw data.

SELL YOUR MODEL

Now, once you have created a great model, you can share it with the rest of the world. For free or at any price you set.Predictions are paid for in BigML Prediction Credits. The minimum price is ‘Free’ and the maximum price indicated is 100 credits.

Having a public, digital marketplace for data and data analysis has been proposed by many and attempted by more than just a few.

Data is bought and sold today, but not by the digital equivalent of small shop keepers. The shop keepers who changed the face of Europe.

Data is bought and sold today by the digital equivalent of the great feudal lords. Complete with castles (read silos).

Will BigML give rise to a new mercantile class?

Or just as importantly, will you be a member of it or bound to the estate of a feudal lord?

First Steps with NLTK

Filed under: Machine Learning,NLTK,Python — Patrick Durusau @ 3:18 pm

First Steps with NLTK by Sujit Pal.

From the post:

Most of what I know about NLP is as a byproduct of search, ie, find named entities in (medical) text and annotating them with concept IDs (ie node IDs in our taxonomy graph). My interest in NLP so far has been mostly as a user, like using OpenNLP to do POS tagging and chunking. I’ve been meaning to learn a bit more, and I did take the Stanford Natural Language Processing class from Coursera. It taught me a few things, but still not enough for me to actually see where a deeper knowledge would actually help me. Recently (over the past month and a half), I have been reading the NLTK Book and the NLTK Cookbook in an effort to learn more about NLTK, the Natural Language Toolkit for Python.

This is not the first time I’ve been through the NLTK book, but it is the first time I have tried working out all the examples and (some of) the exercises (available on GitHub here), and I feel I now understand the material a lot better than before. I also realize that there are parts of NLP that I can safely ignore at my (user) level, since they are not either that baked out yet or because their scope of applicability is rather narrow. In this post, I will describe what I learned, where NLTK shines, and what one can do with it.

You will find the structured listing of links into the NLTK PyDocs very useful.

100 most popular Machine Learning talks at VideoLectures.Net

Filed under: CS Lectures,Machine Learning — Patrick Durusau @ 9:17 am

100 most popular Machine Learning talks at VideoLectures.Net by Davor Orlič.

A treasure trove of lectures on machine learning.

If there is a sort order to this collection, title, author, length, subject, it escapes me.

Even browsing you will find more than enough material to fill the coming weekend (and beyond).

October 24, 2012

Kaggle Digit Recognizer: A K-means attempt

Filed under: K-Means Clustering,K-Nearest-Neighbors,Machine Learning — Patrick Durusau @ 9:05 am

Kaggle Digit Recognizer: A K-means attempt by Michael Needham.

From the post:

Over the past couple of months Jen and I have been playing around with the Kaggle Digit Recognizer problem – a ‘competition’ created to introduce people to Machine Learning.

The goal in this competition is to take an image of a handwritten single digit, and determine what that digit is.

You are given an input file which contains multiple rows each containing 784 pixel values representing a 28×28 pixel image as well as a label indicating which number that image actually represents.

One of the algorithms that we tried out for this problem was a variation on the k-means clustering one whereby we took the values at each pixel location for each of the labels and came up with an average value for each pixel.

The results of machine learning are likely to be direct or indirect input into your topic maps.

Useful evaluation of that input will depend your understanding of machine learning.

October 23, 2012

Wyner et al.: An Empirical Approach to the Semantic Representation of Laws

Filed under: Language,Law,Legal Informatics,Machine Learning,Semantics — Patrick Durusau @ 10:37 am

Wyner et al.: An Empirical Approach to the Semantic Representation of Laws

Legal Informatics brings news of Dr. Adam Wyner’s paper, An Empirical Approach to the Semantic Representation of Laws, and quotes the abstract as:

To make legal texts machine processable, the texts may be represented as linked documents, semantically tagged text, or translated to formal representations that can be automatically reasoned with. The paper considers the latter, which is key to testing consistency of laws, drawing inferences, and providing explanations relative to input. To translate laws to a form that can be reasoned with by a computer, sentences must be parsed and formally represented. The paper presents the state-of-the-art in automatic translation of law to a machine readable formal representation, provides corpora, outlines some key problems, and proposes tasks to address the problems.

The paper originated at Project IMPACT.

If you haven’t looked at semantics and the law recently, this is a good opportunity to catch up.

I have only skimmed the paper and its references but am already looking for online access to early issues of Jurimetrics (a journal by the American Bar Association) that addressed such issues many years ago.

Should be fun to see what has changed and by how much. What issues remain and how they are viewed today.

October 20, 2012

Jubatus:…Realtime Analysis of Big Data [XLDB2012 Presentation]

Filed under: BigData,Jubatus,Machine Learning — Patrick Durusau @ 10:43 am

XLDB2012: Jubatus: Distributed Online Machine Learning Framework for Realtime Analysis of Big Data by Hiroyuki Makino.

I first pointed to Jubatus here.

The presentation reviews some impressive performance numbers and one technique that merits special mention.

Intermediate results are shared among the servers during processing to improve their accuracy. That may be common in distributed machine learning systems but it was the first mention I have encountered.

In parallel processing of topic maps, has anyone considered sharing merging information across servers?

October 10, 2012

Artificial Intelligence and Machine Learning [Mid-week present]

Filed under: Artificial Intelligence,Machine Learning — Patrick Durusau @ 4:20 pm

Artificial Intelligence and Machine Learning (Research at Google)

I assume you have been good so far this week so time for a mid-week present!

As of today, a list of two hundred and forty-nine publications in artificial intelligence and machine learning from Google Research!

From the webpage:

Much of our work on language, speech, translation, and visual processing relies on Machine Learning and AI. In all of those tasks and many others, we gather large volumes of direct or indirect evidence of relationships of interest, and we apply learning algorithms to generalize from that evidence to new cases of interest. Machine Learning at Google raises deep scientific and engineering challenges. Contrary to much of current theory and practice, the statistics of the data we observe shifts very rapidly, the features of interest change as well, and the volume of data often precludes the use of standard single-machine training algorithms. When learning systems are placed at the core of interactive services in a rapidly changing and sometimes adversarial environment, statistical models need to be combined with ideas from control and game theory, for example when using learning in auction algorithms.

Research at Google is at the forefront of innovation in Machine Learning with one of the most active groups working on virtually all aspects of learning, theory as well as applications, and a strong academic presence through technical talks and publications in major conferences and journals.

Don’t neglect your “real” work but either find a paper relevant to your “real” work or read one during lunch or on break.

You will be glad you did!

Machine Learning in Gradient Descent

Filed under: Machine Learning — Patrick Durusau @ 4:19 pm

Machine Learning in Gradient Descent by Ricky Ho.

From the post:

In Machine Learning, gradient descent is a very popular learning mechanism that is based on a greedy, hill-climbing approach.

Gradient Descent

The basic idea of Gradient Descent is to use a feedback loop to adjust the model based on the error it observes (between its predicted output and the actual output). The adjustment (notice that there are multiple model parameters and therefore should be considered as a vector) is pointing to a direction where the error is decreasing in the steepest sense (hence the term “gradient”).

A general introduction to a machine learning technique you are going to see fairly often.

Explore Python, machine learning, and the NLTK library

Filed under: Machine Learning,NLTK,Python — Patrick Durusau @ 4:18 pm

Explore Python, machine learning, and the NLTK library by Chris Joakim (cjoakim@bellsouth.net), Senior Software Engineer, Primedia Inc.

From the post:

The challenge: Use machine learning to categorize RSS feeds

I was recently given the assignment to create an RSS feed categorization subsystem for a client. The goal was to read dozens or even hundreds of RSS feeds and automatically categorize their many articles into one of dozens of predefined subject areas. The content, navigation, and search functionality of the client website would be driven by the results of this daily automated feed retrieval and categorization.

The client suggested using machine learning, perhaps with Apache Mahout and Hadoop, as she had recently read articles about those technologies. Her development team and ours, however, are fluent in Ruby rather than Java™ technology. This article describes the technical journey, learning process, and ultimate implementation of a solution.

If a wholly automated publication process leaves you feeling uneasy, imagine the same system that feeds content to subject matter experts for further processing.

Think of it as processing raw ore on the way to finding diamonds and then deciding which ones get polished.

October 8, 2012

Are Expert Semantic Rules so 1980’s?

In The Geometry of Constrained Structured Prediction: Applications to Inference and Learning of Natural Language Syntax André Martins proposes advances in inferencing and learning for NLP processing. And it is important work for that reason.

But in his introduction to recent (and rapid) progress in language technologies, the following text caught my eye:

So, what is the driving force behind the aforementioned progress? Essentially, it is the alliance of two important factors: the massive amount of data that became available with the advent of the Web, and the success of machine learning techniques to extract statistical models from the data (Mitchell, 1997; Manning and Schötze, 1999; Schölkopf and Smola, 2002; Bishop, 2006; Smith, 2011). As a consequence, a new paradigm has emerged in the last couple of decades, which directs attention to the data itself, as opposed to the explicit representation of knowledge (Abney, 1996; Pereira, 2000; Halevy et al., 2009). This data-centric paradigm has been extremely fruitful in natural language processing (NLP), and came to replace the classic knowledge representation methodology which was prevalent until the 1980s, based on symbolic rules written by experts. (emphasis added)

Are RDF, Linked Data, topic maps, and other semantic technologies caught in a 1980’s “symbolic rules” paradigm?

Are we ready to make the same break that NLP did, what, thirty (30) years ago now?

To get started on the literature, consider André’s sources:

Abney, S. (1996). Statistical methods and linguistics. In The balancing act: Combining symbolic and statistical approaches to language, pages 1–26. MIT Press, Cambridge, MA.

A more complete citation: Steven Abney. Statistical Methods and Linguistics. In: Judith Klavans and Philip Resnik (eds.), The Balancing Act: Combining Symbolic and Statistical Approaches to Language. The MIT Press, Cambridge, MA. 1996. (Link is to PDF of Abney’s paper.)

Pereira, F. (2000). Formal grammar and information theory: together again? Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences, 358(1769):1239–1253.

I added a pointer to the Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences abstract for the article. You can see it at: Formal grammar and information theory: together again? (PDF file).

Halevy, A., Norvig, P., and Pereira, F. (2009). The unreasonable effectiveness of data. Intelligent Systems, IEEE, 24(2):8–12.

I added a pointer to the Intelligent Systems, IEEE abstract for the article. You can see it at: The unreasonable effectiveness of data (PDF file).

The Halevy article doesn’t have an abstract per se but the ACM reports one as:

Problems that involve interacting with humans, such as natural language understanding, have not proven to be solvable by concise, neat formulas like F = ma. Instead, the best approach appears to be to embrace the complexity of the domain and address it by harnessing the power of data: if other humans engage in the tasks and generate large amounts of unlabeled, noisy data, new algorithms can be used to build high-quality models from the data. [ACM]

That sounds like a challenge to me. You?

PS: I saw the pointer to this thesis at Christophe Lalanne’s A bag of tweets / September 2012

October 4, 2012

GATE, NLTK: Basic components of Machine Learning (ML) System

Filed under: Machine Learning,Natural Language Processing,NLTK — Patrick Durusau @ 4:03 pm

GATE, NLTK: Basic components of Machine Learning (ML) System by Krishna Prasad.

From the post:

I am currently building a Machine Learning system. In this blog I want to captures the elements of a machine learning system.

My definition of a Machine Learning System is to take voice or text inputs from a user and provide relevant information. And over a period of time, learn the user behavior and provides him better information. Let us hold on to this comment and dissect apart each element.

In the below example, we will consider only text input. Let us also assume that the text input will be a freeflowing English text.

  • As a 1st step, when someone enters a freeflowing text, we need to understand what is the noun, what is the verb, what is the subject and what is the predicate. For doing this we need a Parts of Speech analyzer (POS), for example “I want a Phone”. One of the components of Natural Language Processing (NLP) is POS.
  • For associating relationship between a noun and a number, like “Phone greater than 20 dollers”, we need to run the sentence thru a rule engine. The terminology used for this is Semantic Rule Engine
  • The 3rd aspect is the Ontology, where in each noun needs to translate to a specific product or a place. For example, if someone says “I want a Bike” it should translate as “I want a Bicycle” and it should interpret that the company that manufacture a bicycle is BSA, or a Trac. We typically need to build a Product Ontology
  • Finally if you have buying pattern of a user and his friends in the system, we need a Recommendation Engine to give the user a proper recommendation

What would you add (or take away) to make the outlined system suitable as a topic map authoring assistant?

Feel free to add more specific requirements/capabilities.

I first saw this at DZone.

Scalable Machine Learning with Hadoop (most of the time)

Filed under: Hadoop,Machine Learning,Mahout — Patrick Durusau @ 1:44 pm

Scalable Machine Learning with Hadoop (most of the time) by Grant Ingersoll. (slides)

Grant’s slides from a presentation on machine learning with Hadoop in Taiwan!

Not quite like being there but still useful.

And a reminder that I need to get a copy of Taming Text!

October 3, 2012

Big Learning with Graphs by Joey Gonzalez (Video Lecture)

Filed under: GraphLab,Graphs,Machine Learning — Patrick Durusau @ 5:00 am

Big Learning with Graphs by Joey Gonzalez by Marti Hearst.

From the post:

For those of you who follow the latest developments in the Big Data technology stack, you’ll know that GraphLab is the hottest technology for processing huge graphs in fast time. We got to hear the algorithms behind GraphLab 2 even before the OSDI crowd! Check it out:

Slides.

GraphLab homepage.

For when you want to move up to parallel graph processing.

October 1, 2012

Scikit-learn 0.12 released

Filed under: Machine Learning,Python,Scikit-Learn — Patrick Durusau @ 10:31 am

Scikit-learn 0.12 released by Andreas Mueller.

From the post:

Last night I uploaded the new version 0.12 of scikit-learn to pypi. Also the updated website is up and running and development now starts towards 0.13.

The new release has some nifty new features (see whatsnew):

  • Multidimensional scaling
  • Multi-Output random forests (like these)
  • Multi-task Lasso
  • More loss functions for ensemble methods and SGD
  • Better text feature extraction

Eventhough, the majority of changes in this release are somewhat “under the hood”.

Vlad developed and set up a continuous performance benchmark for the main algorithms during his google summer of code. I am sure this will help improve performance.

There already has been a lot of work in improving performance, by Vlad, Immanuel, Gilles and others for this release.

Just in case you haven’t been keeping up with Scikit-learn.

Troll Detection with Scikit-Learn

Filed under: Machine Learning,Python,Scikit-Learn — Patrick Durusau @ 9:52 am

Troll Detection with Scikit-Learn by Andreas Mueller.

I had thought that troll detection was one of those “field guide” sort of things:

troll dolls

After reading Andreas’ post, apparently not. 😉

From the post:

Cross-post from Peekaboo, Andreas Mueller‘s computer vision and machine learning blog. This post documents his experience in the Impermium Detecting Insults in Social Commentary competition, but rest of the blog is well worth a read, especially for those interested in computer vision and Python scikit-learn and -image.

Recently I entered my first kaggle competition – for those who don’t know it, it is a site running machine learning competitions. A data set and time frame is provided and the best submission gets a money prize, often something between 5000$ and 50000$.

I found the approach quite interesting and could definitely use a new laptop, so I entered Detecting Insults in Social Commentary.

My weapon of choice was Python with scikit-learn – for those who haven’t read my blog before: I am one of the core devs of the project and never shut up about it.

During the competition I was visiting Microsoft Reseach, so this is where most of my time and energy went, in particular in the end of the competition, as it was also the end of my internship. And there was also the scikit-learn release in between. Maybe I can spent a bit more time on the next competition.

September 30, 2012

Vowpal Wabbit, version 7.0

Filed under: Machine Learning,Vowpal Wabbit — Patrick Durusau @ 4:59 pm

Vowpal Wabbit, version 7.0

From the post:

A new version of VW is out. The primary changes are:

  1. Learning Reductions: I’ve wanted to get learning reductions working and we’ve finally done it. Not everything is implemented yet, but VW now supports direct:
    1. Multiclass Classification –oaa or –ect.
    2. Cost Sensitive Multiclass Classification –csoaa or –wap.
    3. Contextual Bandit Classification –cb.
    4. Sequential Structured Prediction –searn or –dagger

    In addition, it is now easy to build your own custom learning reductions for various plausible uses: feature diddling, custom structured prediction problems, or alternate learning reductions. This effort is far from done, but it is now in a generally useful state. Note that all learning reductions inherit the ability to do cluster parallel learning.

  2. Library interface: VW now has a basic library interface. The library provides most of the functionality of VW, with the limitation that it is monolithic and nonreentrant. These will be improved over time.
  3. Windows port: The priority of a windows port jumped way up once we moved to Microsoft. The only feature which we know doesn’t work at present is automatic backgrounding when in daemon mode.
  4. New update rule: Stephane visited us this summer, and we fixed the default online update rule so that it is unit invariant.

There are also many other small updates including some contributed utilities that aid the process of applying and using VW.

Plans for the near future involve improving the quality of various items above, and of course better documentation: several of the reductions are not yet well documented.

A good test for your understanding of a subject is your ability to explain it.

Writing good documentation for projects like Vowpal Wabbit would benefit the project. And demonstrate your chops with the software. Something to consider.

September 20, 2012

Interacting with Weka from Jython

Filed under: Machine Learning,Weka — Patrick Durusau @ 8:08 pm

Interacting with Weka from Jython by Christophe Lalanne.

From the post:

I discovered a lovely feature: You can use WEKA directly with Jython in a friendly interactive REPL.

There are days when I think I need more than multiple workspaces on multiple monitors. I need an extra set of hands and eyes. 😉

Enjoy!

September 16, 2012

Machine Learning: Genetic Algorithms in Javascript Part 2

Filed under: Genetic Algorithms,Machine Learning — Patrick Durusau @ 1:05 pm

Machine Learning: Genetic Algorithms in Javascript Part 2 by Burak Kanber.

From the post:

Today we’re going to revisit the genetic algorithm. If you haven’t read Genetic Algorithms Part 1 yet, I strongly recommend reading that now. This article will skip over the fundamental concepts covered in part 1 — so if you’re new to genetic algorithms you’ll definitely want to start there.

Just looking for the example?

The Problem

You’re a scientist that has recently been framed for murder by an evil company. Before you flee the lab you have an opportunity to steal 1,000 pounds (or kilograms!) of pure elements from the chemical warehouse; your plan is to later sell them and survive off of the earnings.

Given the weight and value of each element, which combination should you take to maximize the total value without exceeding the weight limit?

This is called the knapsack problem. The one above is a one-dimensional problem, meaning the only constraint is weight. We could complicate matters by also considering volume, but we need to start somewhere. Note that in our version of the problem only one piece of each element is available, and each piece has a fixed weight. There are some knapsack problems where you can take unlimited platinum or up to 3 pieces of gold or something like that, but here we only have one of each available to us.

Why is this problem tough to solve? We’ll be using 118 elements. The brute-force approach would require that we test 2118 or 3.3 * 1035 different combinations of elements.

What if you have subject identity criteria of varying reliability? What is the best combination for the highest reliability?

To sharpen the problem: Your commanding officer has requested declaration of sufficient identity for a drone strike target.

Machine Learning: Genetic Algorithms Part 1 (Javascript)

Filed under: Genetic Algorithms,Javascript,Machine Learning — Patrick Durusau @ 12:37 pm

Machine Learning: Genetic Algorithms Part 1 (Javascript) by Burak Kanber.

From the post:

I like starting my machine learning classes with genetic algorithms (which we’ll abbreviate “GA” sometimes). Genetic algorithms are probably the least practical of the ML algorithms I cover, but I love starting with them because they’re fascinating and they do a good job of introducing the “cost function” or “error function”, and the idea of local and global optima — concepts both important and common to most other ML algorithms.

Genetic algorithms are inspired by nature and evolution, which is seriously cool to me. It’s no surprise, either, that artificial neural networks (“NN”) are also modeled from biology: evolution is the best general-purpose learning algorithm we’ve experienced, and the brain is the best general-purpose problem solver we know. These are two very important pieces of our biological existence, and also two rapidly growing fields of artificial intelligence and machine learning study. While I’m tempted to talk more about the distinction I make between the GA’s “learning algorithm” and the NN’s “problem solver” terminology, we’ll drop the topic of NNs altogether and concentrate on GAs… for now.

One phrase I used above is profoundly important: “general-purpose”. For almost any specific computational problem, you can probably find an algorithm that solves it more efficiently than a GA. But that’s not the point of this exercise, and it’s also not the point of GAs. You use the GA not when you have a complex problem, but when you have a complex problem of problems. Or you may use it when you have a complicated set of disparate parameters.

Off to a great start!

« Newer PostsOlder Posts »

Powered by WordPress