Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 14, 2012

Kostas T. on How To Detect Twitter Trends

Filed under: Machine Learning,Tweets — Patrick Durusau @ 9:14 am

Kostas T. on How To Detect Twitter Trends by Marti Hearst.

From the post:

Have you ever wondered how Twitter computes its Trending Topics? Kostas T. is one of the wizards behind that, and today he shared some of the secrets with our class:

Be prepared to watch this more than once!

Sparks a number of ideas about how to track and analyze tweets.

September 12, 2012

Do You Just Talk About The Weather?

Filed under: Dataset,Machine Learning,Mahout,Weather Data — Patrick Durusau @ 9:24 am

After reading this post by Alex you will still just be talking about the weather, but you may have something interesting to say. 😉

Locating Mountains and More with Mahout and Public Weather Dataset by Alex Baranau

From the post:

Recently I was playing with Mahout and public weather dataset. In this post I will describe how I used Mahout library and weather statistics to fill missing gaps in weather measurements and how I managed to locate steep mountains in US with a little Machine Learning (n.b. we are looking for people with Machine Learning or Data Mining backgrounds – see our jobs).

The idea was to just play and learn something, so the effort I did and the decisions chosen along with the approaches should not be considered as a research or serious thoughts by any means. In fact, things done during this effort may appear too simple and straightforward to some. Read on if you want to learn about the fun stuff you can do with Mahout!
Tools & Data

The data and tools used during this effort are: Apache Mahout project and public weather statistics dataset. Mahout is a machine learning library which provided a handful of machine learning tools. During this effort I used just small piece of this big pie. The public weather dataset is a collection of daily weather measurements (temperature, wind speed, humidity, pressure, &c.) from 9000+ weather stations around the world.

What other questions could you explore with the weather data set?

The real power of “big data” access and tools may be that we no longer have to rely on the summaries of others.

Summaries still have a value-add, perhaps even more so when the original data is available for verification.

September 11, 2012

Web Data Extraction, Applications and Techniques: A Survey

Filed under: Data Mining,Extraction,Machine Learning,Text Extraction,Text Mining — Patrick Durusau @ 5:05 am

Web Data Extraction, Applications and Techniques: A Survey by Emilio Ferrara, Pasquale De Meo, Giacomo Fiumara, Robert Baumgartner.

Abstract:

Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of application domains. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc application domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction.

This survey aims at providing a structured and comprehensive overview of the research efforts made in the field of Web Data Extraction. The fil rouge of our work is to provide a classification of existing approaches in terms of the applications for which they have been employed. This differentiates our work from other surveys devoted to classify existing approaches on the basis of the algorithms, techniques and tools they use.

We classified Web Data Extraction approaches into categories and, for each category, we illustrated the basic techniques along with their main variants.

We grouped existing applications in two main areas: applications at the Enterprise level and at the Social Web level. Such a classification relies on a twofold reason: on one hand, Web Data Extraction techniques emerged as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. On the other hand, Web Data Extraction techniques allow for gathering a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities of analyzing human behaviors on a large scale.

We discussed also about the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.

Comprehensive (> 50 pages) survey of web data extraction. Supplements and updates existing work by its focus on classifying by field of use, web data extraction.

Very likely to lead to adaptation of techniques from one field to another.

September 10, 2012

Learning Mahout : Classification

Filed under: Classification,Machine Learning,Mahout — Patrick Durusau @ 10:01 am

Learning Mahout : Classification by Sujit Pal.

From the post:

The final part covered in the MIA book is Classification. The popular algorithms available are Stochastic Gradient Descent (SGD), Naive Bayes and Complementary Naive Bayes, Random Forests and Online Passive Aggressive. There are other algorithms in the pipeline, as seen from the Classification section of the Mahout wiki page.

The MIA book has generic classification information and advice that will be useful for any algorithm, but it specifically covers SGD, Bayes and Naive Bayes (the last two via Mahout scripts). Of these SGD and Random Forest are good for classification problems involving continuous variables and small to medium datasets, and the Naive Bayes family is good for problems involving text like variables and medium to large datasets.

In general, a solution to a classification problem involves choosing the appropriate features for classification, choosing the algorithm, generating the feature vectors (vectorization), training the model and evaluating the results in a loop. You continue to tweak stuff in each of these steps until you get the results with the desired accuracy.

Sujit notes that classification is under rapid development. The classification material is likely to become dated.

Some additional resources to consider:

Mahout User List (subscribe)

Mahout Developer List (subscribe)

IRC: Mahout’s IRC channel is #mahout.

Mahout QuickStart

September 5, 2012

Machine Learning in All Languages: Introduction

Filed under: Javascript,Machine Learning,Perl,PHP,Ruby — Patrick Durusau @ 4:20 pm

Machine Learning in All Languages: Introduction by Burak Kanber.

From the post:

I love machine learning algorithms. I’ve taught classes and seminars and given talks on ML. The subject is fascinating to me, but like all skills fascination simply isn’t enough. To get good at something, you need to practice!

I also happen to be a PHP and Javascript developer. I’ve taught classes on both of these as well — but like any decent software engineer I have experience with Ruby, Python, Perl, and C. I just prefer PHP and JS. Before you flame PHP, I’ll just say that while it has its problems, I like it because it gets stuff done.

Whenever I say that Tidal Labs’ ML algorithms are in PHP, they look at me funny and ask me how it’s possible. Simple: it’s possible to write ML algorithms in just about any language. Most people just don’t care the learn the fundamentals strongly enough that they can write an algorithm from scratch. Instead, they rely on Python libraries to do the work for them, and end up not truly grasping what’s happening inside the black box.

Through this series of articles, I’ll teach you the fundamental machine learning algorithms in a variety of languages, including:

  • PHP
  • Javascript
  • Perl
  • C
  • Ruby

Just started so too soon to comment but thought it might be of interest.

September 2, 2012

Learning Mahout : Clustering

Filed under: Clustering,Machine Learning,Mahout — Patrick Durusau @ 6:01 pm

Learning Mahout : Clustering by Sujit Pal.

From the post:

The next section in the MIA book is Clustering. As with Recommenders, Mahout provides both in-memory and map-reduce versions of various clustering algorithms. However, unlike Recommenders, there are quite a few toolkits (like Weka or Mallet for example) which are more comprehensive than Mahout for small or medium sized datasets, so I decided to concentrate on the M/R implementations.

The full list of clustering algorithms available in Mahout at the moment can be found on its Wiki Page under the Clustering section. The ones covered in the book are K-Means, Canopy, Fuzzy K-Means, LDA and Dirichlet. All these algorithms expect data in the form of vectors, so the first step is to convert the input data into this format, a process known as vectorization. Essentially, clustering is the process of finding nearby points in n-dimensional space, where each vector represents a point in this space, and each element of a vector represents a dimension in this space.

It is important to choose the right vector format for the clustering algorithm. For example, one should use the SequentialAccessSparseVector for KMeans, sinc there is lot of sequential access in the algorithm. Other possibilities are the DenseVector and the RandomAccessSparseVector formats. The input to a clustering algorithm is a SequenceFile containing key-value pairs of {IntWritable, VectorWritable} objects. Since the implementations are given, Mahout users would spend most of their time vectorizing the input (and thinking about what feature vectors to use, of course).

Once vectorized, one can invoke the appropriate algorithm either by calling the appropriate bin/mahout subcommand from the command line, or through a program by calling the appropriate Driver’s run method. All the algorithms require the initial centroids to be provided, and the algorithm iteratively perturbes the centroids until they converge. One can either guess randomly or use the Canopy clusterer to generate the initial centroids.

Finally, the output of the clustering algorithm can be read using the Mahout cluster dumper subcommand. To check the quality, take a look at the top terms in each cluster to see how “believable” they are. Another way to measure the quality of clusters is to measure the intercluster and intracluster distances. A lower spread of intercluster and intracluster distances generally imply “good” clusters. Here is code to calculate inter-cluster distance based on code from the MIA book.

Detailed walk through of two of the four case studies in Mahout In Action. This post and the book are well worth your time.

August 26, 2012

Patterns for research in machine learning

Filed under: Machine Learning — Patrick Durusau @ 11:03 am

Patterns for research in machine learning by S. M. Ali Eslami.

From the post:

There are a handful of basic code patterns that I wish I was more aware of when I started research in machine learning. Each on its own may seem pointless, but collectively they go a long way towards making the typical research workflow more efficient. Here they are:

Perhaps “good practices” or Harper’s incremental improvement. Either way, likely to be useful in topic maps research.

August 25, 2012

Algebraic Topology and Machine Learning

Filed under: Machine Learning,Topological Data Analysis,Topology — Patrick Durusau @ 2:58 pm

Algebraic Topology and Machine Learning – In conjunction with Neural Information Processing Systems (NIPS 2012)

September 16, 2012 – Submissions Due
October 7, 2012 – Acceptance Notices
December 7 or 8 (TBD), 2011, Lake Tahoe, Nevada, USA.

From the call for papers:

Topological methods and machine learning have long enjoyed fruitful interactions as evidenced by popular algorithms like ISOMAP, LLE and Laplacian Eigenmaps which have been borne out of studying point cloud data through the lens of topology/geometry. More recently several researchers have been attempting to understand the algebraic topological properties of data. Algebraic topology is a branch of mathematics which uses tools from abstract algebra to study and classify topological spaces. The machine learning community thus far has focused almost exclusively on clustering as the main tool for unsupervised data analysis. Clustering however only scratches the surface, and algebraic topological methods aim at extracting much richer topological information from data.

The goals of this workshop are:

  1. To draw the attention of machine learning researchers to a rich and emerging source of interesting and challenging problems.
  2. To identify problems of interest to both topologists and machine learning researchers and areas of potential collaboration.
  3. To discuss practical methods for implementing topological data analysis methods.
  4. To discuss applications of topological data analysis to scientific problems.

We also invite submissions in a variety of areas, at the intersection of algebraic topology and learning, that have witnessed recent activity. Areas of focus for submissions include but are not limited to:

  1. Statistical approaches to robust topological inference.
  2. Novel applications of topological data analysis to problems in machine learning.
  3. Scalable methods for topological data analysis.

NIPS2012 site. You will appreciate the “dramatization.” 😉

Put on your calendar and/or watch for papers!

Machine Learning [Andrew Ng]

Filed under: CS Lectures,Machine Learning — Patrick Durusau @ 2:44 pm

Machine Learning [Andrew Ng]

The machine learning course by Andrew Ng started up on 20 August 2012, so there is time to enroll and catch up.

From the post:

What Is Machine Learning?

Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome. Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it. Many researchers also think it is the best way to make progress towards human-level AI. In this class, you will learn about the most effective machine learning techniques, and gain practice implementing them and getting them to work for yourself. More importantly, you’ll learn about not only the theoretical underpinnings of learning, but also gain the practical know-how needed to quickly and powerfully apply these techniques to new problems. Finally, you’ll learn about some of Silicon Valley’s best practices in innovation as it pertains to machine learning and AI.

About the Course

This course provides a broad introduction to machine learning, datamining, and statistical pattern recognition. Topics include: (i) Supervised learning (parametric/non-parametric algorithms, support vector machines, kernels, neural networks). (ii) Unsupervised learning (clustering, dimensionality reduction, recommender systems, deep learning). (iii) Best practices in machine learning (bias/variance theory; innovation process in machine learning and AI). The course will also draw from numerous case studies and applications, so that you’ll also learn how to apply learning algorithms to building smart robots (perception, control), text understanding (web search, anti-spam), computer vision, medical informatics, audio, database mining, and other areas.

August 24, 2012

Learning Mahout : Collaborative Filtering [Recommend Your Preferences?]

Filed under: Collaboration,Filters,Machine Learning,Mahout — Patrick Durusau @ 3:52 pm

Learning Mahout : Collaborative Filtering by Sujit Pal.

From the post:

My Mahout in Action (MIA) book has been collecting dust for a while now, waiting for me to get around to learning about Mahout. Mahout is evolving quite rapidly, so the book is a bit dated now, but I decided to use it as a guide anyway as I work through the various modules in the currently GA) 0.7 distribution.

My objective is to learn about Mahout initially from a client perspective, ie, find out what ML modules (eg, clustering, logistic regression, etc) are available, and which algorithms are supported within each module, and how to use them from my own code. Although Mahout provides non-Hadoop implementations for almost all its features, I am primarily interested in the Hadoop implementations. Initially I just want to figure out how to use it (with custom code to tweak behavior). Later, I would like to understand how the algorithm is represented as a (possibly multi-stage) M/R job so I can build similar implementations.

I am going to write about my progress, mainly in order to populate my cheat sheet in the sky (ie, for future reference). Any code I write will be available in this GitHub (Scala) project.

The first module covered in the book is Collaborative Filtering. Essentially, it is a technique of predicting preferences given the preferences of others in the group. There are two main approaches – user based and item based. In case of user-based filtering, the objective is to look for users similar to the given user, then use the ratings from these similar users to predict a preference for the given user. In case of item-based recommendation, similarities between pairs of items are computed, then preferences predicted for the given user using a combination of the user’s current item preferences and the similarity matrix.

While you are working your way through this post, keep in mind: Collaborative filtering with GraphChi.

Question: What if you are an outlier?

Telephone marketing interviews with me get shortened by responses like: “X? Is that a TV show?”

How would you go about piercing the marketing veil to recommend your preferences?

Now that is a product to which even I might subscribe. (But don’t advertise on TV, I won’t see it.)

Foundations of Machine Learning

Filed under: Machine Learning — Patrick Durusau @ 10:29 am

Foundations of Machine Learning by Mehryar Mohri, Afshin Rostamizadeh and Ameet Talwalkar.

From the description:

This graduate-level textbook introduces fundamental concepts and methods in machine learning. It describes several important modern algorithms, provides the theoretical underpinnings of these algorithms, and illustrates key aspects for their application. The authors aim to present novel theoretical tools and concepts while giving concise proofs even for relatively advanced topics.

Foundations of Machine Learning fills the need for a general textbook that also offers theoretical details and an emphasis on proofs. Certain topics that are often treated with insufficient attention are discussed in more detail here; for example, entire chapters are devoted to regression, multi-class classification, and ranking. The first three chapters lay the theoretical foundation for what follows, but each remaining chapter is mostly self-contained. The appendix offers a concise probability review, a short introduction to convex optimization, tools for concentration bounds, and several basic properties of matrices and norms used in the book.

The book is intended for graduate students and researchers in machine learning, statistics, and related areas; it can be used either as a textbook or as a reference text for a research seminar.

Before I lay out $70 for a copy, would appreciate comments on how this differs from say Christopher M. Bishop’s Pattern Recognition and Machine Learning (2007, 2nd printing)? Five (5) years will make some difference, but how much?

August 21, 2012

Predictive Models: Build once, Run Anywhere

Filed under: Machine Learning,Prediction,Predictive Analytics — Patrick Durusau @ 2:59 pm

Predictive Models: Build once, Run Anywhere

From the post:

We have released a new version of our open source Python bindings. This new version aims at showing how the BigML API can be used to build predictive models capable of generating predictions locally or remotely. You can get full access to the code at Github and read the full documentation at Read the Docs.

The complete list of updates includes (drum roll, please):

Development Mode

We recently introduced a free sandbox to help developers play with BigML on smaller datasets without being concerned about credits. In the new Python bindings you can use BigML in development mode, and all dataset and models smaller than 1MB can be created for free:

from bigml.api import BigML

api = BigML(dev_mode=True)

A “sandbox” for your machine learning experiments!

August 13, 2012

Machine Learning Throwdown, [Part 1, 2, 3, 4, 5, 6 (complete)]

Filed under: Machine Learning — Patrick Durusau @ 3:52 pm

Machine Learning Throwdown, Part 1 – Introduction by Nick Wilson.

From the post:

Hi, I’m Nick the intern. The fine folks at BigML brought me on board for the summer to drink their coffee, eat their snacks, and compare their service to similar offerings from other companies. I have a fair amount of software engineering experience but limited machine learning skills beyond some introductory classes. Prior to beginning this internship, I had no experience with the services I am going to talk about. Since BigML aims to make machine learning easy for non-experts like myself, I believe I am in a great position to provide feedback on these types of services. But please, take what I say with a grain of salt. I’ll try to stay impartial but it’s not easy when BigML keeps dumping piles of money and BigML credits on my doorstep to ensure a favorable outcome.

From my time at BigML, it has become clear that everyone here is a big believer in the power of machine learning to extract value from data and build intelligent systems. Unfortunately, machine learning has traditionally had a high barrier to entry. The BigML team is working hard to change this; they want anyone to be able to gain valuable insights and predictive power from their data.

It turns out BigML is not the only player in this game. How does it stack up against the competition? This is the first in a series of blog posts where I compare BigML to a few other services offering machine learning capabilities. These services vary in multiple ways including the level of expertise required, the types of models that can be created, and the ease with which they can be integrated into your business.

You need to make decisions on services using your own data and requirements but Nick’s posts make as good a place to start as any.

Will be even more useful if the posts result in counter-posts on other blogs, not so much disputing trivia but in outlining their best approach as opposed to other best approaches.

Could be quite educational.

Series continues with:

Machine Learning Throwdown, Part 2 – Data Preparation

Machine Learning Throwdown, Part 3 – Models

Machine Learning Throwdown, Part 4 – Predictions

Machine Learning Throwdown, Part 5 – Miscellaneous

Machine Learning Throwdown, Part 6 – Summary

Series in now complete.

August 5, 2012

Machine Learning — Introduction

Filed under: Machine Learning — Patrick Durusau @ 4:28 am

Machine Learning — Introduction by Jeremy Kun.

These days an absolutely staggering amount of research and development work goes into the very coarsely defined field of “machine learning.” Part of the reason why it’s so coarsely defined is because it borrows techniques from so many different fields. Many problems in machine learning can be phrased in different but equivalent ways. While they are often purely optimization problems, such techniques can be expressed in terms of statistical inference, have biological interpretations, or have a distinctly geometric and topological flavor. As a result, machine learning has come to be understood as a toolbox of techniques as opposed to a unified theory.

It is unsurprising, then, that such a multitude of mathematics supports this diversified discipline. Practitioners (that is, algorithm designers) rely on statistical inference, linear algebra, convex optimization, and dabble in graph theory, functional analysis, and topology. Of course, above all else machine learning focuses on algorithms and data.

The general pattern, which we’ll see over and over again as we derive and implement various techniques, is to develop an algorithm or mathematical model, test it on datasets, and refine the model based on specific domain knowledge. The first step usually involves a leap of faith based on some mathematical intuition. The second step commonly involves a handful of established and well understood datasets (often taken from the University of California at Irvine’s machine learning database, and there is some controversy over how ubiquitous this practice is). The third step often seems to require some voodoo magic to tweak the algorithm and the dataset to complement one another.

It is this author’s personal belief that the most important part of machine learning is the mathematical foundation, followed closely by efficiency in implementation details. The thesis is that natural data has inherent structure, and that the goal of machine learning is to represent this and utilize it. To make true progress, one must represent and analyze structure abstractly. And so this blog will focus predominantly on mathematical underpinnings of the algorithms and the mathematical structure of data.

Jeremy is starting a series of posts on machine learning that should prove to be useful.

While I would disagree about “inherent structure[s]” in data, we do treat data as though that were the case. Careful attention to those structures, inherent or not, is the watchword of useful analysis.

August 2, 2012

Using machine learning to extract quotes from text

Filed under: Machine Learning,Text Mining — Patrick Durusau @ 2:44 pm

Using machine learning to extract quotes from text by Chase Davis.

From the post:

Since we launched our Politics Verbatim project a couple of years ago, I’ve been hung up on what should be a simple problem: How can we automate the extraction of quotes from news articles, so it doesn’t take a squad of bored-out-of-their-minds interns to keep track of what politicians say in the news?

You’d be surprised at how tricky this is. At first glance, it looks like something a couple of regular expressions could solve. Just find the text with quotes in it, then pull out the words in between! But what about “air quotes?” Or indirect quotes (“John said he hates cheeseburgers.”)? Suffice it to say, there are plenty of edge cases that make this problem harder than it looks.

When I took over management of the combined Center for Investigative Reporting/Bay Citizen technology team a couple of months ago, I encouraged everyone to have a personal project on the back burner – an itch they wanted to scratch either during slow work days or (in this case) on nights and weekends.

This is mine: the citizen-quotes project, an app that uses simple machine learning techniques to extract more than 40,000 quotes from every article that ran on The Bay Citizen since it launched in 2010. The goal was to build something that accounts for the limitations of the traditional method of solving quote extraction – regular expressions and pattern matching. And sure enough, it does a pretty good job.

Illustrates the application of machine learning to a non-trivial text analysis problem.

August 1, 2012

Practical machine learning tricks…

Filed under: Machine Learning,MapReduce — Patrick Durusau @ 2:03 pm

Practical machine learning tricks from the KDD 2011 best industry paper by David Andrzejewski.

From the post:

A machine learning research paper tends to present a newly proposed method or algorithm in relative isolation. Problem context, data preparation, and feature engineering are hopefully discussed to the extent required for reader understanding and scientific reproducibility, but are usually not the primary focus. Given the goals and constraints of the format, this can be seen as a reasonable trade-off: the authors opt to spend scarce "ink" on only the most essential (often abstract) ideas.

As a consequence, implementation details relevant to the use of the proposed technique in an actual production system are often not mentioned whatsoever. This aspect of machine learning is often left as "folk wisdom" to be picked up from colleagues, blog
posts, discussion boards, snarky tweets, open-source libraries, or more often than not, first-hand experience.

Papers from conference "industry tracks" often deviate from this template, yielding valuable insights about what it takes to make machine learning effective in practice. This paper from Google on detecting "malicious" (ie, scam/spam) advertisements won best industry paper at KDD 2011 and is a particularly interesting example.

Detecting Adversarial Advertisements in the Wild

D. Sculley, Matthew Otey, Michael Pohl, Bridget Spitznagel,

John Hainsworth, Yunkai Zhou

http://research.google.com/pubs/archive/37195.pdf

At first glance, this might appear to be a "Hello-World" machine learning problem straight out of a textbook or tutorial: we simply train a Naive Bayes on a set of bad ads versus a set of good ones. However this is apparently far from being the case – while Google is understandably shy about hard numbers, the paper mentions several issues which make this especially challenging and notes that this is a business-critical problem for Google.

The paper describes an impressive and pragmatic blend of different techniques and tricks. I've briefly described some of the highlights, but I would certainly encourage the interested reader to check out the original paper and presentation slides.

In addition to the original paper and slides, I would suggest having David’s comments at hand while you read the paper. Not to mention having access to a machine and online library at the same time.

There is much here to repurpose to assist you and your users.

July 31, 2012

Big Data Machine Learning: Patterns for Predictive Analytics

Filed under: Machine Learning,Predictive Analytics — Patrick Durusau @ 7:25 am

Big Data Machine Learning: Patterns for Predictive Analytics by Ricky Ho.

A DZone “refcard” and as you might expect, a bit “slim” to cover predictive analytics. Still, printed in full color it would make a nice handout on predictive analytics for a general audience.

What would you add to make a “refcard” on a particular method?

Or for that matter, what would you include to make a “refcard” on popular government resources? Can you name all the fields on the campaign disclosure files? Thought not.

July 28, 2012

Exploring the Universe with Machine Learning

Filed under: Astroinformatics,Machine Learning — Patrick Durusau @ 6:59 pm

Exploring the Universe with Machine Learning by Bruce Berriman.

From the post:

A short while ago, I attended a webinar on the above topic by Alex Gray and Nick Ball. The traditional approach to analytics involves identifying which collections of data or collections of information follow sets of rules. Machine learning (ML) takes a very different approach by finding patterns and making predictions from large collections of data.

The post reviews the presentation, CANFAR + Skytree Webinar Presentation (video here).

Good way to broaden your appreciation for “big data.” Astronomy has been awash in “big data” for years.

July 27, 2012

Anaconda: Scalable Python Computing

Filed under: Anaconda,Data Analysis,Machine Learning,Python,Statistics — Patrick Durusau @ 10:19 am

Anaconda: Scalable Python Computing

Easy, Scalable Distributed Data Analysis

Anaconda is a distribution that combines the most popular Python packages for data analysis, statistics, and machine learning. It has several tools for a variety of types of cluster computations, including MapReduce batch jobs, interactive parallelism, and MPI.

All of the packages in Anaconda are built, tested, and supported by Continuum. Having a unified runtime for distributed data analysis makes it easier for the broader community to share code, examples, and best practices — without getting tangled in a mess of versions and dependencies.

Good way to avoid dependency issues!

On scaling, I am reminded of a developer who designed a Python application to require upgrading for “heavy” use. Much to their disappointment, Python scaled under “heavy” use with no need for an upgrade. 😉

I saw this in Christophe Lalanne’s Bag of Tweets for July 2012.

July 19, 2012

GraphLab 2.1 [New Release]

Filed under: GraphLab,Graphs,Machine Learning,Networks — Patrick Durusau @ 10:43 am

GraphLab 2.1

A new release (July 10, 2012) of GraphLab!

From the webpage:

Overview

Designing and implementing efficient and provably correct parallel machine learning (ML) algorithms can be very challenging. Existing high-level parallel abstractions like MapReduce are often insufficiently expressive while low-level tools like MPI and Pthreads leave ML experts repeatedly solving the same design challenges. By targeting common patterns in ML, we developed GraphLab, which improves upon abstractions like MapReduce by compactly expressing asynchronous iterative algorithms with sparse computational dependencies while ensuring data consistency and achieving a high degree of parallel performance.

The new GraphLab 2.1 features:

  • a new GraphLab 2 abstraction
  • Fully Distributed with HDFS integration
  • New toolkits
    • Collaborative Filtering
    • Clustering
    • Text Modeling
    • Computer Vision
  • Improved Build system and documentation

Go to http://graphlab.org for details.

If you want to get started with GraphLab today download the source or clone us Google code. We recommend cloning to get the latest features and bug fixes.

hg clone https://code.google.com/p/graphlabapi/

If you don't have mercurial (hg) you can get it from http://mercurial.selenic.com/.

I almost didn’t find the download link. Just larger than anything else on the page, white letters, black backgroud, at: http:www.graphlab.org. Kept looking for a drop down menu item, etc.

Shows even the clearest presentation can be missed by a user. 😉

Now to get this puppy running on my local box.

July 11, 2012

Learning From Data [Machine Learning]

Filed under: Machine Learning — Patrick Durusau @ 2:26 pm

Learning from Data lectures by Professor Yaser Abu-Mostafa.

Just the main topics, these are composed of sub-topics on the lecture page (above):

  • Bayesian Learning
  • Bias-Variance Tradeoff
  • Bin Model
  • Data Snooping
  • Ensemble Learning
  • Error Measures
  • Gradient Descent
  • Learning Curves
  • Learning Diagram
  • Learning Paradigms
  • Linear Classification
  • Linear Regression
  • Logistic Regression
  • Netflix Competition
  • Neural Networks
  • Nonlinear Transformation
  • Occam’s Razor
  • Overfitting
  • Radial Basis Functions
  • Regularization
  • Sampling Bias
  • Support Vector Machines
  • Validation
  • VC Dimension

The textbook by the same title: Learning from Data.

The lectures look like a good place to get lost. For days!

July 8, 2012

R Integration in Weka

Filed under: Data Mining,Machine Learning,R,Weka — Patrick Durusau @ 9:57 am

R Integration in Weka by Mark Hall.

From the post:

These days it seems like every man and his proverbial dog is integrating the open-source R statistical language with his/her analytic tool. R users have long had access to Weka via the RWeka package, which allows R scripts to call out to Weka schemes and get the results back into R. Not to be left out in the cold, Weka now has a brand new package that brings the power of R into the Weka framework.

Weka

In this section I briefly cover what the new RPlugin package for Weka >= 3.7.6 offers. This package can be installed via Weka’s built-in package manager.

Here is an list of the functionality implemented:

  • Execution of arbitrary R scripts in Weka’s Knowledge Flow engine
  • Datasets into and out of the R environment
  • Textual results out of the R environment
  • Graphics out of R in png format for viewing inside of Weka and saving to files via the JavaGD graphics device for R
  • A perspective for the Knowledge Flow and a plugin tab for the Explorer that provides visualization of R graphics and an interactive R console
  • A wrapper classifier that invokes learning and prediction of R machine learning schemes via the MLR (Machine Learning in R) library

The use of R appears to be spreading! (Oracle, SAP, Hadoop, just to name a few that come readily to mind.)

Where is it on your list of data mining tools?

I first saw this at DZone.

July 7, 2012

Natural Language Processing | Hub

Natural Language Processing | Hub

From the “about” page:

NLP|Hub is an aggregator of news about Natural Language Processing and other related topics, such as Text Mining, Information Retrieval, Linguistics or Machine Learning.

NLP|Hub finds, collects and arranges related news from different sites, from academic webs to company blogs.

NLP|Hub is a product of Cilenis, a company specialized in Natural Language Processing.

If you have interesting posts for NLP|Hub, or if you do not want NLP|Hub indexing your text, please contact us at info@cilenis.com

Definitely going on my short list of sites to check!

July 6, 2012

BigMl 0.3.1 Release

Filed under: Machine Learning,Predictive Analytics,Python — Patrick Durusau @ 9:45 am

BigMl 0.3.1 Release

From the webpage:

An open source binding to BigML.io, the public BigML API

Downloads

BigML makes machine learning easy by taking care of the details required to add data-driven decisions and predictive power to your company. Unlike other machine learning services, BigML creates beautiful predictive models that can be easily understood and interacted with.

These BigML Python bindings allow you to interact with BigML.io, the API for BigML. You can use it to easily create, retrieve, list, update, and delete BigML resources (i.e., sources, datasets, models and, predictions).

There’s that phrase again, predictive models.

Don’t people read patent literature anymore? 😉 I don’t care for absurdist fiction so I tend to avoid it. People claiming invention for having a patent lawyer write common art up in legal prose. Good for patent lawyers, bad for researchers and true inventers.

July 5, 2012

JMLR – Journal of Machine Learning Research

Filed under: Artificial Intelligence,Machine Learning — Patrick Durusau @ 10:37 am

JMLR – Journal of Machine Learning Research

From the webpage:

The Journal of Machine Learning Research (JMLR) provides an international forum for the electronic and paper publication of high-quality scholarly articles in all areas of machine learning. All published papers are freely available online.

Starts with volume 1 in October of 2000 and continues to the present.

Special topics that call out articles from different issues and special issues are also listed.

A first rate collection of machine learning research.

June 28, 2012

Pig as Teacher

Filed under: Machine Learning,Pig — Patrick Durusau @ 6:31 pm

Russell Jumey summarizes machine learning using Pig at the Hadoop Summit:

Jimmy Lin’s sold out talk about Large Scale Machine Learning at Twitter (paper available) (slides available) described the use of Pig to train machine learning algorithms at scale using Hadoop. Interestingly, learning was achieved using a Pig UDF StoreFunc (documentation available). Some interesting, related work can be found by Ted Dunning on github (source available).

The emphasis isn’t on innovation per se but in using Pig to create workflows that include machine learning on large data sets.

Read in detail for the Pig techniques (which you can reuse elsewhere) and the machine learning examples.

June 15, 2012

Deep Learning Tutorials

Filed under: Deep Learning,Machine Learning — Patrick Durusau @ 1:29 pm

Deep Learning Tutorials

From the main page:

Deep Learning is a new area of Machine Learning research, which has been introduced with the objective of moving Machine Learning closer to one of its original goals: Artificial Intelligence. See these course notes for a brief introduction to Machine Learning for AI and an `introduction to Deep Learning algorithms.

Deep Learning is about learning multiple levels of representation and abstraction that help to make sense of data such as images, sound, and text.

For more about deep learning algorithms, see for example:

The tutorials presented here will introduce you to some of the most important deep learning algorithms and will also show you how to run them using Theano. Theano is a python library that makes writing deep learning models easy, and gives the option of training them on a GPU.

The algorithm tutorials have some prerequisites. You should know some python, and be familiar with numpy. Since this tutorial is about using Theano, you should read over the Theano basic tutorial first. Once you’ve done that, read through our Getting Started chapter — it introduces the notation, and [downloadable] datasets used in the algorithm tutorials, and the way we do optimization by stochastic gradient descent.

The tutorial materials reflect the content of Yoshua Bengio’s Learning Algorithms (ITF6266) course.

Part of the resources you will find at: Deep Learning … moving beyond shallow machine learning since 2006!. There is a break between 2010 and 2012, with a few entries, such as in the blog, dated for 2012. There has been a considerable amount of work in the mean time so you might want to contribute to the site.

June 13, 2012

ML-Flex

Filed under: Machine Learning,ML-Flex — Patrick Durusau @ 10:46 am

ML-Flex by Stephen Piccolo.

From the webpage:

ML-Flex uses machine-learning algorithms to derive models from independent variables, with the purpose of predicting the values of a dependent (class) variable. For example, machine-learning algorithms have long been applied to the Iris data set, introduced by Sir Ronald Fisher in 1936, which contains four independent variables (sepal length, sepal width, petal length, petal width) and one dependent variable (species of Iris flowers = setosa, versicolor, or virginica). Deriving prediction models from the four independent variables, machine-learning algorithms can often differentiate between the species with near-perfect accuracy.

Machine-learning algorithms have been developed in a wide variety of programming languages and offer many incompatible ways of interfacing to them. ML-Flex makes it possible to interface with any algorithm that provides a command-line interface. This flexibility enables users to perform machine-learning experiments with ML-Flex as a harness while applying algorithms that may have been developed in different programming languages or that may provide different interfaces.

ML-Flex is described at: jmlr.csail.mit.edu/papers/volume13/piccolo12a/piccolo12a.pdf

I don’t see any inconsistency in my interest in machine learning and thinking that users are the ultimate judges of semantics. Machine learning is a tool, much like indexes, concordances and other tools before it.

I first saw ML-Flex at KDnuggets.

June 11, 2012

Machine Learning in Java has never been easier! [Java App <-> BigML Rest API]

Filed under: Java,Machine Learning — Patrick Durusau @ 4:25 pm

Machine Learning in Java has never been easier!

From the post:

Java is by far one of the most popular programming languages. It’s on the top of the TIOBE index and thousands of the most robust, secure, and scalable backends have been built in Java. In addition, there are many wonderful libraries available that can help accelerate your project enormously. For example, most of BigML’s backend is developed in Clojure which runs on top of the Java Virtual Machine. And don’t forget the ever-growing Android market, with 850K new devices activated each day!

There are number of machine learning Java libraries available to help build smart data-driven applications. Weka is one of the more popular options. In fact, some of BigML’s team members were Weka users as far back as the late 90s. We even used it as part of the first BigML backend prototype in early 2011. Apache Mahout is another great Java library if you want to deal with bigger amounts of data. However in both cases you cannot avoid “the fun of running servers, installing packages, writing MapReduce jobs, and generally behaving like IT ops folks“. In addition you need to be concerned with selecting and parametrizing the best algorithm to learn from your data as well as finding a way to activate and integrate the model that you generate into your application.

Thus we are thrilled to announce the availability of the first Open Source Java library that easily connects any Java application with the BigML REST API. It has been developed by Javi Garcia, an old friend of ours. A few of the BigML team members have been lucky enough to work with Javi in two other companies in the past.

With this new library, in just a few lines of code you can create a predictive model and generate predictions for any application domain. From finding the best price for a new product to forecasting sales, creating recommendations, diagnosing malfunctions, or detecting anomalies.

It won’t be as easy as “…in just a few lines of code…” but it will, what’s the term, modularize the building of machine learning applications. Someone has to run/maintain the servers, do security patches, backups but it doesn’t have to be you.

Specialization, that’s the other term. So that team members can be really good at what they do, as opposed to sorta good at a number of things.

If you need a common example, consider documentation, most of which is written by developers when they can spare the time. Reads like it. Costs your clients time and money trying to get their developers to work with poor documentation.

Not to mention costing you time and money when the software is not longer totally familiar to one person.

PS: As of today, June 11, 2012, Java is now #2 and C is #1 on the TIOBE list.

June 7, 2012

I Dream of “Jini”

Filed under: Environment,Machine Learning,Smart-Phones — Patrick Durusau @ 2:20 pm

The original title reads: Argus Labs Celebrates The Launch Of The Beta Version Of Jini, The App That Goes Beyond The Check-In, And Unveils 2012 Roadmap For The First Time. See what you think:

Argus Labs, a deep data, machine learning and mobile start-up operating out of Antwerp (Belgium), will celebrate the closed beta of the mobile application the night before LeWeb 2012 at Tiger-Tiger, Haymarket in London’s West-End. From 18th June, registered users will be able to download and start evaluating the first version of the intelligent application, called Jini.

Jini is a personal advisor that helps discover unknown relations and hyper-personalised opportunities. Jini feels best when helping the user out in serendipitous moments, or propose things that respond to the affinity its user has with its environment. Having access to hot opportunities and continuously being ‘in the know’ means a user can boost the quality of offline life.

Jini aims to raise the bar for private social networks by going beyond the check-in, saving the user the effort of doing too many manual actions. Jini applies machine learning with ambient sensing technology, so that the user can focus exclusively on having an awesome social sharing and discovery experience on smart-phones.

During the London launch event users will be able to sign up and exclusively download the first beta release of the app. The number of beta users is limited, so be fast. Argus Labs love to pioneer and will also have some goodies in store for the first 250 beta-users of the app.

See the post for registration information.

I sense a contradiction between being “…continuously being ‘in the know’ means a user can boost the quality of offline life.” How am I going to be ‘in the know’ if I am offline?

Still, I suspect there are opportunities here to merge diverse data sets to provide users with “hyper-personalized opportunities,” so long as it doesn’t interrupt one “hyper-personalized” situation to advise of another, potential “hyper-personalized” opportunity.

That would be like a phone call from an ex-girlfriend at an inopportune time. Bad joss.

« Newer PostsOlder Posts »

Powered by WordPress