SML: Scalable Machine Learning « Another Word For It

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 19, 2012

SML: Scalable Machine Learning

Filed under: Machine Learning — Patrick Durusau @ 8:35 pm

Alex Smola’s lectures on Scalable Machine Learning at Berkeley with a wealth of supplemental materials.

Overview:

Scalable Machine Learning occurs when Statistics, Systems, Machine Learning and Data Mining are combined into flexible, often nonparametric, and scalable techniques for analyzing large amounts of data at internet scale. This class aims to teach methods which are going to power the next generation of internet applications. The class will cover systems and processing paradigms, an introduction to statistical analysis, algorithms for data streams, generalized linear methods (logistic models, support vector machines, etc.), large scale convex optimization, kernels, graphical models and inference algorithms such as sampling and variational approximations, and explore/exploit mechanisms. Applications include social recommender systems, real time analytics, spam filtering, topic models, and document analysis.

Just to give you a taste for the content, the first set of lectures is on Hardware and covers:

Hardware

Processor, RAM, buses, GPU, disk, SSD, network, switches, racks, server centers

Bandwidth, latency and faults

Basic parallelization paradigms

Trees, stars, rings, queues

Hashing (consistent, proportional)

Distributed hash tables and P2P

Storage

RAID

Google File System / HadoopFS

Distributed (key, value) storage

Processing

MapReduce

Dryad

S4 / stream processing

Structured access beyond SQL

BigTable

Cassandra

Each set of lectures was back to back (to reduce travel time for Smola).

Hardware influences our thinking and design choices so it was good to see the lectures starting with coverage of hardware.

Interesting point near the end of the first lecture about never using editors to create editorial data. Then Alex explains that query results were validated at one point by women in their twenties so other perspectives on query results were not reflected in the results. He suggested getting users to provide data for search validation than using experts to label the data.

I would split his comments on editorial content into:

Editorial content from experts
Editorial content from users

I would put #1 in the same category as getting ontologists or linked data types to markup data. It works for them and from their point of view, but that doesn’t mean it works for the users of the data.

On the other hand, #2, content from users about how they think about their data and what constitutes a good result, seems a lot more appealing to me.

I would say that Alex’s point isn’t to not to use editors but to choose one’s editors carefully, favoring the users who will be using the results of the searches. (And avoiding the activity of labeling, there are better ways to get the needed data from users.)

That doesn’t work for a generalized search interface like Google but then a public ….., err, water trough is a public water trough.

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 19, 2012

SML: Scalable Machine Learning

No Comments