SML: Scalable Machine Learning
Alex Smola’s lectures on Scalable Machine Learning at Berkeley with a wealth of supplemental materials.
Overview:
Scalable Machine Learning occurs when Statistics, Systems, Machine Learning and Data Mining are combined into flexible, often nonparametric, and scalable techniques for analyzing large amounts of data at internet scale. This class aims to teach methods which are going to power the next generation of internet applications. The class will cover systems and processing paradigms, an introduction to statistical analysis, algorithms for data streams, generalized linear methods (logistic models, support vector machines, etc.), large scale convex optimization, kernels, graphical models and inference algorithms such as sampling and variational approximations, and explore/exploit mechanisms. Applications include social recommender systems, real time analytics, spam filtering, topic models, and document analysis.
Just to give you a taste for the content, the first set of lectures is on Hardware and covers:
- Hardware
- Processor, RAM, buses, GPU, disk, SSD, network, switches, racks, server centers
- Bandwidth, latency and faults
- Basic parallelization paradigms
- Trees, stars, rings, queues
- Hashing (consistent, proportional)
- Distributed hash tables and P2P
- Storage
- RAID
- Google File System / HadoopFS
- Distributed (key, value) storage
- Processing
- MapReduce
- Dryad
- S4 / stream processing
- Structured access beyond SQL
Each set of lectures was back to back (to reduce travel time for Smola).
Hardware influences our thinking and design choices so it was good to see the lectures starting with coverage of hardware.
Interesting point near the end of the first lecture about never using editors to create editorial data. Then Alex explains that query results were validated at one point by women in their twenties so other perspectives on query results were not reflected in the results. He suggested getting users to provide data for search validation than using experts to label the data.
I would split his comments on editorial content into:
- Editorial content from experts
- Editorial content from users
I would put #1 in the same category as getting ontologists or linked data types to markup data. It works for them and from their point of view, but that doesn’t mean it works for the users of the data.
On the other hand, #2, content from users about how they think about their data and what constitutes a good result, seems a lot more appealing to me.
I would say that Alex’s point isn’t to not to use editors but to choose one’s editors carefully, favoring the users who will be using the results of the searches. (And avoiding the activity of labeling, there are better ways to get the needed data from users.)
That doesn’t work for a generalized search interface like Google but then a public ….., err, water trough is a public water trough.