Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 2, 2014

Data Science Master

Open Source Data Science Master – The Plan by Fras and Sabine.

From the post:

Free!! education platforms have put some of the world’s most prestigious courses online in the last few years. This is our plan to use these and create our own custom open source data science Master.

Free online courses are selected to cover: Data Manipulation, Machine Learning & Algorithms, Programming, Statistics, and Visualization.

Be sure to take know of the pre-requisites the authors completed before embarking on their course work.

No particular project component is suggested because the course work will suggest ideas.

What other choices would you suggest? Either for broader basics or specialization?

August 1, 2014

GraphLab Conference 2014 (Videos!)

Filed under: GraphLab,Graphs,Machine Learning — Patrick Durusau @ 1:45 pm

GraphLab Conference 2014 (Videos!)

Videos from the GraphLab Conference 2014 have been posted! Who needs to wait for a new season of Endeavor? 😉

(I included the duration times so you can squeeze these in between conference calls.)

Presentations, ordered by author’s last name.

Training Sessions on GraphLab Create

I first saw this in a tweet by xamat.

July 30, 2014

Awesome Machine Learning

Filed under: Data Analysis,Machine Learning,Visualization — Patrick Durusau @ 3:30 pm

Awesome Machine Learning by Joseph Misiti.

From the webpage:

A curated list of awesome machine learning frameworks, libraries and software (by language). Inspired by awesome-php. Other awesome lists can be found in the awesome-awesomeness list.

If you want to contribute to this list (please do), send me a pull request or contact me @josephmisiti

Not strictly limited to “machine learning” as it offers resources on data analysis, visualization, etc.

With a list of 576 resources, I am sure you will find something new!

July 28, 2014

Oryx 2:…

Filed under: Machine Learning,Spark — Patrick Durusau @ 6:56 pm

Oryx 2: Lambda architecture on Spark for real-time large scale machine learning

From the overview:

This is a redesign of the Oryx project as “Oryx 2.0”. The primary design goals are:

1. A more reusable platform for lambda-architecture-style designs, with batch, speed and serving layers

2. Make each layer usable independently

3.Fuller support for common machine learning needs

  • Test/train set split and evaluation
  • Parallel model build
  • Hyper-parameter selection

4. Use newer technologies like Spark and Streaming in order to simplify:

  • Remove separate in-core implementations for scale-down
  • Remove custom data transport implementation in favor of message queues like Apache Kafka
  • Use a ‘real’ streaming framework instead of reimplementing a simple one
  • Remove complex MapReduce-based implementations in favor of Apache Spark-based implementations

5. Support more input (i.e. not just CSV)

Initial import was three days ago if you are interested in being in on the beginning!

July 19, 2014

What is deep learning, and why should you care?

Filed under: Deep Learning,Image Recognition,Machine Learning — Patrick Durusau @ 2:45 pm

What is deep learning, and why should you care? by Pete Warden.

From the post:

neuron

When I first ran across the results in the Kaggle image-recognition competitions, I didn’t believe them. I’ve spent years working with machine vision, and the reported accuracy on tricky tasks like distinguishing dogs from cats was beyond anything I’d seen, or imagined I’d see anytime soon. To understand more, I reached out to one of the competitors, Daniel Nouri, and he demonstrated how he used the Decaf open-source project to do so well. Even better, he showed me how he was quickly able to apply it to a whole bunch of other image-recognition problems we had at Jetpac, and produce much better results than my conventional methods.

I’ve never encountered such a big improvement from a technique that was largely unheard of just a couple of years before, so I became obsessed with understanding more. To be able to use it commercially across hundreds of millions of photos, I built my own specialized library to efficiently run prediction on clusters of low-end machines and embedded devices, and I also spent months learning the dark arts of training neural networks. Now I’m keen to share some of what I’ve found, so if you’re curious about what on earth deep learning is, and how it might help you, I’ll be covering the basics in a series of blog posts here on Radar, and in a short upcoming ebook.

Pete gives a brief sketch of “deep learning” and promises more posts and a short ebook to follow.

Along those same lines you will want to see:

Microsoft Challenges Google’s Artificial Brain With ‘Project Adam’ by Daniela Hernandez (WIRED).

If you want in depth (technical) coverage, see: Deep Learning…moving beyond shallow machine learning since 2006! The reading list and references here should keep you busy for some time.

BTW, on “…shallow machine learning…” you do know the “Dark Ages” really weren’t “dark” but were so named in the Renaissance in order to show the fall into darkness (the Fall of Rome), the “Dark Ages,” and then the return of “light” in the Renaissance? See: Dark Ages (historiography).

Don’t overly credit characterizations of ages or technologies by later ages or newer technologies. They too will be found primitive and superstitious.

HOGWILD!

Filed under: Algorithms,Machine Learning — Patrick Durusau @ 2:21 pm

Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent by Feng Niu, Benjamin Recht, Christopher Ré and Stephen J. Wright.

Abstract:

Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performance-destroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called Hogwild! which allows processors access to shared memory with the possibility of over-writing each other’s work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then Hogwild! achieves a nearly optimal rate of convergence. We demonstrate experimentally that Hogwild! outperforms alternative schemes that use locking by an order of magnitude. (emphasis in original)

From further in the paper:

Our second graph cut problem sought a mulit-way cut to determine entity recognition in a large database of web data. We created a data set of clean entity lists from the DBLife website and of entity mentions from the DBLife Web Crawl [11]. The data set consists of 18,167 entities and 180,110 mentions and similarities given by string similarity. In this problem each stochastic gradient step must compute a Euclidean projection onto a simplex of dimension 18,167.

A 9X speedup on 10 cores. (Against Vowpal Wabbit.)

A must read paper.

I first saw this in Nat Torkington’s Four short links: 15 July 2014. Nat says:

the algorithm that Microsoft credit with the success of their Adam deep learning system.

July 17, 2014

Scikit-learn 0.15 release

Filed under: Machine Learning,Python,Scikit-Learn — Patrick Durusau @ 6:16 pm

Scikit-learn 0.15 release by Gaël Varoquaux.

From the post:

Highlights:

Quality— Looking at the commit log, there has been a huge amount of work to fix minor annoying issues.

Speed— There has been a huge effort put in making many parts of scikit-learn faster. Little details all over the codebase. We do hope that you’ll find that your applications run faster. For instance, we find that the worst case speed of Ward clustering is 1.5 times faster in 0.15 than 0.14. K-means clustering is often 1.1 times faster. KNN, when used in brute-force mode, got faster by a factor of 2 or 3.

Random Forest and various tree methods— The random forest and various tree methods are much much faster, use parallel computing much better, and use less memory. For instance, the picture on the right shows the scikit-learn random forest running in parallel on a fat Amazon node, and nicely using all the CPUs with little RAM usage.

Hierarchical aglomerative clusteringComplete linkage and average linkage clustering have been added. The benefit of these approach compared to the existing Ward clustering is that they can take an arbitrary distance matrix.

Robust linear models— Scikit-learn now includes RANSAC for robust linear regression.

HMM are deprecated— We have been discussing for a long time removing HMMs, that do not fit in the focus of scikit-learn on predictive modeling. We have created a separate hmmlearn repository for the HMM code. It is looking for maintainers.

And much more— plenty of “minor things”, such as better support for sparse data, better support for multi-label data…

Get thee to Scikit-learn!

July 15, 2014

Classification and regression trees

Filed under: Classification,Machine Learning,Regression,Trees — Patrick Durusau @ 3:47 pm

Classification and regression trees by Wei-Yin Loh.

Abstract:

Classification and regression trees are machine-learningmethods for constructing prediction models from data. The models are obtained by recursively partitioning the data space and fitting a simple prediction model within each partition. As a result, the partitioning can be represented graphically as a decision tree. Classification trees are designed for dependent variables that take a finite number of unordered values, with prediction error measured in terms of misclassification cost. Regression trees are for dependent variables that take continuous or ordered discrete values, with prediction error typically measured by the squared difference between the observed and predicted values. This article gives an introduction to the subject by reviewing some widely available algorithms and comparing their capabilities, strengths, and weakness in two examples. 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 14–23 DOI: 10.1002/widm.8.

A bit more challenging that CSV formats but also very useful.

I heard a joke many years ago but a then U.S. Assistant Attorney General who said:

To create a suspect list for a truck hijacking in New York, you choose files with certain name characteristics, delete the ones that are currently in prison and those that remain are your suspect list. (paraphrase)

If topic maps can represent any “subject” then they should be able to represent “group subjects” as well. We may know that our particular suspect is the member of a group, but we just don’t know which member of the group is our suspect.

Think of it as a topic map that evolves as more data/analysis is brought to the map and members of a group subject can be broken out into smaller groups or even individuals.

In fact, displaying summaries of characteristics of members of a group in response to classification/regression could well help with the subject analysis process. An interactive construction/mining of the topic map as it were.

Great paper whether you use it for topic map subject analysis or more traditional purposes.

July 14, 2014

CMU Machine Learning Summer School (2014)

Filed under: Machine Learning — Patrick Durusau @ 4:37 pm

CMU Machine Learning Summer School (2014)

From the webpage:

Machine Learning is a foundational discipline that forms the basis of much modern data analysis. It combines theory from areas as diverse as Statistics, Mathematics, Engineering, and Information Technology with many practical and relevant real life applications. The focus of the current summer school is big data analytics, distributed inference, scalable algorithms, and applications to the digital economy. The event is targeted at research students, IT professionals, and academics from all over the world.

This school is suitable for all levels, both for researchers without previous knowledge in Machine Learning, and those wishing to broaden their expertise in this area. That said, some background will prove useful. For a research student, the summer school provides a unique, high-quality, and intensive period of study. It is ideally suited for students currently pursuing, or intending to pursue, research in Machine Learning or related fields. Limited scholarships are available for students to cover accommodation, registration costs, and partial travel expenses.

Videos have been posted at YouTube!

Enjoy!

Quoc Le’s Lectures on Deep Learning

Filed under: Machine Learning,Neural Networks — Patrick Durusau @ 1:32 pm

Quoc Le’s Lectures on Deep Learning by Gaurav Trivedi.

From the post:

Dr. Quoc Le from the Google Brain project team (yes, the one that made headlines for creating a cat recognizer) presented a series of lectures at the Machine Learning Summer School (MLSS ’14) in Pittsburgh this week. This is my favorite lecture series from the event till now and I was glad to be able to attend them.

The good news is that the organizers have made available the entire set of video lectures in 4K for you to watch. But since Dr. Le did most of them on the board and did not provide any accompanying slides, I decided to put the contents of the lectures along with the videos here.

I like Gaurav’s “enhanced” version over the straight YouTube version.

I need to go back and look at the cat recognizer. Particularly if I can use it as a filter on a twitter stream. 😉

I first saw this in Nat Torkington’s Four short links: 14 July 2014.

July 7, 2014

Random Forests…

Filed under: Ensemble Methods,GPU,Machine Learning,Random Forests — Patrick Durusau @ 2:30 pm

Random Forests of Very Fast Decision Trees on GPU for Mining Evolving Big Data Streams by Diego Marron, Albert Bifet, Gianmarco De Francisci Morales.

Abstract:

Random Forests is a classical ensemble method used to improve the performance of single tree classifiers. It is able to obtain superior performance by increasing the diversity of the single classifiers. However, in the more challenging context of evolving data streams, the classifier has also to be adaptive and work under very strict constraints of space and time. Furthermore, the computational load of using a large number of classifiers can make its application extremely expensive. In this work, we present a method for building Random Forests that use Very Fast Decision Trees for data streams on GPUs. We show how this method can benefit from the massive parallel architecture of GPUs, which are becoming an efficient hardware alternative to large clusters of computers. Moreover, our algorithm minimizes the communication between CPU and GPU by building the trees directly inside the GPU. We run an empirical evaluation and compare our method to two well know machine learning frameworks, VFML and MOA. Random Forests on the GPU are at least 300x faster while maintaining a similar accuracy.

The authors should get a special mention for honesty in research publishing. Figure 11 shows their GPU Random Forest algorithm seeming to scale almost constantly. The authors explain:

In this dataset MOA scales linearly while GPU Random Forests seems to scale almost constantly. This is an effect of the scale, as GPU Random Forests runs in milliseconds instead of minutes.

How fast/large are your data streams?

I first saw this in a tweet by Stefano Bertolo.

June 18, 2014

Drag-n-Drop Machine Learning?

Filed under: Azure Marketplace,Machine Learning,Microsoft — Patrick Durusau @ 9:34 am

Microsoft to provide drag-and-drop machine learning on Azure by Derrick Harris.

From the post:

Microsoft is stepping up its cloud computing game with a new service called Azure Machine Learning that users visually build and machine learning models, and then publish APIs to insert those models into applications. The service, which will be available for public preview in July, is one of the first of its kind and the latest demonstration of Microsoft’s heavy investment in machine learning.

Azure Machine Learning will include numerous prebuilt model types and packages, including recommendation engines, decision trees, R packages and even deep neural networks (aka deep learning models), explained Joseph Sirosh, corporate vice president at Microsoft. The data that the models train on and analyze can reside in Azure or locally, and users are charged based on the number of API calls to their models and the amount of computing resources consumed running them.

The reason why there are so few data scientists today, Sirosh theorized, is that they need to know so many software tools and so much math and computer science just to experiment and build models. Actually deploying those models into production, especially at scale, opens up a whole new set of engineering challenges. Sirosh said Microsoft hopes Azure Machine Learning will open up advanced machine learning to anyone who understands the R programming language or, really, anyone with a respectable understanding of statistics.

“It’s also very simple. My high school son can build machine learning models and publish APIs,” he said.

Reducing the technical barriers to use machine learning is a great thing. However, if that also results in reducing the understanding of machine learning, its perils and pitfalls, that is also a very bad thing.

One of the strengths of the Weka courses taught by Prof. Ian H. Witten is that students learn that choices are made in machine learning algorithms that aren’t apparent to the casual user. And that data choices can make as much different in outcomes as the algorithms used to process that data.

Use of software with no real understanding of its limitations isn’t new but with Azure Machine Learning any challenge to analysis will be met with the suggestion you “…run the analysis yourself.” Where the speaker does not understand that a replicated a bad result is still a bad result.

Be prepared to challenge data and means of analysis used in drag-n-drop machine learning drive-bys.

May 24, 2014

the HiggsML challenge

Filed under: Challenges,Machine Learning,Particle Physics — Patrick Durusau @ 2:29 pm

the HiggsML challenge

The challenge runs from May 12th to September 2014.

From the challenge:

In a nutshell, we provide a data set containing a mixture of simulated signal and background events, built from simulated events provided by the ATLAS collaboration at CERN. Competitors can use or develop any algorithm they want, and the one who achieves the best signal/background separation wins! Besides classical prizes for the winners, a special “HEP meets ML” prize will also be awarded with an invitation to CERN; we are also seeking to organise a NIPS workshop.

For this HEP challenge we deliberately picked one of the most recent and hottest playgrounds: the Higgs decaying into a pair of tau leptons. The first ATLAS results were made public in december 2013 in a CERN seminar, ATLAS sees Higgs boson decay to fermions. The simulated events that participants will have in their hands are the same that physicists used. Participants will be working in realistic conditions although we have simplified quite a bit the original problem so that it became tractable without any background in physics.

HEP physicist, even ATLAS physicists, who have experience with multivariate analysis, neural nets, boosted decision trees and the like are warmly encouraged to compete with machine learning experts.

The Laboratoire de l’Accélerateur Linéaire (LAL) is a French lab located in the vicinity of Paris. It is overseen by both the CNRS (IN2P3) and University Paris-Sud. It counts 330 employees (125 researchers and 205 engineers and technicians) and brings internationally recognized contributions to experimental Particle Physics, Accelerator Physics, Astroparticle Physics, and Cosmology.

Contact : for any question of general interest about the challenge, please consult and use the forum provided on the Kaggle web site. For private comments, we are also reachable at higgsml_at_lal.in2p3.fr.

Now there is a machine learning challenge for the summer!

Not to mention more science being done on the basis of public data sets.

Be sure to forward this to both your local computer science and physics department.

May 23, 2014

Learning Everything About Anything (sort of)

Filed under: Artificial Intelligence,Machine Learning — Patrick Durusau @ 2:52 pm

Meet the algorithm that can learn “everything about anything” by Dennis Harris.

From the post:

One of the more interesting projects is a system called LEVAN, which is short for Learn EVerything about ANything and was created by a group of researchers out of the Allen Institute for Artificial Intelligence and the University of Washington. One of them, Carlos Guestrin, is also co-founder and CEO of a data science startup called GraphLab. What’s really interesting about LEVAN is that it’s neither human-supervised nor unsupervised (like many deep learning systems), but what its creators call “webly supervised.”

(image omitted)

What that means, essentially, is that LEVAN uses the web to learn everything it needs to know. It scours Google Books Ngrams to learn common phrases associated with a particular concept, then searches for those phrases in web image repositories such as Google Images, Bing and Flickr. For example, LEVAN now knows that “heavyweight boxing,” “boxing ring” and “ali boxing” are all part of the larger concept of “boxing,” and it knows what each one looks like.

When I said “sort of” in the title I didn’t mean any disrespect for LEVAN. On the contrary, the researchers limiting LEVAN to Google Book Ngrams and images is a brilliant move. That limits LEVAN to the semantic debris that can be found in public image repositories but depending upon your requirements, that may be more than sufficient.

The other upside is that despite a pending patent, sigh, the source code is available for research/academic purposes.

What data sets make useful limits for your AI/machine learning algorithm? Your application need not understand intercepted phone conversations, Barbara Walters, or popular music, if those are not in your requirements. Simplifying your AI problem may be the first step towards solving it.

May 18, 2014

12 Free (as in beer) Data Mining Books

12 Free (as in beer) Data Mining Books by Chris Leonard.

While all of these volumes could be shelved under “data mining” in a bookstore, I would break them out into smaller categories:

  • Bayesian Analysis/Methods
  • Data Mining
  • Data Science
  • Machine Learning
  • R
  • Statistical Learning

Didn’t want you to skip over Chris’ post because it was “just about data mining.” 😉

Check your hard drive to see what you are missing.

I first saw this in a tweet by Carl Anderson.

May 15, 2014

Distributed LIBLINEAR:

Filed under: Machine Learning,MPI,Spark,Virtual Machines — Patrick Durusau @ 10:23 am

Distributed LIBLINEAR: Libraries for Large-scale Linear Classification on Distributed Environments

From the webpage:

MPI LIBLINEAR is an extension of LIBLINEAR on distributed environments. The usage and the data format are the same as LIBLINEAR. Currently only two solvers are supported:

  • L2-regularized logistic regression (LR)
  • L2-regularized L2-loss linear SVM

NOTICE: This extension can only run on Unix-like systems. (We test it on Ubuntu 13.10.) Python and Matlab interfaces are not supported.

Spark LIBLINEAR is a Spark implementation based on LIBLINEAR and integrated with Hadoop distributed file system. This package is developed using Scala. Currently it supports the same two solvers as MPI LIBLINEAR.

If you are unfamiliar with LIBLINEAR:

LIBLINEAR is a linear classifier for data with millions of instances and features. It supports

  • L2-regularized classifiers
    L2-loss linear SVM, L1-loss linear SVM, and logistic regression (LR)
  • L1-regularized classifiers (after version 1.4)
    L2-loss linear SVM and logistic regression (LR)
  • L2-regularized support vector regression (after version 1.9)
    L2-loss linear SVR and L1-loss linear SVR.

Main features of LIBLINEAR include

  • Same data format as LIBSVM, our general-purpose SVM solver, and also similar usage
  • Multi-class classification: 1) one-vs-the rest, 2) Crammer & Singer
  • Cross validation for model selection
  • Probability estimates (logistic regression only)
  • Weights for unbalanced data
  • MATLAB/Octave, Java, Python, Ruby interfaces

You will also find instructions for creating distributed environments using VirtualBox for both MPI LIBLINEAR and Spark LIBLINEAR. I am going to post on that separately to draw attention to it.

The phrase “standalone computer” is rapidly becoming a misnomer. Forward looking algorithm designers and power users will begin gaining experience with the new distributed “normal,” at every opportunity.

I first saw this in a tweet by Reynold Xin.

May 13, 2014

Bringing machine learning and compositional semantics together

Filed under: Machine Learning,Semantics — Patrick Durusau @ 6:24 pm

Bringing machine learning and compositional semantics together by Percy Liang and Christopher Potts.

Abstract:

Computational semantics has long been seen as a fi eld divided between logical and statistical approaches, but this divide is rapidly eroding, with the development of statistical models that learn compositional semantic theories from corpora and databases. This paper presents a simple discriminative learning framework for defi ning such models and relating them to logical theories. Within this framework, we discuss the task of learning to map utterances to logical forms (semantic parsing) and the task of learning from denotations with logical forms as latent variables. We also consider models that use distributed (e.g., vector) representations rather than logical ones, showing that these can be seen as part of the same overall framework for understanding meaning and structural complexity.

My interest is in how computational semantics can illuminate issues in semantics. It has been my experience that the transition from natural language to more formal (and less robust) representations draws out semantic issues, such as ambiguity, that lurk unnoticed in natural language texts.

With right at seven pages of references, you will have no shortage of reading material on compositional semantics.

I first saw this in a tweet by Chris Brockett.

May 12, 2014

Enough Machine Learning to…

Filed under: Machine Learning,Python,Scikit-Learn — Patrick Durusau @ 6:38 pm

Enough Machine Learning to Make Hacker News Readable Again by Ned Jackson Lovely.

From the description:

It’s inevitable that online communities will change, and that we’ll remember the community with a fondness that likely doesn’t accurately reflect the former reality. We’ll explore how we can take a set of articles from an online community and winnow out the stuff we feel is unworthy. We’ll explore some of the machine learning tools that are just a “pip install” away, such as scikit-learn and nltk.

Ned recommends you start with the map I cover at: Machine Learning Cheat Sheet (for scikit-learn).

Great practice with scikit-learn. Following this as a general outline will develop your machine learning skills!

May 11, 2014

…Technology-Assisted Review in Electronic Discovery…

Filed under: Machine Learning,Spark — Patrick Durusau @ 7:16 pm

Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery by Gordon V. Cormack & Maura R. Grossman.

Abstract:

Using a novel evaluation toolkit that simulates a human reviewer in the loop, we compare the effectiveness of three machine-learning protocols for technology-assisted review as used in document review for discovery in legal proceedings. Our comparison addresses a central question in the deployment of technology-assisted review: Should training documents be selected at random, or should they be selected using one or more non-random methods, such as keyword search or active learning? On eight review tasks — four derived from the TREC 2009 Legal Track and four derived from actual legal matters — recall was measured as a function of human review effort. The results show that entirely non-random training methods, in which the initial training documents are selected using a simple keyword search, and subsequent training documents are selected by active learning, require substantially and significantly less human review effort (P<0.01) to achieve any given level of recall, than passive learning, in which the machine-learning algorithm plays no role in the selection of training documents. Among passive-learning methods, significantly less human review effort (P<0.01) is required when keywords are used instead of random sampling to select the initial training documents. Among active-learning methods, continuous active learning with relevance feedback yields generally superior results to simple active learning with uncertainty sampling, while avoiding the vexing issue of "stabilization" -- determining when training is adequate, and therefore may stop.

New acronym for me: TAR (technology-assisted review).

If you are interested in legal discovery, take special note that the authors have released a TAR evaluation toolkit.

This article and its references will repay a close reading several times over.

March 29, 2014

mtx:…

Filed under: Information Retrieval,Machine Learning — Patrick Durusau @ 6:35 pm

mtx: a swiss-army knife for information retrieval

From the webpage:

mtx is a command-line tool for rapidly trying new ideas in Information Retrieval and Machine Learning.

mtx is the right tool if you secretly wish you could:

  • play with Wikipedia-sized datasets on your laptop
  • do it interactively, like the boys whose data fits in Matlab
  • quickly test that too-good-to-be-true algorithm you see at SIGIR
  • try ungodly concoctions, like BM25-weighted PageRank over ratings
  • cache all intermediate results, so you never have to re-run a month-long job
  • use awk/perl to hack internal data structures half-way through a computation

mtx is made for Unix hackers. It is a shell tool, not a library or an application. It’s designed for interactive use and relies on your shell’s tab-completion and history features. For scripting it, I highly recommend this.

What do you have on your bootable USB stick? 😉

March 27, 2014

Apache Mahout, “…Ya Gotta Hit The Road”

Filed under: H20,Machine Learning,Mahout,MapReduce,Spark — Patrick Durusau @ 3:24 pm

The news in Derrick Harris’ “Apache Mahout, Hadoop’s original machine learning project, is moving on from MapReduce” reminded of a line from Tommy, “Just as the gypsy queen must do, ya gotta hit the road.”

From the post:

Apache Mahout, a machine learning library for Hadoop since 2009, is joining the exodus away from MapReduce. The project’s community has decided to rework Mahout to support the increasingly popular Apache Spark in-memory data-processing framework, as well as the H2O engine for running machine learning and mathematical workloads at scale.

While data processing in Hadoop has traditionally been done using MapReduce, the batch-oriented framework has fallen out of vogue as users began demanding lower-latency processing for certain types of workloads — such as machine learning. However, nobody really wants to abandon Hadoop entirely because it’s still great for storing lots of data and many still use MapReduce for most of their workloads. Spark, which was developed at the University of California, Berkeley, has stepped in to fill that void in a growing number of cases where speed and ease of programming really matter.

H2O was developed separately by a startup called 0xadata (pronounced hexadata), although it’s also available as open source software. It’s an in-memory data engine specifically designed for running various types of types of statisical computations — including deep learning models — on data stored in the Hadoop Distributed File System.

Support for multiple data frameworks is yet another reason to learn Mahout.

March 10, 2014

Data Science 101: Deep Learning Methods and Applications

Filed under: Data Science,Deep Learning,Machine Learning,Microsoft — Patrick Durusau @ 7:56 pm

Data Science 101: Deep Learning Methods and Applications by Daniel Gutierrez.

From the post:

Microsoft Research, the research arm of the software giant, is a hotbed of data science and machine learning research. Microsoft has the resources to hire the best and brightest researchers from around the globe. A recent publication is available for download (PDF): “Deep Learning: Methods and Applications” by Li Deng and Dong Yu, two prominent researchers in the field.

Deep sledding with twenty (20) pages of bibliography and pointers to frequently updated lists of resources (at page 8).

You did say you were interested in deep learning. Yes? 😉

Enjoy!

February 26, 2014

Open science in machine learning

Filed under: Dataset,Machine Learning,Open Science — Patrick Durusau @ 3:29 pm

Open science in machine learning by Joaquin Vanschoren, Mikio L. Braun, and Cheng Soon Ong.

Abstract:

We present OpenML and mldata, open science platforms that provides easy access to machine learning data, software and results to encourage further study and application. They go beyond the more traditional repositories for data sets and software packages in that they allow researchers to also easily share the results they obtained in experiments and to compare their solutions with those of others.

From 2 OpenML:

OpenML (http://openml.org) is a website where researchers can share their data sets, implementations and experiments in such a way that they can easily be found and reused by others. It offers a web API through which new resources and results can be submitted automatically, and is being integrated in a number of popular machine learning and data mining platforms, such as Weka, RapidMiner, KNIME, and data mining packages in R, so that new results can be submitted automatically. Vice versa, it enables researchers to easily search for certain results (e.g. evaluations of algorithms on a certain data set), to directly compare certain techniques against each other, and to combine all submitted data in advanced queries.

From 3 mldata:

mldata (http://mldata.org) is a community-based website for the exchange of machine learning data sets. Data sets can either be raw data files or collections of files, or use one of the supported file formats like HDF5 or ARFF in which case mldata looks at meta data contained in the files to display more information. Similar to OpenML, mldata can define learning tasks based on data sets, where mldata currently focuses on supervised learning data. Learning tasks identify which features are used for input and output and also which score is used to evaluate the functions. mldata also allows to create learning challenges by grouping learning tasks together, and lets users submit results in the form of predicted labels which are then automatically evaluated.

Interesting sites.

Does raise the question of who will index the indexers of datasets?

I first saw this in a tweet by Stefano Betolo.

February 16, 2014

Stanford Spring Offerings

Filed under: Compilers,Machine Learning — Patrick Durusau @ 8:43 pm

Just a quick heads up on two Stanford Online courses that may be of interest:

Compilers, Alex Aiken, Monday, March 17, 2014

From the course page:

This course will discuss the major ideas used today in the implementation of programming language compilers, including lexical analysis, parsing, syntax-directed translation, abstract syntax trees, types and type checking, intermediate languages, dataflow analysis, program optimization, code generation, and runtime systems. As a result, you will learn how a program written in a high-level language designed for humans is systematically translated into a program written in low-level assembly more suited to machines. Along the way we will also touch on how programming languages are designed, programming language semantics, and why there are so many different kinds of programming languages.

The course lectures will be presented in short videos. To help you master the material, there will be in-lecture questions to answer, quizzes, and two exams: a midterm and a final. There will also be homework in the form of exercises that ask you to show a sequence of logical steps needed to derive a specific result, such as the sequence of steps a type checker would perform to type check a piece of code, or the sequence of steps a parser would perform to parse an input string. This checking technology is the result of ongoing research at Stanford into developing innovative tools for education, and we’re excited to be the first course ever to make it available to students.

An optional course project is to write a complete compiler for COOL, the Classroom Object Oriented Language. COOL has the essential features of a realistic programming language, but is small and simple enough that it can be implemented in a few thousand lines of code. Students who choose to do the project can implement it in either C++ or Java.

I hope you enjoy the course!

Machine Learning, Andrew Ng, Monday, March 3, 2014.

From the course page:

Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome. Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it. Many researchers also think it is the best way to make progress towards human-level AI. In this class, you will learn about the most effective machine learning techniques, and gain practice implementing them and getting them to work for yourself. More importantly, you’ll learn about not only the theoretical underpinnings of learning, but also gain the practical know-how needed to quickly and powerfully apply these techniques to new problems. Finally, you’ll learn about some of Silicon Valley’s best practices in innovation as it pertains to machine learning and AI.

This course provides a broad introduction to machine learning, datamining, and statistical pattern recognition. Topics include: (i) Supervised learning (parametric/non-parametric algorithms, support vector machines, kernels, neural networks). (ii) Unsupervised learning (clustering, dimensionality reduction, recommender systems, deep learning). (iii) Best practices in machine learning (bias/variance theory; innovation process in machine learning and AI). The course will also draw from numerous case studies and applications, so that you’ll also learn how to apply learning algorithms to building smart robots (perception, control), text understanding (web search, anti-spam), computer vision, medical informatics, audio, database mining, and other areas.

Enjoy!

February 2, 2014

Data Workflows for Machine Learning:

Filed under: Machine Learning,Workflow — Patrick Durusau @ 4:32 pm

Data Workflows for Machine Learning: by Paco Nathan.

Excellent presentation on data workflows, at least if you think of them as being primarily from one machine or process to another. Hence the closing emphasis on PMML – Predictive Model Markup Language.

Although Paco alludes to the organizational/social side of data flow, that gets lost in the thicket of technical options.

For example, at slide 25, Paco talks about using Cascading to combing the workflow from multiple departments into an integrated app.

Which I am certain is withing the capabilities of Cascading, but that does not address the social or organizational difficulties of getting that to happen.

One of the main problems in the recent U.S. health care exchange debacle was the interchange of data between two of the vendors.

I suppose in recent management lingo, no one took “ownership” of that problem. 😉

Data interchange isn’t new technical territory but failure to cooperate is as deadly to a data processing project as a melting CPU.

The technical side of data workflows is necessary for success, but so is avoiding any beaver dams across the data stream.

Dealt with any beavers lately?

January 28, 2014

ICML 2014

Filed under: Machine Learning — Patrick Durusau @ 5:50 pm

Volume 32: Proceedings of The 31st International Conference on Machine Learning. Edited by Xing, Eric P. and Jebara, Tony.

I count some eighty-five (85) papers, many with supplementary materials.

Enjoy!

I first saw this in a tweet by Mark Reid.

January 22, 2014

Want to win $1,000,000,000 (yes, that’s one billion dollars)?

Want to win $1,000,000,000 (yes, that’s one billion dollars)? by Ann Drobnis.

The offer is one billion dollars for picking the winners of every game in the NCAA men’s basketball tournament in the Spring of 2014.

Unfortunately, none of the news stories I saw had links back to any authentic information from Quicken Loans and Berkshire Hathaway about the offer.

After some searching I found: Win a Billion Bucks with the Quicken Loans Billion Dollar Bracket Challenge by Clayton Closson, on January 21, 2014 on the Quicken Loans blog. (As far as I can tell it is an authentic post on the QL website.)

From that post:

You could be America’s next billionaire if you’re the grand prize winner of the Quicken Loans Billion Dollar Bracket Challenge. You read that right: one billion. Not one million. Not one hundred million. Not five hundred million. One billion U.S. dollars.

All you have to do is pick a perfect tournament bracket for the upcoming 2014 tournament. That’s it. Guess all the winners of all the games correctly, and Quicken Loans, along with Berkshire Hathaway, will make you a billionaire. The official press release is below. The contest starts March 3, 2014, so we’ll soon have all the info on how and when to enter your perfect bracket.

Good luck, my friends. This is your chance to play in perhaps the biggest sweepstakes in U.S. history. It’s your chance for a billion.

Oh, and by the way, the 20 closest imperfect brackets will win a cool hundred grand to put toward their home (or new home). Plus, in conjunction with the sweepstakes, Quicken Loans will donate $1 million to Detroit and Cleveland nonprofits to help with education of inner city youth.

So, to recap: If you’re perfect, you’ll win a billion. If you’re not perfect, you could win $100,000. The entry period begins Monday, March 3, 2014 and runs until Wednesday, March 19, 2014. Stay tuned on how to enter.

Contest updates at: Facebook.com/QuickenLoans.

The odds against winning are absurd but this has all the markings of a big data project. Historical data, current data on the teams and players, models, prior outcomes to test your models, etc.

I wonder if Watson likes basketball?

January 17, 2014

Petuum

Filed under: Hadoop,Machine Learning,Petuum — Patrick Durusau @ 7:14 pm

Petuum

From the homepage:

Petuum is a distributed machine learning framework. It takes care of the difficult system “plumbing work”, allowing you to focus on the ML. Petuum runs efficiently at scale on research clusters and cloud compute like Amazon EC2 and Google GCE.

A Bit More Details

Petuum provides essential distributed programming tools that minimize programmer effort. It has a distributed parameter server (key-value storage), a distributed task scheduler, and out-of-core (disk) storage for extremely large problems. Unlike general-purpose distributed programming platforms, Petuum is designed specifically for ML algorithms. This means that Petuum takes advantage of data correlation, staleness, and other statistical properties to maximize the performance for ML algorithms.

Plug and Play

Petuum comes with a fast and scalable parallel LASSO regression solver, as well as an implementation of topic model (Latent Dirichlet Allocation) and L2-norm Matrix Factorization – with more to be added on a regular basis. Petuum is fully self-contained, making installation a breeze – if you know how to use a Linux package manager and type “make”, you’re ready to use Petuum. No mucking around trying to find that Hadoop cluster, or (worse still) trying to install Hadoop yourself. Whether you have a single machine or an entire cluster, Petuum just works.

What’s Petuum anyway?

Petuum comes from “perpetuum mobile,” which is a musical style characterized by a continuous steady stream of notes. Paganini’s Moto Perpetuo is an excellent example. It is our goal to build a system that runs efficiently and reliably — in perpetual motion.

Musically inclined programmers? 😉

The bar for using Hadoop and machine learning gets lower by the day. At least in terms of details that can be mastered by code.

Which is how it should be. The creative work, choosing data, appropriate algorithms, etc., being left to human operators.

I first saw this at Danny Bickson’s Petuum – a new distributed machine learning framework from CMU (Eric Xing).

PS: Remember to register for the 3rd GraphLab Conference!

January 16, 2014

LxMLS 2013

Filed under: Conferences,Machine Learning — Patrick Durusau @ 2:11 pm

LxMLS 2013: 3rd Lisbon Machine Learning School (videos)

If you missed the lectures you can view them at techtalk.tv!

Now available:

Enjoy!

January 14, 2014

Algorithmic Music Discovery at Spotify

Filed under: Algorithms,Machine Learning,Matrix,Music,Music Retrieval,Python — Patrick Durusau @ 3:19 pm

Algorithmic Music Discovery at Spotify by Chris Johnson.

From the description:

In this presentation I introduce various Machine Learning methods that we utilize for music recommendations and discovery at Spotify. Specifically, I focus on Implicit Matrix Factorization for Collaborative Filtering, how to implement a small scale version using python, numpy, and scipy, as well as how to scale up to 20 Million users and 24 Million songs using Hadoop and Spark.

Among a number of interesting points, Chris points out differences between movie and music data.

One difference is that songs are consumed over and over again. Another is that users rate movies but “vote” by their streaming behavior on songs.*

While leads to Chris’ main point, implicit matrix factorization. Code. The source code page points to: Collaborative Filtering for Implicit Feedback Datasets by Yifan Hu, Yehuda Koren, and Chris Volinsky.

Scaling that process is represented in blocks for Hadoop and Spark.

* I suspect that “behavior” is more reliable than “ratings” from the same user. Reasoning ratings are more likely to be subject to social influences. I don’t have any research at my fingertips on that issue. Do you?

« Newer PostsOlder Posts »

Powered by WordPress