Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 5, 2014

Rule-based Information Extraction is Dead!…

Filed under: Information Retrieval,Machine Learning,Topic Maps — Patrick Durusau @ 3:14 pm

Rule-based Information Extraction is Dead! Long Live Rule-based Information Extraction Systems! by Laura Chiticariu, Yunyao Li, and Frederick R. Reiss.

Abstract:

The rise of “Big Data” analytics over unstructured text has led to renewed interest in information extraction (IE). We surveyed the landscape of IE technologies and identified a major disconnect between industry and academia: while rule-based IE dominates the commercial world, it is widely regarded as dead-end technology by the academia. We believe the disconnect stems from the way in which the two communities measure the benefits and costs of IE, as well as academia’s perception that rule-based IE is devoid of research challenges. We make a case for the importance of rule-based IE to industry practitioners. We then lay out a research agenda in advancing the state-of-the-art in rule-based IE systems which we believe has the potential to bridge the gap between academic research and industry practice.

After demonstrating the disconnect between industry (rule-based) and academia (ML) approaches to information extraction, the authors propose:

Define standard IE rule language and data model.

If research on rule-based IE is to move forward in a principled way, the community needs a standard way to express rules. We believe that the NLP community can replicate the success of the SQL language in connecting data management research and practice. SQL has been successful largely due to: (1) expressivity: the language provides all primitives required for performing basic manipulation of structured data, (2) extensibility: the language can be extended with new features without fundamental changes to the language, (3)declarativity: the language allows the specification of computation logic without describing its control flow,
thus allowing developers to code what the program should accomplish, rather than how to accomplish it.

On the contrary, both industry and academia would be better served by domain specific declarative languages (DSDLs).

I say “doman specific” because each domain has its own terms and semantics that are embedded in those terms. If we don’t want to repeat the chaos of owl:sameAs, we had better enable users to define and document the semantics they attach to terms, either as operators or as data.

A host of research problems open up when semantic domains are enabled to document the semantics of their data structures and data. How do semantic understandings evolve over time within a community? Rather difficult to answer if its semantics are never documented. What are the best ways to map between the documented semantics of different communities? Again, difficult to answer without pools of documented semantics of different communities.

Not to mention the development of IE and mapping languages, which share a common core value of documenting semantics and extracting information but have specific features for particular domains. There is no reason to expect or hope that a language designed for genomic research will have all the features needed for monetary arbitrage analysis.

Rather than seeking an “Ur” language for documenting semantics/extracting data, industry can demonstrate ROI and academia progress, with targeted, declarative languages that are familiar to members of individual domains.

I first saw this in a tweet by Kyle Wade Grove.

SDM 2014 Workshop on Heterogeneous Learning

Filed under: Conferences,Heterogeneous Data,Heterogeneous Programming,Machine Learning — Patrick Durusau @ 11:00 am

SDM 2014 Workshop on Heterogeneous Learning

Key Dates:

01/10/2014: Paper Submission
01/31/2014: Author Notification
02/10/2014: Camera Ready Paper Due

From the post:

The main objective of this workshop is to bring the attention of researchers to real problems with multiple types of heterogeneities, ranging from online social media analysis, traffic prediction, to the manufacturing process, brain image analysis, etc. Some commonly found heterogeneities include task heterogeneity (as in multi-task learning), view heterogeneity (as in multi-view learning), instance heterogeneity (as in multi-instance learning), label heterogeneity (as in multi-label learning), oracle heterogeneity (as in crowdsourcing), etc. In the past years, researchers have proposed various techniques for modeling a single type of heterogeneity as well as multiple types of heterogeneities.

This workshop focuses on novel methodologies, applications and theories for effectively leveraging these heterogeneities. Here we are facing multiple challenges. To name a few: (1) how can we effectively exploit the label/example structure to improve the classification performance; (2) how can we handle the class imbalance problem when facing one or more types of heterogeneities; (3) how can we improve the effectiveness and efficiency of existing learning techniques for large-scale problems, especially when both the data dimensionality and the number of labels/examples are large; (4) how can we jointly model multiple types of heterogeneities to maximally improve the classification performance; (5) how do the underlying assumptions associated with multiple types of heterogeneities affect the learning methods.

We encourage submissions on a variety of topics, including but not limited to:

(1) Novel approaches for modeling a single type of heterogeneity, e.g., task/view/instance/label/oracle heterogeneities.

(2) Novel approaches for simultaneously modeling multiple types of heterogeneities, e.g., multi-task multi-view learning to leverage both the task and view heterogeneities.

(3) Novel applications with a single or multiple types of heterogeneities.

(4) Systematic analysis regarding the relationship between the assumptions underlying each type of heterogeneity and the performance of the predictor;

Apologies but I saw this announcement too late for you to have a realistic opportunity to submit a paper. 🙁

Very unfortunate because the focus of the workshop is right up the topic map alley.

The main conference, which focuses on data mining, is April 24-26, 2014 in Philadelphia, Pennsylvania, USA.

I am very much looking forward to reading the papers from this workshop! (And looking for notice of next year’s workshop much earlier!)

December 31, 2013

Augur:…

Filed under: Bayesian Models,GPU,Machine Learning,Probabilistic Programming,Scala — Patrick Durusau @ 2:40 pm

Augur: a Modeling Language for Data-Parallel Probabilistic Inference by Jean-Baptiste Tristan, et.al.

Abstract:

It is time-consuming and error-prone to implement inference procedures for each new probabilistic model. Probabilistic programming addresses this problem by allowing a user to specify the model and having a compiler automatically generate an inference procedure for it. For this approach to be practical, it is important to generate inference code that has reasonable performance. In this paper, we present a probabilistic programming language and compiler for Bayesian networks designed to make effective use of data-parallel architectures such as GPUs. Our language is fully integrated within the Scala programming language and benefits from tools such as IDE support, type-checking, and code completion. We show that the compiler can generate data-parallel inference code scalable to thousands of GPU cores by making use of the conditional independence relationships in the Bayesian network.

A very good paper but the authors should highlight the caveat in the introduction:

We claim that many MCMC inference algorithms are highly data-parallel (Hillis & Steele, 1986; Blelloch, 1996) if we take advantage of the conditional independence relationships of the input model (e.g. the assumption of i.i.d. data makes the likelihood independent across data points).

(Where i.i.d. = Independent and identically distributed random variables.)

That assumption does allow for parallel processing, but users should be cautious about accepting assumptions about data.

The algorithms will still work, even if your assumptions about the data are incorrect.

But the answer you get may not be as useful as you would like.

I first saw this in a tweet by Stefano Bertolo.

Provable Algorithms for Machine Learning Problems

Provable Algorithms for Machine Learning Problems by Rong Ge.

Abstract:

Modern machine learning algorithms can extract useful information from text, images and videos. All these applications involve solving NP-hard problems in average case using heuristics. What properties of the input allow it to be solved effciently? Theoretically analyzing the heuristics is very challenging. Few results were known.

This thesis takes a di fferent approach: we identify natural properties of the input, then design new algorithms that provably works assuming the input has these properties. We are able to give new, provable and sometimes practical algorithms for learning tasks related to text corpus, images and social networks.

The first part of the thesis presents new algorithms for learning thematic structure in documents. We show under a reasonable assumption, it is possible to provably learn many topic models, including the famous Latent Dirichlet Allocation. Our algorithm is the first provable algorithms for topic modeling. An implementation runs 50 times faster than latest MCMC implementation and produces comparable results.

The second part of the thesis provides ideas for provably learning deep, sparse representations. We start with sparse linear representations, and give the fi rst algorithm for dictionary learning problem with provable guarantees. Then we apply similar ideas to deep learning: under reasonable assumptions our algorithms can learn a deep network built by denoising autoencoders.

The fi nal part of the thesis develops a framework for learning latent variable models. We demonstrate how various latent variable models can be reduced to orthogonal tensor decomposition, and then be solved using tensor power method. We give a tight sample complexity analysis for tensor power method, which reduces the number of sample required for learning many latent variable models.

In theory, the assumptions in this thesis help us understand why intractable problems in machine learning can often be solved; in practice, the results suggest inherently new approaches for machine learning. We hope the assumptions and algorithms inspire new research problems and learning algorithms.

Admittedly an odd notion, starting with the data rather than an answer and working back towards data but it does happen. 😉

Given the performance improvements for LDA (50X), I anticipate this approach being applied to algorithms for “big data.”

I first saw this in a tweet by Chris Deihl.

December 28, 2013

Mining the Web to Predict Future Events

Filed under: Machine Learning,News,Prediction,Predictive Analytics — Patrick Durusau @ 11:30 am

Mining the Web to Predict Future Events by Kira Radinsky and Eric Horvitz.

Abstract:

We describe and evaluate methods for learning to forecast forthcoming events of interest from a corpus containing 22 years of news stories. We consider the examples of identifying significant increases in the likelihood of disease outbreaks, deaths, and riots in advance of the occurrence of these events in the world. We provide details of methods and studies, including the automated extraction and generalization of sequences of events from news corpora and multiple web resources. We evaluate the predictive power of the approach on real-world events withheld from the system.

The paper starts off well enough:

Mark Twain famously said that “the past does not repeat itself, but it rhymes.” In the spirit of this reflection, we develop and test methods for leveraging large-scale digital histories captured from 22 years of news reports from the New York Times (NYT) archive to make real-time predictions about the likelihoods of future human and natural events of interest. We describe how we can learn to predict the future by generalizing sets of specific transitions in sequences of reported news events, extracted from a news archive spanning the years 1986–2008. In addition to the news corpora, we leverage data from freely available Web resources, including Wikipedia, FreeBase, OpenCyc, and GeoNames, via the LinkedData platform [6]. The goal is to build predictive models that generalize from specific sets of sequences of events to provide likelihoods of future outcomes, based on patterns of evidence observed in near-term newsfeeds. We propose the methods as a means of generating actionable forecasts in advance of the occurrence of target events in the world.

But when it gets down to actual predictions, the experiment predicts:

  • Cholera following flooding in Bangladesh.
  • Riots following police shootings in immigrant/poor neighborhoods.

Both are generally true but I don’t need 22 years worth of New York Times (NYT) archives to make those predictions.

Test offers of predictive advice by asking for specific predictions relevant to your enterprise. Also ask long time staff to make their predictions. Compare the predictions.

Unless the automated solution is significantly better, reward the staff and drive on.

I first saw this in Nat Torkington’s Four short links: 26 December 2013.

December 10, 2013

Statistics, Data Mining, and Machine Learning in Astronomy:…

Filed under: Astroinformatics,Data Mining,Machine Learning,Statistics — Patrick Durusau @ 3:26 pm

Statistics, Data Mining, and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data by Željko Ivezic, Andrew J. Connolly, Jacob T VanderPlas, Alexander Gray.

From the Amazon page:

As telescopes, detectors, and computers grow ever more powerful, the volume of data at the disposal of astronomers and astrophysicists will enter the petabyte domain, providing accurate measurements for billions of celestial objects. This book provides a comprehensive and accessible introduction to the cutting-edge statistical methods needed to efficiently analyze complex data sets from astronomical surveys such as the Panoramic Survey Telescope and Rapid Response System, the Dark Energy Survey, and the upcoming Large Synoptic Survey Telescope. It serves as a practical handbook for graduate students and advanced undergraduates in physics and astronomy, and as an indispensable reference for researchers.

Statistics, Data Mining, and Machine Learning in Astronomy presents a wealth of practical analysis problems, evaluates techniques for solving them, and explains how to use various approaches for different types and sizes of data sets. For all applications described in the book, Python code and example data sets are provided. The supporting data sets have been carefully selected from contemporary astronomical surveys (for example, the Sloan Digital Sky Survey) and are easy to download and use. The accompanying Python code is publicly available, well documented, and follows uniform coding standards. Together, the data sets and code enable readers to reproduce all the figures and examples, evaluate the methods, and adapt them to their own fields of interest.

  • Describes the most useful statistical and data-mining methods for extracting knowledge from huge and complex astronomical data sets
  • Features real-world data sets from contemporary astronomical surveys
  • Uses a freely available Python codebase throughout
  • Ideal for students and working astronomers

Still in pre-release but if you want to order the Kindle version (or hardback) to be sent to me, I’ll be sure to it on my list of items to blog about in 2014!

Or your favorite book on graphs, data analysis, etc, for that matter. 😉

December 8, 2013

Advances in Neural Information Processing Systems 26

Advances in Neural Information Processing Systems 26

The NIPS 2013 conference ended today.

All of the NIPS 2013 papers were posted today.

I count three hundred and sixty (360) papers.

From the NIPS Foundation homepage:

The Foundation: The Neural Information Processing Systems (NIPS) Foundation is a non-profit corporation whose purpose is to foster the exchange of research on neural information processing systems in their biological, technological, mathematical, and theoretical aspects. Neural information processing is a field which benefits from a combined view of biological, physical, mathematical, and computational sciences.

The primary focus of the NIPS Foundation is the presentation of a continuing series of professional meetings known as the Neural Information Processing Systems Conference, held over the years at various locations in the United States, Canada and Spain.

Enjoy the proceedings collection!

I first saw this in a tweet by Benoit Maison.

December 7, 2013

The Society of the Mind

Filed under: Artificial Intelligence,Machine Learning — Patrick Durusau @ 2:40 pm

The Society of the Mind by Marvin Minsky.

From the Prologue:

This book tries to explain how minds work. How can intelligence emerge from nonintelligence? To answer that, we’ll show that you can build a mind from many little parts, each mindless by itself.

I’ll call Society of Mind this scheme in which each mind is made of many smaller processes. These we’ll call agents. Each mental agent by itself can only do some simple thing that needs no mind or thought at all. Yet when we join these agents in societies — in certain very special ways — this leads to true intelligence.

There’s nothing very technical in this book. It, too, is a society — of many small ideas. Each by itself is only common sense, yet when we join enough of them we can explain the strangest mysteries of mind. One trouble is that these ideas have lots of cross-connections. My explanations rarely go in neat, straight lines from start to end. I wish I could have lined them up so that you could climb straight to the top, by mental stair-steps, one by one. Instead they’re tied in tangled webs.

Perhaps the fault is actually mine, for failing to find a tidy base of neatly ordered principles. But I’m inclined to lay the blame upon the nature of the mind: much of its power seems to stem from just the messy ways its agents cross-connect. If so, that complication can’t be helped; it’s only what we must expect from evolution’s countless tricks.

What can we do when things are hard to describe? We start by sketching out the roughest shapes to serve as scaffolds for the rest; it doesn’t matter very much if some of those forms turn out partially wrong. Next, draw details to give these skeletons more lifelike flesh. Last, in the final filling-in, discard whichever first ideas no longer fit.

That’s what we do in real life, with puzzles that seem very hard. It’s much the same for shattered pots as for the cogs of great machines. Until you’ve seen some of the rest, you can’t make sense of any part.

All 270 essays in 30 chapters of Minsky’s 1988 book by the same name.

To be read critically.

It is dated but a good representative of a time in artificial intelligence.

I first saw this in Nat Torkington’s Five Short Links for 6 December 2013.

November 27, 2013

Machine Learning Video Library

Filed under: Machine Learning — Patrick Durusau @ 2:27 pm

Machine Learning Video Library by Yaser Abu-Mostafa.

Snippets of lectures by Professor Abu-Mostafa listed by subject area and topics.

The main subject areas are:

  • Aggregation
  • Bayesian Learning
  • Bias-Variance Tradeoff
  • Bin Model
  • Data Snooping
  • Error Measures
  • Gradient Descent
  • Learning Curves
  • Learning Diagram
  • Learning Paradigms
  • Linear Classification
  • Linear Regression
  • Logistic Regression
  • Netflix Competition
  • Neural Networks
  • Nonlinear Transformation
  • Occam’s Razor
  • Overfitting
  • Radial Basis Functions
  • Regularization
  • Sampling Bias
  • Support Vector Machines
  • Validation
  • VC Dimension

The clips should come with a Warning that viewing any segment may result in you watching the video if not the entire class!

Just at random I watched Occam’s Razor, Definition and analysis.

A very lucid and entertaining lecture, complete a theoretical basis for a postal scam. 😉

In the segments, Professor Abu-Mostafa refers to other lectures and topics previously covered. That will have you thinking about watching the lectures in order.

May not be good substitutes for holiday favorites but are a pleasure to watch.

…Features from YouTube Videos…

Filed under: Data,Machine Learning,Multiview Learning — Patrick Durusau @ 1:30 pm

Released Data Set: Features Extracted from YouTube Videos for Multiview Learning by Omid Madani.

From the post:

“If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.”

The “duck test”.

Performance of machine learning algorithms, supervised or unsupervised, is often significantly enhanced when a variety of feature families, or multiple views of the data, are available. For example, in the case of web pages, one feature family can be based on the words appearing on the page, and another can be based on the URLs and related connectivity properties. Similarly, videos contain both audio and visual signals where in turn each modality is analyzed in a variety of ways. For instance, the visual stream can be analyzed based on the color and edge distribution, texture, motion, object types, and so on. YouTube videos are also associated with textual information (title, tags, comments, etc.). Each feature family complements others in providing predictive signals to accomplish a prediction or classification task, for example, in automatically classifying videos into subject areas such as sports, music, comedy, games, and so on.

We have released a dataset of over 100k feature vectors extracted from public YouTube videos. These videos are labeled by one of 30 classes, each class corresponding to a video game (with some amount of class noise): each video shows a gameplay of a video game, for teaching purposes for example. Each instance (video) is described by three feature families (textual, visual, and auditory), and each family is broken into subfamilies yielding up to 13 feature types per instance. Neither video identities nor class identities are released.

The concept of multiview learning is clear enough but the term was unfamiliar.

In that regard, you may want to read: A Survey on Multi-view Learning by Chang Xu, Dacheng Tao, Chao Xu.

Abstract:

In recent years, a great many methods of learning from multi-view data by considering the diversity of different views have been proposed. These views may be obtained from multiple sources or different feature subsets. In trying to organize and highlight similarities and differences between the variety of multi-view learning approaches, we review a number of representative multi-view learning algorithms in different areas and classify them into three groups: 1) co-training, 2) multiple kernel learning, and 3) subspace learning. Notably, co-training style algorithms train alternately to maximize the mutual agreement on two distinct views of the data; multiple kernel learning algorithms exploit kernels that naturally correspond to different views and combine kernels either linearly or non-linearly to improve learning performance; and subspace learning algorithms aim to obtain a latent subspace shared by multiple views by assuming that the input views are generated from this latent subspace. Though there is significant variance in the approaches to integrating multiple views to improve learning performance, they mainly exploit either the consensus principle or the complementary principle to ensure the success of multi-view learning. Since accessing multiple views is the fundament of multi-view learning, with the exception of study on learning a model from multiple views, it is also valuable to study how to construct multiple views and how to evaluate these views. Overall, by exploring the consistency and complementary properties of different views, multi-view learning is rendered more effective, more promising, and has better generalization ability than single-view learning.

Be forewarned that the survey runs 59 pages and has 9 1/2 pages of references. Not something you take home for a quick read. 😉

November 23, 2013

SAMOA

Introducing SAMOA, an open source platform for mining big data streams by Gianmarco De Francisci Morales and Albert Bifet.

From the post:

https://github.com/yahoo/samoa

Machine learning and data mining are well established techniques in the world of IT and especially among web companies and startups. Spam detection, personalization and recommendations are just a few of the applications made possible by mining the huge quantity of data available nowadays. However, “big data” is not only about Volume, but also about Velocity (and Variety, 3V of big data).

The usual pipeline for modeling data (what “data scientists” do) involves taking a sample from production data, cleaning and preprocessing it to make it usable, training a model for the task at hand and finally deploying it to production. The final output of this process is a pipeline that needs to run periodically (and be maintained) in order to keep the model up to date. Hadoop and its ecosystem (e.g., Mahout) have proven to be an extremely successful platform to support this process at web scale.

However, no solution is perfect and big data is “data whose characteristics forces us to look beyond the traditional methods that are prevalent at the time”. The current challenge is to move towards analyzing data as soon as it arrives into the system, nearly in real-time.

For example, models for mail spam detection get outdated with time and need to be retrained with new data. New data (i.e., spam reports) comes in continuously and the model starts being outdated the moment it is deployed: all the new data is sitting without creating any value until the next model update. On the contrary, incorporating new data as soon as it arrives is what the “Velocity” in big data is about. In this case, Hadoop is not the ideal tool to cope with streams of fast changing data.

Distributed stream processing engines are emerging as the platform of choice to handle this use case. Examples of these platforms are Storm, S4, and recently Samza. These platforms join the scalability of distributed processing with the fast response of stream processing. Yahoo has already adopted Storm as a key technology for low-latency big data processing.

Alas, currently there is no common solution for mining big data streams, that is, for doing machine learning on streams on a distributed environment.

Enter SAMOA

SAMOA (Scalable Advanced Massive Online Analysis) is a framework for mining big data streams. As most of the big data ecosystem, it is written in Java. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm and S4. SAMOA includes distributed algorithms for the most common machine learning tasks such as classification and clustering. For a simple analogy, you can think of SAMOA as Mahout for streaming.

After you get SAMOA installed, you may want to read: Distributed Decision Tree Learning for Mining Big Data Streams by Arinto Murdopo (thesis).

The nature of streaming data prevents SAMOA from offering the range of machine learning algorithms common in machine learning packages.

But if the SAMOA algorithms fit your use cases, what other test would you apply?

November 15, 2013

November 13, 2013

Tackling some really tough problems…

Filed under: Machine Learning,Topological Data Analysis,Topology — Patrick Durusau @ 2:56 pm

Tackling some really tough problems with machine learning by Derrick Harris.

From the post:

Machine learning startup Ayasdi is partnering with two prominent institutions — Lawrence Livermore National Laboratory and the Texas Medical Center — to help advance some of their complicated data challenges. At LLNL, the company will collaborate on research in energy, climate change, medical technology, and national security, while its work with the Texas Medical Center will focus on translational medicine, electronic medical records and finding new uses for existing drugs.

Ayasdi formally launched in January after years researching its core technology, called topological data analysis. Essentially, the company’s software, called Iris, uses hundreds of machine learning algorithms to analyze up to tens of billions of data points and identify the relationships among them. The topological part comes from the way the results of this analysis are visually mapped into a network that places similar or tightly connected points near one another so users can easily spot collections of variables that appear to affect each other.

Tough problems:

At LLNL, the company will collaborate on research in energy, climate change, medical technology, and national security, while its work with the Texas Medical Center will focus on translational medicine, electronic medical records and finding new uses for existing drugs.

I would say so but that wasn’t the “tough” problem I was expecting.

The “tough” problem I had in mind was taking data with no particular topology and mapping it to a topology.

I ask because “similar or tightly connected points” depend upon a notion of “similarity” that is not inherent in most data points.

For example, how “similar” are you from a leaker by working in the same office? How does that “similarity” compare to the “similarity” of other relationships?


Original text (which I have corrected above):

I ask because “similar or tightly connected points” depend upon a notion of “distance” that is not inherent in most data points.

For example, how “near” or “far” are you from a leaker by working in the same office? How does that “near” or “far” compare to the nearness or farness of other relationships?

I corrected the original post to remove the implication of a metric distance.

November 12, 2013

Oryx [Alphaware]

Filed under: Cloudera,Machine Learning — Patrick Durusau @ 4:29 pm

Oryx [Alphaware] (Cloudera)

From the webpage:

The Oryx open source project provides simple, real-time large-scale machine learning infrastructure. It implements a few classes of algorithm commonly used in business applications: collaborative filtering / recommendation, classification / regression, and clustering. It can continuously build models from a stream of data at large scale using Apache Hadoop‘s MapReduce. It also serves queries of those models in real-time via an HTTP REST API, and can update models approximately in response to new data. Models are exchanged in PMML format.

It is not a library, visualization tool, exploratory analytics tool, or environment. Oryx represents a unified continuation of the Myrrix and cloudera/ml projects.

Oryx should be considered alpha software; it may have bugs and will change in incompatible ways.

I’m sure management has forgotten about that incident where you tanked the production servers. Not to mention those beady-eyed government agents that slowly track you in a car when you grab lunch. 😉

Just teasing. Keep Oryx off the production servers and explore!

Sorry, no advice for the beady-eyed government agents.

Advantages of Different Classification Algorithms

Filed under: Classification,Machine Learning — Patrick Durusau @ 4:18 pm

What are the advantages of different classification algorithms? (Question on Quora.)

Useful answers follow.

Not a bad starting place for a set of algorithms you are likely to encounter on a regular basis. Either to become familiar with them and/or to work out stock criticisms of their use.

Enjoy!

I first saw this link at myNoSQL by Alex Popescu.

November 11, 2013

Day 14: Stanford NER…

Day 14: Stanford NER–How To Setup Your Own Name, Entity, and Recognition Server in the Cloud by Shekhar Gulati.

From the post:

I am not a huge fan of machine learning or natural text processing (NLP) but I always have ideas in mind which require them. The idea that I will explore during this post is the ability to build a real time job search engine using twitter data. Tweets will contain the name of the company which if offering a job, the location of the job, and name of the contact person at the company. This requires us to parse the tweet for Person, Location, and Organisation. This type of problem falls under Named Entity Recognition.

A continuation of Shekhar’s Learning 30 Technologies in 30 Days… but one that merits a special shout out.

In part because you can consume the entities that other “recognize” or you can be in control of the recognition process.

It isn’t easy but on the other hand, it isn’t free from hidden choices and selection biases.

I would prefer those were my hidden choices and selection biases, if you don’t mind. 😉

November 8, 2013

JML [Java Machine Learning]

Filed under: Java,Machine Learning — Patrick Durusau @ 5:48 pm

JML [Java Machine Learning] by Mingjie Qian.

From the webpage:

JML is a pure Java library for machine learning. The goal of JML is to make machine learning methods easy to use and speed up the code translation from MATLAB to Java. Tutorial-JML.pdf

Current version implements logistic regression, Maximum Entropy modeling (MaxEnt), AdaBoost, LASSO, KMeans, spectral clustering, Nonnegative Matrix Factorization (NMF), sparse NMF, Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA) (by Gibbs sampling based on LdaGibbsSampler.java by Gregor Heinrich), joint l_{2,1}-norms minimization, Hidden Markov Model (HMM), Conditional Random Field (CRF), etc. just for examples of implementing machine learning methods by using this general framework. The SVM package LIBLINEAR is also incorporated. I will try to add more important models such as Markov Random Field (MRF) to this package if I get the time:)

JML library’s another advantage is its complete independence from feature engineering, thus any preprocessed data could be run. For example, in the area of natural language processing, feature engineering is a crucial part for MaxEnt, HMM, and CRF to work well and is often embedded in model training. However, we believe that it is better to separate feature engineering and parameter estimation. On one hand, modularization could be achieved so that people can simply focus on one module without need to consider other modules; on the other hand, implemented modules could be reused without incompatibility concerns.

JML also provides implementations of several efficient, scalable, and widely used general purpose optimization algorithms, which are very important for machine learning methods be applicable on large scaled data, though particular optimization strategy that considers the characteristics of a particular problem is more effective and efficient (e.g., dual coordinate descent for bound constrained quadratic programming in SVM). Currently supported optimization algorithms are limited-memory BFGS, projected limited-memory BFGS (non-negative constrained or bound constrained), nonlinear conjugate gradient, primal-dual interior-point method, general quadratic programming, accelerated proximal gradient, and accelerated gradient descent. I would always like to implement more practical efficient optimization algorithms. (emphasis in original)

Something else “practical” for your weekend. 😉

ParLearning 2014

ParLearning 2014 The 3rd International Workshop on Parallel and Distributed Computing for Large Scale Machine Learning and Big Data Analytics.

Dates:

Workshop Paper Due: December 30, 2013
Author Notification: February 14, 2014
Camera-ready Paper Due: March 14, 2014
Workshop: May 23, 2014 Phoenix, AZ, USA

From the webpage:

Data-driven computing needs no introduction today. The case for using data for strategic advantages is exemplified by web search engines, online translation tools and many more examples. The past decade has seen 1) the emergence of multicore architectures and accelerators as GPGPUs, 2) widespread adoption of distributed computing via the map-reduce/hadoop eco-system and 3) democratization of the infrastructure for processing massive datasets ranging into petabytes by cloud computing. The complexity of the technological stack has grown to an extent where it is imperative to provide frameworks to abstract away the system architecture and orchestration of components for massive-scale processing. However, the growth in volume and heterogeneity in data seems to outpace the growth in computing power. A “collect everything” culture stimulated by cheap storage and ubiquitous sensing capabilities contribute to increasing the noise-to-signal ratio in all collected data. Thus, as soon as the data hits the processing infrastructure, determining the value of information, finding its rightful place in a knowledge representation and determining subsequent actions are of paramount importance. To use this data deluge to our advantage, a convergence between the field of Parallel and Distributed Computing and the interdisciplinary science of Artificial Intelligence seems critical. From application domains of national importance as cyber-security, health-care or smart-grid to providing real-time situational awareness via natural interface based smartphones, the fundamental AI tasks of Learning and Inference need to be enabled for large-scale computing across this broad spectrum of application domains.

Many of the prominent algorithms for learning and inference are notorious for their complexity. Adopting parallel and distributed computing appears as an obvious path forward, but the mileage varies depending on how amenable the algorithms are to parallel processing and secondly, the availability of rapid prototyping capabilities with low cost of entry. The first issue represents a wider gap as we continue to think in a sequential paradigm. The second issue is increasingly recognized at the level of programming models, and building robust libraries for various machine-learning and inferencing tasks will be a natural progression. As an example, scalable versions of many prominent graph algorithms written for distributed shared memory architectures or clusters look distinctly different from the textbook versions that generations of programmers have grown with. This reformulation is difficult to accomplish for an interdisciplinary field like Artificial Intelligence for the sheer breadth of the knowledge spectrum involved. The primary motivation of the proposed workshop is to invite leading minds from AI and Parallel & Distributed Computing communities for identifying research areas that require most convergence and assess their impact on the broader technical landscape.

Taking full advantage of parallel processing remains a distant goal. This workshop looks like a good concrete step towards that goal.

November 4, 2013

Crowdsourcing Multi-Label Classification for Taxonomy Creation

Filed under: Crowd Sourcing,Decision Making,Machine Learning,Taxonomy — Patrick Durusau @ 5:19 pm

Crowdsourcing Multi-Label Classification for Taxonomy Creation by Jonathan Bragg, Mausam and Daniel S. Weld.

Abstract:

Recent work has introduced CASCADE, an algorithm for creating a globally-consistent taxonomy by crowdsourcing microwork from many individuals, each of whom may see only a tiny fraction of the data (Chilton et al. 2013). While CASCADE needs only unskilled labor and produces taxonomies whose quality approaches that of human experts, it uses significantly more labor than experts. This paper presents DELUGE, an improved workflow that produces taxonomies with comparable quality using significantly less crowd labor. Specifically, our method for crowdsourcing multi-label classification optimizes CASCADE’s most costly step (categorization) using less than 10% of the labor required by the original approach. DELUGE’s savings come from the use of decision theory and machine learning, which allow it to pose microtasks that aim to maximize information gain.

An extension of work reported at Cascade: Crowdsourcing Taxonomy Creation.

While the reduction in required work is interesting, the ability to sustain more complex workflows looks like the more important.

That will require the development of workflows to be optimized, at least for subject identification.

Or should I say validation of subject identification?

What workflow do you use for subject identification and/or validation of subject identification?

November 3, 2013

Shogun… 3.0.0

Filed under: Machine Learning — Patrick Durusau @ 5:04 pm

Shogun – A Large Scale Machine Learning Toolbox (3.0.0 release)

Highlights of the Shogun 3.0.0 release:

This release features 8 successful Google Summer of Code projects and it is the result of an incredible effort by our students. All projects come with very cool ipython-notebooks that contain background, code examples and visualizations. These can be found on our webpage!

    The projects are:

  • Gaussian Processes for binary classification [Roman Votjakov]
  • Sampling log-determinants for large sparse matrices [Soumyajit De]
  • Metric Learning via LMNN [Fernando Iglesias]
  • Independent Component Analysis (ICA) [Kevin Hughes]
  • Hashing Feature Framework [Evangelos Anagnostopoulos]
  • Structured Output Learning [Hu Shell]
  • A web-demo framework [Liu Zhengyang] Other important changes are the change of our build-system to cmake and the addition of clone/equals methods to our base-class. In addition, you get the usual ton of bugfixes, new unit-tests, and new mini-features.
  • Features:
    • In addition, the following features have been added:
    • Added method to importance sample the (true) marginal likelihood of a Gaussian Process using a posterior approximation.
    • Added a new class for classical probability distribution that can be sampled and whose log-pdf can be evaluated. Added the multivariate Gaussian with various numerical flavours.
    • Cross-validation framework works now with Gaussian Processes
    • Added nu-SVR for LibSVR class
    • Modelselection is now supported for parameters of sub-kernels of combined kernels in the MKL context. Thanks to Evangelos Anagnostopoulos
    • Probability output for multi-class SVMs is now supported using various heuristics. Thanks to Shell Xu Hu.
    • Added an “equals” method to all Shogun objects that recursively compares all registered parameters with those of another instance — up to a specified accuracy.
    • Added a “clone” method to all Shogun objects that creates a deep copy
    • Multiclass LDA. Thanks to Kevin Hughes.
    • Added a new datatype, complex128_t, for complex numbers. Math functions, support for SGVector/Matrix, SGSparseVector/Matrix, and serialization with Ascii and Xml files added. [Soumyajit De].
    • Added mini-framework for numerical integration in one variable. Implemented Gauss-Kronrod and Gauss-Hermite quadrature formulas.
    • Changed from configure script to CMake by Viktor Gal.
    • Add C++0x and C++11 cmake detection scripts
    • ND-Array typmap support for python and octave modular.

Toolbox machine learning lacks the bells and whistles of custom code but it is a great way to experiment with data and machine learning techniques.

Experimenting with data and techniques will help immunize you from the common frauds and deceptions using machine learning techniques.

David Huff wrote How to Lie with Statistics in the 1950’s.

Is there anything equivalent to that for machine learning? Given the technical nature of many of the techniques a guide to what questions to ask, etc., could be a real boon. To one side of machine learning based discussions at least.

October 31, 2013

Data Preparation for Machine Learning using MySQL

Filed under: Machine Learning,MySQL — Patrick Durusau @ 8:25 pm

Data Preparation for Machine Learning using MySQL

From the post:

Most Machine Learning algorithms require data to be into a single text file in tabular format, with each row representing a full instance of the input dataset and each column one of its features. For example, imagine data in normal form separated in a table for users, another for movies, and another for ratings. You can get it in machine-learning-ready format in this way (i.e., joining by userid and movieid and removing ids and names):

Just in case you aren’t up to the Stinger level of SQL but still need to prepare data for machine learning.

Excellent tutorial on using MySQL for machine learning data preparation.

Machine learning for cancer classification – part 1

Filed under: Bioinformatics,Machine Learning — Patrick Durusau @ 8:13 pm

Machine learning for cancer classification – part 1 – preparing the data sets by Obi Griffith.

From the post:

I am planning a series of tutorials illustrating basic concepts and techniques for machine learning. We will try to build a classifier of relapse in breast cancer. The analysis plan will follow the general pattern (simplified) of a recent paper I wrote. The gcrma step may require you to have as much as ~8gb of ram. I ran this tutorial on a Mac Pro (Snow Leopard) with R 3.0.2 installed. It should also work on linux or windows but package installation might differ slightly. The first step is to prepare the data sets. We will use GSE2034 as a training data set and GSE2990 as a test data set. These are both data sets making use of the Affymetrix U133A platform (GPL96). First, let’s download and pre-process the training data.

Assuming you are ready to move beyond Iris data sets for practicing machine learning, this would be a good place to start.

October 30, 2013

MADlib

Filed under: Analytics,Machine Learning,MADlib,Mathematics,Statistics — Patrick Durusau @ 6:58 pm

MADlib

From the webpage:

MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data.

The MADlib mission: to foster widespread development of scalable analytic skills, by harnessing efforts from commercial practice, academic research, and open-source development.

Until the Impala post called my attention to it, I didn’t realize that MADlib had an upgrade earlier in October to 1.3!

Congratulations to MADlib!

Use MADlib Pre-built Analytic Functions….

Filed under: Analytics,Cloudera,Impala,Machine Learning,MADlib — Patrick Durusau @ 6:53 pm

How-to: Use MADlib Pre-built Analytic Functions with Impala by Victor Bittorf.

From the post:

Cloudera Impala is an exciting project that unlocks interactive queries and SQL analytics on big data. Over the past few months I have been working with the Impala team to extend Impala’s analytic capabilities. Today I am happy to announce the availability of pre-built mathematical and statistical algorithms for the Impala community under a free open-source license. These pre-built algorithms combine recent theoretical techniques for shared nothing parallelization for analytics and the new user-defined aggregations (UDA) framework in Impala 1.2 in order to achieve big data scalability. This initial release has support for logistic regression, support vector machines (SVMs), and linear regression.

Having recently completed my masters degree while working in the database systems group at University of Madison Wisconsin, I’m excited to work with the Impala team on this project while I continue my research as a visiting student at Stanford. I’m going to go through some details about what we’ve implemented and how to use it.

As interest in data analytics increases, there is growing demand for deploying analytic algorithms in enterprise systems. One approach that has received much attention from researchers, engineers and data scientists is the integration of statistical data analysis into databases. One example of this is MADlib, which leverages the data-processing capabilities of an RDBMS to analyze data.

Victor walks through several examples of data analytics but for those of you who want to cut to the chase:

This package uses UDAs and UDFs when training and evaluating analytic models. While all of these tasks can be done in pure SQL using the Impala shell, we’ve put together some front-end scripts to streamline the process. The source code for the UDAs, UDFs, and scripts are all on GitHub.

Usual cautions apply: The results of your script or model may or may not have any resemblance to “facts” as experienced by others.

October 26, 2013

0xdata Releases Second Generation H2O…

Filed under: H20,Hadoop,Machine Learning — Patrick Durusau @ 8:01 pm

0xdata Releases Second Generation H2O, Big Data’s Fastest Open Source Machine Learning and Predictive Analytics Engine

From the post:

0xdata (www.0xdata.com), the open source machine learning and predictive analytics company for big data, today announced general availability of the latest release of H2O, the industry’s fastest prediction engine for big data users of Hadoop, R and Excel. H2O delivers parallel and distributed advanced algorithms on big data at speeds up to 100X faster than other predictive analytics providers.

The second generation H2O “Fluid Vector” release — currently in use at two of the largest insurance companies in the world, the largest provider of streaming video entertainment and the largest online real estate services company — delivers new levels of performance, ease of use and integration with R. Early H2O customers include Netflix, Trulia and Vendavo.

“We developed H2O to unlock the predictive power of big data through better algorithms,” said SriSatish Ambati, CEO and co-founder of 0xdata. “H2O is simple, extensible and easy to use and deploy from R, Excel and Hadoop. The big data science world is one of algorithm-haves and have-nots. Amazon, Goldman Sachs, Google and Netflix have proven the power of algorithms on data. With our viral and open Apache software license philosophy, along with close ties into the math, Hadoop and R communities, we bring the power of Google-scale machine learning and modeling without sampling to the rest of the world.”

“Big data by itself is useless. It is only when you have big data plus big analytics that one has the capability to achieve big business impact. H2O is the platform for big analytics that we have found gives us the biggest advantage compared with other alternatives,” said Chris Pouliot, Director of Algorithms and Analytics at Netflix and advisor to 0xdata. “Our data scientists can build sophisticated models, minimizing their worries about data shape and size on commodity machines. Over the past year, we partnered with the talented 0xdata team to work with them on building a great product that will meet and exceed our algorithm needs in the cloud.”

From the H2O Github page:

H2O makes hadoop do math!
H2O scales statistics, machine learning and math over BigData. H2O is extensible and users can build blocks using simple math legos in the core.
H2O keeps familiar interfaces like R, Excel & JSON so that big data enthusiasts & & experts can explore, munge, model and score datasets using a range of simple to advanced algorithms.
Data collection is easy. Decision making is hard. H2O makes it fast and easy to derive insights from your data through faster and better predictive modeling

Product Vision for first cut:

  • H2O, the Analytics Engine will scale Classification and Regression.
  • RandomForest, Generalized Linear Modeling (GLM), logistic regression, k-Means, available over R / REST/ JSON-API
  • Basic Linear Algebra as building blocks for custom algorithms
  • High predictive power of the models
  • High speed and scale for modeling and validation over BigData
  • Data Sources:
    • We read and write from/to HDFS, S3
    • We ingest data in CSV format from local and distributed filesystems (nfs)
    • A JDBC driver for SQL and DataAdapters for NoSQL datasources is in the roadmap. (v2)
  • Adhoc Data Analytics at scale via R-like Parser on BigData

Machine learning is not as ubiquitous as Excel, yet.

But like Excel, the quality of results depends on the skills of the user, not the technology.

Machine Learning And Analytics…

Filed under: Analytics,Machine Learning,Python — Patrick Durusau @ 4:10 pm

Machine Learning And Analytics Using Python For Beginners by Naveen Venkataraman.

From the post:

Analytics has been a major personal theme in 2013. I’ve recently taken an interest in machine learning after spending some time in analytics consulting. In this post, I’ll share a few tips for folks looking to get started with machine learning and data analytics.

Audience

The audience for this article is people who are looking to understand the basics of machine learning and those who are interested in developing analytics projects using python. A coding background is not required in order to read this article

Most resource postings list too many resources to consult.

Naveen lists a handful of resources and why you should use them.

October 22, 2013

Active learning, almost black magic

Filed under: Active Learning,Duke,Genetic Algorithms,Machine Learning — Patrick Durusau @ 6:53 pm

Active learning, almost black magic by Lars Marius Garshol.

From the post:

I’ve written Duke, an engine for figuring out which records represent the same thing. It works fine, but people find it difficult to configure correctly, which is not so strange. Getting the configurations right requires estimating probabilities and choosing between comparators like Levenshtein, Jaro-Winkler, and Dice coefficient. Can we get the computer to do something people cannot? It sounds like black magic, but it’s actually pretty simple.

I implemented a genetic algorithm that can set up a good configuration automatically. The genetic algorithm works by making lots of configurations, then removing the worst and making more of the best. The configurations that are kept are tweaked randomly, and the process is repeated over and over again. It’s dead simple, but it works fine. The problem is: how is the algorithm to know which configurations are the best? The obvious solution is to have test data that tells you which records should be linked, and which ones should not be linked.

But that leaves us with a bootstrapping problem. If you can create a set of test data big enough for this to work, and find all the correct links in that set, then you’re fine. But how do you find all the links? You can use Duke, but if you can set up Duke well enough to do that you don’t need the genetic algorithm. Can you do it in other ways? Maybe, but that’s hard work, quite possibly harder than just battling through the difficulties and creating a configuration.

So, what to do? For a year or so I was stuck here. I had something that worked, but it wasn’t really useful to anyone.

Then I came across a paper where Axel Ngonga described how to solve this problem with active learning. Basically, the idea is to pick some record pairs that perhaps should be linked, and ask the user whether they should be linked or not. There’s an enormous number of pairs we could ask the user about, but most of these pairs provide very little information. The trick is to select those pairs which teach the algorithm the most.

This great stuff.

Particularly since I have a training problem that lacks a training set. 😉

Looking forward to trying this on “real-world problems” as Lars says.

Titanic Machine Learning from Disaster (Kaggle Competition)

Filed under: Data Mining,Graphics,Machine Learning,Visualization — Patrick Durusau @ 4:34 pm

Titanic Machine Learning from Disaster (Kaggle Competition) by Andrew Conti.

From the post (and from the Kaggle page):

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this contest, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

This Kaggle Getting Started Competition provides an ideal starting place for people who may not have a lot of experience in data science and machine learning.”

From Andrew’s post:

Goal for this Notebook:

Show a simple example of an analysis of the Titanic disaster in Python using a full complement of PyData utilities. This is aimed for those looking to get into the field or those who are already in the field and looking to see an example of an analysis done with Python.

This Notebook will show basic examples of:

Data Handling

  • Importing Data with Pandas
  • Cleaning Data
  • Exploring Data through Visualizations with Matplotlib

Data Analysis

  • Supervised Machine learning Techniques:
    • Logit Regression Model
    • Plotting results
  • Unsupervised Machine learning Techniques
    • Support Vector Machine (SVM) using 3 kernels
    • Basic Random Forest
    • Plotting results

Valuation of the Analysis

  • K-folds cross validation to valuate results locally
  • Output the results from the IPython Notebook to Kaggle

Required Libraries:

This is wicked cool!

I first saw this in Kaggle Titanic Contest Tutorial by Danny Bickson.

PS: Don’t miss Andrew Conti’s new homepage.

October 20, 2013

PredictionIO Guide

Filed under: Cascading,Hadoop,Machine Learning,Mahout,Scalding — Patrick Durusau @ 4:20 pm

PredictionIO Guide

From the webpage:

PredictionIO is an open source Machine Learning Server. It empowers programmers and data engineers to build smart applications. With PredictionIO, you can add the following features to your apps instantly:

  • predict user behaviors
  • offer personalized video, news, deals, ads and job openings
  • help users to discover interesting events, documents, apps and restaurants
  • provide impressive match-making services
  • and more….

PredictionIO is built on top of solid open source technology. We support Hadoop, Mahout, Cascading and Scalding natively.

PredictionIO looks interesting in general but especially its Item Similarity Engine.

From the Item Similarity: Overview:

People who like this may also like….

This engine tries to suggest N items that are similar to a targeted item. By being ‘similar’, it does not necessarily mean that the two items look alike, nor they share similar attributes. The definition of similarity is independently defined by each algorithm and is usually calculated by a distance function. The built-in algorithms assume that similarity between two items means the likelihood any user would like (or buy, view etc) both of them.

The example that comes to mind is merging all “shoes” from any store and using the resulting price “occurrences” to create a price range and average for each store.

October 16, 2013

LIBMF: …

Filed under: Machine Learning,Matrix,Recommendation — Patrick Durusau @ 6:04 pm

LIBMF: A Matrix-factorization Library for Recommender Systems by Machine Learning Group at National Taiwan University.

From the webpage:

LIBMF is an open source tool for approximating an incomplete matrix using the product of two matrices in a latent space. Matrix factorization is commonly used in collaborative filtering. Main features of LIBMF include

  • In addition to the latent user and item features, we add user bias, item bias, and average terms for better performance.
  • LIBMF can be parallelized in a multi-core machine. To make our package more efficient, we use SSE instructions to accelerate the vector product operations.

    For a data sets of 250M ratings, LIBMF takes less then eight minutes to converge to a reasonable level.

  • Download

    The current release (Version 1.0, Sept 2013) of LIBMF can be obtained by downloading the zip file or tar.gz file.

    Please read the COPYRIGHT notice before using LIBMF.

    Documentation

    The algorithms of LIBMF is described in the following paper.

    Y. Zhuang, W.-S. Chin, Y.-C. Juan, and C.-J. Lin. A Fast Parallel SGD for Matrix Factorization in Shared Memory Systems. Proceedings of ACM Recommender Systems 2013.

    See README in the package for the practical use.

Being curious about what “practical use” would be in the README, ;-), I discovered a demo data set. And basic instructions for use.

For the details of application for recommendations, see the paper.

« Newer PostsOlder Posts »

Powered by WordPress