Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 15, 2013

DatumBox

Filed under: Machine Learning — Patrick Durusau @ 8:01 pm

DatumBox

From the webpage:

Datumbox offers a large number of off-the-shelf Classifiers and Natural Language Processing services which can be used in a broad spectrum of applications including: Sentiment Analysis, Topic Classification, Language Detection, Subjectivity Analysis, Spam Detection, Reading Assessment, Keyword and Text Extraction and more. All services are accessible via our powerful REST API which allows you to develop your own smart Applications in no time.

I am taking a machine learning course based on Weka and that may be why this service caught my eye.

Particularly the part that reads:

Datumbox eliminates the complex and time consuming process of designing and training Machine Learning models. Our service gives you access to classifiers that can be directly used in your software.

I would agree that designing from scratch a machine learning model would be a time consuming task. And largely unnecessary for most applications in light of the large number of machine learning models that are already available.

However, I’m not sure how any machine learning model is going to avoid training? At least if it is going to provide you with meaningful results.

Still, it is a free service so I am applying for an API key and will report back with more details.

October 8, 2013

Splunk Enterprise 6

Filed under: Intelligence,Machine Learning,Operations,Splunk — Patrick Durusau @ 3:27 pm

Splunk Enterprise 6

The latest version of Splunk is described as:

Operational Intelligence for Everyone

Splunk Enterprise is the leading platform for real-time operational intelligence. It’s the easy, fast and secure way to search, analyze and visualize the massive streams of machine data generated by your IT systems and technology infrastructure—physical, virtual and in the cloud.

Splunk Enterprise 6 is our latest release and delivers:

  • Powerful analytics for everyone—at amazing speeds
  • Completely redesigned user experience
  • Richer developer environment to easily extend the platform

The current download page promises the enterprise version for 60 days. At the end of that period you can convert to a Free license or purchase an Enterprise license.

October 1, 2013

Recursive Deep Models for Semantic Compositionality…

Filed under: Machine Learning,Modeling,Semantic Vectors,Semantics,Sentiment Analysis — Patrick Durusau @ 4:12 pm

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank by Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng and Christopher Potts.

Abstract:

Semantic word spaces have been very useful but cannot express the meaning of longer phrases in a principled way. Further progress towards understanding compositionality in tasks such as sentiment detection requires richer supervised training and evaluation resources and more powerful models of composition. To remedy this, we introduce a Sentiment Treebank. It includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality. To address them, we introduce the Recursive Neural Tensor Network. When trained on the new treebank, this model outperforms all previous methods on several metrics. It pushes the state of the art in single sentence positive/negative classification from 80% up to 85.4%. The accuracy of predicting fine-grained sentiment labels for all phrases reaches 80.7%, an improvement of 9.7% over bag of features baselines. Lastly, it is the only model that can accurately capture the effect of contrastive conjunctions as well as negation and its scope at various tree levels for both positive and negative phrases.

You will no doubt want to see the webpage with the demo.

Along with possibly the data set and the code.

I was surprised by “fine-grained sentiment labels” meaning:

  1. Positive
  2. Somewhat positive
  3. Neutral
  4. Somewhat negative
  5. Negative

But then for many purposes, subject recognition on that level of granularity may be sufficient.

September 30, 2013

OTexts​.org is launched

Filed under: Books,Forecasting,Logic,Machine Learning — Patrick Durusau @ 6:46 pm

OTexts​.org is launched by Rob J Hyn­d­man.

From the post:

The pub­lish­ing plat­form I set up for my fore­cast­ing book has now been extended to cover more books and greater func­tion­al­ity. Check it out at www​.otexts​.org.

otexts.org

So far, we have three com­plete books:

  1. Fore­cast­ing: prin­ci­ples and prac­tice, by Rob J Hyn­d­man and George Athana­sopou­los
  2. Sta­tis­ti­cal foun­da­tions of machine learn­ing, by Gian­luca Bon­tempi and Souhaib Ben Taieb
  3. Modal logic of strict neces­sity and poss­bibil­ity, by Evgeni Lati­nov

OTexts.org is looking for readers, authors and donors.

Saying you support open access is one thing.

Supporting open access by contributing content or funding is another.

September 29, 2013

Learning From Data

Filed under: Machine Learning — Patrick Durusau @ 3:43 pm

Learning From Data by Professor Yaser Abu-Mostafa.

Rather than being broken into smaller segments, these lectures are traditional lecture length.

Personally I prefer the longer lecture style over shorter snippets, such as were used for Learning from Data (an earlier version).

Lectures:

  • Lecture 1 (The Learning Problem)
  • Lecture 2 (Is Learning Feasible?)
  • Lecture 3 (The Linear Model I)
  • Lecture 4 (Error and Noise)
  • Lecture 5 (Training versus Testing)
  • Lecture 6 (Theory of Generalization)
  • Lecture 7 (The VC Dimension)
  • Lecture 8 (Bias-Variance Tradeoff)
  • Lecture 9 (The Linear Model II)
  • Lecture 10 (Neural Networks)
  • Lecture 11 (Overfitting)
  • Lecture 12 (Regularization)
  • Lecture 13 (Validation)
  • Lecture 14 (Support Vector Machines)
  • Lecture 15 (Kernel Methods)
  • Lecture 16 (Radial Basis Functions)
  • Lecture 17 (Three Learning Principles)
  • Lecture 18 (Epilogue)

Enjoy!

September 25, 2013

Machine Learning: The problem is…

Filed under: Machine Learning,Weka — Patrick Durusau @ 2:07 pm

I am watching the Data Mining with Weka videos and Prof. Ian Witten observed that Weka makes machine learning easy but:

The problem is understanding what it is that you have done.

That’s really the rub isn’t it? You loaded data, the program ran without crashing, some output was displayed.

All well and good but does it mean anything?

Or does your boss tell you what a data set will show after you complete machine learning on it?

Not to single out machine learning because there any number of ways to “cook” data long before it gets to the machine learning processor.

Take survey data for example. Where you ask some group of people for their responses.

A quick scan of survey methodology at Wikipedia and you will realize that services like Survey Monkey are for:

Monkey

I’ve heard the arguments of no money to do a survey correctly so mid-management makes up questions that leads to the correct result. Business decisions are justified on that type of survey data.

Collecting data and running machine learning algorithms are vital day to day activities in data science.

Even if you plan to fool others, do be fooled yourself. Develop a critical outlook and questions that should be asked of data sets, depending upon their point of origin.

PS: Do you know of any courses on “data skepticism?” That would make a great course title. 😉

September 24, 2013

A Course in Machine Learning (book)

Filed under: Machine Learning — Patrick Durusau @ 4:11 pm

A Course in Machine Learning by Hal Daumé III.

From the webpage:

Machine learning is the study of algorithms that learn from data and experience. It is applied in a vast variety of application areas, from medicine to advertising, from military to pedestrian. Any area in which you need to make sense of data is a potential consumer of machine learning.

CIML is a set of introductory materials that covers most major aspects of modern machine learning (supervised learning, unsupervised learning, large margin methods, probabilistic modeling, learning theory, etc.). It’s focus is on broad applications with a rigorous backbone. A subset can be used for an undergraduate course; a graduate course could probably cover the entire material and then some.

You may obtain the written materials by purchasing a ($55) print copy, by the entire book, or by downloading individual chapters below. If you find the electronic version of the book useful and would like to donate a small amount to support further development, that’s always appreciated! The current version is 0.9 (the “beta” pre-release).

Have you noticed that the quality of materials on the Internet is increasing. At least in some domains?

If you want to look at individual chapters:

  1. Front Matter
  2. Decision Trees
  3. Geometry and Nearest Neighbors
  4. The Perceptron
  5. Machine Learning in Practice
  6. Beyond Binary Classification
  7. Linear Models
  8. Probabilistic Modeling
  9. Neural Networks
  10. Kernel Methods
  11. Learning Theory
  12. Ensemble Methods
  13. Efficient Learning
  14. Unsupervised Learning
  15. Expectation Maximization
  16. Semi-Supervised Learning
  17. Graphical Models
  18. Online Learning
  19. Structured Learning
  20. Bayesian Learning
  21. Back Matter

Code and datasets said to be coming soon.

I first saw this at: A Course in Machine Learning (free book).

September 21, 2013

Search Rules using Mahout’s Association Rule Mining

Filed under: Machine Learning,Mahout,Searching — Patrick Durusau @ 2:05 pm

Search Rules using Mahout’s Association Rule Mining by Sujit Pal.

This work came about based on a conversation with one of our domain experts, who was relaying a conversation he had with one of our clients. The client was looking for ways to expand the query based on terms already in the query – for example, if a query contained “cattle” and “neurological disorder”, then we should also server results for “bovine spongiform encephalopathy”, also known as “mad cow disease”.

We do semantic search, which involves annotating words and phrases in documents with concepts from our taxonomy. One view of an annotated document is the bag of concepts view, where a document is modeled as a sparsely populated array of scores, each position corresponding to a concept. One way to address the client’s requirement would be to do Association Rule Mining on the concepts, looking for significant co-occurrences of a set of concepts per document across the corpus.

The data I used to build this proof-of-concept with came from one of my medium sized indexes, and contains 12,635,756 rows and 342,753 unique concepts. While Weka offers the Apriori algorithm, I suspect that it won’t be able to handle this data volume. Mahout is probably a better fit, and it offers the FPGrowth algorithm running on Hadoop, so thats what I used. This post describes the things I had to do to prepare my data for Mahout, run the job with Mahout on Amazon Elastic Map Reduce (EMR) platform, then post process the data to get useful information out of it.
(…)

I don’t know that I would call these “search rules” but they would certainly qualify as input into defining merging rules.

Particularly if I was mining domain literature where co-occurrences of terms are likely to have the same semantic. Not always but likely. The likelihood of semantic sameness is something you can sample for and develop confidence measures about.

August 29, 2013

Data Mining with Weka [Free MOOC]

Filed under: Data Mining,Machine Learning,Weka — Patrick Durusau @ 6:25 pm

Data Mining with Weka

From the webpage:

Welcome to the free online course Data Mining with Weka

This 5 week MOOC will introduce data mining concepts through practical experience with the free Weka tool.

The course features:

The course will start September 9, 2013, with enrolments now open.

An opportunity to both keep your mind in shape and learn something useful.

The need for data intuits who also know machine learning is increasing.

Are you going to be the pro from Dover or not?

August 27, 2013

Classification of handwritten digits

Filed under: Machine Learning,Mathematica — Patrick Durusau @ 6:21 pm

Classification of handwritten digits

From the post:

In this blog post I show some experiments with algorithmic recognition of images of handwritten digits.

I followed the algorithm described in Chapter 10 of the book “Matrix Methods in Data Mining and Pattern Recognition” by Lars Elden.

The algorithm described uses the so called thin Singular Value Decomposition (SVD).

An interesting introduction to a traditional machine learning exercise.

Not to mention the use of Mathematica, a standard tool for mathematical analysis.

You do know they have a personal version for home use? List price as of today is $295 to purchase a copy.

August 18, 2013

Distributed Machine Learning with Spark using MLbase

Filed under: Machine Learning,MLBase,Spark — Patrick Durusau @ 1:01 pm

Apache Spark: Distributed Machine Learning with Spark using MLbase by Ameet Talwaker and Evan Sparks.

From the description:

In this talk we describe our efforts, as part of the MLbase project, to develop a distributed Machine Learning platform on top of Spark. In particular, we present the details of two core components of MLbase, namely MLlib and MLI, which are scheduled for open-source release this summer. MLlib provides a standard Spark library of scalable algorithms for common learning settings such as classification, regression, collaborative filtering and clustering. MLI is a machine learning API that facilitates the development of new ML algorithms and feature extraction methods. As part of our release, we include a library written against the MLI containing standard and experimental ML algorithms, optimization primitives and feature extraction methods.

Useful links:

http://mlbase.org

http://spark-project.org/

http://incubator.apache.org/projects/spark.html

Suggestion: When you make a video of a presentation, don’t include members of the audience eating (pizza in this case). It’s distracting.

From: http://mlbase.org

  • MLlib: A distributed low-level ML library written directly against the Spark runtime that can be called from Scala and Java. The current library includes common algorithms for classification, regression, clustering and collaboritive filtering, and will be included as part of the Spark v0.8 release.
  • MLI: An API / platform for feature extraction and algorithm development that introduces high-level ML programming abstractions. MLI is currently implemented against Spark, leveraging the kernels in MLlib when possible, though code written against MLI can be executed on any runtime engine supporting these abstractions. MLI includes more extensive functionality and has a faster development cycle than MLlib. It will be released in conjunction with MLlib as a separate project.
  • ML Optimizer: This layer aims to simplify ML problems for End Users by automating the task of model selection. The optimizer solves a search problem over feature extractors and ML algorithms included in MLI. This component is under active development.

The goal of this project, to make machine learning easier for developers and end users is a laudable one.

And it is the natural progression of a technology from being experimental to common use.

On the other hand, I am uneasy about the weight users will put on results, while not understanding biases or uncertainties that are cooked into the data or algorithms.

I don’t think there is a solution to the bias/uncertainty problem other than to become more knowledgeable about machine learning.

Not that you will win an argument with an end users who keeps pointing to a result as though it were untouched by human biases.

But you may be able to better avoid such traps for yourself and your clients.

August 15, 2013

Learning the meaning behind words

Filed under: Machine Learning,Meaning — Patrick Durusau @ 6:55 pm

Learning the meaning behind words by By Tomas Mikolov, Ilya Sutskever, and Quoc Le, Google Knowledge.

From the post:

Today computers aren’t very good at understanding human language, and that forces people to do a lot of the heavy lifting—for example, speaking “searchese” to find information online, or slogging through lengthy forms to book a trip. Computers should understand natural language better, so people can interact with them more easily and get on with the interesting parts of life.

While state-of-the-art technology is still a ways from this goal, we’re making significant progress using the latest machine learning and natural language processing techniques. Deep learning has markedly improved speech recognition and image classification. For example, we’ve shown that computers can learn to recognize cats (and many other objects) just by observing large amount of images, without being trained explicitly on what a cat looks like. Now we apply neural networks to understanding words by having them “read” vast quantities of text on the web. We’re scaling this approach to datasets thousands of times larger than what has been possible before, and we’ve seen a dramatic improvement of performance — but we think it could be even better. To promote research on how machine learning can apply to natural language problems, we’re publishing an open source toolkit called word2vec that aims to learn the meaning behind words.

Word2vec uses distributed representations of text to capture similarities among concepts. For example, it understands that Paris and France are related the same way Berlin and Germany are (capital and country), and not the same way Madrid and Italy are. This chart shows how well it can learn the concept of capital cities, just by reading lots of news articles — with no human supervision:
(…)

Google has open sourced the code for word2vec.

I wonder how this would perform on all the RFC’s?

Or all of the papers at Citeseer?

August 4, 2013

Web Scale? Or do you want to try for human scale?

Filed under: Data Mining,Machine Learning,Ontology — Patrick Durusau @ 4:41 pm

How often have your heard the claim this or that technology is “web scale?”

How big is “web scale?”

Visit http://www.worldwidewebsize.com/ to get an estimate of the size of the Web.

As of today, the estimated number of indexed web pages for Google is approximately 47 billion pages.

How does that compare, say to scholarly literature?

Would you believe 1 trillion pages of scholarly journal literature?

An incomplete inventory (Fig. 1), divided into biological, social, and physical sciences, contains 400, 200, and 65 billion pages, respectively (see supplemental data*).

Or better with an image:

webscale

I didn’t bother putting in the trillion page data but for your information, the indexed Web is < 5% of all scholarly journal literature.

Nor did I try to calculate the data that Chicago is collecting every day with 10,000 video cameras.

Is your app ready to step up to human scale information retrieval?

*Advancing science through mining libraries, ontologies, and communities by JA Evans, A. Rzhetsky. J Biol Chem. 2011 Jul 8;286(27):23659-66. doi: 10.1074/jbc.R110.176370. Epub 2011 May 12.

August 2, 2013

Apache Mahout 0.8!

Filed under: Machine Learning,Mahout — Patrick Durusau @ 3:08 pm

Apache Mahout 0.8.

From the homepage:

Mahout currently has

  • Collaborative Filtering
  • User and Item based recommenders
  • K-Means, Fuzzy K-Means clustering
  • Mean Shift clustering
  • Dirichlet process clustering
  • Latent Dirichlet Allocation
  • Singular value decomposition
  • Parallel Frequent Pattern mining
  • Complementary Naive Bayes classifier
  • Random forest decision tree based classifier
  • High performance java collections (previously colt collections)
  • A vibrant community
  • and many more cool stuff to come by this summer thanks to Google summer of code

If you are interested in Mahout, be sure to read the notes on future plans.

As the project moves towards a 1.0 release, the community is working to clean up and/or remove parts of the code base that are under-supported or that underperform as well as to better focus the energy and contributions on key algorithms that are proven to scale in production and have seen wide-spread adoption. To this end, in the next release, the project is planning on removing support for the following algorithms unless there is sustained support and improvement of them before the next release.

If you see an algorithm you need, best to step up or support someone else stepping up to support and improve the existing code.

July 30, 2013

RTextTools: A Supervised Learning Package for Text Classification

Filed under: Classification,Machine Learning,R — Patrick Durusau @ 2:17 pm

RTextTools: A Supervised Learning Package for Text Classification by Timothy P. Jurka, Loren Collingwood, Amber E. Boydstun, Emiliano Grossman, and Wouter van Atteveldt.

Abstract:

Social scientists have long hand-labeled texts to create datasets useful for studying topics from congressional policymaking to media reporting. Many social scientists have begun to incorporate machine learning into their toolkits. RTextTools was designed to make machine learning accessible by providing a start-to-finish product in less than 10 steps. After installing RTextTools, the initial step is to generate a document term matrix. Second, a container object is created, which holds all the objects needed for further analysis. Third, users can use up to nine algorithms to train their data. Fourth, the data are classified. Fifth, the classification is summarized. Sixth, functions are available for performance evaluation. Seventh, ensemble agreement is conducted. Eighth, users can cross-validate their data. Finally, users write their data to a spreadsheet, allowing for further manual coding if required.

Another software package that comes with a sample data set!

The congressional bills example reminds me of a comment by Trey Grainger in Building a Real-time, Big Data Analytics Platform with Solr.

Trey makes the point that “document” in Solr depends on how you define document. Which enables processing/retrieval at a much lower level than a traditional “document.”

If the congressional bills were broken down at a clause level, would the results be different?

Not something I am going to pursue today but will appreciate comments and suggestions if you have seen that tried in other contexts.

July 25, 2013

Classification accuracy is not enough

Filed under: Classification,Machine Learning,Music — Patrick Durusau @ 4:41 pm

Classification accuracy is not enough by Bob L. Sturm.

From the post:

Finally published is my article, Classification accuracy is not enough: On the evaluation of music genre recognition systems. I made it completely open access and free for anyone.

Some background: In my paper Two Systems for Automatic Music Genre Recognition: What Are They Really Recognizing?, I perform three different experiments to determine how well two state-of-the-art systems for music genre recognition are recognizing genre. In the first experiment, I find the two systems are consistently making extremely bad misclassifications. In the second experiment, I find the two systems can be fooled by such simple transformations that they cannot possibly be listening to the music. In the third experiment, I find their internal models of the genres do not match how humans think the genres sound. Hence, it appears that the systems are not recognizing genre in the least. However, this seems to contradict the fact that they achieve extremely good classification accuracies, and have been touted as superior solutions in the literature. Turns out, Classification accuracy is not enough!

(…)

I look closely at what kinds of mistakes the systems make, and find they all make very poor yet “confident” mistakes. I demonstrate the latter by looking at the decision statistics of the systems. There is little difference for a system between making a correct classification, and an incorrect one. To judge how poor the mistakes are, I test with humans whether the labels selected by the classifiers describe the music. Test subjects listen to a music excerpt and select between two labels which they think was given by a human. Not one of the systems fooled anyone. Hence, while all the systems had good classification accuracies, good precisions, recalls, and F-scores, and confusion matrices that appeared to make sense, a deeper evaluation shows that none of them are recognizing genre, and thus that none of them are even addressing the problem. (They are all horses, making decisions based on irrelevant but confounded factors.)

(…)

If you have ever wondered what a detailed review of classification efforts would look like, you need wonder no longer!

Bob’s Two Systems for Automatic Music Genre Recognition: What Are They Really Recognizing? is thirty-six (36) pages that examines efforts at music genre recognition (MGR) in detail.

I would highly recommend this paper as a demonstration of good research technique.

July 8, 2013

Mahout – Unofficial 0.8 Release

Filed under: Machine Learning,Mahout — Patrick Durusau @ 3:29 pm

Mahout – Unofficial 0.8 Release (email from Grant Ingersoll.

From the post:

A _preview_ of release artifacts for 0.8 are at https://repository.apache.org/content/repositories/orgapachemahout-113/org/apache/mahout/.

This is not an official release. I will call a vote in a day or two, pending feedback on this thread, so please review/test.

A _preview_ of the release notes are at https://cwiki.apache.org/confluence/display/MAHOUT/Release+0.8

In case you are interested in contributing comments pre-release.

June 22, 2013

Machine Learning Cheat Sheet [Suggestions for a better one]

Filed under: Algorithms,Machine Learning — Patrick Durusau @ 3:39 pm

Machine Learning Cheat Sheet (pdf)

If you need to memorize machine learning formulas for an exam, this might be the very thing.

On the other hand, if you are sitting at your console, you are likely to have online or hard copy references with this formula and more detailed information.

A generally helpful machine learning cheatsheet would some common cases where each algorithm has been successful. Perhaps even some edge cases you are unlikely to think about.

The algorithms are rarely in question. Proper application, well, that’s an entirely different story.

I first saw this in a tweet by Siah.

Dlib C++ Library [New Release]

Filed under: Machine Learning — Patrick Durusau @ 3:26 pm

Dlib C++ Library

From the webpage:

A major design goal of this portion of the library is to provide a highly modular and simple architecture for dealing with kernel algorithms. Towards this end, dlib takes a generic programming approach using C++ templates. In particular, each algorithm is parameterized to allow a user to supply either one of the predefined dlib kernels (e.g. RBF operating on column vectors), or a new user defined kernel. Moreover, the implementations of the algorithms are totally separated from the data on which they operate. This makes the dlib implementation generic enough to operate on any kind of data, be it column vectors, images, or some other form of structured data. All that is necessary is an appropriate kernel.

New features in 18.3:

  • Machine Learning:
    • Added the svr_linear_trainer, a tool for solving large scale support vector
      regression problems.
    • Added a tool for working with BIO and BILOU style sequence taggers/segmenters. This is the new sequence_segmenter object and its associated structural_sequence_segmentation_trainer object.
    • Added a python interface to some of the machine learning tools. These include the svm_c_trainer, svm_c_linear_trainer, svm_rank_trainer, and structural_sequence_segmentation_trainer objects as well as the cca() routine.
  • Added point_transform_projective and find_projective_transform().
  • Added a function for numerically integrating arbitrary functions, this is the new integrate_function_adapt_simpson() routine which was contributed by Steve Taylor
  • Added jet(), a routine for coloring images with the jet color scheme.

This looks interesting. Lots of good references, etc.

I first saw this in a tweet by Mxlearn.

June 21, 2013

The LION Way

Filed under: Interface Research/Design,Machine Learning — Patrick Durusau @ 5:43 pm

The LION Way: Machine Learning plus Intelligent Optimization by Roberto Battiti and Mauro Brunato.

From the introduction:

Learning and Intelligent Optimization (LION) is the combination of learning from data and optimization applied to solve complex and dynamic problems. The LION way is about increasing the automation level and connecting data directly to decisions and actions. More power is directly in the hands of decision makers in a self-service manner, without resorting to intermediate layers of data scientists. LION is a complex array of mechanisms, like the engine in an automobile, but the user (driver) does not need to know the inner-workings of the engine in order to realize tremendous benefits. LION’s adoption will create a prairie fire of innovation which will reach most businesses in the next decades. Businesses, like plants in wildfire-prone ecosystems, will survive and prosper by adapting and embracing LION techniques, or they risk being transformed from giant trees to ashes by the spreading competition.

The questions to be asked in the LION paradigm are not about mathematical goodness models but about abundant data, expert judgment of concrete options (examples of success cases), interactive definition of success criteria, at a level which makes a human person at ease with his mental models. For example, in marketing, relevant data can describe the money allocation and success of previous campaigns, in engineering they can describe experiments about motor designs (real or simulated) and corresponding fuel consumption.

OK, the “…prairie fire of innovation…” stuff is a bit over the top but it’s promoting a paradigm.

And I’m not unsympathetic to making tools easier for users to use.

Although, I must confess that people who choose a “self-service” model for complex information processing are likely to get the results they deserve (but don’t want).

Like most people I can “type” after a fashion. I don’t look at the keyboard and do use all ten fingers. But, compared to a professional typist of my youth, I am not even an entry level typist. A professional typist could produce far more error free content in a couple of hours than I can all day.

Odd how “self-service” works out to putting more of a burden on the user for a poorer result.

The book is free and worth a read.

I first saw this at KDNuggets.

June 16, 2013

Music Information Research Based on Machine Learning

Filed under: Machine Learning,Music,Music Retrieval — Patrick Durusau @ 3:38 pm

Music Information Research Based on Machine Learning by Masataka Goto and Kazuyoshi Yoshii.

From the webpage:

Music information research is gaining a lot of attention after 2000 when the general public started listening to music on computers in daily life. It is widely known as an important research field, and new researchers are continually joining the field worldwide. Academically, one of the reasons many researchers are involved in this field is that the essential unresolved issue is the understanding of complex musical audio signals that convey content by forming a temporal structure while multiple sounds are interrelated. Additionally, there are still appealing unresolved issues that have not been touched yet, and the field is a treasure trove of research topics that could be tackled with state-of-the-art machine learning techniques.

This tutorial is intended for an audience interested in the application of machine learning techniques to such music domains. Audience members who are not familiar with music information research are welcome, and researchers working on music technologies are likely to find something new to study.

First, the tutorial serves as a showcase of music information research. The audience can enjoy and study many state-of-the-art demonstrations of music information research based on signal processing and machine learning. This tutorial highlights timely topics such as active music listening interfaces, singing information processing systems, web-related music technologies, crowdsourcing, and consumer-generated media (CGM).

Second, this tutorial explains the music technologies behind the demonstrations. The audience can learn how to analyze and understand musical audio signals, process singing voices, and model polyphonic sound mixtures. As a new approach to advanced music modeling, this tutorial introduces unsupervised music understanding based on nonparametric Bayesian models.

Third, this tutorial provides a practical guide to getting started in music information research. The audience can try available research tools such as music feature extraction, machine learning, and music editors. Music databases and corpora are then introduced. As a hint towards research topics, this tutorial also discusses open problems and grand challenges that the audience members are encouraged to tackle.

In the future, music technologies, together with image, video, and speech technologies, are expected to contribute toward all-around media content technologies based on machine learning.

Download tutorial slides.

Always nice to start with week with something different.

I first saw this in a tweet by Masataka Goto.

June 12, 2013

NYU Large Scale Machine Learning Class Notes

Filed under: Machine Learning — Patrick Durusau @ 1:41 pm

NYU Large Scale Machine Learning Class Notes by John Langford.

John has posted the class notes from the large scale machine learning class he co-taught with Yann LeCun.

Catch the videos here.

June 10, 2013

Mahout for R Users

Filed under: Machine Learning,Mahout,R — Patrick Durusau @ 2:31 pm

Mahout for R Users by Simon Raper.

From the post:

I have a few posts coming up on Apache Mahout so I thought it might be useful to share some notes. I came at it as primarily an R coder with some very rusty Java and C++ somewhere in the back of my head so that will be my point of reference. I’ve also included at the bottom some notes for setting up Mahout on Ubuntu.

What is Mahout?

A machine learning library written in Java that is designed to be scalable, i.e. run over very large data sets. It achieves this by ensuring that most of its algorithms are parallelizable (they fit the map-reduce paradigm and therefore can run on Hadoop.) Using Mahout you can do clustering, recommendation, prediction etc. on huge datasets by increasing the number of CPUs it runs over. Any job that you can split up into little jobs that can done at the same time is going to see vast improvements in performance when parallelized.

Like R it’s open source and free!

So why use it?

Should be obvious from the last point. The parallelization trick brings data and tasks that were once beyond the reach of machine learning suddenly into view. But there are other virtues. Java’s strictly object orientated approach is a catalyst to clear thinking (once you get used to it!). And then there is a much shorter path to integration with web technologies. If you are thinking of a product rather than just a one off piece of analysis then this is a good way to go.

Large data sets have been in the news of late. 😉

Are you ready to apply machine learning techniques to large data sets?

And will you be familiar enough with the techniques to spot computational artifacts?

Can’t say for sure but more knowledge of and practice with Mahout might help with those questions.

June 5, 2013

Crowdsourcing + Machine Learning…

Filed under: Crowd Sourcing,Machine Learning,Manuscripts — Patrick Durusau @ 9:20 am

Crowdsourcing + Machine Learning: Nicholas Woodward at TCDL by Ben W. Brumfield.

I was so impressed by Nicholas Woodward’s presentation at TCDL this year that I asked him if I could share “Crowdsourcing + Machine Learning: Building an Application to Convert Scanned Documents to Text” on this blog.

Hi. My name is Nicholas Woodward, and I am a Software Developer for the University of Texas Libraries. Ben Brumfield has been so kind as to offer me an opportunity to write a guest post on his blog about my approach for transcribing large scanned document collections that combines crowdsourcing and computer vision. I presented my application at the Texas Conference on Digital Libraries on May 7th, 2013, and the slides from the presentation are available on TCDL’s website. This purpose of this post is to introduce my approach along with a test collection and preliminary results. I’ll conclude with a discussion on potential avenues for future work.

Before we delve into algorithms for computer vision and what-not, I’d first like to say a word about the collection used in this project and why I think it’s important to look for new ways to complement crowdsourcing transcription. The Guatemalan National Police Historical Archive (or AHPN, in Spanish) contains the records of the Guatemalan National Police from 1882-2005. It is estimated that AHPN contains more than 80 million pages of documents (8,000 linear meters) such as handwritten journals and ledgers, birth certificate and marriage license forms, identification cards and typewritten letters. To date, the AHPN staff have processed and digitized approximately 14 million pages of the collection, and they are publicly available in a digital repository that was developed by UT Libraries.

While unique for its size, AHPN is representative of an increasingly common problem in the humanities and social sciences. The nature of the original documents precludes any economical OCR solution on the scanned images (See below), and the immense size of the collection makes page-by-page transcription highly impractical, even when using a crowdsourcing approach. Additionally, the collection does not contain sufficient metadata to support browsing via commonly used traits, such as titles or authors of documents.

A post at the intersection of many of my interests!

Imagine pushing this just a tad further to incorporate management of subject identity, whether visible to the user or not.

Trends in Machine Learning [SciPy]

Filed under: Machine Learning,Python — Patrick Durusau @ 8:11 am

Trends in Machine Learning by Olivier Grisel.

Slides from presentation at Paris DataGeeks 2013.

Focus is on Python and SciPy.

Covers probabilistic programming, deep learning, and has links at the end.

Good way to check your currency on machine learning with Python.

May 13, 2013

How to Build a Text Mining, Machine Learning….

Filed under: Document Classification,Machine Learning,R,Text Mining — Patrick Durusau @ 3:51 pm

How to Build a Text Mining, Machine Learning Document Classification System in R! by Timothy DAuria.

From the description:

We show how to build a machine learning document classification system from scratch in less than 30 minutes using R. We use a text mining approach to identify the speaker of unmarked presidential campaign speeches. Applications in brand management, auditing, fraud detection, electronic medical records, and more.

Well made video introduction to R and text mining.

May 3, 2013

Deep learning made easy

Filed under: Artificial Intelligence,Deep Learning,Machine Learning,Sparse Data — Patrick Durusau @ 1:06 pm

Deep learning made easy by Zygmunt Zając.

From the post:

As usual, there’s an interesting competition at Kaggle: The Black Box. It’s connected to ICML 2013 Workshop on Challenges in Representation Learning, held by the deep learning guys from Montreal.

There are a couple benchmarks for this competition and the best one is unusually hard to beat – only less than a fourth of those taking part managed to do so. We’re among them. Here’s how.

The key ingredient in our success is a recently developed secret Stanford technology for deep unsupervised learning, called sparse filtering. Actually, it’s not secret. It’s available at Github, and has one or two very appealling properties. Let us explain.

The main idea of deep unsupervised learning, as we understand it, is feature extraction. One of the most common applications are in multimedia. The reason for that is that multimedia tasks, for example object recognition, are easy for humans, but difficult for the computers*.

Geoff Hinton from Toronto talks about two ends of spectrum in machine learning: one is statistics and getting rid of noise, the other one – AI, or the things that humans are good at but computers are not. Deep learning proponents say that deep, that is, layered, architectures, are the way to solve AI kind of problems.

The idea might have something to do with an inspiration from how the brain works. Each layer is supposed to extract higher-level features, and these features are supposed to be more useful for the task at hand.

Rather say layered architectures are observed to mimic human results.

Just as a shovel mimics and exceeds a human hand for digging.

But you would not say operation of a shovel gives us insight into the operation of a human hand.

Or would you?

April 30, 2013

GraphLab Workshop 2013 (Update)

Filed under: Conferences,GraphLab,Machine Learning — Patrick Durusau @ 2:46 pm

GraphLab Workshop 2013 Confirmed Agenda

You probably already have your plane tickets and hotel reservation but have you registered for GraphLab Workshop 2013?

Not just a select few graph databases for comparison but:

We have secured talks and demos about the hottest graph processing systems out there: GraphLab (CMU/UW), Pregel (Google), Giraph (Facebook) , Cassovary (Twitter), Grappa (UW), Combinatorial BLAS (LBNL/UCSB), Allegro Graph (Franz) ,Neo4j, Titan (Aurelius), DEX (Sparsity Technologies), YarcData and others!

Registration.

2013 Graphlab Workshop on Large Scale Machine Learning
Sessions Events LLC
Monday, July 1, 2013 from 8:00 AM to 7:00 PM (PDT)
San Francisco, CA

I know, I know, 8 AM is an unholy time to be anywhere (other than on your way home) on the West Coast.

Just pull an all-dayer for a change. 😉

Expecting to see lots of posts and tweets from the conference!

April 24, 2013

Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices

Filed under: Graphic Processors,Graphs,Machine Learning,R,Sparse Data,Sparse Matrices — Patrick Durusau @ 7:05 pm

Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices by Shivaram Venkataraman, Erik Bodzsar, Indrajit Roy, Alvin AuYoung, and Robert S. Schreiber.

Abstract:

It is cumbersome to write machine learning and graph algorithms in data-parallel models such as MapReduce and Dryad. We observe that these algorithms are based on matrix computations and, hence, are inefficient to implement with the restrictive programming and communication interface of such frameworks.

In this paper we show that array-based languages such as R [3] are suitable for implementing complex algorithms and can outperform current data parallel solutions. Since R is single-threaded and does not scale to large datasets, we have built Presto, a distributed system that extends R and addresses many of its limitations. Presto efficiently shares sparse structured data, can leverage multi-cores, and dynamically partitions data to mitigate load imbalance. Our results show the promise of this approach: many important machine learning and graph algorithms can be expressed in a single framework and are substantially faster than those in Hadoop and Spark.

Your mileage may vary but the paper reports that for PageRank, Presto is 40X faster than Hadoop and 15X Spark.

Unfortunately I can’t point you to any binary or source code for Presto.

Still, the description is an interesting one at a time of rapid development of computing power.

April 17, 2013

List of Machine Learning APIs

Filed under: Machine Learning,Programming — Patrick Durusau @ 3:39 pm

List of Machine Learning APIs

From the post:

Wikipedia defines Machine Learning as “a branch of artificial intelligence that deals with the construction and study of systems that can learn from data.”

Below is a compilation of APIs that have benefited from Machine Learning in one way or another, we truly are living in the future so strap into your rocketship and prepare for blastoff.

Interesting collection.

Worth reviewing for “assists” to human curators working with data.

And a rich hunting ground for head to head competitions against human curated data.

I first saw this at Alex Popescu’s List of Machine Learning APIs.

« Newer PostsOlder Posts »

Powered by WordPress