Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 5, 2015

Hopping on the Deep Learning Bandwagon

Filed under: Classification,Deep Learning,Machine Learning,Music — Patrick Durusau @ 3:51 pm

Hopping on the Deep Learning Bandwagon by Yanir Seroussi.

From the post:

I’ve been meaning to get into deep learning for the last few years. Now, the stars having finally aligned and I have the time and motivation to work on a small project that will hopefully improve my understanding of the field. This is the first in a series of posts that will document my progress on this project.

As mentioned in a previous post on getting started as a data scientist, I believe that the best way of becoming proficient at solving data science problems is by getting your hands dirty. Despite being familiar with high-level terminology and having some understanding of how it all works, I don’t have any practical experience applying deep learning. The purpose of this project is to fix this experience gap by working on a real problem.

The problem: Inferring genre from album covers

Deep learning has been very successful at image classification. Therefore, it makes sense to work on an image classification problem for this project. Rather than using an existing dataset, I decided to make things a bit more interesting by building my own dataset. Over the last year, I’ve been running BCRecommender – a recommendation system for Bandcamp music. I’ve noticed that album covers vary by genre, though it’s hard to quantify exactly how they vary. So the question I’ll be trying to answer with this project is how accurately can genre be inferred from Bandcamp album covers?

As the goal of this project is to learn about deep learning rather than make a novel contribution, I didn’t do a comprehensive search to see whether this problem has been addressed before. However, I did find a recent post by Alexandre Passant that describes his use of Clarifai’s API to tag the content of Spotify album covers (identifying elements such as men, night, dark, etc.), and then using these tags to infer the album’s genre. Another related project is Karayev et al.’s Recognizing image style paper, in which the authors classified datasets of images from Flickr and Wikipedia by style and art genre, respectively. In all these cases, the results are pretty good, supporting my intuition that the genre inference task is feasible.

Yanir continues this adventure into deep learning with: Learning About Deep Learning Through Album Cover Classification. And you will want to look over his list of Deep Learning Resources.

Yanir’s observation that the goal of the project was “…to learn about deep learning rather than make a novel contribution…” is an important one.

The techniques and lessons you learn may be known to others but they will be new to you.

November 2, 2015

Interactive visual machine learning in spreadsheets

Filed under: Interface Research/Design,Machine Learning,Spreadsheets,Visualization — Patrick Durusau @ 7:59 am

Interactive visual machine learning in spreadsheets by Advait Sarkar, Mateja Jamnik, Alan F. Blackwell, Martin Spott.


BrainCel is an interactive visual system for performing general-purpose machine learning in spreadsheets, building on end-user programming and interactive machine learning. BrainCel features multiple coordinated views of the model being built, explaining its current confidence in predictions as well as its coverage of the input domain, thus helping the user to evolve the model and select training examples. Through a study investigating users’ learning barriers while building models using BrainCel, we found that our approach successfully complements the Teach and Try system [1] to facilitate more complex modelling activities.

To assist users in building machine learning models in spreadsheets:

The user should be able to critically evaluate the quality, capabilities, and outputs of the model. We present “BrainCel,” an interface designed to facilitate this. BrainCel enables the end-user to understand:

  1. How their actions modify the model, through visualisations of the model’s evolution.
  2. How to identify good training examples, through a colour-based interface which “nudges” the user to attend to data where the model has low confidence.
  3. Why and how the model makes certain predictions, through a network visualisation of the k-nearest neighbours algorithm; a simple, consistent way of displaying decisions in an arbitrarily high-dimensional space.

A great example of going where users are spending their time, spreadsheets, as opposed to originating new approaches to data they already possess.

To get a deeper understanding of the Sarkar’s approach to users via spreadsheets as an interface, see also:

Spreadsheet interfaces for usable machine learning by Advait Sarkar.


In the 21st century, it is common for people of many professions to have interesting datasets to which machine learning models may be usefully applied. However, they are often unable to do so due to the lack of usable tools for statistical non-experts. We present a line of research into using the spreadsheet — already familiar to end-users as a paradigm for data manipulation — as a usable interface which lowers the statistical and computing knowledge barriers to building and using these models.

Teach and Try: A simple interaction technique for exploratory data modelling by end users by Advait Sarkar, Alan F Blackwell, Mateja Jamnik, Martin Spott.


The modern economy increasingly relies on exploratory data analysis. Much of this is dependent on data scientists – expert statisticians who process data using statistical tools and programming languages. Our goal is to offer some of this analytical power to end-users who have no statistical training through simple interaction techniques and metaphors. We describe a spreadsheet-based interaction technique that can be used to build and apply sophisticated statistical models such as neural networks, decision trees, support vector machines and linear regression. We present the results of an experiment demonstrating that our prototype can be understood and successfully applied by users having no professional training in statistics or computing, and that the experience of interacting with the system leads them to acquire some understanding of the concepts underlying exploratory statistical modelling.

Sarkar doesn’t mention it but while non-expert users lack skills with machine learning tools, they do have expertise with their own data and domain. Data/domain expertise that is more difficult to communicate to an expert user than machine learning techniques to the non-expert.

Comparison of machine learning expert vs. domain data expert analysis lies in the not too distant and interesting future.

I first saw this in a tweet by Felienne Hermans.

October 30, 2015

Deep Feature Synthesis:… [Replacing Human Intuition?, Calling Bull Shit]

Filed under: Feature Learning,Machine Learning,MySQL — Patrick Durusau @ 3:24 pm

Deep Feature Synthesis: Towards Automating Data Science Endeavors by James Max Kanter and Kalyan Veeramachaneni.


In this paper, we develop the Data Science Machine, which is able to derive predictive models from raw data automatically. To achieve this automation, we first propose and develop the Deep Feature Synthesis algorithm for automatically generating features for relational datasets. The algorithm follows relationships in the data to a base field, and then sequentially applies mathematical functions along that path to create the final feature. Second, we implement a generalizable machine learning pipeline and tune it using a novel Gaussian Copula process based approach. We entered the Data Science Machine in 3 data science competitions that featured 906 other data science teams. Our approach beats 615 teams in these data science competitions. In 2 of the 3 competitions we beat a majority of competitors, and in the third, we achieved 94% of the best competitor’s score. In the best case, with an ongoing competition, we beat 85.6% of the teams and achieved 95.7% of the top submissions score.

The most common phrase I saw in headlines about this paper included some variation on: MIT algorithm replaces human intuition or words to that effect. For example, MIT developing a system that replaces human intuition for big data analysis siliconAngle, An Algorithm May Be Better Than Humans at Breaking Down Big Data Newsweek, Is an MIT algorithm better than human intuition? Christian Science Monitor, and A new AI algorithm can outperform human intuition The World Weekly, just to name a few.

Being the generous sort of reviewer that I am, ;-), I am going to assume that the reporters who wrote about the imperiled status of human intuition either didn’t read the article or were working from a poorly written press release.

The error is not buried in a deeply mathematical or computational part of the paper.

Take a look at the second, fourth and seventh paragraphs of the introduction to see if you can spot the error:

To begin with, we observed that many data science problems, such as the ones released by KAGGLE, and competitions at conferences (KDD cup, IJCAI, ECML) have a few common properties. First, the data is structured and relational, usually presented as a set of tables with relational links. Second, the data captures some aspect of human interactions with a complex system. Third, the presented problem attempts to predict some aspect of human behavior, decisions, or activities (e.g., to predict whether a customer will buy again after a sale [IJCAI], whether a project will get funded by donors [KDD Cup 2014], or even where a taxi rider will choose to go [ECML]). [Second paragraph of introduction]

Transforming raw data into features is often the part of the process that most heavily involves humans, because it is driven by intuition. While recent developments in deep learning and automated processing of images, text, and signals have enabled significant automation in feature engineering for those data types, feature engineering for relational and human behavioral data remains iterative, human-intuition driven, and challenging, and hence, time consuming. At the same time, because the efficacy of a machine learning algorithm relies heavily on the input features [1], any replacement for a human must be able to engineer them acceptably well. [Fourth paragraph of introduction]

With these components in place, we present the Data Science Machine — an automated system for generating predictive models from raw data. It starts with a relational database and automatically generates features to be used for predictive modeling. Most parameters of the system are optimized automatically, in pursuit of good general purpose performance. [Seventh paragraph of introduction]

Have you spotted the problem yet?

In the first paragraph the authors say:

First, the data is structured and relational, usually presented as a set of tables with relational links.

In the fourth paragraph the authors say:

Transforming raw data into features is often the part of the process that most heavily involves humans, because it is driven by intuition.

In the seventh paragraph the authors say:

…an automated system for generating predictive models from raw data. It starts with a relational database and automatically generates features…

That is the first time I have ever heard relational database tables and links called raw data.

Human intuition was baked into the data by the construction of the relational tables and links between them, before the Data Science Machine was ever given the data.

The Data Science Machine is wholly and solely dependent upon the human intuition already baked into the relational database data to work at all.

The researchers say as much in the seventh paragraph, unless you think data spontaneously organizes itself into relational tables. Spontaneous relational tables?

If you doubt that human intuition (decision making) is involved in the creation of relational tables, take a quick look at: A Quick-Start Tutorial on Relational Database Design.

This isn’t to take anything away from Kanter and Veeramachaneni. Their Data Science Machine builds upon human intuition captured in relational databases. That is no mean feat. Human intuition should be captured and used to augment machine learning whenever possible.

That isn’t the same as “replacing” human intuition.

PS: Please forward to any news outlet/reporter who has been repeating false information about “deep feature synthesis.”

I first saw this in a tweet by Kirk Borne.

October 29, 2015

How to build and run your first deep learning network

Filed under: Deep Learning,Machine Learning — Patrick Durusau @ 3:30 pm

How to build and run your first deep learning network by Pete Warden.

From the post:

When I first became interested in using deep learning for computer vision I found it hard to get started. There were only a couple of open source projects available, they had little documentation, were very experimental, and relied on a lot of tricky-to-install dependencies. A lot of new projects have appeared since, but they’re still aimed at vision researchers, so you’ll still hit a lot of the same obstacles if you’re approaching them from outside the field.

In this article — and the accompanying webcast — I’m going to show you how to run a pre-built network, and then take you through the steps of training your own. I’ve listed the steps I followed to set up everything toward the end of the article, but because the process is so involved, I recommend you download a Vagrant virtual machine that I’ve pre-loaded with everything you need. This VM lets us skip over all the installation headaches and focus on building and running the neural networks.

I have been unable to find the posts that were to follow in this series.

Even by itself this will be enough to get you going on deep learning but the additional posts would be nice.

Pointers anyone?

October 28, 2015

Model-Based Machine Learning

Filed under: Machine Learning,Modeling — Patrick Durusau @ 8:36 pm

Model-Based Machine Learning by John Winn and Christopher Bishop with Thomas Diethe.

From How can machine learning solve my problem? (first chapter):

In this book we look at machine learning from a fresh perspective which we call model-based machine learning. This viewpoint helps to address all of these challenges, and makes the process of creating effective machine learning solutions much more systematic. It is applicable to the full spectrum of machine learning techniques and application domains, and will help guide you towards building successful machine learning solutions without requiring that you master the huge literature on machine learning.

The core idea at the heart of model-based machine learning is that all the assumptions about the problem domain are made explicit in the form of a model. In fact a model is just made up of this set of assumptions, expressed in a precise mathematical form. These assumptions include the number and types of variables in the problem domain, which variables affect each other, and what the effect of changing one variable is on another variable. For example, in the next chapter we build a model to help us solve a simple murder mystery. The assumptions of the model include the list of suspected culprits, the possible murder weapons, and the tendency for particular weapons to be preferred by different suspects. This model is then used to create a model-specific algorithm to solve the specific machine learning problem. Model-based machine learning can be applied to pretty much any problem, and its general-purpose approach means you don’t need to learn a huge number of machine learning algorithms and techniques.

So why do the assumptions of the model play such a key role? Well it turns out that machine learning cannot generate solutions purely from data alone. There are always assumptions built into any algorithm, although usually these assumptions are far from explicit. Different algorithms correspond to different sets of assumptions, and when the assumptions are implicit the only way to decide which algorithm is likely to give the best results is to compare them empirically. This is time-consuming and inefficient, and it requires software implementations of all of the algorithms being compared. And if none of the algorithms tried gives good results it is even harder to work out how to create a better algorithm.

Four chapters are complete now and four more are coming.

Not a fast read but has a great deal of promise, particularly if readers are honest about their assumptions when modeling problems.

It is an opportunity to examine your assumptions about data in your organization and assumptions about your organization. Those assumptions will have as much if not more impact on your project than assumptions cooked into your machine learning.

October 25, 2015

What a Deep Neural Network thinks about your #selfie

Filed under: Image Processing,Image Recognition,Machine Learning,Neural Networks — Patrick Durusau @ 8:02 pm

What a Deep Neural Network thinks about your #selfie by Andrej Karpathy.

From the post:

Convolutional Neural Networks are great: they recognize things, places and people in your personal photos, signs, people and lights in self-driving cars, crops, forests and traffic in aerial imagery, various anomalies in medical images and all kinds of other useful things. But once in a while these powerful visual recognition models can also be warped for distraction, fun and amusement. In this fun experiment we’re going to do just that: We’ll take a powerful, 140-million-parameter state-of-the-art Convolutional Neural Network, feed it 2 million selfies from the internet, and train it to classify good selfies from bad ones. Just because it’s easy and because we can. And in the process we might learn how to take better selfies 🙂

A must read for anyone interested in deep neural networks and image recognition!

Selfies provide abundant and amusing data to illustrate neural network techniques that are being used every day.

Andrej provides numerous pointers to additional materials and references on neural networks. Good think considering how much interest his post is going to generate!

October 10, 2015

AI vs. Taxpayer (so far, taxpayer wins)

Filed under: Evoluntionary,Genetic Algorithms,Machine Learning — Patrick Durusau @ 7:19 am

Computer Scientists Wield Artificial Intelligence to Battle Tax Evasion by Lynnley Browning.

From the post:

When federal authorities want to ferret out abusive tax shelters, they send an army of forensic accountants, auditors and lawyers to burrow into suspicious tax returns.

Analyzing mountains of filings and tracing money flows through far-flung subsidiaries is notoriously difficult; even if the Internal Revenue Service manages to unravel a major scheme, it typically does so only years after its emergence, by which point a fresh dodge has often already replaced it.

But what if that needle-in-a-haystack quest could be done routinely, and quickly, by a computer? Could the federal tax laws — 74,608 pages of legal gray areas and welters of credits, deductions and exemptions — be accurately rendered in an algorithm?

“We see the tax code as a calculator,” said Jacob Rosen, a researcher at the Massachusetts Institute of Technology who focuses on the abstract representation of financial transactions and artificial intelligence techniques. “There are lots of extraordinarily smart people who take individual parts of the tax code and recombine them in complex transactions to construct something not intended by the law.”

A recent paper by Mr. Rosen and four other computer scientists — two others from M.I.T. and two at the Mitre Corporation, a nonprofit technology research and development organization — demonstrated how an algorithm could detect a certain type of known tax shelter used by partnerships.

I had to chuckle when I read:

“There are lots of extraordinarily smart people who take individual parts of the tax code and recombine them in complex transactions to construct something not intended by the law.”

It would be more accurate to say: “…something not intended by the tax policy wonks at the IRS.”

Or at Justice Sutherland said in Gregory v. Helvering (1934):

The legal right of a taxpayer to decrease the amount of what otherwise would be his taxes, or altogether to avoid them, by means which the law permits, cannot be doubted.

Gregory v. Helvering isn’t much comfort because Sutherland also found against the taxpayer in that case on a “not intended by the law” basis.

Still, if you read the paper you will realize taxpayers are still well ahead vis-a-vis any AI:

Drawbacks are that currently SCOTE has a very simplified view of transactions, audit points and law.

Should we revisit the Turing test?

Perhaps a series of tax code tests, 1040A, 1040 long form, corporate reorganization, each one more complex than the one before.

Pitch the latest AIs against tax professionals?

October 9, 2015

Machine Learning for Developers (

Filed under: Machine Learning,Programming — Patrick Durusau @ 10:51 am

Machine Learning for Developers ( by Mike de Waard.

From the webpage:

Most developers these days have heard of machine learning, but when trying to find an ‘easy’ way into this technique, most people find themselves getting scared off by the abstractness of the concept of Machine Learning and terms as regression, unsupervised learning, Probability Density Function and many other definitions. If one switches to books there are books such as An Introduction to Statistical Learning with Applications in R and Machine Learning for Hackers who use programming language R for their examples.

However R is not really a programming language in which one writes programs for everyday use such as is done with for example Java, C#, Scala etc. This is why in this blog machine learning will be introduced using Smile, a machine learning library that can be used both in Java and Scala. These are languages that most developers have seen at least once during their study or career.

The first section ‘The global idea of machine learning’ contains all important concepts and notions you need to know about to get started with the practical examples that are described in the section ‘Practical Examples’. The section practical examples is inspired by the examples from the book Machine Learning for Hackers. Additionally the book Machine Learning in Action was used for validation purposes.

The second section Practical examples contains examples for various machine learning (ML) applications, with Smile as ML library.

Note that in this blog, ‘new’ definitions are hyperlinked such that if you want, you can read more regarding that specific topic, but you are not obliged to do this in order to be able to work through the examples.

A great resource for developers who need an introduction to machine learning.

But an “introduction only.” The practical examples are quite useful but there are only seven (7) of them.

If you like this, look at the resources Grant Ingersoll has collected at: Getting started with open source machine learning and Andrew Ng’s Machine Learning online course in particular.

The nuances of data that can “fool” or lead to unexpected results from machine learning algorithms appears to be largely unexplored or at least not widely discussed.

As machine learning becomes more prevalent, assisting users in obtaining expected answers is going to be a very marketable skill.

September 26, 2015

Writing “Python Machine Learning”

Filed under: Books,Machine Learning,Python — Patrick Durusau @ 8:46 pm

Writing “Python Machine Learning” by Sebastian Raschka.

From the post:

It’s been about time. I am happy to announce that “Python Machine Learning” was finally released today! Sure, I could just send an email around to all the people who were interested in this book. On the other hand, I could put down those 140 characters on Twitter (minus what it takes to insert a hyperlink) and be done with it. Even so, writing “Python Machine Learning” really was quite a journey for a few months, and I would like to sit down in my favorite coffeehouse once more to say a few words about this experience.

A delightful tale for those of us who have authored books and an inspiration (with some practical suggestions) for anyone who hopes to write a book.

Sebastian’s productivity hints will ring familiar for those with similar habits and bear study by those who hope to become more productive.

Sebastian never comes out and says it but his writing approach breaks each stage of the book into manageable portions. It is far easier to say (and do) “write an outline” than to “write the complete and fixed outline for an almost 500 page book.”

If the task is too large, the complete and immutable outline, you won’t get up enough momentum to make a reasonable start.

After reading Sebastian’s post, what book are you thinking about writing?

September 22, 2015

Coursera Specialization in Machine Learning:…

Filed under: Machine Learning — Patrick Durusau @ 7:29 pm

Coursera Specialization in Machine Learning: A New Way to Learn Machine Learning by Emily Fox.

From the post:

Machine learning is transforming how we experience the world as intelligent applications have become more pervasive over the past five years. Following this trend, there is an increasing demand for ML experts. To help meet this demand, Carlos and I were excited to team up with our colleagues at the University of Washington and Dato to develop a Coursera specialization in Machine Learning. Our goal is to avoid the standard prerequisite-heavy approach used in other ML courses. Instead, we motivate concepts through intuition and real-world applications, and solidify concepts with a very hands-on approach. The result is a self-paced, online program targeted at a broad audience and offered through Coursera with the first course available today.

Change how people learn about machine learning?

Do they mean to depart from simply replicating static textbook content in another medium?

Horrors! (NOT!)

Education has been evolving since the earliest days online and will continue to do so.

Still, it is encouraging to see people willing to admit to being different.


I first saw this in a tweet by Dato.

September 21, 2015

Python & R codes for Machine Learning

Filed under: Machine Learning,Python,R — Patrick Durusau @ 7:54 pm

While I am thinking about machine learning, I wanted to mention: Cheatsheet – Python & R codes for common Machine Learning Algorithms by Manish Saraswat.

From the post:

In his famous book – Think and Grow Rich, Napolean Hill narrates story of Darby, who after digging for a gold vein for a few years walks away from it when he was three feet away from it!

Now, I don’t know whether the story is true or false. But, I surely know of a few Data Darby around me. These people understand the purpose of machine learning, its execution and use just a set 2 – 3 algorithms on whatever problem they are working on. They don’t update themselves with better algorithms or techniques, because they are too tough or they are time consuming.

Like Darby, they are surely missing from a lot of action after reaching this close! In the end, they give up on machine learning by saying it is very computation heavy or it is very difficult or I can’t improve my models above a threshold – what’s the point? Have you heard them?

Today’s cheat sheet aims to change a few Data Darby’s to machine learning advocates. Here’s a collection of 10 most commonly used machine learning algorithms with their codes in Python and R. Considering the rising usage of machine learning in building models, this cheat sheet is good to act as a code guide to help you bring these machine learning algorithms to use. Good Luck!

Here’s a very good idea! Whether you want to learn these algorithms or a new Emacs mode. 😉

Sure, you can always look up the answer but that breaks your chain of thought, over and over again.


Machine-Learning-Cheat-Sheet [Cheating Machine Learning?]

Filed under: Machine Learning — Patrick Durusau @ 7:25 pm

Machine-Learning-Cheat-Sheet by Frank Dai.

From the Preface:

This cheat sheet contains many classical equations and diagrams on machine learning, which will help you quickly recall knowledge and ideas in machine learning.

This cheat sheet has three significant advantages:

1. Strong typed. Compared to programming languages, mathematical formulas are weakly typed. For example, X can be a set, a random variable, or a matrix. This causes difficulty in understanding the meaning of formulas. In this cheat sheet, I try my best to standardize symbols used, see section §.

2. More parentheses. In machine learning, authors are prone to omit parentheses, brackets and braces, this usually causes ambiguity in mathematical formulas. In this cheat sheet, I use parentheses(brackets and braces) at where they are needed, to make formulas easy to understand.

3. Less thinking jumps. In many books, authors are prone to omit some steps that are trivial in his option. But it often makes readers get lost in the middle way of derivation.

Two other advantages of this “cheat-sheet” are that it resides on Github and is written using the Springer LaTeX template.

Neural networks can be easily fooled, Deep Neural Networks are Easily Fooled:… so the question becomes, how easy is it to fool the machine learning algorithms summarized by Frank Dai?

Or to put it another way, if I know the machine algorithm most likely to be used, what steps, if any, can I take to shape data to influence the likely outcome?

Excluding outright false data because that would be too easily detected and possibly trip too many alarms.

The more you know about how an algorithm can be cheated, the safer you will be in evaluating the machine learning results of others.

I first saw this in a tweet by Kirk Borne.

September 20, 2015

Announcing Spark 1.5

Filed under: Machine Learning,Spark — Patrick Durusau @ 7:13 pm

Announcing Spark 1.5 by Reynold Xin and Patrick Wendell.

From the post:

Today we are happy to announce the availability of Apache Spark’s 1.5 release! In this post, we outline the major development themes in Spark 1.5 and some of the new features we are most excited about. In the coming weeks, our blog will feature more detailed posts on specific components of Spark 1.5. For a comprehensive list of features in Spark 1.5, you can also find the detailed Apache release notes below.

Many of the major changes in Spark 1.5 are under-the-hood changes to improve Spark’s performance, usability, and operational stability. Spark 1.5 ships major pieces of Project Tungsten, an initiative focused on increasing Spark’s performance through several low-level architectural optimizations. The release also adds operational features for the streaming component, such as backpressure support. Another major theme of this release is data science: Spark 1.5 ships several new machine learning algorithms and utilities, and extends Spark’s new R API.

One interesting tidbit is that in Spark 1.5, we have crossed the 10,000 mark for JIRA number (i.e. more than 10,000 tickets have been filed to request features or report bugs). Hopefully the added digit won’t slow down our development too much!

It’s time to upgrade your Spark installation again!


September 19, 2015

10 Misconceptions about Neural Networks [Update to car numberplate game?]

Filed under: Machine Learning,Neural Networks — Patrick Durusau @ 8:20 pm

10 Misconceptions about Neural Networks by Stuart Reid.

From the post:

Neural networks are one of the most popular and powerful classes of machine learning algorithms. In quantitative finance neural networks are often used for time-series forecasting, constructing proprietary indicators, algorithmic trading, securities classification and credit risk modelling. They have also been used to construct stochastic process models and price derivatives. Despite their usefulness neural networks tend to have a bad reputation because their performance is “temperamental”. In my opinion this can be attributed to poor network design owing to misconceptions regarding how neural networks work. This article discusses some of those misconceptions.

The car numberplate game was a game where passengers in a car, usually children, would compete to find license plates from different states (in the US). That was prior to children being entombed in intellectual isolation bubbles with iPads, Gameboys, DVD players and wireless access, while riding.

Hard to believe but some people used to look outside the vehicle in which they were riding. Now of course what little attention they have is captured by cellphones and not other occupants of the same vehicle.

Rather than rail against that trend, may I suggest we update the car numberplate game to “mistakes about neural networks?”

Using Stuart’s post as a baseline, send a text message to each passenger pointing to Stuart’s post and requesting a count of the number of “mistakes about neural networks” they can find in an hour.

Personally I would put popular media off limits for post-high school players to keep the scores under four digits.

When discussing the scores, after sharing browsing histories, each player has to analyze the claimed error and match it to one on Stuart’s list.

I realize that will require full bandwidth communication with others in your physical presence but with practice, that won’t seem so terribly odd.

I first saw this in a tweet by Kirk Borne.

September 14, 2015

Getting started with open source machine learning

Filed under: Machine Learning,Open Source — Patrick Durusau @ 8:45 pm

Getting started with open source machine learning by Grant Ingersoll.

From the post:

Despite all the flashy headlines from Musk and Hawking on the impending doom to be visited on us mere mortals by killer robots from the skies, machine learning and artificial intelligence are here to stay. More importantly, machine learning (ML) is quickly becoming a critical skill for developers to enhance their applications and their careers, better understand data, and to help users be more effective.

What is machine learning? It is the use of both historical and current data to make predictions, organize content, and learn patterns about data without being explicitly programmed to do so. This is typically done using statistical techniques that look for significant events like co-occurrences and anomalies in the data and then factoring in their likelihood into a model that is queried at a later time to provide a prediction for some new piece of data.

Common machine learning tasks include classification (applying labels to items), clustering (grouping items automatically), and topic detection. It is also commonly used in natural language processing. Machine learning is increasingly being used in a wide variety of use cases, including content recommendation, fraud detection, image analysis and ecommerce. It is useful across many industries and most popular programming languages have at least one open source library implementing common ML techniques.

Reflecting the broader push in software towards open source, there are now many vibrant machine learning projects available to experiment with as well as a plethora of books, articles, tutorials, and videos to get you up to speed. Let’s look at a few projects leading the way in open source machine learning and a few primers on related ML terminology and techniques.

Grant rounds up a starting list of primers and projects if you need an introduction to machine learning.


September 10, 2015

Spark Release 1.5.0

Filed under: Data Frames,GraphX,Machine Learning,R,Spark,Streams — Patrick Durusau @ 1:42 pm

Spark Release 1.5.0

From the post:

Spark 1.5.0 is the sixth release on the 1.x line. This release represents 1400+ patches from 230+ contributors and 80+ institutions. To download Spark 1.5.0 visit the downloads page.

You can consult JIRA for the detailed changes. We have curated a list of high level changes here:

Time for your Fall Spark Upgrade!


August 9, 2015

Machine Learning and Human Bias: An Uneasy Pair

Filed under: Bias,Machine Learning — Patrick Durusau @ 10:45 am

Machine Learning and Human Bias: An Uneasy Pair by Jason Baldridge.

From the post:

“We’re watching you.” This was the warning that the Chicago Police Department gave to more than 400 people on its “Heat List.” The list, an attempt to identify the people most likely to commit violent crime in the city, was created with a predictive algorithm that focused on factors including, per the Chicago Tribune, “his or her acquaintances and their arrest histories – and whether any of those associates have been shot in the past.”

Algorithms like this obviously raise some uncomfortable questions. Who is on this list and why? Does it take race, gender, education and other personal factors into account? When the prison population of America is overwhelmingly Black and Latino males, would an algorithm based on relationships disproportionately target young men of color?

There are many reasons why such algorithms are of interest, but the rewards are inseparable from the risks. Humans are biased, and the biases we encode into machines are then scaled and automated. This is not inherently bad (or good), but it raises the question: how do we operate in a world increasingly consumed with “personal analytics” that can predict race, religion, gender, age, sexual orientation, health status and much more.

Jason’s post is a refreshing step back from the usual “machine learning isn’t biased like people are,” sort of stance.

Of course machine learning is biased, always biased. The algorithms are biased themselves, to say nothing of the programmers who inexactly converted those algorithms into code. It would not be much of an algorithm if it could not vary its results based on its inputs. That’s discrimination no matter how you look at it.

The difference, at least in some cases, is that discrimination is acceptable in some cases and not others. One imagines that only women are eligible for birth control pill prescriptions. That’s a reasonable discrimination. Other bases for discrimination, not so much.

And machine learning is further biased by the data we choose to input to the already biased implementation of a biased algorithm.

That isn’t a knock on machine learning but a caveat when confronted with a machine learning result, look behind the result to the data, the implementation of the algorithm and the algorithm itself before taking serious action based on the result.

Of course, the first question I would ask is: “Why is this person showing me this result and want do they expect me to do based on it?”

That they are trying to help me on my path to becoming self-actualized isn’t my first reaction.


July 8, 2015

Here’s Why Elon Musk Is Wrong About AI

Filed under: Artificial Intelligence,Machine Learning — Patrick Durusau @ 4:46 pm

Here’s Why Elon Musk Is Wrong About AI by Chris V. Nicholson.

From the post:

Nothing against Elon Musk, but the campaign he’s leading against AI is an unfortunate distraction from the true existential threats to humanity: global warming and nuclear proliferation.

Last year was the hottest year on record. We humans as a whole are just a bunch of frogs in an planet-sized pot of boiling water. We’re cooking ourselves with coal and petroleum, pumping carbon dioxide into the air. Smart robots should be the least of our worries.

Pouring money into AI ethics research is the wrong battle to pick because a) it can’t be won, b) it shouldn’t be fought, and c) to survive, humans must focus on other, much more urgent, issues. In the race to destroy humanity, other threats are much better bets than AI.

Not that I disagree with Nicholson, there are much more important issues to worry about than rogue AI, but that overlooks one critical aspect of the argument by Musk.

Musk has said to the world that he’s worried about AI and, more importantly, he has $7 Million+ for anyone who worries about it with him.

Your choices are:

  1. Ignore Musk because building an artificial intelligence when we don’t understand human intelligence seems too remote to be plausible, or
  2. Agree with Musk and if you are in a research group, take a chance on a part of $7 Million in grants.

I am firmly in the #1 camp because I have better things to do with my time attending UFO type meetings. Unfortunately, there are a lot of people in the #2 camp. Just depends on how much money is being offered.

There are any number of research projects that legitimately push the boundaries of knowledge. Unfortunately the government and others also fund projects that are wealth re-distribution programs for universities, hotels, transportation, meeting facilities and the like.

PS: There is a lot of value in the programs being explored under the misnomer of “artificial intelligence.” I don’t have an alternative moniker to suggest but it needs one.

June 29, 2015

A Critical Review of Recurrent Neural Networks for Sequence Learning

Filed under: Computer Science,Machine Learning,Neural Networks — Patrick Durusau @ 1:03 pm

A Critical Review of Recurrent Neural Networks for Sequence Learning by Zachary C. Lipton.


Countless learning tasks require awareness of time. Image captioning, speech synthesis, and video game playing all require that a model generate sequences of outputs. In other domains, such as time series prediction, video analysis, and music information retrieval, a model must learn from sequences of inputs. Significantly more interactive tasks, such as natural language translation, engaging in dialogue, and robotic control, often demand both.

Recurrent neural networks (RNNs) are a powerful family of connectionist models that capture time dynamics via cycles in the graph. Unlike feedforward neural networks, recurrent networks can process examples one at a time, retaining a state, or memory, that reflects an arbitrarily long context window. While these networks have long been difficult to train and often contain millions of parameters, recent advances in network architectures, optimization techniques, and parallel computation have enabled large-scale learning with recurrent nets.

Over the past few years, systems based on state of the art long short-term memory (LSTM) and bidirectional recurrent neural network (BRNN) architectures have demonstrated record-setting performance on tasks as varied as image captioning, language translation, and handwriting recognition. In this review of the literature we synthesize the body of research that over the past three decades has yielded and reduced to practice these powerful models. When appropriate, we reconcile conflicting notation and nomenclature. Our goal is to provide a mostly self-contained explication of state of the art systems, together with a historical perspective and ample references to the primary research.

Lipton begins with an all too common lament:

The literature on recurrent neural networks can seem impenetrable to the uninitiated. Shorter papers assume familiarity with a large body of background literature. Diagrams are frequently underspecified, failing to indicate which edges span time steps and which don’t. Worse, jargon abounds while notation is frequently inconsistent across papers or overloaded within papers. Readers are frequently in the unenviable position of having to synthesize conflicting information across many papers in order to understand but one. For example, in many papers subscripts index both nodes and time steps. In others, h simultaneously stands for link functions and a layer of hidden nodes. The variable t simultaneously stands for both time indices and targets, sometimes in the same equation. Many terrific breakthrough papers have appeared recently, but clear reviews of recurrent neural network literature are rare.

Unfortunately, Lipton gives no pointers to where the variant practices occur, leaving the reader forewarned but not forearmed.

Still, this is a survey paper with seventy-three (73) references over thirty-three (33) pages, so I assume you will encounter various notation practices if you follow the references and current literature.

Capturing variations in notation, along with where they have been seen, won’t win the Turing Award but may improve the CS field overall.

June 22, 2015

Learning to Execute

Filed under: Machine Learning,Neural Networks — Patrick Durusau @ 3:18 pm

Learning to Execute by Wojciech Zaremba and Ilya Sutskever.


Recurrent Neural Networks (RNNs) with Long Short-Term Memory units (LSTM) are widely used because they are expressive and are easy to train. Our interest lies in empirically evaluating the expressiveness and the learnability of LSTMs in the sequence-to-sequence regime by training them to evaluate short computer programs, a domain that has traditionally been seen as too complex for neural networks. We consider a simple class of programs that can be evaluated with a single left-to-right pass using constant memory. Our main result is that LSTMs can learn to map the character-level representations of such programs to their correct outputs. Notably, it was necessary to use curriculum learning, and while conventional curriculum learning proved ineffective, we developed a new variant of curriculum learning that improved our networks’ performance in all experimental conditions. The improved curriculum had a dramatic impact on an addition problem, making it possible to train an LSTM to add two 9-digit numbers with 99% accuracy.

Code to replicate the experiments:

A step towards generation of code that conforms to coding standards?

I first saw this in a tweet by samin.

June 20, 2015

Real-time Trainable Neural Network (on a chip)

Filed under: Machine Learning,Neural Networks — Patrick Durusau @ 8:07 pm

Real-time Trainable Neural Network

From the webpage:

The architecture of the CogniMem™ chip makes it the most practical implementation of a Radial Basis Function classifier with autonomous adaptive learning capabilities.

The Radial Basis Function is a classifier capable of representing complex nonlinear decision spaces using hyperspheres with adaptable radii. It is widely used for face recognition and other image recognition applications, function approximation, time series prediction, novelty detection.


The CogniMem Advantage: Upon receipt of an input vector, all the cognitive memories holding a previously learned vector calculate their distance to the input vector and evaluate immediately if it falls in their similarity domain. If so, the “firing” cells are ready to output their response in an orderly fashion giving the way to the cell which holds the smallest distance. If no cell fires and a teaching command is issued, the next available cell automatically learns the vector. Also, if a teaching command conflicts with the category that a firing cell, the latter automatically corrects itself by reducing its influence field.

This autonomous learning and recognition behavior pertains to the unique CogniMem parallel architecture and a patented Search and Sort process.

The website has a wealth of information and modules start at $175 per unit.

I first saw this in a tweet by Kirk Borne.

June 18, 2015

I’m a bird watcher, I’m a bird watcher, here comes one now…

Filed under: Image Recognition,Image Understanding,Machine Learning — Patrick Durusau @ 4:51 pm

New website can identify birds using photos

From the post:

In a breakthrough for computer vision and for bird watching, researchers and bird enthusiasts have enabled computers to achieve a task that stumps most humans—identifying hundreds of bird species pictured in photos.

The bird photo identifier, developed by the Visipedia research project in collaboration with the Cornell Lab of Ornithology, is available for free at:

Results will be presented by researchers from Cornell Tech and the California Institute of Technology at the Computer Vision and Pattern Recognition (CVPR) conference in Boston on June 8, 2015.

Called Merlin Bird Photo ID, the identifier is capable of recognizing 400 of the mostly commonly encountered birds in the United States and Canada.

“It gets the bird right in the top three results about 90% of the time, and it’s designed to keep improving the more people use it,” said Jessie Barry at the Cornell Lab of Ornithology. “That’s truly amazing, considering that the computer vision community started working on the challenge of bird identification only a few years ago.”

The perfect website for checking photos of birds made on summer vacation and an impressive feat of computer vision.

The more the service is used, the better it gets. Upload your vacation bird pics today!

June 3, 2015

Neural Networks and Deep Learning

Filed under: Deep Learning,Machine Learning — Patrick Durusau @ 2:02 pm

Neural Networks and Deep Learning by Michael Nielsen.

From the webpage:

Neural Networks and Deep Learning is a free online book. The book will teach you about:

  • Neural networks, a beautiful biologically-inspired programming paradigm which enables a computer to learn from observational data
  • Deep learning, a powerful set of techniques for learning in neural networks

Neural networks and deep learning currently provide the best solutions to many problems in image recognition, speech recognition, and natural language processing. This book will teach you the core concepts behind neural networks and deep learning.

The book is currently an incomplete beta draft. More chapters will be added over the coming months. For now, you can:

Michael starts off with a task that we all mastered as small children, recognizing hand written digits. Along the way, you will learn not just the mechanics of how the characters are recognized but why neural networks work the way they do.

Great introductory material to pass along to a friend.

June 2, 2015

Data Science on Spark

Filed under: BigData,Machine Learning,Spark — Patrick Durusau @ 2:43 pm

Databricks Launches MOOC: Data Science on Spark by Ameet Talwalkar and Anthony Joseph.

From the post:

For the past several months, we have been working in collaboration with professors from the University of California Berkeley and University of California Los Angeles to produce two freely available Massive Open Online Courses (MOOCs). We are proud to announce that both MOOCs will launch in June on the edX platform!

The first course, called Introduction to Big Data with Apache Spark, begins today [June 1, 2015] and teaches students about Apache Spark and performing data analysis. The second course, called Scalable Machine Learning, will begin on June 29th and will introduce the underlying statistical and algorithmic principles required to develop scalable machine learning pipelines, and provides hands-on experience using Spark. Both courses will be freely available on the edX MOOC platform, and edX Verified Certificates are also available for a fee.

Both courses are available for free on the edX website, and you can sign up for them today:

  1. Introduction to Big Data with Apache Spark
  2. Scalable Machine Learning

It is our mission to enable data scientists and engineers around the world to leverage the power of Big Data, and an important part of this mission is to educate the next generation.

If you believe in the wisdom of crowds, some 80K enrolled students as of yesterday.

So, what are you waiting for?


May 30, 2015

Announcing KeystoneML

Filed under: Government,Machine Learning,Spark — Patrick Durusau @ 6:23 pm

Announcing KeystoneML

From the post:

We’ve written about machine learning pipelines in this space in the past. At the AMPLab Retreat this week, we released (live, on stage!) KeystoneML, a software framework designed to simplify the construction of large scale, end-to-end, machine learning pipelines in Apache Spark. KeystoneML is alpha software, but we’re releasing it now to get feedback from users and to collect more use cases.

Included in the package is a type-safe API for building robust pipelines and example operators used to construct them in the domains of natural language processing, computer vision, and speech. Additionally, we’ve included and linked to several scalable and robust statistical operators and machine learning algorithms which can be reused by many workflows.

Also included in the code are several example pipelines that demonstrate how to use the software to reproduce recent academic results in computer vision, natural language processing, and speech processing….

In case you don’t have plans for the rest of the weekend! 😉

Being mindful of Emmett McQuinn’s post, Amazon Machine Learning is not for your average developer – yet, doesn’t mean you have to remain an “average” developer.

You can wait for a cookie cutter solution from Amazon or you can get ahead of the curve. Your call.

Web Page Structure, Without The Semantic Web

Could a Little Startup Called Diffbot Be the Next Google?

From the post:

Diffbot founder and CEO Mike Tung started the company in 2009 to fix a problem: there was no easy, automated way for computers to understand the structure of a Web page. A human looking at a product page on an e-commerce site, or at the front page of a newspaper site, knows right away which part is the headline or the product name, which part is the body text, which parts are comments or reviews, and so forth.

But a Web-crawler program looking at the same page doesn’t know any of those things, since these elements aren’t described as such in the actual HTML code. Making human-readable Web pages more accessible to software would require, as a first step, a consistent labeling system. But the only such system to be seriously proposed, Tim Berners-Lee’s Semantic Web, has long floundered for lack of manpower and industry cooperation. It would take a lot of people to do all the needed markup, and developers around the world would have to adhere to the Resource Description Framework prescribed by the World Wide Web Consortium.

Tung’s big conceptual leap was to dispense with all that and attack the labeling problem using computer vision and machine learning algorithms—techniques originally developed to help computers make sense of edges, shapes, colors, and spatial relationships in the real world. Diffbot runs virtual browsers in the cloud that can go to a given URL; suck in the page’s HTML, scripts, and style sheets; and render it just as it would be shown on a desktop monitor or a smartphone screen. Then edge-detection algorithms and computer-vision routines go to work, outlining and measuring each element on the page.

Using machine-learning techniques, this geometric data can then be compared to frameworks or “ontologies”—patterns distilled from training data, usually by humans who have spent time drawing rectangles on Web pages, painstakingly teaching the software what a headline looks like, what an image looks like, what a price looks like, and so on. The end result is a marked-up summary of a page’s important parts, built without recourse to any Semantic Web standards.

The irony here, of course, is that much of the information destined for publication on the Web starts out quite structured. The WordPress content-management system behind Xconomy’s site, for example, is built around a database that knows exactly which parts of this article should be presented as the headline, which parts should look like body text, and (crucially, to me) which part is my byline. But these elements get slotted into a layout designed for human readability—not for parsing by machines. Given that every content management system is different and that every site has its own distinctive tags and styles, it’s hard for software to reconstruct content types consistently based on the HTML alone.

There are several themes here that are relevant to topic maps.

First, it is true that most data starts with some structure, styles if you will, before it is presented for user consumption. Imagine an authoring application that automatically and unknown to its user, metadata that can then provide semantics for its data.

Second, the recognition of structure approach being used by Diffbot is promising in the large but should also be promising in the small as well. Local documents of a particular type are unlikely to have the variance of documents across the web. Meaning that with far less effort, you can build recognition systems that can empower more powerful searching of local document repositories.

Third, and perhaps most importantly, while the results may not be 100% accurate, the question for any such project should be how much accuracy is required? If I am mining social commentary blogs, a 5% error rate on recognition of speakers might be acceptable, because for popular threads or speakers, those errors are going to be quickly corrected. Unpopular threads or authors never followed, does that come under no harm/no foul?

Highly recommended for reading/emulation.

May 22, 2015

Deep Learning (MIT Press Book) – Update

Filed under: Deep Learning,Machine Learning — Patrick Durusau @ 8:17 pm

Deep Learning (MIT Press Book) by Yoshua Bengio, Ian Goodfellow and Aaron Courville.

I last mentioned this book last August and wanted to point out that a new draft appeared on 19/05/2015.

Typos and opportunities for improvement still exist! Now is your chance to help the authors make this a great book!


The Unreasonable Effectiveness of Recurrent Neural Networks

Filed under: Machine Learning,Neural Networks — Patrick Durusau @ 8:08 pm

The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy.

From the post:

There’s something magical about Recurrent Neural Networks (RNNs). I still remember when I trained my first recurrent network for Image Captioning. Within a few dozen minutes of training my first baby model (with rather arbitrarily-chosen hyperparameters) started to generate very nice looking descriptions of images that were on the edge of making sense. Sometimes the ratio of how simple your model is to the quality of the results you get out of it blows past your expectations, and this was one of those times. What made this result so shocking at the time was that the common wisdom was that RNNs were supposed to be difficult to train (with more experience I’ve in fact reached the opposite conclusion). Fast forward about a year: I’m training RNNs all the time and I’ve witnessed their power and robustness many times, and yet their magical outputs still find ways of amusing me. This post is about sharing some of that magic with you.

We’ll train RNNs to generate text character by character and ponder the question “how is that even possible?”

By the way, together with this post I am also releasing code on Github that allows you to train character-level language models based on multi-layer LSTMs. You give it a large chunk of text and it will learn to generate text like it one character at a time. You can also use it to reproduce my experiments below. But we’re getting ahead of ourselves; What are RNNs anyway?

I try to blog or reblog about worthy posts by others but every now and again, I encounter a post that is stunning in its depth and usefulness.

This post by Andrej Karpathy is one of the stunning ones.

In addition to covering RNNs in general, he takes the reader on a tour of “Fun with RNNs.”

Which covers the application of RNNs to:

  • A Paul Graham generator
  • Shakespeare
  • Wikipedia
  • Algebraic Geometry (Latex)
  • Linux Source Code

Along with sourcecode, Andrej provides a list of further reading.

What’s your example of using RNNs?

May 21, 2015

Machine-Learning Algorithm Mines Rap Lyrics, Then Writes Its Own

Filed under: Humor,Machine Learning,Music — Patrick Durusau @ 2:01 pm

Machine-Learning Algorithm Mines Rap Lyrics, Then Writes Its Own

From the post:

The ancient skill of creating and performing spoken rhyme is thriving today because of the inexorable rise in the popularity of rapping. This art form is distinct from ordinary spoken poetry because it is performed to a beat, often with background music.

And the performers have excelled. Adam Bradley, a professor of English at the University of Colorado has described it in glowing terms. Rapping, he says, crafts “intricate structures of sound and rhyme, creating some of the most scrupulously formal poetry composed today.”

The highly structured nature of rap makes it particularly amenable to computer analysis. And that raises an interesting question: if computers can analyze rap lyrics, can they also generate them?

Today, we get an affirmative answer thanks to the work of Eric Malmi at the University of Aalto in Finland and few pals. These guys have trained a machine-learning algorithm to recognize the salient features of a few lines of rap and then choose another line that rhymes in the same way on the same topic. The result is an algorithm that produces rap lyrics that rival human-generated ones for their complexity of rhyme.

The review is a fun read but I rather like the original paper title as well: DopeLearning: A Computational Approach to Rap Lyrics Generation by Eric Malmi, Pyry Takala, Hannu Toivonen, Tapani Raiko, Aristides Gionis.


Writing rap lyrics requires both creativity, to construct a meaningful and an interesting story, and lyrical skills, to produce complex rhyme patterns, which are the cornerstone of a good flow. We present a method for capturing both of these aspects. Our approach is based on two machine-learning techniques: the RankSVM algorithm, and a deep neural network model with a novel structure. For the problem of distinguishing the real next line from a randomly selected one, we achieve an 82 % accuracy. We employ the resulting prediction method for creating new rap lyrics by combining lines from existing songs. In terms of quantitative rhyme density, the produced lyrics outperform best human rappers by 21 %. The results highlight the benefit of our rhyme density metric and our innovative predictor of next lines.

You should also visit BattleBot (a rap engine):

BattleBot is a rap engine which allows you to “spit” any line that comes to your mind after which it will respond to you with a selection of rhyming lines found among 0.5 million lines from existing rap songs. The engine is based on a slightly improved version of the Raplyzer algorithm and the eSpeak speech synthesizer.

You can try out BattleBot simply by hitting “Spit” or “Random”. The latter will randomly pick a line among the whole database of lines and find the rhyming lines for that. The underlined part shows approximately the rhyming part of a result. To understand better, why it’s considered as a rhyme, you can click on the result, see the phonetic transcriptions of your line and the result, and look for matching vowel sequences starting from the end.

BTW, the MIT review concludes with:

What’s more, this and other raps generated by DeepBeat have a rhyming density significantly higher than any human rapper. “DeepBeat outperforms the top human rappers by 21% in terms of length and frequency of the rhymes in the produced lyrics,” they point out.

I can’t help but wonder when DeepBeat is going to hit the charts! 😉

May 20, 2015

H2O 3.0

Filed under: H20,Machine Learning — Patrick Durusau @ 3:03 pm

H20 3.0

From the webpage:

Why H2O?

H2O is for data scientists and business analysts who need scalable and fast machine learning. H2O is an open source predictive analytics platform. Unlike traditional analytics tools, H2O provides a combination of extraordinary math and high performance parallel processing with unrivaled ease of use. H2O speaks the language of data science with support for R, Python, Scala, Java and a robust REST API. Smart business applications are powered by H2O’s NanoFastTM Scoring Engine.

Get H2O!

What is H2O?

H2O makes it possible for anyone to easily apply math and predictive analytics to solve today’s most challenging business problems. It intelligently combines unique features not currently found in other machine learning platforms including:

  • Best of Breed Open Source Technology – Enjoy the freedom that comes with big data science powered by OpenSource technology. H2O leverages the most popular OpenSource products like ApacheTM Hadoop® and SparkTM to give customers the flexibility to solve their most challenging data problems.
  • Easy-to-use WebUI and Familiar Interfaces – Set up and get started quickly using either H2O’s intuitive Web-based user interface or familiar programming environ- ments like R, Java, Scala, Python, JSON, and through our powerful APIs.
  • Data Agnostic Support for all Common Database and File Types – Easily explore and model big data from within Microsoft Excel, R Studio, Tableau and more. Connect to data from HDFS, S3, SQL and NoSQL data sources. Install and deploy anywhere
  • Massively Scalable Big Data Analysis – Train a model on complete data sets, not just small samples, and iterate and develop models in real-time with H2O’s rapid in-memory distributed parallel processing.
  • Real-time Data Scoring – Use the Nanofast Scoring Engine to score data against models for accurate predictions in just nanoseconds in any environment. Enjoy 10X faster scoring and predictions than the next nearest technology in the market.

Note the caveat near the bottom of the page:

With H2O, you can:

  • Make better predictions. Harness sophisticated, ready-to-use algorithms and the processing power you need to analyze bigger data sets, more models, and more variables.
  • Get started with minimal effort and investment. H2O is an extensible open source platform that offers the most pragmatic way to put big data to work for your business. With H2O, you can work with your existing languages and tools. Further, you can extend the platform seamlessly into your Hadoop environments.

The operative word being “can.” Your results with H2O depend upon your knowledge of machine learning, knowledge of your data and the effort you put into using H2O, among other things.

« Newer PostsOlder Posts »

Powered by WordPress