Archive for the ‘H20’ Category

H2O 3.0

Wednesday, May 20th, 2015

H20 3.0

From the webpage:

Why H2O?

H2O is for data scientists and business analysts who need scalable and fast machine learning. H2O is an open source predictive analytics platform. Unlike traditional analytics tools, H2O provides a combination of extraordinary math and high performance parallel processing with unrivaled ease of use. H2O speaks the language of data science with support for R, Python, Scala, Java and a robust REST API. Smart business applications are powered by H2O’s NanoFastTM Scoring Engine.

Get H2O!

What is H2O?

H2O makes it possible for anyone to easily apply math and predictive analytics to solve today’s most challenging business problems. It intelligently combines unique features not currently found in other machine learning platforms including:

  • Best of Breed Open Source Technology – Enjoy the freedom that comes with big data science powered by OpenSource technology. H2O leverages the most popular OpenSource products like ApacheTM Hadoop® and SparkTM to give customers the flexibility to solve their most challenging data problems.
  • Easy-to-use WebUI and Familiar Interfaces – Set up and get started quickly using either H2O’s intuitive Web-based user interface or familiar programming environ- ments like R, Java, Scala, Python, JSON, and through our powerful APIs.
  • Data Agnostic Support for all Common Database and File Types – Easily explore and model big data from within Microsoft Excel, R Studio, Tableau and more. Connect to data from HDFS, S3, SQL and NoSQL data sources. Install and deploy anywhere
  • Massively Scalable Big Data Analysis – Train a model on complete data sets, not just small samples, and iterate and develop models in real-time with H2O’s rapid in-memory distributed parallel processing.
  • Real-time Data Scoring – Use the Nanofast Scoring Engine to score data against models for accurate predictions in just nanoseconds in any environment. Enjoy 10X faster scoring and predictions than the next nearest technology in the market.

Note the caveat near the bottom of the page:

With H2O, you can:

  • Make better predictions. Harness sophisticated, ready-to-use algorithms and the processing power you need to analyze bigger data sets, more models, and more variables.
  • Get started with minimal effort and investment. H2O is an extensible open source platform that offers the most pragmatic way to put big data to work for your business. With H2O, you can work with your existing languages and tools. Further, you can extend the platform seamlessly into your Hadoop environments.

The operative word being “can.” Your results with H2O depend upon your knowledge of machine learning, knowledge of your data and the effort you put into using H2O, among other things.

H2O World 2014

Friday, January 2nd, 2015

H2O World 2014

From the H2O homepage:

H2O is for data scientists and application developers who need fast, in-memory scalable machine learning for smarter applications. H2O is an open source parallel processing engine for machine learning. Unlike traditional analytics tools, H2O provides a combination of extraordinary math, a high performance parallel architecture, and unrivaled ease of use.

Videos and docs from two days of presentations on H2O.

I first saw this in Video: H2O Talks by Trevor Hastie and John Chambers by Joseph Rickert.

Apache Mahout, “…Ya Gotta Hit The Road”

Thursday, March 27th, 2014

The news in Derrick Harris’ “Apache Mahout, Hadoop’s original machine learning project, is moving on from MapReduce” reminded of a line from Tommy, “Just as the gypsy queen must do, ya gotta hit the road.”

From the post:

Apache Mahout, a machine learning library for Hadoop since 2009, is joining the exodus away from MapReduce. The project’s community has decided to rework Mahout to support the increasingly popular Apache Spark in-memory data-processing framework, as well as the H2O engine for running machine learning and mathematical workloads at scale.

While data processing in Hadoop has traditionally been done using MapReduce, the batch-oriented framework has fallen out of vogue as users began demanding lower-latency processing for certain types of workloads — such as machine learning. However, nobody really wants to abandon Hadoop entirely because it’s still great for storing lots of data and many still use MapReduce for most of their workloads. Spark, which was developed at the University of California, Berkeley, has stepped in to fill that void in a growing number of cases where speed and ease of programming really matter.

H2O was developed separately by a startup called 0xadata (pronounced hexadata), although it’s also available as open source software. It’s an in-memory data engine specifically designed for running various types of types of statisical computations — including deep learning models — on data stored in the Hadoop Distributed File System.

Support for multiple data frameworks is yet another reason to learn Mahout.

0xdata Releases Second Generation H2O…

Saturday, October 26th, 2013

0xdata Releases Second Generation H2O, Big Data’s Fastest Open Source Machine Learning and Predictive Analytics Engine

From the post:

0xdata (, the open source machine learning and predictive analytics company for big data, today announced general availability of the latest release of H2O, the industry’s fastest prediction engine for big data users of Hadoop, R and Excel. H2O delivers parallel and distributed advanced algorithms on big data at speeds up to 100X faster than other predictive analytics providers.

The second generation H2O “Fluid Vector” release — currently in use at two of the largest insurance companies in the world, the largest provider of streaming video entertainment and the largest online real estate services company — delivers new levels of performance, ease of use and integration with R. Early H2O customers include Netflix, Trulia and Vendavo.

“We developed H2O to unlock the predictive power of big data through better algorithms,” said SriSatish Ambati, CEO and co-founder of 0xdata. “H2O is simple, extensible and easy to use and deploy from R, Excel and Hadoop. The big data science world is one of algorithm-haves and have-nots. Amazon, Goldman Sachs, Google and Netflix have proven the power of algorithms on data. With our viral and open Apache software license philosophy, along with close ties into the math, Hadoop and R communities, we bring the power of Google-scale machine learning and modeling without sampling to the rest of the world.”

“Big data by itself is useless. It is only when you have big data plus big analytics that one has the capability to achieve big business impact. H2O is the platform for big analytics that we have found gives us the biggest advantage compared with other alternatives,” said Chris Pouliot, Director of Algorithms and Analytics at Netflix and advisor to 0xdata. “Our data scientists can build sophisticated models, minimizing their worries about data shape and size on commodity machines. Over the past year, we partnered with the talented 0xdata team to work with them on building a great product that will meet and exceed our algorithm needs in the cloud.”

From the H2O Github page:

H2O makes hadoop do math!
H2O scales statistics, machine learning and math over BigData. H2O is extensible and users can build blocks using simple math legos in the core.
H2O keeps familiar interfaces like R, Excel & JSON so that big data enthusiasts & & experts can explore, munge, model and score datasets using a range of simple to advanced algorithms.
Data collection is easy. Decision making is hard. H2O makes it fast and easy to derive insights from your data through faster and better predictive modeling

Product Vision for first cut:

  • H2O, the Analytics Engine will scale Classification and Regression.
  • RandomForest, Generalized Linear Modeling (GLM), logistic regression, k-Means, available over R / REST/ JSON-API
  • Basic Linear Algebra as building blocks for custom algorithms
  • High predictive power of the models
  • High speed and scale for modeling and validation over BigData
  • Data Sources:
    • We read and write from/to HDFS, S3
    • We ingest data in CSV format from local and distributed filesystems (nfs)
    • A JDBC driver for SQL and DataAdapters for NoSQL datasources is in the roadmap. (v2)
  • Adhoc Data Analytics at scale via R-like Parser on BigData

Machine learning is not as ubiquitous as Excel, yet.

But like Excel, the quality of results depends on the skills of the user, not the technology.