Archive for October, 2013

Data Preparation for Machine Learning using MySQL

Thursday, October 31st, 2013

Data Preparation for Machine Learning using MySQL

From the post:

Most Machine Learning algorithms require data to be into a single text file in tabular format, with each row representing a full instance of the input dataset and each column one of its features. For example, imagine data in normal form separated in a table for users, another for movies, and another for ratings. You can get it in machine-learning-ready format in this way (i.e., joining by userid and movieid and removing ids and names):

Just in case you aren’t up to the Stinger level of SQL but still need to prepare data for machine learning.

Excellent tutorial on using MySQL for machine learning data preparation.

Delivering on Stinger:…

Thursday, October 31st, 2013

Delivering on Stinger: a Phase 3 Progress Update by Arun Murthy.

From the post:

With the attention of the Hadoop community on Strata/Hadoop World in New York this week, it’s seems an appropriate time to give everyone an early update on continued community development of Apache Hive. This progress well and truly cements Hive as the standard open-source SQL solution for the Apache Hadoop ecosystem for not just extremely large-scale, batch queries but also for low-latency, human-interactive queries.

Many of you have heard of Project Stinger already, but for those who have not, Stinger is a community-facing roadmap laid out to improve Hive’s performance 100x and bring true interactive query to Hadoop. You can read more at

We’ve gotten really excited lately as we’ve started to piece together the performance gains brought on by the past 9 months of hard work, including more than 700 closed Hive JIRAs and the launch of Apache Tez, which moves Hadoop beyond batch into a truly interactive big data platform.

I won’t replicate the performance graphics but I can hint that 200x improvements are worth your attention.

That’s right. 200x improvement in query performance.

Don’t take my word for it, read Arun’s post.

Machine learning for cancer classification – part 1

Thursday, October 31st, 2013

Machine learning for cancer classification – part 1 – preparing the data sets by Obi Griffith.

From the post:

I am planning a series of tutorials illustrating basic concepts and techniques for machine learning. We will try to build a classifier of relapse in breast cancer. The analysis plan will follow the general pattern (simplified) of a recent paper I wrote. The gcrma step may require you to have as much as ~8gb of ram. I ran this tutorial on a Mac Pro (Snow Leopard) with R 3.0.2 installed. It should also work on linux or windows but package installation might differ slightly. The first step is to prepare the data sets. We will use GSE2034 as a training data set and GSE2990 as a test data set. These are both data sets making use of the Affymetrix U133A platform (GPL96). First, let’s download and pre-process the training data.

Assuming you are ready to move beyond Iris data sets for practicing machine learning, this would be a good place to start.

List of NoSQL Databases (150 at present count)

Wednesday, October 30th, 2013

List of NoSQL Databases (150 at present count)

A tweet by John Troon pointed me to the current NoSQL listing at with 150 entries.

Is there a betting pool on how many more will appear by May 1, 2014?

Just curious.


Wednesday, October 30th, 2013


From the webpage:

MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data.

The MADlib mission: to foster widespread development of scalable analytic skills, by harnessing efforts from commercial practice, academic research, and open-source development.

Until the Impala post called my attention to it, I didn’t realize that MADlib had an upgrade earlier in October to 1.3!

Congratulations to MADlib!

Use MADlib Pre-built Analytic Functions….

Wednesday, October 30th, 2013

How-to: Use MADlib Pre-built Analytic Functions with Impala by Victor Bittorf.

From the post:

Cloudera Impala is an exciting project that unlocks interactive queries and SQL analytics on big data. Over the past few months I have been working with the Impala team to extend Impala’s analytic capabilities. Today I am happy to announce the availability of pre-built mathematical and statistical algorithms for the Impala community under a free open-source license. These pre-built algorithms combine recent theoretical techniques for shared nothing parallelization for analytics and the new user-defined aggregations (UDA) framework in Impala 1.2 in order to achieve big data scalability. This initial release has support for logistic regression, support vector machines (SVMs), and linear regression.

Having recently completed my masters degree while working in the database systems group at University of Madison Wisconsin, I’m excited to work with the Impala team on this project while I continue my research as a visiting student at Stanford. I’m going to go through some details about what we’ve implemented and how to use it.

As interest in data analytics increases, there is growing demand for deploying analytic algorithms in enterprise systems. One approach that has received much attention from researchers, engineers and data scientists is the integration of statistical data analysis into databases. One example of this is MADlib, which leverages the data-processing capabilities of an RDBMS to analyze data.

Victor walks through several examples of data analytics but for those of you who want to cut to the chase:

This package uses UDAs and UDFs when training and evaluating analytic models. While all of these tasks can be done in pure SQL using the Impala shell, we’ve put together some front-end scripts to streamline the process. The source code for the UDAs, UDFs, and scripts are all on GitHub.

Usual cautions apply: The results of your script or model may or may not have any resemblance to “facts” as experienced by others.

Hortonworks Sandbox Version 2.0

Wednesday, October 30th, 2013

Hortonworks Sandbox Version 2.0

From the web page:

Sandbox is a personal, portable Hadoop environment that comes with a dozen interactive Hadoop tutorials. Sandbox includes many of the most exciting developments from the latest HDP distribution, packaged up in a virtual environment that you can get up and running in 15 minutes!

Sandbox comes with:

Component Version
Apache Hadoop 2.2.0
Apache Hive 0.12.0
Apache HCatalog 0.12.0
Apache HBase 0.96.0
Apache ZooKeeper 3.4.5
Apache Pig 0.12.0
Apache Sqoop 1.4.4
Apache Flume 1.4.0
Apache Oozie 4.0.0
Apache Ambari 1.4.1
Apache Mahout 0.8.0
Hue 2.3.0

If you check the same listing at the Hortonworks page, you will see the Hue lacks a hyperlink. I had forgotten why until I ran the link down. 😉


International chemical identifier for reactions (RInChI)

Wednesday, October 30th, 2013

International chemical identifier for reactions (RInChI) by Guenter Grethe, Jonathan M Goodman and Chad HG Allen. (Journal of Cheminformatics 2013, 5:45 doi:10.1186/1758-2946-5-45)


The IUPAC International Chemical Identifier (InChI) provides a method to generate a unique text descriptor of molecular structures. Building on this work, we report a process to generate a unique text descriptor for reactions, RInChI. By carefully selecting the information that is included and by ordering the data carefully, different scientists studying the same reaction should produce the same RInChI. If differences arise, these are most likely the minor layers of the InChI, and so may be readily handled. RInChI provides a concise description of the key data in a chemical reaction, and will help enable the rapid searching and analysis of reaction databases.

The line from the abstract:

By carefully selecting the information that is included and by ordering the data carefully, different scientists studying the same reaction should produce the same RInChI.

sounds good in theory but doubtful in practice.

Although, the authors did test a set of reactions from three different publishers, some 2900 RInChIs and they were able to quickly eliminate duplicates, etc.

A project to watch as larger data sets are tested with the goal of encoding the same reactions the same way.

The RInChI Project

Hadoop Weekly – October 28, 2013

Tuesday, October 29th, 2013

Hadoop Weekly – October 28, 2013 by Joe Crobak.

A weekly blog post that tracks all things in the Hadoop ecosystem.

I will keep posting on Hadoop things of particular interest for topic maps but will also be pointing to this blog for those who want/need more Hadoop coverage.

the /unitedstates project

Tuesday, October 29th, 2013

the /unitedstates project

From the webpage:

/unitedstates is a shared commons of data and tools for the United States. Made by the public, used by the public.

There you will find:

bill-nicknames Tiny spreadsheet of common nicknames for bills and laws.

citation Stand-alone legal citation detector. Text in, citations out.

congress-legislators Detailed data on members of Congress, past and present.

congress Scrapers and parsers for the work of Congress, all day, every day.

glossary A public domain glossary for the United States.

licensing Policy guidelines for the licensing of US government information.

uscode Parser for the US Code.

wish-list Post ideas for new projects.

Can you guess what the #1 wish on the project list is?

Campaign finance donor de-duplicator

Useful Unix/Linux One-Liners for Bioinformatics

Tuesday, October 29th, 2013

Useful Unix/Linux One-Liners for Bioinformatics by Stephen Turner.

From the post:

Much of the work that bioinformaticians do is munging and wrangling around massive amounts of text. While there are some “standardized” file formats (FASTQ, SAM, VCF, etc.) and some tools for manipulating them (fastx toolkit, samtools, vcftools, etc.), there are still times where knowing a little bit of Unix/Linux is extremely helpful, namely awk, sed, cut, grep, GNU parallel, and others.

This is by no means an exhaustive catalog, but I’ve put together a short list of examples using various Unix/Linux utilities for text manipulation, from the very basic (e.g., sum a column) to the very advanced (munge a FASTQ file and print the total number of reads, total number unique reads, percentage of unique reads, most abundant sequence, and its frequency). Most of these examples (with the exception of the SeqTK examples) use built-in utilities installed on nearly every Linux system. These examples are a combination of tactics I used everyday and examples culled from other sources listed at the top of the page.

What one liners do you have laying about?

For what data sets?

A Checklist for Creating Data Products

Tuesday, October 29th, 2013

A Checklist for Creating Data Products by Zach Gemignani.

From the post:

Are you are sitting on a gold mine — if only you could transform your unique data into a valuable, monetizable data product?

Over the years, we’ve worked with dozens of clients to create applications that refine data and package the results in a form users will love. We often talk with product managers early in the conception phase to help define the target market and end-user needs, even before designing interfaces for presenting and visualizing the data.

In the process, we’ve learned a few lessons and gather a bunch of useful resources. Download our Checklist for Product Managers of Data Solutions. It is divided into four sections:

  1. Audience: Understand the people who need your data
  2. Data: Define and enhance the data for your solution
  3. Design: Craft an application that solves problems
  4. Delivery: Transition from application to profitable product

Zach and friends have done a good job packing this one page checklist with helpful hints.

No turn-key solution to riches but may spark some ideas that will move you closer to a viable data product.

Graph Triumphalism

Tuesday, October 29th, 2013

The Next Battle Ground for the Titans of Tech by Charles Silver.

From the post:

To win this galactic battle for dominance of Web 3.0, the victorious titan must find a way to move the entire tech world off of relational databases — which have been the foundation of computing since the 1970s — and onto graph databases, the key to semantic computing. The reason: Relational databases, though revolutionary way back when, are not up to the job of managing today’s Big Data. There are two huge, insurmountable issues preventing this:

  • Data integration. Relational databases (basically, all that stuff in silos) are finicky. They come in many forms, from many sources, and don’t play well with others. While search engines can find data containing specific keywords, they can’t do much of anything with it.
  • Intelligent “thinking.” While it’s impossible for computers to reason or form concepts using relational databases, they can do exactly that with linked data in graph databases. Semantic search engines can connect related data, forming a big picture out of small pieces, Star Trek-like.

This is exactly what users want and need. Consumers, marketers, advertisers, researchers, defense experts, financiers, medical researchers, astrophysicists, everyone who uses search engines (that’s everyone) wants to type in questions and get clear, accurate, complete answers, fast, that relate to them. If they’re shopping (for insurance, red shoes, DIY drones), they want where-to-get-it resources, ratings and more. Quite a wish list. Yet chunks of it are already happening.

I really like graph databases. I really do.

But to say relational databases = silos, with the implication that graph databases != silos, is just wrong.

Relational or graph databases (or any other kind of information system) will look like a silo if you don’t know the semantics of its structure and the data inside.

Technology doesn’t make silos, users who don’t disclose/document the semantics of data structure and data create silos.

Some technologies make it easier to disclose semantics than others but it is always a users choice that is responsible for the creation of a data silo.

And no, graphs don’t make it possible for computers to “…reason or form concepts….” That’s just silly.

Law of Conservation of Intelligence: You can’t obtain more intelligence from an system than was designed into it.

PS: I know, I’m cheating because I did not define “intelligence.” At least I am aware I didn’t define it. 😉

Theory and Applications for Advanced Text Mining

Monday, October 28th, 2013

Theory and Applications for Advanced Text Mining edited by Shigeaki Sakurai.

From the post:

Book chapters include:

  • Survey on Kernel-Based Relation Extraction by Hanmin Jung, Sung-Pil Choi, Seungwoo Lee and Sa-Kwang Song
  • Analysis for Finding Innovative Concepts Based on Temporal Patterns of Terms in Documents by Hidenao Abe
  • Text Clumping for Technical Intelligence by Alan Porter and Yi Zhang
  • A Semantic-Based Framework for Summarization and Page Segmentation in Web Mining by Alessio Leoncini, Fabio Sangiacomo, Paolo Gastaldo and Rodolfo Zunino
  • Ontology Learning Using Word Net Lexical Expansion and Text Mining by Hiep Luong, Susan Gauch and Qiang Wang
  • Automatic Compilation of Travel Information from Texts: A Survey by Hidetsugu Nanba, Aya Ishino and Toshiyuki Takezawa
  • Analyses on Text Data Related to the Safety of Drug Use Based on Text Mining Techniques by Masaomi Kimura
  • Biomedical Named Entity Recognition: A Survey of Machine-Learning Tools by David Campos, Sergio Matos and Jose Luis Oliveira
  • Toward Computational Processing of Less Resourced Languages: Primarily Experiments for Moroccan Amazigh Language by Fadoua Ataa Allah and Siham Boulaknadel

Download the book or the chapters at:

Is it just me or have more data mining/analysis books been appearing as open texts alongside traditional print publication? Than say five years ago?

Series: The Neophyte’s Guide to Scala

Monday, October 28th, 2013

Series: The Neophyte’s Guide to Scala

From the post:

Daniel Westheide (@kaffeecoder on twitter) has written a wonderful series of blog posts about Scala, including Akka towards the end. The individual articles are:

This series was published between Nov 21, 2012 and Apr 3, 2013 and Daniel has aggregated all content including an EPUB download here. Big kudos, way to go!


Applying the Big Data Lambda Architecture

Sunday, October 27th, 2013

Applying the Big Data Lambda Architecture by Michael Hausenblas.

From the article:

Based on his experience working on distributed data processing systems at Twitter, Nathan Marz recently designed a generic architecture addressing common requirements, which he called the Lambda Architecture. Marz is well-known in Big Data: He’s the driving force behind Storm and at Twitter he  led the streaming compute team, which provides and develops shared infrastructure to support critical real-time applications.

Marz and his team described the underlying motivation for building systems with the lambda architecture as:

  • The need for a robust system that is fault-tolerant, both against hardware failures and human mistakes.
  • To serve a wide range of workloads and use cases, in which low-latency reads and updates are required. Related to this point, the system should support ad-hoc queries.
  • The system should be linearly scalable, and it should scale out rather than up, meaning that throwing more machines at the problem will do the job.
  • The system should be extensible so that features can be added easily, and it should be easily debuggable and require minimal maintenance.

From a bird’s eye view the lambda architecture has three major components that interact with new data coming in and responds to queries, which in this article are driven from the command line:

The goal of the article:

In this article, I employ the lambda architecture to implement what I call UberSocialNet (USN). This open-source project enables users to store and query acquaintanceship data. That is, I want to be able to capture whether I happen to know someone from multiple social networks, such as Twitter or LinkedIn, or from real-life circumstances. The aim is to scale out to several billions of users while providing low-latency access to the stored information. To keep the system simple and comprehensible, I limit myself to bulk import of the data (no capabilities to live-stream data from social networks) and provide only a very simple a command-line user interface. The guts, however, use the lambda architecture.

Something a bit challenging for the start of the week. 😉

Big Data Modeling with Cassandra

Sunday, October 27th, 2013

Big Data Modeling with Cassandra by Mat Brown.


When choosing the right data store for an application, developers face a trade-off between scalability and programmer-friendliness. With the release of version 3 of the Cassandra Query Language, Cassandra provides a uniquely attractive combination of both, exposing robust and intuitive data modeling capabilities while retaining the scalability and availability of a distributed, masterless data store.

This talk will focus on practical data modeling and access in Cassandra using CQL3. We’ll cover nested data structures; different types of primary keys; and the many shapes your tables can take. There will be a particular focus on understanding the way Cassandra stores and accesses data under the hood, to better reason about designing schemas for performant queries. We’ll also cover the most important (and often unexpected) differences between ACID databases and distributed data stores like Cassandra.

Mat Brown ( is a software engineer at Rap Genius, a platform for annotating and explaining the world’s text. Mat is the author of Cequel, a Ruby object/row mapper for Cassandra, as well as Elastictastic, an object/document mapper for ElasticSearch, and Sunspot, a Ruby model integration layer for Solr.

Mat covers limitations of Cassandra without being pressed. Not unknown but not common either.

Migration from relational schema to Cassandra is a bad idea. (paraphrase)

Mat examines the internal data structures that influence how you should model data in Cassandra.

At 17:40, shows how the data structure is represented internally.

The internal representation drives schema design.

You may also like Cequel by the presenter.

PS: I suspect that if considered carefully, the internal representation of data in most databases drives the advice given by tech support.

PubMed Commons

Sunday, October 27th, 2013

PubMed Commons

From the webpage:

PubMed Commons is a system that enables researchers to share their opinions about scientific publications. Researchers can comment on any publication indexed by PubMed, and read the comments of others. PubMed Commons is a forum for open and constructive criticism and discussion of scientific issues. It will thrive with high quality interchange from the scientific community. PubMed Commons is currently in a closed pilot testing phase, which means that only invited participants can add and view comments in PubMed.

Just in case you are looking for a place to practice your data skepticism skills.

In closed beta now but when it opens up…, pick an article in a field that interests you or at random.

Just my suggestion but try to do very high quality comments and check with others on your analysis.

A record of to the point, non-shrill, substantive comments might be a nice addition to your data skeptic resume. (Under papers re-written/retracted.)

Tiny Data: Rapid development with Elasticsearch

Sunday, October 27th, 2013

Tiny Data: Rapid development with Elasticsearch by Leslie Hawthorn.

From the post:

Today we’re pleased to bring you the story of the creation of SeeMeSpeak, a Ruby application that allows users to record gestures for those learning sign language. Florian Gilcher, one of the organizers of the Berlin Elasticsearch User Group participated in a hackathon last weekend with three friends, resulting in this brand new open source project using Elasticsearch on the back end. (Emphasis in original.)


Sadly, there are almost no good learning resources for sign language on the internet. If material is available, licensing is a hassle or both the licensing and the material is poorly documented. Documenting sign language yourself is also hard, because producing and collecting videos is difficult. You need third-party recording tools, video conversion and manual categorization. That’s a sad state in a world where every notebook has a usable camera built in!

Our idea was to leverage modern browser technologies to provide an easy recording function and a quick interface to categorize the recorded words. The result is SeeMeSpeak.

Two lessons here:

  1. Data does not have to be “big” in order to be important.
  2. Browsers are very close to being the default UI for users.

Lucene Image Retrieval LIRE

Sunday, October 27th, 2013

Lucene Image Retrieval LIRE by Mathias Lux.

From the post:

Today I gave a talk on LIRE at the ACM Multimedia conference in the open source software competition, currently taking place in Barcelona. It gave me the opportunity to present a local installation of the LIRE Solr plugin and the possibilities thereof. Find the slides of the talk at slideshare: LIRE presentation at the ACM Multimedia Open Source Software Competition 2013

The Solr plugin itself is fully functional for Solr 4.4 and the source is available at There is a markdown document explaining what can be done with plugin and how to actually install it. Basically it can do content based search, content based re-ranking of text searches and brings along a custom field implementation & sub linear search based on hashing.

There is a demo site as well.

See also: LIRE: open source image retrieval in Java.

If you plan on capturing video feeds from traffic cams or other sources, to link up with other data, image recognition is in your future.

You can start with a no-bid research contract or with LIRE and Lucene.

Your call.

Erlang – Concurrent Language for Concurrent World

Sunday, October 27th, 2013

Erlang – Concurrent Language for Concurrent World by Zvi Avraham.

If you need to forward a “why Erlang” to a programmer, this set of slides should be near the top of your list.

It includes this quote from Joe Armstrong:

The world is concurrent… I could not drive the car, if I did not understand concurrency…”

Which makes me wonder: Do all the drivers I have seen in Atlanta understand concurrency?

That would really surprise me. 😉

The point should be that systems should be concurrent by their very nature, like the world around us.

Users should object when systems exhibit sequential behavior.

An In-Depth Look at Modern Database Systems

Sunday, October 27th, 2013

An In-Depth Look at Modern Database Systems by C. Mohan.


This tutorial is targeted at a broad set of database systems and applications people. It is intended to let the attendees better appreciate what is really behind the covers of many of the modern database systems (e.g., NoSQL and NewSQL systems), going beyond the hype associated with these open source, commercial and research systems. The capabilities and limitations of such systems will be addressed. Modern extensions to decades old relational DBMSs will also be described. Some application case studies will also be presented.

This is a revised version of a tutorial presented first at the 39th International Conference on Very Large Databases (VLDB2013) in Riva del Garda, Italy in August 2013. This is also a follow up to my EDBT2013 keynote talk “History Repeats Itself: Sensible and NonsenSQL Aspects of the NoSQL Hoopla” (see the paper at

Latest Bibliography.

The one thing I have not found for this tutorial is a video!

While highly enjoyable (from my perspective), detailed analysis of the database platforms and the ideas they missed or incorporated would be even more valuable.

It is one thing to say generally that an idea was missed and quite another to obtain agreement on that point.

A series of workshops documenting the intellectual history of databases would go a long way to hastening progress, as opposed to proliferation of wheels.

Lectures on scientific computing with Python

Sunday, October 27th, 2013

Lectures on scientific computing with Python by J.R. Johansson.

From the webpage:

A set of lectures on scientific computing with Python, using IPython notebooks.

Read only versions of the lectures:

To debunk pitches, proposals, articles, demos, etc., you will need to know, among other things, how scientific computing should be done.

Scientific computing is a very large field so take this as a starting point, not a destination.

Trouble at the lab [Data Skepticism]

Sunday, October 27th, 2013

Trouble at the lab, Oct. 19, 2013, The Economist.

From the web page:

“I SEE a train wreck looming,” warned Daniel Kahneman, an eminent psychologist, in an open letter last year. The premonition concerned research on a phenomenon known as “priming”. Priming studies suggest that decisions can be influenced by apparently irrelevant actions or events that took place just before the cusp of choice. They have been a boom area in psychology over the past decade, and some of their insights have already made it out of the lab and into the toolkits of policy wonks keen on “nudging” the populace.

Dr Kahneman and a growing number of his colleagues fear that a lot of this priming research is poorly founded. Over the past few years various researchers have made systematic attempts to replicate some of the more widely cited priming experiments. Many of these replications have failed. In April, for instance, a paper in PLoS ONE, a journal, reported that nine separate experiments had not managed to reproduce the results of a famous study from 1998 purporting to show that thinking about a professor before taking an intelligence test leads to a higher score than imagining a football hooligan.

The idea that the same experiments always get the same results, no matter who performs them, is one of the cornerstones of science’s claim to objective truth. If a systematic campaign of replication does not lead to the same results, then either the original research is flawed (as the replicators claim) or the replications are (as many of the original researchers on priming contend). Either way, something is awry.

The numbers will make you a militant data skeptic:

  • Original results could be duplicated for only 6 out of 53 landmark studies of cancer.
  • Drug company could reproduce only 1/4 of 67 “seminal studies.”
  • NIH official estimates at least three-quarters of publishing biomedical finding would be hard to reproduce.
  • Three-quarter of published paper in machine learning are bunk due to overfitting.

Those and more examples await you in this article from The Economist.

As the sub-heading for the article reads:

Scientists like to think of science as self-correcting. To an alarming degree, it is not

You may not mind misrepresenting facts to others, but do you want other people misrepresenting facts to you?

Do you have a professional data critic/skeptic on call?

Modeling data with functional programming in R

Sunday, October 27th, 2013

Modeling data with functional programming in R by Brain Lee Rowe.

From the post:

As some of you know, I’ve been writing a book (to be published by CRC Press/Chapman & Hall and released in late 2014) for the past year and a half. It’s one of those books that spans multiple disciplines so is both unique and also niche. In essence it’s a manifesto of sorts on using functional programming for mathematical modeling and analysis, which is based on my R package lambda.r. It spans the lambda calculus, traditional mathematical analysis, and set theory to 1) develop a mathematical model for the R language, and 2) show how to use this formalism to prove the equivalence of programs to their underlying model. I try to keep the book focused on applications, so I discuss financial trading systems, some NLP/document classification, and also web analytics.

The book started off as a more practical affair, but one thing that I’ve learned through this process is how to push ideas to the limit. So now it delves into quite a bit of theory, which makes it a more compelling read. In some ways it reminds me of ice climbing, where you’re scaling a waterfall and really pushing yourself in ways you didn’t think possible. Three chapters into the process, and it’s been that same combination of exhilarating challenge that results in conflicting thoughts racing through your head: “Holy crap — what am I doing?!” versus “This is so fun — wheeeee!” versus “I can’t believe I did it!”

Brain says some of the images, proofs and examples need work but that should not diminish your reading of the draft.

Do take the time to return comments while you are reading the draft.

Rowe – Modeling data with functional programming.

I first saw this in a tweet from StatFact.

Pitch Advice For Entrepreneurs

Saturday, October 26th, 2013

Pitch Advice For Entrepreneurs: LinkedIn’s Series B Pitch to Greylock by Reid Hoffman.

From the post:

At Greylock, my partners and I are driven by one guiding mission: always help entrepreneurs. It doesn’t matter whether an entrepreneur is in our portfolio, whether we’re considering an investment, or whether we’re casually meeting for the first time.

Entrepreneurs often ask me for help with their pitch decks. Because we value integrity and confidentiality at Greylock, we never share an entrepreneur’s pitch deck with others. What I’ve honorably been able to do, however, is share the deck I used to pitch LinkedIn to Greylock for a Series B investment back in 2004.

This past May was the 10th anniversary of LinkedIn, and while reflecting on my entrepreneurial journey, I realized that no one gets to see the presentation decks for successful companies. This gave me an idea: I could help many more entrepreneurs by making the deck available not just to the Greylock network of entrepreneurs, but to everyone.

Today, I share the Series B deck with you, too. It has many stylistic errors — and a few substantive ones, too — that I would now change having learned more, but I realized that it still provides useful insights for entrepreneurs and startup participants outside of the Greylock network, particularly across three areas of interest:

  • how entrepreneurs should approach the pitch process
  • the evolution of LinkedIn as a company
  • the consumer internet landscape in 2004 vs. today

Read, digest, and then read again.

I first saw this in a tweet by Tim O’Reilly.

0xdata Releases Second Generation H2O…

Saturday, October 26th, 2013

0xdata Releases Second Generation H2O, Big Data’s Fastest Open Source Machine Learning and Predictive Analytics Engine

From the post:

0xdata (, the open source machine learning and predictive analytics company for big data, today announced general availability of the latest release of H2O, the industry’s fastest prediction engine for big data users of Hadoop, R and Excel. H2O delivers parallel and distributed advanced algorithms on big data at speeds up to 100X faster than other predictive analytics providers.

The second generation H2O “Fluid Vector” release — currently in use at two of the largest insurance companies in the world, the largest provider of streaming video entertainment and the largest online real estate services company — delivers new levels of performance, ease of use and integration with R. Early H2O customers include Netflix, Trulia and Vendavo.

“We developed H2O to unlock the predictive power of big data through better algorithms,” said SriSatish Ambati, CEO and co-founder of 0xdata. “H2O is simple, extensible and easy to use and deploy from R, Excel and Hadoop. The big data science world is one of algorithm-haves and have-nots. Amazon, Goldman Sachs, Google and Netflix have proven the power of algorithms on data. With our viral and open Apache software license philosophy, along with close ties into the math, Hadoop and R communities, we bring the power of Google-scale machine learning and modeling without sampling to the rest of the world.”

“Big data by itself is useless. It is only when you have big data plus big analytics that one has the capability to achieve big business impact. H2O is the platform for big analytics that we have found gives us the biggest advantage compared with other alternatives,” said Chris Pouliot, Director of Algorithms and Analytics at Netflix and advisor to 0xdata. “Our data scientists can build sophisticated models, minimizing their worries about data shape and size on commodity machines. Over the past year, we partnered with the talented 0xdata team to work with them on building a great product that will meet and exceed our algorithm needs in the cloud.”

From the H2O Github page:

H2O makes hadoop do math!
H2O scales statistics, machine learning and math over BigData. H2O is extensible and users can build blocks using simple math legos in the core.
H2O keeps familiar interfaces like R, Excel & JSON so that big data enthusiasts & & experts can explore, munge, model and score datasets using a range of simple to advanced algorithms.
Data collection is easy. Decision making is hard. H2O makes it fast and easy to derive insights from your data through faster and better predictive modeling

Product Vision for first cut:

  • H2O, the Analytics Engine will scale Classification and Regression.
  • RandomForest, Generalized Linear Modeling (GLM), logistic regression, k-Means, available over R / REST/ JSON-API
  • Basic Linear Algebra as building blocks for custom algorithms
  • High predictive power of the models
  • High speed and scale for modeling and validation over BigData
  • Data Sources:
    • We read and write from/to HDFS, S3
    • We ingest data in CSV format from local and distributed filesystems (nfs)
    • A JDBC driver for SQL and DataAdapters for NoSQL datasources is in the roadmap. (v2)
  • Adhoc Data Analytics at scale via R-like Parser on BigData

Machine learning is not as ubiquitous as Excel, yet.

But like Excel, the quality of results depends on the skills of the user, not the technology.

Entity Discovery using Mahout CollocDriver

Saturday, October 26th, 2013

Entity Discovery using Mahout CollocDriver by Sujit Pal.

From the post:

I spent most of last week trying out various approaches to extract “interesting” phrases from a collection of articles. The objective was to identify candidate concepts that could be added to our taxonomy. There are various approaches, ranging from simple NGram frequencies, to algorithms such as RAKE (Rapid Automatic Keyword Extraction), to rescoring NGrams using Log Likelihood or Chi-squared measures. In this post, I describe how I used Mahout’s CollocDriver (which uses the Log Likelihood measure) to find interesting phrases from a small corpus of about 200 articles.

The articles were in various formats (PDF, DOC, HTML), and I used Apache Tika to parse them into text (yes, I finally found the opportunity to learn Tika :-)). Tika provides parsers for many common formats, so all we have to do was to hook them up to produce text from the various file formats. Here is my code:

Think of this as winnowing the chaff that your human experts would otherwise read.

A possible next step would be to decorate the candidate “interesting” phrases with additional information before being viewed by your expert(s).

Center for Language and Speech Processing Archives

Saturday, October 26th, 2013

Center for Language and Speech Processing Archives

Archived seminars from the Center for Language and Speech Processing (CLSP) at John Hopkins University.

I mentioned recently that Chris Callison-Burch is digitizing these videos and posting them to Vimeo. (Say Good-Bye to iTunes: > 400 NLP Videos)

Unfortunately, Vimeo offers primitive sorting (by upload date), etc.

Works if you are a Kim Kardashian fan. One tweet, photo or video is as meaningful (sic) as another.

Works less well if you looking for specific and useful content.

CLSP offers searching “by speaker, year, or keyword from title, abstract, bio.”


Machine Learning And Analytics…

Saturday, October 26th, 2013

Machine Learning And Analytics Using Python For Beginners by Naveen Venkataraman.

From the post:

Analytics has been a major personal theme in 2013. I’ve recently taken an interest in machine learning after spending some time in analytics consulting. In this post, I’ll share a few tips for folks looking to get started with machine learning and data analytics.


The audience for this article is people who are looking to understand the basics of machine learning and those who are interested in developing analytics projects using python. A coding background is not required in order to read this article

Most resource postings list too many resources to consult.

Naveen lists a handful of resources and why you should use them.