Deep Feature Synthesis: Towards Automating Data Science Endeavors by James Max Kanter and Kalyan Veeramachaneni.
In this paper, we develop the Data Science Machine, which is able to derive predictive models from raw data automatically. To achieve this automation, we first propose and develop the Deep Feature Synthesis algorithm for automatically generating features for relational datasets. The algorithm follows relationships in the data to a base field, and then sequentially applies mathematical functions along that path to create the final feature. Second, we implement a generalizable machine learning pipeline and tune it using a novel Gaussian Copula process based approach. We entered the Data Science Machine in 3 data science competitions that featured 906 other data science teams. Our approach beats 615 teams in these data science competitions. In 2 of the 3 competitions we beat a majority of competitors, and in the third, we achieved 94% of the best competitor’s score. In the best case, with an ongoing competition, we beat 85.6% of the teams and achieved 95.7% of the top submissions score.
The most common phrase I saw in headlines about this paper included some variation on: MIT algorithm replaces human intuition or words to that effect. For example, MIT developing a system that replaces human intuition for big data analysis siliconAngle, An Algorithm May Be Better Than Humans at Breaking Down Big Data Newsweek, Is an MIT algorithm better than human intuition? Christian Science Monitor, and A new AI algorithm can outperform human intuition The World Weekly, just to name a few.
Being the generous sort of reviewer that I am, ;-), I am going to assume that the reporters who wrote about the imperiled status of human intuition either didn’t read the article or were working from a poorly written press release.
The error is not buried in a deeply mathematical or computational part of the paper.
Take a look at the second, fourth and seventh paragraphs of the introduction to see if you can spot the error:
To begin with, we observed that many data science problems, such as the ones released by KAGGLE, and competitions at conferences (KDD cup, IJCAI, ECML) have a few common properties. First, the data is structured and relational, usually presented as a set of tables with relational links. Second, the data captures some aspect of human interactions with a complex system. Third, the presented problem attempts to predict some aspect of human behavior, decisions, or activities (e.g., to predict whether a customer will buy again after a sale [IJCAI], whether a project will get funded by donors [KDD Cup 2014], or even where a taxi rider will choose to go [ECML]). [Second paragraph of introduction]
Transforming raw data into features is often the part of the process that most heavily involves humans, because it is driven by intuition. While recent developments in deep learning and automated processing of images, text, and signals have enabled significant automation in feature engineering for those data types, feature engineering for relational and human behavioral data remains iterative, human-intuition driven, and challenging, and hence, time consuming. At the same time, because the efficacy of a machine learning algorithm relies heavily on the input features , any replacement for a human must be able to engineer them acceptably well. [Fourth paragraph of introduction]
With these components in place, we present the Data Science Machine — an automated system for generating predictive models from raw data. It starts with a relational database and automatically generates features to be used for predictive modeling. Most parameters of the system are optimized automatically, in pursuit of good general purpose performance. [Seventh paragraph of introduction]
Have you spotted the problem yet?
In the first paragraph the authors say:
First, the data is structured and relational, usually presented as a set of tables with relational links.
In the fourth paragraph the authors say:
Transforming raw data into features is often the part of the process that most heavily involves humans, because it is driven by intuition.
In the seventh paragraph the authors say:
…an automated system for generating predictive models from raw data. It starts with a relational database and automatically generates features…
That is the first time I have ever heard relational database tables and links called raw data.
Human intuition was baked into the data by the construction of the relational tables and links between them, before the Data Science Machine was ever given the data.
The Data Science Machine is wholly and solely dependent upon the human intuition already baked into the relational database data to work at all.
The researchers say as much in the seventh paragraph, unless you think data spontaneously organizes itself into relational tables. Spontaneous relational tables?
If you doubt that human intuition (decision making) is involved in the creation of relational tables, take a quick look at: A Quick-Start Tutorial on Relational Database Design.
This isn’t to take anything away from Kanter and Veeramachaneni. Their Data Science Machine builds upon human intuition captured in relational databases. That is no mean feat. Human intuition should be captured and used to augment machine learning whenever possible.
That isn’t the same as “replacing” human intuition.
PS: Please forward to any news outlet/reporter who has been repeating false information about “deep feature synthesis.”
I first saw this in a tweet by Kirk Borne.