The Parable of Google Flu: Traps in Big Data Analysis by David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani.
From the article:
In February 2013, Google Flu Trends (GFT) made headlines but not for a reason that Google executives or the creators of the flu tracking system would have hoped. Nature reported that GFT was predicting more than double the proportion of doctor visits for influenza-like illness (ILI) than the Centers for Disease Control and Prevention (CDC), which bases its estimates on surveillance reports from laboratories across the United States (1,2). This happened despite the fact that GFT was built to predict CDC reports. Given that GFT is often held up as an exemplary use of big data (3, 4), what lessons can we drawfrom this error?
The problems we identify are not limited to GFT. Research on whether search or social media can predict x has become commonplace (5-7) and is often put in sharp contrast with traditional methods and hypotheses. Although these studies have shown the value of these data, we are far from a place where they can supplant more traditional methods or theories (8). We explore two issues that contributed to GFTs mistakes big data hubris and algorithm dynamics and offer lessons for moving forward in the big data age.
Highly recommended reading for big data advocates.
Not that I doubt the usefulness of big data, but I do doubt its usefulness in the absence of an analyst who understands the data.
Did you catch the aside about documentation?
There are multiple challenges to replicating GFTs original algorithm. GFT has never documented the 45 search terms used, and the examples that have been released appear misleading (14) (SM). Google does provide a service, Google Correlate, which allows the user to identify search data that correlate with a given time series; however, it is limited to national level data, whereas GFT was developed using correlations at the regional level (13). The service also fails to return any of the sample search terms reported in GFT-related publications (13,14).
Document your analysis and understanding of data. Or you can appear in a sequel to Google Flu. Not really where I want my name to appear. You?
I first saw this in a tweet by Edward Tufte.