Constructing Case-Control Studies With Hadoop by Josh Wills.
From the post:
San Francisco seems to be having an unusually high number of flu cases/searches this April, and the Cloudera Data Science Team has been hit pretty hard. Our normal activities (working on Crunch, speaking at conferences, finagling a job with the San Francisco Giants) have taken a back seat to bed rest, throat lozenges, and consuming massive quantities of orange juice. But this bit of downtime also gave us an opportunity to focus on solving a large-scale data science problem that helps some of the people who help humanity the most: epidemiologists.
Case-Control Studies
A case-control study is a type of observational study in which a researcher attempts to identify the factors that contribute to a medical condition by comparing a set of subjects who have that condition (the ‘cases’) to a set of subjects who do not have the condition, but otherwise resemble the case subjects (the ‘controls’). They are useful for exploratory analysis because they are relatively cheap to perform, and have led to many important discoveries- most famously, the link between smoking and lung cancer.
Epidemiologists and other researchers now have access to data sets that contain tens of millions of anonymized patient records. Tens of thousands of these patient records may include a particular disease that a researcher would like to analyze. In order to find enough unique control subjects for each case subject, a researcher may need to execute tens of thousands of queries against a database of patient records, and I have spoken to researchers who spend days performing this laborious task. Although they would like to parallelize these queries across multiple machines, there is a constraint that makes this problem a bit more interesting: each control subject may only be matched with at most one case subject. If we parallelize the queries across the case subjects, we need to check to be sure that we didn’t assign a control subject to multiple cases. If we parallelize the queries across the control subjects, we need to be sure that each case subject ends up with a sufficient number of control subjects. In either case, we still need to query the data an arbitrary number of times to ensure that the matching of cases and controls we come up with is feasible, let alone optimal.
Analyzing a case-control study is a problem for a statistician. Constructing a case-control study is a problem for a data scientist.
Great walk through on constructing a case-control study, including the use of the Apache Giraph library.