Using Apache Hadoop to Find Signal in the Noise: Analyzing Adverse Drug Events
From the post:
Last month at the Web 2.0 Summit in San Francisco, Cloudera CEO Mike Olson presented some work the Cloudera Data Science Team did to analyze adverse drug events. We decided to share more detail about this project because it demonstrates how to use a variety of open-source tools – R, Gephi, and Cloudera’s Distribution Including Apache Hadoop (CDH) – to solve an old problem in a new way.
Background: Adverse Drug Events
An adverse drug event (ADE) is an unwanted or unintended reaction that results from the normal use of one or more medications. The consequences of ADEs range from mild allergic reactions to death, with one study estimating that 9.7% of adverse drug events lead to permanent disability. Another study showed that each patient who experiences an ADE remains hospitalized for an additional 1-5 days and costs the hospital up to $9,000.
Some adverse drug events are caused by drug interactions, where two or more prescription or over-the-counter (OTC) drugs taken together leads to an unexpected outcome. As the population ages and more patients are treated for multiple health conditions, the risk of ADEs from drug interactions increases. In the United States, roughly 4% of adults older than 55 are at risk for a major drug interaction.
Because clinical trials study a relatively small number of patients, both regulatory agencies and pharmaceutical companies maintain databases in order to track adverse events that occur after drugs have been approved for market. In the United States, the FDA uses the Adverse Event Reporting System (AERS), where healthcare professionals and consumers may report the details of ADEs they experienced. The FDA makes a well-formatted sample of the reports available for download from their website, to the benefit of data scientists everywhere.
Methodology
Identifying ADEs is primarily a signal detection problem: we have a collection of events, where each event has multiple attributes (in this case, the drugs the patient was taking) and multiple outcomes (the adverse reactions that the patient experienced), and we would like to understand how the attributes correlate with the outcomes. One simple technique for analyzing these relationships is a 2×2 contingency table:
For All Drugs/Reactions:
|
Reaction = Rj
|
Reaction != Rj
|
Total
|
Drug = Di
|
A
|
B
|
A + B
|
Drug != Di
|
C
|
D
|
C + D
|
Total
|
A + C
|
B + D
|
A + B + C + D
|
Based on the values in the cells of the tables, we can compute various measures of disproportionality to find drug-reaction pairs that occur more frequently than we would expect if they were independent.
For this project, we analyzed interactions involving multiple drugs, using a generalization of the contingency table method that is described in the paper, “Empirical bayes screening for multi-item associations” by DuMouchel and Pregibon. Their model computes a Multi-Item Gamma-Poisson Shrinkage (MGPS) estimator for each combination of drugs and outcomes, and gives us a statistically sound measure of disproportionality even if we only have a handful of observations for a particular combination of drugs. The MGPS model has been used for a variety of signal detection problems across multiple industries, such as identifying fraudulent phone calls, performing market basket analyses and analyzing defects in automobiles.
Apologies for the long setup:
Solving the Hard Problem with Apache Hadoop
At first glance, it doesn’t seem like we would need anything beyond a laptop to analyze ADEs, since the FDA only receives about one million reports a year. But when we begin to examine these reports, we discover a problem that is similar to what happens when we attempt to teach computers to play chess: a combinatorial explosion in the number of possible drug interactions we must consider. Even restricting ourselves to analyzing pairs of drugs, there are more than 3 trillion potential drug-drug-reaction triples in the AERS dataset, and tens of millions of triples that we actually see in the data. Even including the iterative Expectation Maximization algorithm that we use to fit the MGPS model, the total runtime of our analysis is dominated by the process of counting how often the various interactions occur.
The good news is that MapReduce running on a Hadoop cluster is ideal for this problem. By creating a pipeline of MapReduce jobs to clean, aggregate, and join our data, we can parallelize the counting problem across multiple machines to achieve a linear speedup in our overall runtime. The faster runtime for each individual analysis allows us to iterate rapidly on smaller models and tackle larger problems involving more drug interactions than anyone has ever looked at before.
Where have I heard about combinatorial explosions before?
If you think about it, semantic environments (except for artificial ones) are inherently noisy and the signal we are looking for to trigger merging may be hard to find.
Semantic environments like Cyc are noise free, but they are also not the semantic environments in which most data exists and in which we have to make decisions.
Questions: To what extent are “clean” semantic environments artifacts of adapting to the capacities of existing hardware/software? What aspects of then current hardware/software would you point to in making that case?