Tall Big Data, Wide Big Data by Luis Apiolaza.
From the post:
After attending two one-day workshops last week I spent most days paying attention to (well, at least listening to) presentations in this biostatistics conference. Most presenters were R users—although Genstat, Matlab and SAS fans were also present and not once I heard “I can’t deal with the current size of my data sets”. However, there were some complaints about the speed of R, particularly when dealing with simulations or some genomic analyses.
Some people worried about the size of coming datasets; nevertheless that worry was across statistical packages or, more precisely, it went beyond statistical software. How will we able to even store the data from something like the Square Kilometer Array, let alone analyze it?
Luis makes a distinction between “tall data,” large data set but few predictors per item, “tall data,” versus small data set but large number of predictors per item, “wide data.” Not certain that sampling works with wide data.
Sampling wide data is a question that can be settled by experimentation. Takers?