Archive for the ‘Reservoir Sampling’ Category

Algorithms Every Data Scientist Should Know: Reservoir Sampling

Sunday, April 28th, 2013

Algorithms Every Data Scientist Should Know: Reservoir Sampling by Josh Wills.

Data scientists, that peculiar mix of software engineer and statistician, are notoriously difficult to interview. One approach that I’ve used over the years is to pose a problem that requires some mixture of algorithm design and probability theory in order to come up with an answer. Here’s an example of this type of question that has been popular in Silicon Valley for a number of years:

Say you have a stream of items of large and unknown length that we can only iterate over once. Create an algorithm that randomly chooses an item from this stream such that each item is equally likely to be selected.

The first thing to do when you find yourself confronted with such a question is to stay calm. The data scientist who is interviewing you isn’t trying to trick you by asking you to do something that is impossible. In fact, this data scientist is desperate to hire you. She is buried under a pile of analysis requests, her ETL pipeline is broken, and her machine learning model is failing to converge. Her only hope is to hire smart people such as yourself to come in and help. She wants you to succeed.

Beaker image
Remember: Stay Calm.

The second thing to do is to think deeply about the question. Assume that you are talking to a good person who has read Daniel Tunkelang’s excellent advice about interviewing data scientists. This means that this interview question probably originated in a real problem that this data scientist has encountered in her work. Therefore, a simple answer like, “I would put all of the items in a list and then select one at random once the stream ended,” would be a bad thing for you to say, because it would mean that you didn’t think deeply about what would happen if there were more items in the stream than would fit in memory (or even on disk!) on a single computer.

The third thing to do is to create a simple example problem that allows you to work through what should happen for several concrete instances of the problem. The vast majority of humans do a much better job of solving problems when they work with concrete examples instead of abstractions, so making the problem concrete can go a long way toward helping you find a solution.

In addition to great interview advice, Josh also provides a useful overview of reservoir sampling.

Whether reservoir sampling will be useful to you depends on your test for subject identity.

I tend to think of subject identity as being very precise but that isn’t necessarily the case.

Or should I say that precision of subject identity is a matter of requirements?

For some purposes, it may be sufficient to know the gender of attendees, as a subject, within some margin of statistical error. With enough effort we could know that more precisely but the cost may be prohibitive.

Thinking of any test for subject identity being located on a continuum of subject identification. Where the notion of “precision” itself is up for definition.

Russia’s warnings on one of the Boston Marathon bombers, a warning that used his name as he did, not as captured by the US intelligence community, was a case of mistaken level of precision.

Mostly likely the result of an analyst schooled in an English-only curriculum.