Starting Data Analysis with Assumptions

Why you don’t get taxis in Singapore when it rains? by Zafar Anjum.

From the post:

It is common experience that when it rains, it is difficult to get a cab in Singapore-even when you try to call one in or use your smartphone app to book one.

Why does it happen? What could be the reason behind it?

Most people would think that this unavailability of taxis during rain is because of high demand for cab services.

Well, Big Data has a very surprising answer for you, as astonishing as it was for researcher Oliver Senn.

When Senn was first given his assignment to compare two months of weather satellite data with 830 million GPS records of 80 million taxi trips, he was a little disappointed. “Everyone in Singapore knows it’s impossible to get a taxi in a rainstorm,” says Senn, “so I expected the data to basically confirm that assumption.” As he sifted through the data related to a vast fleet of more than 16,000 taxicabs, a strange pattern emerged: it appeared that many taxis weren’t moving during rainstorms. In fact, the GPS records showed that when it rained (a frequent occurrence in this tropical island state), many drivers pulled over and didn’t pick up passengers at all.

Senn did discover the reason for the patterns in the data, which is being addressed.

The first question should have been: Is this a big data problem?

True, Senn had lots of data to crunch, but that isn’t necessarily an indicator of a big data problem.

Interviews of a few taxi drivers would have dispelled the original assumption of high demand for taxis. It would also have led to the cause of the patterns Senn recognized.

That is the patterns were a symptom, not a cause.

I first saw this in So you want to be a (big) data hero? by Vinnie Mirchandani.

