Aggregation Options on Big Data Sets Part 1: Basic Analysis using a Flights Data Set by Daniel Alabi and Sweet Song, MongoDB Summer Interns.
From the post:
Flights Dataset Overview
This is the first of three blog posts from this summer internship project showing how to answer questions concerning big datasets stored in MongoDB using MongoDB’s frameworks and connectors.
The first dataset explored was a domestic flights dataset. The Bureau of Transportation Statistics provides information for every commercial flight from 1987, but we narrowed down our project to focus on the most recent available data for the past year (April 2012-March 2013).
We were particularly attracted to this dataset because it contains a lot of fields that are well suited for manipulation using the MongoDB aggregation framework.
To get started, we wanted to answer a few basic questions concerning the dataset:
- When is the best time of day/day of week/time of year to fly to minimize delays?
- What types of planes suffer the most delays? How old are these planes?
- How often does a delay cascade into other flight delays?
- What was the effect of Hurricane Sandy on air transportation in New York? How quickly did the state return to normal?
A series of blog posts to watch!
I thought the comment:
We were particularly attracted to this dataset because it contains a lot of fields that are well suited for manipulation using the MongoDB aggregation framework.
was remarkably honest.
The Department of Transportation Table/Field guide reveals that the fields are mostly populated by codes, IDs and date/time values.
Values that lend themselves to easy aggregation.
Looking forward to harder aggregation examples as this series develops.