Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 14, 2014

Model building with the iris data set for Big Data

Filed under: BigData,Data — Patrick Durusau @ 7:09 pm

Model building with the iris data set for Big Data by Joseph Rickert.

From the post:

For similar reasons, the airlines data set used in the 2009 ASA Sections on Statistical Computing and Statistical Graphics Data expo has gained a prominent place in the machine learning world and is well on its way to becoming the “iris data set for big data”. It shows up in all kinds of places. (In addition to this blog, it made its way into the RHIPE documentation and figures in several college course modeling efforts.)

Some key features of the airlines data set are:

  • It is big enough to exceed the memory of most desktop machines. (The version of the airlines data set used for the competition contained just over 123 million records with twenty-nine variables.
  • The data set contains several different types of variables. (Some of the categorical variables have hundreds of levels.)
  • There are interesting things to learn from the data set. (This exercise from Kane and Emerson for example)
  • The data set is tidy, but not clean, making it an attractive tool to practice big data wrangling. (The AirTime variable ranges from -3,818 minutes to 3,508 minutes)

Joseph reviews what may become the iris data set of “big data,” airline data.

Its variables:

Name Description
1 Year 1987-2008
2 Month 1-12
3 DayofMonth 1-31
4 DayOfWeek 1 (Monday) – 7 (Sunday)
5 DepTime actual departure time (local, hhmm)
6 CRSDepTime scheduled departure time (local, hhmm)
7 ArrTime actual arrival time (local, hhmm)
8 CRSArrTime scheduled arrival time (local, hhmm)
9 UniqueCarrier unique carrier code
10 FlightNum flight number
11 TailNum plane tail number
12 ActualElapsedTime in minutes
13 CRSElapsedTime in minutes
14 AirTime in minutes
15 ArrDelay arrival delay, in minutes
16 DepDelay departure delay, in minutes
17 Origin origin IATA airport code
18 Dest destination IATA airport code
19 Distance in miles
20 TaxiIn taxi in time, in minutes
21 TaxiOut taxi out time in minutes
22 Cancelled was the flight cancelled?
23 CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
24 Diverted 1 = yes, 0 = no
25 CarrierDelay in minutes
26 WeatherDelay in minutes
27 NASDelay in minutes
28 SecurityDelay in minutes
29 LateAircraftDelay in minutes

Source: http://stat-computing.org/dataexpo/2009/the-data.html

Waiting for the data set to download. Lots of questions suggest themselves. For example, variation or lack thereof in the use of fields 25-29.

Enjoy!

I first saw this in a tweet by David Smith.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress