IMDB Top 100K Movies Analysis in Depth Part 1 by Bugra Akyildiz.
IMDB Top 100K Movies Analysis in Depth Part 2
IMDB Top 100K Movies Analysis in Depth Part 3
IMDB Top 100K Movies Analysis in Depth Part 4
From part 1:
Data is from IMDB and it includes all of the popularly voted 100042 movies from 1950 to 2013.(I know why 100000 is there but have no idea how 42 movies get squeezed. Instead of blaming my web scraping skills, I blame the universe, though).
The reason why I chose the number of votes as a metric to order the movies is because, generally the information (title, certificate, outline, director and so on) about movie are more likely to be complete for the movies that have high number of votes. Moreover, IMDB uses number of votes as a metric to determine the ranking as well so number of votes also correlate with the rating as well. Further, everybody at least has an idea on IMDB Top 250 or IMDB Top 1000 which are ordered by the ratings computed by IMDB.
Although the data is quite rich in terms of basic information, only year, rating and votes are complete for all of the movies. Only ~80% of the movies have runtime information(minutes). The categories are mostly 90% complete which could be considered good but the certificate information of the movies is the most sparse (only ~25% of them have it).
This post aims to explore data for diffferent aspects of data(categories, rating and categories) and also useful information(best movie in terms of rating or votes for each year).
An interesting analysis of the Internet Movie Database (IMDB) that incorporates other sources, such as for revenue and actors’ and actresses’ age and height information.
Suggestions on other data to include or representation techniques?
I first saw this in a tweet by Gregory Piatetsky.