Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 4, 2014

How NetFlix Reverse Engineered Hollywood [+ Perry Mason Mystery]

Filed under: BigData,Data Analysis,Data Mining,Web Scrapers — Patrick Durusau @ 4:47 pm

How NetFlix Reverse Engineered Hollywood by Alexis C. Madrigal.

From the post:

If you use Netflix, you’ve probably wondered about the specific genres that it suggests to you. Some of them just seem so specific that it’s absurd. Emotional Fight-the-System Documentaries? Period Pieces About Royalty Based on Real Life? Foreign Satanic Stories from the 1980s?

If Netflix can show such tiny slices of cinema to any given user, and they have 40 million users, how vast did their set of “personalized genres” need to be to describe the entire Hollywood universe?

This idle wonder turned to rabid fascination when I realized that I could capture each and every microgenre that Netflix’s algorithm has ever created.

Through a combination of elbow grease and spam-level repetition, we discovered that Netflix possesses not several hundred genres, or even several thousand, but 76,897 unique ways to describe types of movies.

There are so many that just loading, copying, and pasting all of them took the little script I wrote more than 20 hours.

We’ve now spent several weeks understanding, analyzing, and reverse-engineering how Netflix’s vocabulary and grammar work. We’ve broken down its most popular descriptions, and counted its most popular actors and directors.

To my (and Netflix’s) knowledge, no one outside the company has ever assembled this data before.

What emerged from the work is this conclusion: Netflix has meticulously analyzed and tagged every movie and TV show imaginable. They possess a stockpile of data about Hollywood entertainment that is absolutely unprecedented. The genres that I scraped and that we caricature above are just the surface manifestation of this deeper database.

If you like data mining war stories in detail, then you will love this post by Alexis.

Along the way you will learn about:

  • Ubot Studio – Web scraping.
  • AntConc – Linguistic software.
  • Exploring other information to infer tagging practices.
  • More details about Netflix genres in general terms.

Be sure to read to the end to pick up on the Perry Mason mystery.

The Perry Mason mystery:

Netflix’s Favorite Actors (by number of genres)

  1. Raymond Burr (who played Perry Mason)
  2. Bruce Willis
  3. George Carlin
  4. Jackie Chan
  5. Andy Lau
  6. Robert De Niro
  7. Barbara Hale (also on Perry Mason)
  8. Clint Eastwood
  9. Elvis Presley
  10. Gene Autry

Question: Why is Raymond Burr in more genres than any other actor?

Some additional reading for this post: Sellling Blue Elephants

Just as a preview, the “Blue Elephants” book/site is about selling what consumers want to buy. Not about selling what you think is a world saving idea. Those are different. Sometimes very different.

I first saw this in a tweet by Gregory Piatetsky.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress