From the webpage:
The Yahoo News Feed dataset is a collection based on a sample of anonymized user interactions on the news feeds of several Yahoo properties, including the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate. The dataset stands at a massive ~110B lines (1.5TB bzipped) of user-news item interaction data, collected by recording the user- news item interaction of about 20M users from February 2015 to May 2015. In addition to the interaction data, we are providing the demographic information (age segment and gender) and the city in which the user is based for a subset of the anonymized users. On the item side, we are releasing the title, summary, and key-phrases of the pertinent news article. The interaction data is timestamped with the user’s local time and also contains partial information of the device on which the user accessed the news feeds, which allows for interesting work in contextual recommendation and temporal data mining.
The dataset may be used by researchers to validate recommender systems, collaborative filtering methods, context-aware learning, large-scale learning algorithms, transfer learning, user behavior modeling, content enrichment and unsupervised learning methods.
The readme file for this dataset is located in part 1 of the download. Please refer to the readme file for a detailed overview of the dataset.
A great data set but one you aren’t going to see unless you have a university email account.
I thought when it took my regular Yahoo! login and I accepted the license agreement I was in. Not a chance!
No open data at Yahoo!
Why Yahoo! would have such a restriction, particularly in light of the progress towards open data is a complete mystery.
To be honest, even if I heard Yahoo!’s “reasons,” I doubt I would find them convincing.
If you have a university email address, good for you, download and use the data.
If you don’t have a university email address, can you ping me with the email of a decision maker at Yahoo! who can void this no open data policy?