Archive for the ‘Yahoo!’ Category

Yahoo News Feed dataset, version 1.0 (1.5TB) – Sorry, No Open Data At Yahoo!

Thursday, January 14th, 2016

R10 – Yahoo News Feed dataset, version 1.0 (1.5TB)

From the webpage:

The Yahoo News Feed dataset is a collection based on a sample of anonymized user interactions on the news feeds of several Yahoo properties, including the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate. The dataset stands at a massive ~110B lines (1.5TB bzipped) of user-news item interaction data, collected by recording the user- news item interaction of about 20M users from February 2015 to May 2015. In addition to the interaction data, we are providing the demographic information (age segment and gender) and the city in which the user is based for a subset of the anonymized users. On the item side, we are releasing the title, summary, and key-phrases of the pertinent news article. The interaction data is timestamped with the user’s local time and also contains partial information of the device on which the user accessed the news feeds, which allows for interesting work in contextual recommendation and temporal data mining.

The dataset may be used by researchers to validate recommender systems, collaborative filtering methods, context-aware learning, large-scale learning algorithms, transfer learning, user behavior modeling, content enrichment and unsupervised learning methods.

The readme file for this dataset is located in part 1 of the download. Please refer to the readme file for a detailed overview of the dataset.

A great data set but one you aren’t going to see unless you have a university email account.

I thought when it took my regular Yahoo! login and I accepted the license agreement I was in. Not a chance!

No open data at Yahoo!

Why Yahoo! would have such a restriction, particularly in light of the progress towards open data is a complete mystery.

To be honest, even if I heard Yahoo!’s “reasons,” I doubt I would find them convincing.

If you have a university email address, good for you, download and use the data.

If you don’t have a university email address, can you ping me with the email of a decision maker at Yahoo! who can void this no open data policy?


Death of Yahoo Directory

Sunday, October 26th, 2014

Progress Report: Continued Product Focus by Jay Rossiter, SVP, Cloud Platform Group.

From the post:

At Yahoo, focus is an important part of accomplishing our mission: to make the world’s daily habits more entertaining and inspiring. To achieve this focus, we have sunset more than 60 products and services over the past two years, and redirected those resources toward products that our users care most about and are aligned with our vision. With even more smart, innovative Yahoos focused on our core products – search, communications, digital magazines, and video – we can deliver the best for our users.

Directory: Yahoo was started nearly 20 years ago as a directory of websites that helped users explore the Internet. While we are still committed to connecting users with the information they’re passionate about, our business has evolved and at the end of 2014 (December 31), we will retire the Yahoo Directory. Advertisers will be upgraded to a new service; more details to be communicated directly.

Understandable but sad. Think of indexing a book that expanded as rapidly as the Internet over the last twenty (20) years. Especially if the content might or might not have any resemblance to already existing content.

Internet remains in serious need of a curated means to access quality information. Almost any search returns links ranging from high to questionable quality.

Imagine if Yahoo segregated the top 500 computer science publishers, archives, societies, departments, blogs into a block of searchable content. (The 500 number is wholly arbitrary, could be some other number) Users would pre-qualify themselves as interested in computer science materials and create a market segment for advertising purposes.

Users would get less trash in their results and advertisers would have pre-qualified targets.

A pre-curated search set might mean you would miss an important link, but realistically, few people read beyond the first twenty (20) links anyway. An analysis of search logs at PubMed show that 80% of users choose a link from the first twenty results.

In theory you may have > 10,000 “hits” but querying all of those up for serving to a user is a waste to time.

Suspect it varies by domain but twenty (20) high quality “hits” from curated content would be a far cry from average search results now.

I first saw this in Greg Linden’s Quick Links for Wednesday, October 01, 2014.

One Hundred Million…

Wednesday, June 25th, 2014

One Hundred Million Creative Commons Flickr Images for Research by David A. Shamma.

From the post:

Today the photograph has transformed again. From the old world of unprocessed rolls of C-41 sitting in a fridge 20 years ago to sharing photos on the 1.5” screen of a point and shoot camera 10 years back. Today the photograph is something different. Photos automatically leave their capture (and formerly captive) devices to many sharing services. There are a lot of photos. A back of the envelope estimation reports 10% of all photos in the world were taken in the last 12 months, and that was calculated three years ago. And of these services, Flickr has been a great repository of images that are free to share via Creative Commons.

On Flickr, photos, their metadata, their social ecosystem, and the pixels themselves make for a vibrant environment for answering many research questions at scale. However, scientific efforts outside of industry have relied on various sized efforts of one-off datasets for research. At Flickr and at Yahoo Labs, we set out to provide something more substantial for researchers around the globe.

[image omitted]

Today, we are announcing the Flickr Creative Commons dataset as part of Yahoo Webscope’s datasets for researchers. The dataset, we believe, is one of the largest public multimedia datasets that has ever been released—99.3 million images and 0.7 million videos, all from Flickr and all under Creative Commons licensing.

The dataset (about 12GB) consists of a photo_id, a jpeg url or video url, and some corresponding metadata such as the title, description, title, camera type, title, and tags. Plus about 49 million of the photos are geotagged! What’s not there, like comments, favorites, and social network data, can be queried from the Flickr API.

The good news doesn’t stop there, the 100 million photos have been analyzed for standard features as well!


Yahoo! Search Blog moved!

Sunday, June 23rd, 2013

Just in case you noticed that Yahoo! Search Blog has moved and left you an incorrect forwarding address:

FYI: The correct address is:

Streaming IN Hadoop: Yahoo! release Storm-YARN

Saturday, June 15th, 2013

Streaming IN Hadoop: Yahoo! release Storm-YARN by Jim Walker.

From the post:

Over the past year, customers have told us they want to store all their data in one place and interact with it in multiple ways… they want to use Hadoop, but in order to do so, it needs to extend beyond batch. It also needs to be interactive and real-time (among others).

This is the entire principle behind YARN, which together with others in the community, Arun Murthy and the team at Hortonworks have been working on for more than 5 years! The YARN based architecture of Hadoop 2.0 is hugely significant and we have been working closely with many partners to incorporate it into their applications.

Storm-YARN Released as Open Source

Yahoo! has been testing Hadoop 2 and its YARN-based architecture for quite some time. All the while they have worked on the convergence of the streaming framework Storm with Hadoop. This work has resulted in a YARN based version of Storm that will radically improve performance and resource management for streaming.

The release blog post from Yahoo.

Processing of data, even big data, is approaching “interactive and real-time,” although I suspect definitions of those terms vary. What is “interactive” for an automated trader might be too fast for human trader.

What I haven’t seen is concurrent development on the handling of the semantics of big data.

After the initial hysteria over the scope of NSA snooping, except for cases where the NSA was given the identity of a suspect (and not always then), was its data gathering of any use.

In topic map terms, the semantic impedance between the data systems was too great for useful manipulation of the data sets as one.

Streaming in Hadoop is welcome news, but until we can robustly manages the semantics of data in streams, much gold is going to pass uncollected from streams.

Introducing BOSS Geo – the next chapter for BOSS

Friday, September 28th, 2012

Introducing BOSS Geo – the next chapter for BOSS

From the post:

Today, the Yahoo! BOSS team is thrilled to announce BOSS Geo, new additions to our Search API that’s designed to help foster innovation in the search industry. BOSS Geo, comprised of two popular services – PlaceFinder and PlaceSpotter – now offers powerful, new geo services to BOSS developers.

Geo is increasingly important in today’s always-on, mobile world and adding features like these have been among the most requested we’ve received from our developers. With mobile devices becoming more pervasive, users everywhere want to be able to quickly pull up relevant geo information like maps or addresses. By adding PlaceFinder and PlaceSpotter to BOSS, we’re arming developers with rich new tools for driving more valuable and personalized interactions with their users.

PlaceFinder – Geocoding made simple

PlaceFinder is a geocoder (and reverse geocoder) service. The service helps developers convert an address into a latitude/longitude and alternatively, if you provide a latitude/longitude it can resolve it to an address. Whether you are building a check-in service or want to show an address on a map, we’ve got you covered. PlaceFinder already powers several popular applications like foursquare. which uses it to power check-ins on their mobile application. BOSS PlaceFinder offers tiered pricing and one simple monthly bill.

(graphics omitted)

PlaceSpotter – Adding location awareness to your content

The PlaceSpotter API (formerly known as PlaceMaker) allows developers to take a piece of content, pull rich information about the locations mentioned and provide meaning to those locations. A news article is no longer just text but has rich, meaningful geographical information associated with it. For instance, the next time your users are reading a review of a cool new coffee shop in the Mission neighborhood in San Francisco, they can discover another article about a hip new bakery in the same neighborhood. Learn more on the new PlaceSpotter service.

What information would you merge using address data as a link point?

Amsterdam (Netherlands) is included. Perhaps sexual preferences in multiple languages, keyed to your cell phone’s location? (Or does that exist already?)


We intend to shut down the current free versions of PlaceFinder and PlaceMaker on November 17, 2012.

Development using YQL tables will still be available.

Yahoo! Search Scientists Break New Ground on Search Results

Saturday, March 17th, 2012

Yahoo! Search Scientists Break New Ground on Search Results

From the post:

Understanding a person’s intent when searching on the web is critical to the quality of search results offered and at Yahoo! Search, the science team is constantly working to refine our technology and provide people with more relevant answers, not links, to their search query.

Recently, Yahoo! Search scientists built a new search platform from the ground up with machine learning technology that improves Yahoo!’s vertical intent triggering system and, as a result, our ability to better anticipate the needs of the individual user as he or she searches online. With this new platform, our search algorithm has the ability to adapt to what users are really interested in, by continuously monitoring how they engage with the search results. The system then continuously and automatically improves itself to provide the most engaging web search experience.

This technology was recently launched for news and movie search queries, two categories that tested extremely well with the technology. For example, with breaking news search terms constantly changing, humans can’t instantly track which queries are now breaking news stories. The intended result for a user can change for the same search query on a daily or even hourly basis. The technology can determine what the users are looking for and bring it to the top. And the key results that may have been at the top this morning, can be moved to the middle of the search results page at the end of day if user behavior determines other content is now more relevant.

Based on the positive feedback we’ve received in testing this platform for news and movie searches, we plan to roll out this new technology to support shopping, local, travel and mobile searches in the coming months, as well as other experiences across the Yahoo! network.

There wasn’t enough information in the post to evaluate the claims of improvement. I tried to post a comment asking when more details will appear but it was with FireFox on Ubuntu so it may not have taken.

If you know what Yahoo! has done differently and can say what it is, please do. I am sure we would all like to know.

As you know, enabling users to state their intent seems like a better strategy to me. At least better than simply running the numbers like a network rating sweeps. It works, but only just.