Improving Twitter search with real-time human computation by Edwin Chen.
From the post:
Before we delve into the details, here’s an overview of how the system works.
(1) First, we monitor for which search queries are currently popular.
Behind the scenes: we run a Storm topology that tracks statistics on search queries.
For example: the query “Big Bird” may be averaging zero searches a day, but at 6pm on October 3, we suddenly see a spike in searches from the US.
(2) Next, as soon as we discover a new popular search query, we send it to our human evaluation systems, where judges are asked a variety of questions about the query.
Behind the scenes: when the Storm topology detects that a query has reached sufficient popularity, it connects to a Thrift API that dispatches the query to Amazon’s Mechanical Turk service, and then polls Mechanical Turk for a response.
For example: as soon as we notice “Big Bird” spiking, we may ask judges on Mechanical Turk to categorize the query, or provide other information (e.g., whether there are likely to be interesting pictures of the query, or whether the query is about a person or an event) that helps us serve relevant tweets and ads.
Finally, after a response from a judge is received, we push the information to our backend systems, so that the next time a user searches for a query, our machine learning models will make use of the additional information. For example, suppose our human judges tell us that “Big Bird” is related to politics; the next time someone performs this search, we know to surface ads by @barackobama or @mittromney, not ads about Dora the Explorer.
Let’s now explore the first two sections above in more detail.
….
The post is quite awesome and I suggest you read it in full.
This resonates with a recent comment about Lotus Agenda.
The short version is a user creates a thesaurus in Agenda that enables searches enriched by the thesaurus. The user supplied semantics to enhance the searches.
In the Twitter case, human reviewers supply semantics to enhance the searches.
In both cases, Agenda and Twitter, humans are supplying semantics to enhance the searches.
I emphasize “supplying semantics” as a contrast to mechanistic searches that rely on text.
Mechanistic searches can be quite valuable but they pale beside searches where semantics have been “supplied.”
The Twitter experience is a an important clue.
The answer to semantics for searches lies somewhere between ask an expert (you get his/her semantics) and ask ask all of us (too many answers to be useful).
More to follow.