Search Algorithms with Google Director of Research Peter Norvig
From the post:
As you will see in the transcript below, this discussion focused on the use of artificial intelligence algorithms in search. Peter outlines for us the approach used by Google on a number of interesting search problems, and how they view search problems in general. This is fascinating reading for those of you who want to get a deeper understanding of how search is evolving and the technological approaches that are driving it. The types of things that are detailed in this interview include:
- The basic approach used to build Google Translate
- The process Google uses to test and implement algorithm updates
- How voice driven search works
- The methodology being used for image recognition
- How Google views speed in search
- How Google views the goals of search overall
Some of the particularly interesting tidbits include:
- Teaching automated translation systems vocabularly and grammar rules is not a viable approach. There are too many exceptions, and language changes and evolved rapidly. Google Translate uses a data driven approach of finding millions of real world translations on the web and learning from them.
- Chrome will auto translate foreign language websites for you on the fly (if you want it to).
- Google tests tens of thousands of algorithm changes per year, and make one to two actual changes every day
- Test is layered, starting with a panel of users comparing current and proposed results, perhaps a spin through the usability lab at Google, and finally with a live test with a small subset of actual Google users.
- Google Voice Search relies on 230 billion real world search queries to learn all the different ways that people articulate given words. So people no longer need to train their speech recognition for their own voice, as Google has enough real world examples to make that step unecessary.
- Google Image search allows you to drag and drop images onto the search box, and it will try to figure out what it is for you. I show a screen shot of an example of this for you below. I LOVE that feature!
- Google is obsessed with speed. As Peter says “you want the answer before you’re done thinking of the question”. Expressed from a productivity perspective, if you don’t have the answer that soon your flow of thought will be interrupted.
Reading the interview it occurred to me that perhaps, just perhaps, that authoring semantic applications, whether Semantic Web or Topic Maps, that we have been overly concerned with “correctness.” More so on the logic side where applications fall on their sides when they encounter outliers but precision is also the enemy of large scale production of topic maps.
What if we took a tack from Google’s use of a data driven approach to find mappings between data structures and the terms in data structures? I know automated techniques have been used for preliminary mapping of schemas before. What I am suggesting that we capture the basis for the mapping, so we can improve or change it.
Although there are more than 70 names for “insurance policy number” in information systems, I suspect that within the domain those stand in relationship to other subjects that would assist in refining a mining of those terms over time. Rather than making mining/mapping a “run it again Sam” type event, capturing that information could improve our odds at other mappings.
Depending on the domain, how accurate does it need to be? Particularly since we can build feedback into the systems so that as users encounter errors, those are corrected and cascade back to other users. Places users don’t visit may be wrong, but if no one visits, what difference does it make?
Very compelling interview and I suggest you read it in full.