Regular Expression Matching with a Trigram Index or How Google Code Search Worked by Russ Cox.
In the summer of 2006, I was lucky enough to be an intern at Google. At the time, Google had an internal tool called gsearch that acted as if it ran grep over all the files in the Google source tree and printed the results. Of course, that implementation would be fairly slow, so what gsearch actually did was talk to a bunch of servers that kept different pieces of the source tree in memory: each machine did a grep through its memory and then gsearch merged the results and printed them. Jeff Dean, my intern host and one of the authors of gsearch, suggested that it would be cool to build a web interface that, in effect, let you run gsearch over the world’s public source code. I thought that sounded fun, so that’s what I did that summer. Due primarily to an excess of optimism in our original schedule, the launch slipped to October, but on October 5, 2006 we did launch (by then I was back at school but still a part-time intern).
I built the earliest demos using Ken Thompson’s Plan 9 grep, because I happened to have it lying around in library form. The plan had been to switch to a “real” regexp library, namely PCRE, probably behind a newly written, code reviewed parser, since PCRE’s parser was a well-known source of security bugs. The only problem was my then-recent discovery that none of the popular regexp implementations – not Perl, not Python, not PCRE – used real automata. This was a surprise to me, and even to Rob Pike, the author of the Plan 9 regular expression library. (Ken was not yet at Google to be consulted.) I had learned about regular expressions and automata from the Dragon Book, from theory classes in college, and from reading Rob’s and Ken’s code. The idea that you wouldn’t use the guaranteed linear time algorithm had never occurred to me. But it turned out that Rob’s code in particular used an algorithm only a few people had ever known, and the others had forgotten about it years earlier. We launched with the Plan 9 grep code; a few years later I did replace it, with RE2.
Russ covers inverted indexes, tri-grams, regexes, pointers to working code and examples of how to use the code searcher locally on Linux source code for example.
Extremely useful article as an introduction to indexes and regexes.