Indexing The World Wide Web: The Journey So Far by Abhishek Das and Ankit Jain.
Abstract:
In this chapter, we describe the key indexing components of today’s web search engines. As the World Wide Web has grown, the systems and methods for indexing have changed significantly. We present the data structures used, the features extracted, the infrastructure needed, and the options available for designing a brand new search engine. We highlight techniques that improve relevance of results, discuss trade-offs to best utilize machine resources, and cover distributed processing concept in this context. In particular, we delve into the topics of indexing phrases instead of terms, storage in memory vs. on disk, and data partitioning. We will finish with some thoughts on information organization for the newly emerging data-forms.
A non-trivial survey of indexing the web attempts and issues. This is going to take a while to digest but it looks like a very good starting place to uncover what to try next.