Building a complete Tweet index by Yi Zhuang.
Since it is Easter Sunday in many religious traditions, what could be more inspirational than “…a search service that efficiently indexes roughly half a trillion documents and serves queries with an average latency of under 100ms.“?
From the post:
Today [11/8/2014], we are pleased to announce that Twitter now indexes every public Tweet since 2006.
Since that first simple Tweet over eight years ago, hundreds of billions of Tweets have captured everyday human experiences and major historical events. Our search engine excelled at surfacing breaking news and events in real time, and our search index infrastructure reflected this strong emphasis on recency. But our long-standing goal has been to let people search through every Tweet ever published.
This new infrastructure enables many use cases, providing comprehensive results for entire TV and sports seasons, conferences (#TEDGlobal), industry discussions (#MobilePayments), places, businesses and long-lived hashtag conversations across topics, such as #JapanEarthquake, #Election2012, #ScotlandDecides, #HongKong. #Ferguson and many more. This change will be rolling out to users over the next few days.
In this post, we describe how we built a search service that efficiently indexes roughly half a trillion documents and serves queries with an average latency of under 100ms.
The most important factors in our design were:
- Modularity: Twitter already had a real-time index (an inverted index containing about a week’s worth of recent Tweets). We shared source code and tests between the two indices where possible, which created a cleaner system in less time.
- Scalability: The full index is more than 100 times larger than our real-time index and grows by several billion Tweets a week. Our fixed-size real-time index clusters are non-trivial to expand; adding capacity requires re-partitioning and significant operational overhead. We needed a system that expands in place gracefully.
- Cost effectiveness: Our real-time index is fully stored in RAM for low latency and fast updates. However, using the same RAM technology for the full index would have been prohibitively expensive.
- Simple interface: Partitioning is unavoidable at this scale. But we wanted a simple interface that hides the underlying partitions so that internal clients can treat the cluster as a single endpoint.
- Incremental development: The goal of “indexing every Tweet” was not achieved in one quarter. The full index builds on previous foundational projects. In 2012, we built a small historical index of approximately two billion top Tweets, developing an offline data aggregation and preprocessing pipeline. In 2013, we expanded that index by an order of magnitude, evaluating and tuning SSD performance. In 2014, we built the full index with a multi-tier architecture, focusing on scalability and operability.
…
If you are interested in scaling search issues, this is a must read post!
Kudos to Twitter Engineering!
PS: Of course all we need now is a complete index to Hilary Clinton’s emails. The NSA probably has a copy.
You know, the NSA could keep the same name, National Security Agency, and take over providing backups and verification for all email and web traffic, including the cloud. Would have to work on who could request copies but that would resolve the issue of backups of the Internet rather neatly. No more deleted emails, tweets, etc.
That would be a useful function, as opposed to harvesting phone data on the premise that at some point in the future it might prove to be useful, despite having not proved useful in the past.