Where else but High Scalability would you find a “how-to” article like this one? Complete with guide and source code.
Good DYI project for the weekend.
Major social networking platforms like Facebook and Twitter have developed their own architectures for handling the need for real-time analytics on huge amounts of data. However, not every company has the need or resources to build their own Twitter-like solution.
In this example we have taken the same Twitter/Facebook-like blueprint, and made it simple enough for developers to implement. We have taken the following approach in our implementation:
- Use In Memory Data Grid (XAP) for handling the real time stream data-processing.
- BigData data-base (Cassandra) for storing the historical data and manage the trend analytics
- Use Cloudify (cloudifysource.org) for managing and automating the deployment on private or pubic cloud
The example demonstrate a simple case of word count analytics. It uses Spring Social to plug-in to real twitter feeds. The solution is designed to efficiently cope with getting and processing the large volume of tweets. First, we partition the tweets so that we can process them in parallel, but we have to decide on how to partition them efficiently. Partitioning by user might not be sufficiently balanced, therefore we decided to partition by the tweet ID, which we assume to be globally unique.
Then we need persist and process the data with low latency, and for this we store the tweets in memory.
Automated harvesting of tweets has real potential, even with clear text transmission. Or perhaps because of it.