Twitter Weather Radar – Test Data for Language Analytics by Nicholas Hartman.
From the post:
Today we’d like to share with you some fun charts that have come out of our internal linguistics research efforts. Specifically, studying weather events by analyzing social media traffic from Twitter.
We do not specialize in social media and most of our data analytics work focuses on the internal operations of leading organizations. Why then would we bother playing around with Twitter data? In short, because it’s good practice. Twitter data mimics a lot of the challenges we face when analyzing the free text streams generated by complex processes. Specifically:
- High Volume: The analysis represented here is looking at around 1 million tweets a day. In the grand scheme of things, that’s not a lot but we’re intentionally running the analysis on a small server. That forces us to write code that rapidly assess what’s relevant to the question we’re trying to answer and what’s not. In this case the raw tweets were quickly tested live on receipt with about 90% of them discarded. The remaining 10% were passed onto the analytics code.
- Messy Language: A lot of text analytics exercises I’ve seen published use books and news articles as their testing ground. That’s fine if you’re trying to write code to analyze books or news articles, but most of the world’s text is not written with such clean and polished prose. The types of text we encounter (e.g., worklogs from an IT incident management system) are full of slang, incomplete sentences and typos. Our language code needs to be good and determining the messages contained within this messy text.
- Varying Signal to Noise: The incoming stream of tweets will always contain a certain percentage of data that isn’t relevant to the item we’re studying. For example, if a band member from One Direction tweets something even tangentially related to what some code is scanning for the dataset can be suddenly overwhelmed with a lot of off-topic tweets. Real world data is similarly has a lot of unexpected noise.
In the exercise below, tweets from Twitter’s streaming API JSON stream were scanned in near real-time for their ability to 1) be pinpointed to a specific location and 2) provide potential details on local weather conditions. The vast majority of tweets passing through our code failed to meet both of these conditions. The tweets that remained were analyzed to determine the type of precipitation being discussed.
An interesting reminder that data to test your data mining/analytics is never far away.
If not Twitter, pick one of the numerous email archives or open data datasets.
The post doesn’t offer any substantial technical details but then you need to work those out for yourself.