Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 19, 2013

Building the Library of Twitter

Filed under: Intelligence,Security,Tweets — Patrick Durusau @ 7:08 pm

Building the Library of Twitter by Ian Armas Foster.

From the post:

On an average day people around the globe contribute 500 million messages to Twitter. Collecting and storing every single tweet and its resulting metadata from a single day would be a daunting task in and of itself.

The Library of Congress is trying something slightly more ambitious than that: storing and indexing every tweet ever posted.

With the help of social media facilitator Gnip, the Library of Congress aims to create an archive where researchers can access any tweet recorded since Twitter’s inception in 2006.

According to this update on the progress of the seemingly herculean project, the LOC has already archived 170 billion tweets and their respective metadata. That total includes the posts from 2006-2010, which Gnip compressed and sent to the LOC over three different files of 2.3 terabytes each. When the LOC uncompressed the files, they filled 20 terabytes’ worth of server space representing 21 billion tweets and its supplementary 50 metadata fields.

It is often said that 90% of the world’s data has accrued over the last two years. That is remarkably close to the truth for Twitter, as an additional 150 billion tweets (88% of the total) poured into the LOC archive in 2011 and 2012. Further, Gnip delivers hourly updates to the tune of half a billion tweets a day. That means 42 days’ worth of 2012-2013 tweets equal the total amount from 2006-2010. In all, they are dealing with 133.2 terabytes of information.

Now there’s a big data problem for you! Not to mention a resource problem for the Library of Congress.

You might want to make a contribution to help fund their work on this project.

Obviously of incredible value for researchers at all levels, smaller sub-sets of the Twitter stream may be valuable as well.

If I were designing a Twitter based lexicon for covert communication for example, I would want to use frequent terms from particular geographic locations.

And/or create patterns of tweets from particular accounts so that they don’t stand out from others.

Not to mention trying to crunch the Twitter stream for content I know must be present.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress